Please whitelist cPanel in your adblocker so that you’re able to see our version release promotions, thanks!

The Community Forums

Interact with an entire community of cPanel & WHM users!
cPanelResources

Tutorial How To Train SpamAssassin with SA-Learn

Identify more SPAM with SpamAssassin's sa-learn utility.

Tags:
  1. cPanelResources
    Overview
    This tutorial includes instructions for using SpamAssassin's sa-learn utility to identify and catch more SPAM.

    Important Notes
    1. The instructions in this tutorial were tested on a server running CentOS 7.6 and cPanel & WHM version 78. However, the commands and examples provided in this tutorial should work with any supported version of cPanel and WHM as well as any supported version of CentOS, Red Hat, or CloudLinux.

    2. The instructions in this tutorial are intended for use with the maildir email format. This means each mailbox is a directory and each message a file, hence the ability to use sa-learn directly on emails and directories. Maildir is utilized by default on cPanel & WHM. You can read about maildir versus mdbox on this document.

    Terms to know
    Spam: In simple terms, spam is unsolicited email which can take many forms like junk email, phishing emails, or it can be Legitimate emails that we do not want to receive and there is no way to opt out of it. For example, a promotional mailing list which does not offer the opt-out option.
    SpamAssassin: SpamAssassin is an open-source tool to help system administrators fight spam using filters and automated tools to facilitate the scanning of emails before they are delivered to the recipient.
    Sa-learn: The sa-learn tool is used to teach SpamAssassin about emails that we considered should be marked as spam, but SpamAssassin is allowing them to come through as not spam. It works by using a Bayesian filter to learn more 'signs' within emails that should be marked spam.
    Email Headers: The Headers, as the name implies, are the top part of an email which is hidden as it contains information to identify routing information of the message, including the sender, recipient, date, and subject. Some headers are mandatory, such as the FROM, TO and DATE headers.
    Email body: The body of an email is the main content of an email message.
    Bayesian Filter: Bayesian spam filtering attempts to classify messages into two categories, "spam" and "not spam", on the basis of machine learning. The sa-learn tool is the primary way to conduct that machine learning (though SpamAssassin is capable of learning some messages automatically under certain conditions). The tool allows users to train the filter on examples of messages which the user for certain classifies as "spam" and "not spam". Once enough good examples of both categories are learned, the Bayesian filter should be able to assign a reasonable probability as to whether the message is "spam" or "not spam" based on those previous classifications.

    Step 1. Ensure SpamAssassin is enabled and running.
    SpamAssassin depends on the spamd service which must be active at all times to ensure it can scan emails.

    The Spamd process running would like the following:
    Code:
    ps aux|grep spamd
    root      9356  0.0  5.5 225116 104732 ?       Ss   02:51   0:10 /usr/local/cpanel/3rdparty/perl/528/bin/perl -T -w /usr/local/cpanel/3rdparty/bin/spamd --max-spare=1 --max-children=3 --allowed-ips=127.0.0.1,::1 --pidfile=/var/run/spamd.pid --listen=5 --listen=6
    
    If the spamd service is not running, make sure to enable it from the WHM > Service Configuration > Service Manager > Apache SpamAssassin™ interface.

    The sa-learn commands we are running come with cPanel and WHM by default:
    Code:
    /usr/local/cpanel/3rdparty/bin/sa-learn --version
    SpamAssassin version 3.4.2
    
    Step 2. Understand the email account directory structure.
    The "/home/USER/mail/DOMAIN.TLD/ACCOUNT/" is going to have many directories and files. The structure we are interested in is the following one:
    Code:
    /home/cptech/mail/cptech.testing/test_2/cur
    /home/cptech/mail/cptech.testing/test_2/new
    /home/cptech/mail/cptech.testing/test_2/.Junk
    
    The "cur" directory contains emails that were read using Webmail or any email client (IMAP).

    The "new" directory contains emails we have not yet opened.

    The ".Junk" directory contains emails that were delivered directly to it (marked as spam) or emails that we manually marked spam from the email client or Webmail.

    Step 3. Using sa-learn
    Important Note:
    Execute sa-learn commands as the cPanel user that owns the email account to ensure the commands function correctly.

    Using the sa-learn command is very simple, and there are two main ways to use it; The first one would be on the junk directory which is pretty straight forward and the second one would be to use on specific emails files.

    A. For example, to help ensure SpamAssassin would classify emails that we have already moved to the spam/Junk (From the emails interface) inbox we would use the following command:
    Code:
    /usr/local/cpanel/3rdparty/bin/sa-learn -p /home/USER/.spamassassin/user_prefs --spam /home/USER/mail/DOMAIN.TLD/ACCOUNT/.Junk/{cur,new}
    
    Where "USER" is the actual cPanel account and ".spamassassin" is a hidden directory within the account's home directory with "domain.tld" been the actual domain receiving the spam, for example, "domain.tld" could be "mydomain.com."

    Important Note: The "-p" option with "/home/USER/.spamassassin/user_prefs" is used to ensure previous user configurations are honored, like a white-listed emails account.

    B. The second form of the command would be to use on specific email files. Each email message that is received on a server has a text file which represents the email in the disk.

    Let us illustrate this with an example, I have sent an emails account from "test@cptech.testing" to "test_2@cptech.testing", the emails file would be under "/home/cptech/mail/cptech.testing/test_2/cur/":

    Code:
    ls -1 /home/cptech/mail/cptech.testing/test_2/new/
    1555510494.M399983P18557.cpanel.novalocal\,S\=1001\,W\=1030
    
    Important Note: Remember the format above "/home/USER/mail/DOMAIN.TLD/ACCOUNT/".

    Now, I consider the above email as spam, why? Because the email says so here is part of the text from the above file:
    Code:
    From: test@cptech.testing
    To: test_2@cptech.testing
    Subject: Hello
    Message-ID: <c277ff65b6589db5dd471b924868e08e@cptech.testing>
    X-Sender: test@cptech.testing
    User-Agent: Roundcube Webmail/1.3.7
    
    This email should be marked as spam because it is spam.
    
    The body of the email which starts right after the "User-Agent" section, has the phrase:
    Code:
    This emails should be marked as spam because it is spam.
    
    Now, let us tell SpamAssassin the above email should be marked as spam and so similar ones:
    Code:
    /usr/local/cpanel/3rdparty/bin/sa-learn -p /home/cptech/.spamassassin/user_prefs --spam /home/cptech/mail/cptech.testing/test_2/new/1555510494.M399983P18557.cpanel.novalocal\,S\=1001\,W\=1030
    Learned tokens from 1 message(s) (1 message(s) examined)
    
    That is it, the more we feed SpamAssassin unwanted emails, the more accurate it becomes in identifying undesired spam.

    Now, if on the contrary, SpamAssassin is falsely marking an email message as spam we can use the same command with the "--ham" option:
    Code:
    /usr/local/cpanel/3rdparty/bin/sa-learn -p /home/cptech/.spamassassin/user_prefs --ham /home/cptech/mail/cptech.testing/test_2/new/1555510494.M399983P18557.cpanel.novalocal\,S\=1001\,W\=1030
    Learned tokens from 1 message(s) (1 message(s) examined)
    
    We can see how many tokens were learned by executing the following command:
    Code:
    /usr/local/cpanel/3rdparty/bin/sa-learn --dump magic
    0.000          0          3          0  non-token data: bayes db version
    0.000          0          1          0  non-token data: nspam
    0.000          0          0          0  non-token data: nham
    0.000          0         78          0  non-token data: ntokens
    0.000          0 1555510494          0  non-token data: oldest atime
    0.000          0 1555510494          0  non-token data: newest atime
    0.000          0          0          0  non-token data: last journal sync atime
    0.000          0          0          0  non-token data: last expiry atime
    0.000          0          0          0  non-token data: last expire atime delta
    0.000          0          0          0  non-token data: last expire reduction count
    
    We should pay attention to the following parameters:
    Code:
    [*]nspam - Number of spam messages examined.
    [*]nham - Number of (non-spam) messages examined.
    [*]ntokens - Number of tokens learned.
    
    Additional Reading
    1. Detailed information about SpamAssassin can be found here: SpamAssassin: Documentation
    2. Detailed information about sa-learn can be found here: https://spamassassin.apache.org/full/3.4.x/doc/sa-learn.html

    Questions/Feedback
    Feel free to click on the Discussion tab to let us know if you have any questions or feedback about the information in this tutorial.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice