A Protocol for Modular Article Filtering in Usenet

Tim Pierce
Last change: 31 Oct 1996

Abstract

Traditionally, automatic filtering in Usenet newsreaders is accomplished by means of a kill file (so named after the KILL file popularized by rn). Kill files are normally very limited in capability, often being unable to do more than mark articles with a given subject, or from a given author, as having been "read." Even in the more powerful newsreaders, a filtering action (e.g. junking or selecting) can depend on only a single condition being met, making it impossible to write a filtering rule which depends on two or more independent criteria (e.g. a rule which is invoked only when the Subject and From headers match certain patterns). The scoring newsreaders which have recently become popular allow users considerably more power, by assigning numeric scores to articles that match particular patterns. However, even these powerful scoring interfaces cannot perform extremely precise filtering, due to limitations in the way scoring commands are designed and implemented.

There are advantages to making the scoring engine completely independent from the newsreader; for example, Brad Templeton's NewsClip[tm] package. Notably: an external filter permits more power and flexibility to the user than existing interfaces offer; it permits cross-platform compatibility for filtering systems; and it reduces the complexity of the newsreader's internal logic. The NewsClip software suffers on the first condition, in that its filters must be written in an idiosyncratic compiled language; though this makes it a more powerful interface, it also makes it less flexible in that the user is bound to using a single language, and is limited to platforms for which that language is implemented. This proposal does not specify the use of any particular language, thus sacrificing some power but gaining flexibility and simplicity. Such an interface has already been implemented in trn and slrn. Work is proceeding for other newsreaders.

Newsreader Interface

In order to perform modular scoring, the newsreader must spawn an external program, called the article filter or simply the filter, which actually calculates scores for individual articles. By sending messages to this filter, the newsreader advises it when new groups are entered and asks it to calculate scores for individual articles. This filter should be invoked once when the newsreader is started or when the first newsgroup is entered, killed (preferably via a bye message) when the newsreader exits, and not deliberately destroyed at any time during the course of a newsreading session. At the very least, if the filter should ever need to be restarted or reinitialized, the newsreader should do so transparently, rather than require the user to take some explicit action.

In principle, the newsreader does not even need to spawn a filtering process itself: instead, it could simply open a communications channel with some already-running process, a filtering daemon which would then calculate and return a set of scores. Such a daemon would not even have to run locally, but could be reached remotely over a network. However, for the sake of convenience, this proposal keys articles on their "article numbers," which are unique to a particular news server, rather than on Message-IDs, which are persistent across sites. Thus, the filtering agent must be local to the newsreader's site. Implementors are encouraged to experiment in this area.

The communication between the newsreader and the filter is all performed using only 7-bit ASCII, in accordance with existing news conventions and standards. Each message is a single line of data terminated with a CRLF pair (octet 015 followed by octet 012). Every message sent by the newsreader is one of three commands: newsgroup, art, and scores.

newsgroup
The newsgroup newsgroup-name command signals that a new newsgroup is being entered. The filter should use this information to select newsgroup-specific scoring routines.
art
The art overview-data command sends one record of overview data to the filter exactly as produced by the output of the XOVER NNTP command. Nearly every message received from the newsreader will be an art command.
scores
Upon receiving a scores command, the filter must return all of the article scores it has calculated to that point. If a particular overview record did not cause the filter to calculate any particular score, the filter should return 0 for that article. The order in which scores are returned is unimportant; the newsreader should be prepared for scores to be returned in any order at all.
The filter sends three messages to the newsreader: skip, article scores and done.
skip
A skip command is sent in response to a newsgroup command when there are no external filters to be run for that newsgroup.
article score
When the filter sends an article score to the newsreader, it sends a line consisting of the article number (the first field in the article's overview record), followed by some whitespace, followed by the numeric score for that article, followed by CRLF. The score may be any integer; this proposal does not limit its size. The filter should send a score for every article it has received, even if that score is zero, but by the same token the newsreader should gracefully handle situations in which the article filter does not return any scoring information.
done
After all scores have been sent to the newsreader, the filter sends a done command, signifying that it has transmitted all scoring data presently available.
(NOTE: An example of a scoring session or two would be helpful at this point.)

Implementors of scoring packages should note that the newsreader may request scores at any time while sending overview records, and it may begin sending overview records at any time after requesting scores. Specifically, it may request a score each time it sends an overview record to the filter. Thus, although a filter may conceivably be written so that the score of one article depends on other articles in the newsgroup (i.e. the contents of other overview records), authors should be aware that this behavior will not be reliable in every newsreader implementing this system.

(The protocol is designed this way specifically to accommodate trn, which in its current incarnation requires an immediate response each time an article score is requested. It is not considered a feature.)

Newsreader authors also should note that it is unwise to request scores before any overview data has been sent, lest the snake bite its own tail.

NOTE: It would be useful here to present some statistics about the speed and efficiency of different scoring interfaces.

Filter Design

This protocol is intentionally designed to be lightweight, to encourage easy integration into existing newsreaders. For the sake of clean and compatible design, as well as end-user ease of use, filtering packages should be written in high-level languages (e.g. Perl, Tcl, Scheme), to make the raw overview data easier to handle and scores easier to manipulate. How this is done is up to the implementors of each package; one advantage of this proposal is that anyone who is not satisfied with the semantics of a particular scoring package may write their own with little effort, without having to modify the newsreader itself, and the redesigned package should work transparently with any newsreader implementing this system.

Filtering packages have already been written in Perl and Tcl. Some details of the Perl implementation are described here, merely as an example of a useful filtering package. It is instructive to note that the same package can be implemented, with little difficulty, in any one of many high-level interpreted languages.

When the Perl article filter is invoked, it first loads a library of functions to assist with article scoring. The most interesting ones (from the user's point of view) are those which actually assign scores to articles:

score_art
Takes two arguments, article and score. Adds the numeric argument score to whatever score article has already accumulated.
select_art
One argument: article. Assigns an extremely high score (currently 10,000) to article, ignoring whatever score this article previously held. Note: subsequent calls to score_art for this article may alter this score.
junk_art
One argument: article. Like select_art, but assigns an extremely high negative score to article.
The filter also includes a Perl library named global, if found. This code defines a global filtering subroutine, global_score, which will be used to calculate a score for every article received, regardless of the newsgroup in which it was found. An example of a global library that may be of interest to many users:
    # global

    sub global_score {
	    my ($a) = @_;

	    $ngs = $a->{xref} =~ tr/ //;
	    score_art($a, $ngs * -5);
    }
The tr command counts the number of spaces found in a variable, in this case the contents of the article's Xref header. Each time the global_score function is called to calculate a score for an article, it will assign that article -5 points for each newsgroup it is crossposted to. The more newsgroups to which an article is crossposted, the lower the score it receives.

Since the user may not wish to apply such draconian measures to articles that appear in news.answers or its sibling groups:

    unless ($a->{xref} =~ /answers/) {
	    $ngs = $a->{xref} =~ tr/ //;
	    score_art($a, $ngs * -5);
    }
These are merely examples, intended to illustrate how easy it is to write a script which filters articles through a very fine sieve.

After loading the global library, the filter enters a loop which responds continuously to commands sent by the newsreader (on standard input). When a newsgroup command is received, the filter looks for a Perl library of the same name as the newsgroup selected, and loads it. For example, when the filter receives the message newsgroup news.admin.misc, it looks in its search path for a file called news.admin.misc which contains Perl code. This library, if it exists, should define a subroutine named local_score, which subsequently will be used for scoring articles in that newsgroup.

When an art command containing an overview record is received, the filter parses the overview data into an associative array. A reference to this structure is passed to the global_score subroutine and the local_score subroutine in turn, if either of them is defined. Because the global_score function is defined exactly once when the filter is invoked, it will remain constant as long as the filter process remains alive. Since the local_score function is defined differently in each newsgroup-specific library, it is redefined each time the filter receives a newsgroup command and will calculate a score for each article according to the rules for the newsgroup the user is currently reading.

When the filter receives a scores command, it returns each score it has calculated (together with the number of the article possessing that score, of course). These lines are printed on standard output, followed by done.

All of this bookkeeping is handled automatically by the script. All the user need do is write the global_score function and each newsgroup's local_score function, assigning scores to articles by means of score_art and related functions. However, one of the strengths of this system is that users who find one filtering package unsatisfactory may write a completely new package with comparative ease. The default package is provided simply as a convenience for users who do not have such needs. An example of what one user's news.admin.net-abuse.misc filter might look like:

    # news.admin.net-abuse.misc

    sub local_score {
        my ($a) = @_;

        junk_art ($a) if ($a->{Subject} =~ /forge/i and
			  $a->{From} =~ /boursy/i);
    }
When news.admin.net-abuse.misc is selected, the newsreader sends a newsgroup command to the filter, which loads the news.admin.net-abuse.misc library, overriding any previous definition of local_score.

Conventional newsreaders do not allow the user to make a particular rule contingent upon two independent headers. For example, it is difficult in some scoring newsreaders to write a rule that junks an article if the article's subject contains the word "forge" and the author's name is "boursy", but not if either condition is met by itself. Generally the only way of doing this is to assign a large negative score to articles with a subject of "forge" and another large negative score to articles written by "boursy", junking articles with a sufficiently high negative aggregate score. This clumsy solution quickly proves impractical if the user wishes to write a large number of rules which depend on such specific conditions, as the independent scoring rules quickly start to interfere with one another. By contrast, a system which implements filtering by applying commands written in a high-level language makes such specific filtering almost a trivial matter.

Summary

Traditional newsreading packages provide relatively limited interfaces for filtering articles, and these interfaces are extremely idiosyncratic; each newsreader implements its own ruleset, its own command syntax, and its own interface. A more powerful and flexible means of filtering articles can be achieved, at the cost of losing a very small amount of efficiency, by "outsourcing" the task of article filtering to an external program. The newsreader sends article data to this external filter in the form of overviews; the filter calculates and returns scores for each of these articles by executing code written by the user. Because the externally-spawned filter can be written in any programming language, the user has potentially unlimited variety of interfaces and potentially unlimited power in determining how to filter articles. At the same time, the interface between the external filter and the newsreader is extremely simple and easy to write, permitting implementation in a matter of hours.

References