CRM114 (program)
CRM114 (full name: "The CRM114 Discriminator") is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.
Origin of the name
The name comes from the CRM-114 Discriminator in the Stanley Kubrick movie Dr. Strangelove - a piece of radio equipment designed to filter out messages lacking a specific code-prefix.
Operation
While others have done statistical Bayesian spam filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a Markov Random Field representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis[1] gave a 99.87% accuracy;[2] Holden [3] and TREC 2005 and 2006.[4][5] gave results of better than 99%, with significant variation depending on the particular corpus.
CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses entropy encoding to determine similarity, a SVM, by mutual compressibility as calculated by a modified LZ77 algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of skip-grams.
The CRM114 algorithms are multi-lingual (compatible with UTF-8 encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in Japanese at better than 99.9% detection rate and a 5.3% false alarm rate.[6]
CRM114 is a good example of pattern recognition software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the GPL.
At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly.
CRM114 has been applied to a number of other applications, including detection of bots on Twitter and Yahoo [7] [8], as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system. [9]. It has also been used as a predictive method for classifying fault-prone software modules [10].
See also
References
- ↑ "The antispam man ", March 19, 2007, Cara Garretson, Network World
- ↑ "Bill Yerazunis: Better Than Human", Paul Graham's website
- ↑ Spam Filtering II
- ↑ Spam Track Overview (2005) - TREC 2005
- ↑ Spam Track Overview (2006) - TREC 2005
- ↑ https://media.blackhat.com/bh-us-10/whitepapers/Yerazunis/BlackHat-USA-2010-Yerazunis-Confidential-Mail-Filtering-wp.pdf
- ↑ Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg?", Zi Chu, Steven Gianvecchio, Haining Wang, Sushil Jajodia, IEEE Transactions on Dependable and Secure Computing, 2012 vol 9, pages 811-824, DOI: 10.1109/TDSC.2012.75
- ↑ https://www.usenix.org/legacy/events/sec08/tech/full_papers/gianvecchio/gianvecchio_html/index.html
- ↑ https://www.oig.dot.gov/sites/default/files/NHTSA%20Safety-Related%20Vehicle%20Defects%20-%20Final%20Report%5E6-18-15.pdf
- ↑ https://www.st.cs.uni-saarland.de/edu/softmine2007/Projects/28300004.pdf