Wednesday, May 19, 2004

Latent Semantic Analysis

There is an interesting article about how the Spam filtering in Mac OS X Mail App works. Basically rather than using filters or rules (or Bayesian filtering) it uses Latent Semantic analysis -- looking at the latent properties of the textual content to classify it against a set of known documents (the corpus). When you move an email from good to the junk folder (or vice versa) it changes the corpus for spam and not-spam. In theory you can use it to classify the emails as well if you (the user) are willing to do the training. The clustering uses Singular Value Decompositon (aka Karhunen-Loeve transform) to create an N dimensional classification space from the corpus. Then you can use standard statistical classification techniques (Bayes, etc) to do the classification of new data. I expect that the app ship with some baseline spam / no-spam data and the rest is customization (it seems to work pretty well out of the box). As with all classification problems, using a set of overlapping techniques is crucial so I'm sure they do some rule based filtering / heuristics.
I think alot of the reason these things work so well are:
  • My spam is different from your spam. The solution should be tailored to me so the active training makes this work
  • To the extent that everyone's spam filter is different, it makes life much more difficult because they have to simultaneously work around an (effectively infinite) # of constraints. Do computational linguistics people "go bad" and work for spammers?

0 Comments:

Post a Comment

<< Home