Full Metal Blog: Latent Semantic Analysis

There is an interesting article about how the Spam filtering in Mac OS X Mail App works. Basically rather than using filters or rules (or Bayesian filtering) it uses Latent Semantic analysis -- looking at the latent properties of the textual content to classify it against a set of known documents (the corpus). When you move an email from good to the junk folder (or vice versa) it changes the corpus for spam and not-spam. In theory you can use it to classify the emails as well if you (the user) are willing to do the training. The clustering uses Singular Value Decompositon (aka Karhunen-Loeve transform) to create an N dimensional classification space from the corpus. Then you can use standard statistical classification techniques (Bayes, etc) to do the classification of new data. I expect that the app ship with some baseline spam / no-spam data and the rest is customization (it seems to work pretty well out of the box). As with all classification problems, using a set of overlapping techniques is crucial so I'm sure they do some rule based filtering / heuristics.
I think alot of the reason these things work so well are:

My spam is different from your spam. The solution should be tailored to me so the active training makes this work
To the extent that everyone's spam filter is different, it makes life much more difficult because they have to simultaneously work around an (effectively infinite) # of constraints. Do computational linguistics people "go bad" and work for spammers?

Wednesday, May 19, 2004

Latent Semantic Analysis

0 Comments:

Previous Posts