Latent semantic analysis can be used to measure the similarity of two documents....

Prrometheus · on Dec 15, 2008

Gregor Heinrich has a good paper on Latent Dirichlet Allocation, which I believe is an extension of Latent Semantic Analysis. It is a model which can be used to group documents based on semantic content. He gives the mathematical details and the "punchlines" for implementation. The model takes as input a collection of documents and outputs a topic label for each word in each document. The documents can be plotted in K-dimensional space, where K is the number of possible topics, by using the proportion of each topic in a document as its coordinates. Documents which are closer to each other have topics more similar to each other. You could then use your favorite clustering algorithm or a simple distance threshold to decide which should documents should link to each other.

The paper is here [PDF]: http://www.arbylon.net/publications/text-est.pdf

C++ implementation: http://gibbslda.sourceforge.net/

(note, I haven't used the C++ implementation).

zupatol · on Dec 15, 2008

That looks even better. I'm going to try this out in my pet project.

Thank you.