Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Latent semantic analysis can be used to measure the similarity of two documents.

http://en.wikipedia.org/wiki/Latent_semantic_analysis

I found examples in ruby and python in this blog, which is unfortunately down just at the moment http://blog.josephwilk.net/ruby/latent-semantic-analysis-in-...



Gregor Heinrich has a good paper on Latent Dirichlet Allocation, which I believe is an extension of Latent Semantic Analysis. It is a model which can be used to group documents based on semantic content. He gives the mathematical details and the "punchlines" for implementation. The model takes as input a collection of documents and outputs a topic label for each word in each document. The documents can be plotted in K-dimensional space, where K is the number of possible topics, by using the proportion of each topic in a document as its coordinates. Documents which are closer to each other have topics more similar to each other. You could then use your favorite clustering algorithm or a simple distance threshold to decide which should documents should link to each other.

The paper is here [PDF]: http://www.arbylon.net/publications/text-est.pdf

C++ implementation: http://gibbslda.sourceforge.net/

(note, I haven't used the C++ implementation).


That looks even better. I'm going to try this out in my pet project.

Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: