Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What would you be hashing, on a text file? What the text "looks like," e.g. particular density and repeated patterns of alphanumerics?

Or would you be hashing the content, the language, the words used, the meaning of the words?

Stylometry is the study of written style. It's used forensically to see if two texts were written the same, to identify anonymous/pseudonymous authors, and there's "adversarial stylometry," which is intentionally changing a writing style to make it look like someone else wrote something.

This deck seems to summarize a lot of these points and issues, but offers no "stylometric hash" solution: http://wiki.uni.lu/mine/docs/RDPresentation.ppt

Plagiarism detection is another forum that might do these sorts of analyses. How many metaphors do you have to change, and synonyms do you have to replace, before it's an original work, or would it always still hash nearby?

Textometry tries to make texts more analyzable. Wordprints or stylometric fingerprints, seem to turn up more results than "hash".



Hm. Well, I want to hash the pattern of bits in the text file, like a cryptographic hash does (suppose md5 or sha-1 for simplicity's sake).

I'll provide some examples of input and output. These examples happen to contain no linefeeds.

Suppose:

    Hello, World! --> 65a8e27d8879283831b664bd8b7f0ad4
Then I want something like:

    Hello, Worlds! --> 65a8e27d8879283831b664bd8b7f4ad4
...Rather than what md5 currently provides:

    Hello, Worlds! --> d0478649ad1c15f0e623846c3e26ebeb
Basically, I love hashes and I use them all the time, but for many purposes I am only accidentally using the cryptographic features and in fact I would sometimes find it nice if similar inputs had similar outputs.

Said differently: I'd like a hashing algorithm where the Levenshtein distance between any given two outputs correlates with the Levenshtein distance between the corresponding inputs.

I can imagine lots of uses for such a tool. ...But I can't imagine how it could be possible to make one: files (or strings) vary in length, for one thing, but a good hash does not! Of course, I couldn't imagine md5 before I saw it in action, either.


Perhaps I'm missing the point, but I came up with a naive and sloppy part solution: https://gist.github.com/peterc/737d9178f02118f8e315 .. there are some weaknesses to this solution but it does perform similarly to your examples.


Thanks, Peter. I don't speak Ruby. Uh, line 5 contains a small function definition, which is applied to each ...slice of the input string, is that right?

Do you (or anyone else) have time to describe this code in English or pseudocode?

I'll start.

    Let there be a function called hash which takes a
        string named str and an integer named
        length_of_hash (which defaults to 20).
    Slice the string (str) into length_of_hash equally-sized
        pieces?
    Take numeric value of each character (of each slice??)
        and do.. something to it, something involving
        modulo 256, unless it's zero.  Save all the
        results.  
    Express each numeric result as hexidecimal and append
        all those together.  Return it, probably.
Clearly there are bugs in the translation. :)


In short, convert the input string into an array of its character values, group those values into sub-arrays of length 'x', add together the aligned values in each sub-array and modulo each by 256, output the resulting values in hex pairs joined together in a big string.

Example, with a hash length of 4: "helloworld" => [104, 101, 108, 108, 111, 119, 111, 114, 108, 100] => [[104, 101, 108, 108], [111,119,111,114], [108, 100]] => [67, 64, 219, 222] => "4340dbde"

So if one character is different in the string, only one hex value in the output will vary too. However, a flaw of the plan is that changes spaced out at an interval that matches your hash "length" will only affect one hex value in the output, but you run into the pigeonhole principle if you want a hash of limited size to have the same or similar Levenshtein distance as the potential inputs, but I suspect there are far smarter solutions :-)


Yes, that's possible. You could a topic clustering algorithm, for example.

Text -> topic_id.


In text analysis you can broadly focus either on style or on topic. Stylometry does the former and typically focuses on highly frequent function words; this technique is good at identifying authors. There are various techniques that focus on topics such as distributional semantic models and bag-of-word models; these techniques are good at clustering documents about the same topics together (regardless of author style). I think that in both cases a text can be reduced by the model to a single vector, which could be considered a hash, but in the end what matters is how well the models perform on a task such as identifying authors or finding documents on a given topic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: