What would you be hashing, on a text file? What the text "looks like," e.g. part...

daveloyall · on Aug 8, 2014

Hm. Well, I want to hash the pattern of bits in the text file, like a cryptographic hash does (suppose md5 or sha-1 for simplicity's sake).

I'll provide some examples of input and output. These examples happen to contain no linefeeds.

Suppose:

    Hello, World! --> 65a8e27d8879283831b664bd8b7f0ad4

Then I want something like:

    Hello, Worlds! --> 65a8e27d8879283831b664bd8b7f4ad4

...Rather than what md5 currently provides:

    Hello, Worlds! --> d0478649ad1c15f0e623846c3e26ebeb

Basically, I love hashes and I use them all the time, but for many purposes I am only accidentally using the cryptographic features and in fact I would sometimes find it nice if similar inputs had similar outputs.

Said differently: I'd like a hashing algorithm where the Levenshtein distance between any given two outputs correlates with the Levenshtein distance between the corresponding inputs.

I can imagine lots of uses for such a tool. ...But I can't imagine how it could be possible to make one: files (or strings) vary in length, for one thing, but a good hash does not! Of course, I couldn't imagine md5 before I saw it in action, either.

petercooper · on Aug 9, 2014

Perhaps I'm missing the point, but I came up with a naive and sloppy part solution: https://gist.github.com/peterc/737d9178f02118f8e315 .. there are some weaknesses to this solution but it does perform similarly to your examples.

daveloyall · on Aug 11, 2014

Thanks, Peter. I don't speak Ruby. Uh, line 5 contains a small function definition, which is applied to each ...slice of the input string, is that right?

Do you (or anyone else) have time to describe this code in English or pseudocode?

I'll start.

    Let there be a function called hash which takes a
        string named str and an integer named
        length_of_hash (which defaults to 20).
    Slice the string (str) into length_of_hash equally-sized
        pieces?
    Take numeric value of each character (of each slice??)
        and do.. something to it, something involving
        modulo 256, unless it's zero.  Save all the
        results.  
    Express each numeric result as hexidecimal and append
        all those together.  Return it, probably.

Clearly there are bugs in the translation. :)

petercooper · on Aug 11, 2014

In short, convert the input string into an array of its character values, group those values into sub-arrays of length 'x', add together the aligned values in each sub-array and modulo each by 256, output the resulting values in hex pairs joined together in a big string.

Example, with a hash length of 4: "helloworld" => [104, 101, 108, 108, 111, 119, 111, 114, 108, 100] => [[104, 101, 108, 108], [111,119,111,114], [108, 100]] => [67, 64, 219, 222] => "4340dbde"

So if one character is different in the string, only one hex value in the output will vary too. However, a flaw of the plan is that changes spaced out at an interval that matches your hash "length" will only affect one hex value in the output, but you run into the pigeonhole principle if you want a hash of limited size to have the same or similar Levenshtein distance as the potential inputs, but I suspect there are far smarter solutions :-)

wisty · on Aug 9, 2014

Yes, that's possible. You could a topic clustering algorithm, for example.

Text -> topic_id.

andreasvc · on Aug 9, 2014

In text analysis you can broadly focus either on style or on topic. Stylometry does the former and typically focuses on highly frequent function words; this technique is good at identifying authors. There are various techniques that focus on topics such as distributional semantic models and bag-of-word models; these techniques are good at clustering documents about the same topics together (regardless of author style). I think that in both cases a text can be reduced by the model to a single vector, which could be considered a hash, but in the end what matters is how well the models perform on a task such as identifying authors or finding documents on a given topic.