Hm... there really is no difference between an email address and SHA-1 (or SHA-2...

ibmthrowaway218 · on April 16, 2014

OK, email me at f234567a360f54c1d31a70936f336bc679ba4f9f (sha1sum of an email address with no trailing carriage return or line feed[1]) and I'll believe you.

1. e.g.

    $ echo -n billg@microsoft.com | sha1sum
    2517e4726f81e16f65eb95cf6446ad35352f566e  -

tomp · on April 16, 2014

In general, the search space even for email addresses is probably too large for me to crack in a few days, but in the context above, where the author's email was already available online (on her website, in SPAM databases, in leaked credential datasets, ...), there is hardly any difference. In any case, if you consider my email address "personally identifiable information", I consider its checksum such information as well.

nthj · on April 16, 2014

So have the customer hash her list with a salt, and you hash your list with the same salt, and everyone goes home for dinner.

ibmthrowaway218 · on April 16, 2014

> In any case, if you consider my email address "personally identifiable information", I consider its checksum such information as well.

I wonder what the odds are on a hash collision from another email address (including abusing + addressing) that genuinely belongs to another person (rather than just exists) and therefore the resulting hash does not uniquely identify a single person.

shabble · on April 16, 2014

Very, very small.

The 'birthday attack'[0] article covers this pretty well, but if we take the output size of a SHA-1 hash as 160 bits, and assume it's outputs are equally distributed[1], a brute-force approach (equivalent to a non-maliciously generated accidental collision across all addresses ever) is:

    sqrt(2**160 * PI/2) ~= 1.5 x10**24

for there to be a 50% probability of a collision occurring. (if I understood/got the maths right)

[0] https://en.wikipedia.org/wiki/Birthday_attack [1] This is the intent of all hash functions, and I don't think there are any fundamental attributes of email addresses that would cause systematic bias in the output

tomp · on April 16, 2014

To put things into perspective:

Approximately, 10^3 = 1000 ~= 1024 = 2^10, 10^2 = 100 ~= 128 = 2^7.

Assume you have 1 billion (10^9) computers, each computer can do 1 billion hashing operations per second. That is 10^18 operations per second combined.

Rounding up, one day has 1 million seconds (10^6), and one year has 1000 (10^3) days. So, we have 10^27 ~= 2^90 operations per year.

100 million years is 10^8 ~= 2^27. So, you have 2^117 operations in 100 million years. Geologically, there was an Extinction Event [1] about every 100 million years (e.g. 66, 200 and 251 million years ago). So, having an (unintentional) hash collision in more than 128 bits (assuming a good hash function that has uniformly distributed hash) is less likely than an event happening within the next second that kills 50% of the Earth's species.

[1] http://en.wikipedia.org/wiki/Extinction_event

DEinspanjer · on April 16, 2014

I'm not willing to answer the challenge, but I definitely believe it could be done. If someone was willing to purchase a large list of harvested e-mail addresses and sha1sum them all, it is very likely a commonly used address would show up in it. Now, if the address you used above is actually some single-purpose address similar to what I use for all my online accounts, that would not work, but I believe that very few people use dynamic partial addresses in that way. Not even the simple ones that gmail provides.

ibmthrowaway218 · on April 16, 2014

If by "dynamic partial addresses" you mean "plus addressing" then, yes, it does use that.

mnarayan01 · on April 16, 2014

> The document then says that in 2011 he sent an email to “hundreds of atheists” with a link to his website and that I had reported him for violating GoDaddy’s policies against spam.

Give it to me in a list along with "hundreds" of red-herrings (let's say < 10000), and sure, no problem.

DEinspanjer · on April 16, 2014

If you have the original list of addresses, and you are given a shasum, you can easily determine to which address the sum belongs. The proposals above do not indicate that GoDaddy should provide the sum to the e-mail sender though.

dbpatterson · on April 16, 2014

Umm. Just leaving this here for anyone who doesn't know - the whole point of hashing things like emails or passwords is that reversing the hash is very difficult (read: near impossible). Indeed, once it becomes feasible to do, the hash is no longer considered useful (for this purpose).

So no, given a hash you can't get the email easily. If this were the case, there would be no point in hashing passwords - might as well store them as plain text.

aaron42net · on April 16, 2014

Password hashing algorithms make it a bit harder to guess passwords by doing thousands of iterations ("rounds") of hashing, in addition to adding a random salt to prevent creating a dictionary for common passwords.

However, e-mail addresses are generally short, human readable, and have a high probability of being at one of a handful of common domains. It would be easy to brute force your way through common e-mail address patterns at common domain names fairly quickly, if they were only protected by a single round of SHA1.

OpenSSL's benchmarking tool claims that one of my servers can do 30 million SHA1s per second given 64 bytes of input each. And we know from Bitcoin that GPUs and FPGAs can do many orders of magnitude faster than that.

How long would it take to get an arbitrary "firstname.lastname@gmail.com" given only its SHA1? The US Census reports that there are about 5,200 common first names and 89,000 common last names, for a total of around 460 million pairs or 15 seconds on my server to try all of them.

I suspect that with some heuristics to favor common e-mail address patterns, guessing at least half of a list of arbitrary e-mail addresses really wouldn't take that long.

Xorlev · on April 16, 2014

Isn't so hard if you've seen the email address before. Hashing an email somewhat of a joke in the industry.

tomp · on April 16, 2014

Of course, that's why everyone suggests hashing passwords with SHA1, no salt. /sarcasm

001spartan · on April 16, 2014

Just because an algorithm isn't suitable for use as a cryptographic hash function doesn't mean it's easy to crack. There's a world of difference.