OK, email me at f234567a360f54c1d31a70936f336bc679ba4f9f (sha1sum of an email address with no trailing carriage return or line feed[1]) and I'll believe you.
In general, the search space even for email addresses is probably too large for me to crack in a few days, but in the context above, where the author's email was already available online (on her website, in SPAM databases, in leaked credential datasets, ...), there is hardly any difference. In any case, if you consider my email address "personally identifiable information", I consider its checksum such information as well.
> In any case, if you consider my email address "personally identifiable information", I consider its checksum such information as well.
I wonder what the odds are on a hash collision from another email address (including abusing + addressing) that genuinely belongs to another person (rather than just exists) and therefore the resulting hash does not uniquely identify a single person.
The 'birthday attack'[0] article covers this pretty well, but if we take the output size of a SHA-1 hash as 160 bits, and assume it's outputs are equally distributed[1], a brute-force approach (equivalent to a non-maliciously generated accidental collision across all addresses ever)
is:
sqrt(2**160 * PI/2) ~= 1.5 x10**24
for there to be a 50% probability of a collision occurring.
(if I understood/got the maths right)
[0] https://en.wikipedia.org/wiki/Birthday_attack
[1] This is the intent of all hash functions, and I don't think there are any fundamental attributes of email addresses that would cause systematic bias in the output
Assume you have 1 billion (10^9) computers, each computer can do 1 billion hashing operations per second. That is 10^18 operations per second combined.
Rounding up, one day has 1 million seconds (10^6), and one year has 1000 (10^3) days. So, we have 10^27 ~= 2^90 operations per year.
100 million years is 10^8 ~= 2^27. So, you have 2^117 operations in 100 million years. Geologically, there was an Extinction Event [1] about every 100 million years (e.g. 66, 200 and 251 million years ago). So, having an (unintentional) hash collision in more than 128 bits (assuming a good hash function that has uniformly distributed hash) is less likely than an event happening within the next second that kills 50% of the Earth's species.
I'm not willing to answer the challenge, but I definitely believe it could be done. If someone was willing to purchase a large list of harvested e-mail addresses and sha1sum them all, it is very likely a commonly used address would show up in it. Now, if the address you used above is actually some single-purpose address similar to what I use for all my online accounts, that would not work, but I believe that very few people use dynamic partial addresses in that way. Not even the simple ones that gmail provides.
> The document then says that in 2011 he sent an email to “hundreds of atheists” with a link to his website and that I had reported him for violating GoDaddy’s policies against spam.
Give it to me in a list along with "hundreds" of red-herrings (let's say < 10000), and sure, no problem.
If you have the original list of addresses, and you are given a shasum, you can easily determine to which address the sum belongs. The proposals above do not indicate that GoDaddy should provide the sum to the e-mail sender though.
Umm. Just leaving this here for anyone who doesn't know - the whole point of hashing things like emails or passwords is that reversing the hash is very difficult (read: near impossible). Indeed, once it becomes feasible to do, the hash is no longer considered useful (for this purpose).
So no, given a hash you can't get the email easily. If this were the case, there would be no point in hashing passwords - might as well store them as plain text.
Password hashing algorithms make it a bit harder to guess passwords by doing thousands of iterations ("rounds") of hashing, in addition to adding a random salt to prevent creating a dictionary for common passwords.
However, e-mail addresses are generally short, human readable, and have a high probability of being at one of a handful of common domains. It would be easy to brute force your way through common e-mail address patterns at common domain names fairly quickly, if they were only protected by a single round of SHA1.
OpenSSL's benchmarking tool claims that one of my servers can do 30 million SHA1s per second given 64 bytes of input each. And we know from Bitcoin that GPUs and FPGAs can do many orders of magnitude faster than that.
How long would it take to get an arbitrary "firstname.lastname@gmail.com" given only its SHA1? The US Census reports that there are about 5,200 common first names and 89,000 common last names, for a total of around 460 million pairs or 15 seconds on my server to try all of them.
I suspect that with some heuristics to favor common e-mail address patterns, guessing at least half of a list of arbitrary e-mail addresses really wouldn't take that long.