Very cool idea. For those who want to try at home, try this (mac and unix users ...

fexl · on Jan 26, 2010

I had to use a different stat command on my Linux system. This worked for me:

find . -type f -exec stat -c %s {} \; | cut -c 1 | sort | uniq -c

Note that I exclude directories to avoid the size 4096 bias.

I ran it in my "project" directory and found that 38% of my file sizes begin with "1". That directory includes Perl source code files, input data files, and automatically generated output files.

After the digit "1" the distributions ranged from 3% to 9% with no obvious bias I could see.

revicon · on Jan 26, 2010

Limiting to just files is a good idea, and if we employ our good friend awk it cuts the time down significantly. This one should work for both OSX and Linux.

find . -type f -ls | awk '{print $7}' | cut -c -1 | sort | uniq -c

fexl · on Jan 27, 2010

MUCH faster, thanks.

Now I'm piping that into Perl to convert the counts to percentages. If I figure out a one-liner for that I'll let you know.

Next I'll be tempted to write a module for generating "realistic" (Benford-compliant) random numbers using this concise specification from HN contributor "shrughes":

"Data whose logarithm is uniformly distributed does [follow Benford's Law]."

I could use that to produce demo or test data.