Tally All Kinds

When there are too many things try lumping them into categories and show largest to smallest kinds.

Consider the case of a web server error log. People on the internet will do crazy things. We'll tally messages by collapsing uninteresting variation and then counting everything that is left.

cat /var/log/httpd/error_log |\ perl -pe ' s/\d+/0/g; s/(\/home\/httpd\/(html|cgi-bin)\/).*$/$1.../; s/(referer: ).*$/$1.../' |\ perl -e ' while(<>) {$t{$_}++}; for(sort keys %t) {print "$t{$_}\t$_"}' |\ sort -n

Here we show those counts in double digits or more. Lines are still long so we have shortened them some more for this exhibit.

10 attempt to invoke directory as script: ... 17 Symbolic link not allowed: ... 24 (0)Permission denied: /home/httpd/html/... 41 Can't sustain current request rate ... 46 script not found or unable to stat: ... 122 Can't sustain current request rate ... 603 script not found or unable to stat: ... 648 unexpected input for wikiname at ... 780 File does not exist: ..., referer: ... 1640 File does not exist: /home/httpd/html/... 3556 grep: writing output: Broken pipe 4523 edits disabled due to abuse at ...

Try experimenting with some natural language text. Here we scrape the Apple, Inc. page from wikipedia, find paragraphs, and remove embedded tags and numeric references. The last perl prints one lowercase letter/digit per line, ready for the same tally perl we used above.

curl -s https://en.wikipedia.org/wiki/Apple_Inc. |\ perl -ne 'print "$1\n" if /<p>(.+).*?<\/p>/' |\ perl -pe 's/<.*?>//g; s/\[\d+\]//g' |\ perl -ne 'print lc($1),"\n" while (/(\w)/g)'

We add these rules one at a time while checking the output to be sure we're getting the expected result.

So here in this one sample we see this distribution.

41 z 58 q 76 8 78 7 90 5 92 3 96 4 103 6 150 x 166 9 166 j 335 2 340 k 375 1 561 0 613 v 834 y 855 b 931 w 1030 g 1204 f 1458 m 1612 u 2242 c 2271 h 2492 p 2537 d 2730 l 3737 r 3836 s 4230 n 4339 i 4465 o 4891 t 5221 a 7261 e

We recall that \w will match letters, digits and underscore. We see no underscore above so we know none occurred in the article text. A run without the preprocessing would show different counts and surely a few underscores.