I was reading through High Performance MySQL today, and when talking about HASH indexes and frequency of CRC32 collisions they offhandedly mention that even in the relatively small sample of /usr/share/dict/words there is a collision, and show that the words are "codding" and "gnu".
I found this a striking coincidence: something as arbitrary as the one CRC32 collision in 100,000 words packaged with the OS is a word that is also the collective name of the foundational command line utilities that make up said OS? Depending on who you ask, part of the very name of the OS? So I wrote a quick Perl script to investigate for myself:
cat /usr/share/dict/words | perl -e 'use String::CRC32; my %crcs; while(<>) { chomp($_); $crc = crc32($_); if(exists $crcs{$crc}) { push(@{ $crcs{$crc} },$_) } else { $crcs{$crc} = [$_]; } } foreach (keys %crcs) { @words = @{$crcs{$_}}; printf "%u: %s (%d word%s)\n", $_,join(",",@words),$#words+1,"s"x($#words<=>0); }' | grep "words)"
And sure enough, out came
1774765869: codding,gnu (2 words)
And nothing else. Quite the coincidence. Unless RMS is secretly purging the wordlist of any other rogue collisions...
submitted by cincodenada to linux
[link] [144 comments]
Ei kommentteja:
Lähetä kommentti
Huomaa: vain tämän blogin jäsen voi lisätä kommentin.