torstai 14. helmikuuta 2013

Out of 98,569 words in my /usr/share/dict/words, one pair has a CRC32 collision. The pair is "codding" and "gnu".


I was reading through High Performance MySQL today, and when talking about HASH indexes and frequency of CRC32 collisions they offhandedly mention that even in the relatively small sample of /usr/share/dict/words there is a collision, and show that the words are "codding" and "gnu".


I found this a striking coincidence: something as arbitrary as the one CRC32 collision in 100,000 words packaged with the OS is a word that is also the collective name of the foundational command line utilities that make up said OS? Depending on who you ask, part of the very name of the OS? So I wrote a quick Perl script to investigate for myself:



cat /usr/share/dict/words | perl -e 'use String::CRC32; my %crcs; while(<>) { chomp($_); $crc = crc32($_); if(exists $crcs{$crc}) { push(@{ $crcs{$crc} },$_) } else { $crcs{$crc} = [$_]; } } foreach (keys %crcs) { @words = @{$crcs{$_}}; printf "%u: %s (%d word%s)\n", $_,join(",",@words),$#words+1,"s"x($#words<=>0); }' | grep "words)"

And sure enough, out came



1774765869: codding,gnu (2 words)



And nothing else. Quite the coincidence. Unless RMS is secretly purging the wordlist of any other rogue collisions...



submitted by cincodenada to linux

[link] [144 comments]