Oh, you're back. Good. Now that you have a good understanding of what they do and how, here's my implementation. It uses each pair of bytes out of an MD5 digest, as an individual hash. (For the security geeks out there, yes I know MD5 has been declared too weak to rely on for security, but this is just a simple demo.)
#! /usr/bin/ruby # Dave Thomas Code Kata #5: Bloom Filter # See http://codekata.pragprog.com/2007/01/kata_five_bloom.html # for problem -- but not data, he's taken down the word list. # Solution by Dave Aronson require 'digest/md5' class Bloom_Filter FixnumSize = 32 def initialize # WARNING WILL ROBINSON, DANGER, DANGER! # There's probably some way to tell Ruby "give me this size chunk # of RAM" but I haven't stumbled across it yet. We need 2^16 bits # cuz we're going to take the hash 2 bytes at a time; divide by # FixnumSize cuz that's the storage unit for our bits. @bytes = Array.new((2 ** 16) / FixnumSize, 0) end def add_datum datum hash = Digest::MD5.hexdigest datum # could also convert hash to arbitrary-size integer en masse # and then take byte pairs as binary... this way seemed easier.... (hash.length / 2).times { |pair| set_bit hash[pair*2 .. pair*2+1].hex } end def check_bit bit_num (@bytes[bit_num / FixnumSize] & (1 << (bit_num % FixnumSize))) != 0 end def check_datum datum hash = Digest::MD5.hexdigest datum (hash.length / 2).times do |subhash| return false if ! check_bit hash[subhash * 2 .. subhash * 2 + 1].hex end true end def load_data data data.each { |datum| add_datum datum } end def set_bit bit_num @bytes[bit_num / FixnumSize] |= (1 << bit_num % FixnumSize) end end # TESTS require 'test/unit' class TC_MyTest < Test::Unit::TestCase def setup @words = [ 'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten' ] @filter = Bloom_Filter.new @filter.load_data @words end def helper lang_words lang_words.each do |word| assert_equal @words.member?(word), @filter.check_datum(word), "error: checking #{word} failed!" end end def test_finding_same_words helper @words end def test_find_commonality_with_english helper ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten',] end # given how the helper works, this is really meaningless, # included just for illustration def test_finding_nothing helper [] end def test_finding_empty helper [''] end def test_finding_space helper [' '] end def test_find_commonality_with_french helper ['zero', 'un', 'deux', 'trois', 'quatre', 'cinq', 'six', 'sept', 'huite', 'neuf', 'dix'] end def test_find_commonality_with_spanish helper ['zero', 'uno', 'dos', 'tres', 'cuatro', 'cinco', 'seis', 'siete', 'ocho', 'neuve', 'dies'] end def test_find_commonality_with_german helper ['zero', 'eins', 'zwei', 'drei', 'fier', 'funf', 'sex', 'sieben', 'ocht', 'neun', 'zehn'] end def test_find_commonality_with_japanese helper ['zero', 'ichi', 'ni', 'san', 'yon', 'go', 'ryoku', 'shichi', 'hachi', 'kyu', 'ju'] end def test_find_commonality_with_multiple helper ['onetwo', 'one two three'] end def test_find_commonality_with_nonsense helper ['ooee', 'ooahah', 'tingtang', 'wallawallabingbang', 'flibbertygibbet', 'asdfasdf', 'qwertyuiop', '@#$%^&*!'] end end
So now, as usual, it's the Peanut Gallery's turn. What would you do differently, and why? Would you use a completely different algorithm? A different way to allocate the memory, and set and check the bits? Would you test it differently? Or is it (ha!) perfect?