Saturday, July 24, 2010

Code Kata Five (Bloom Filter) Solution

   Kata number five is about Bloom Filters.  No, that isn't anything to do with flowers, though of course it could be used for that purpose.  Dave Thomas' description will quite suffice.  If you're not familiar with them, go read it now.  I'll wait.

   Oh, you're back.  Good.  Now that you have a good understanding of what they do and how, here's my implementation.  It uses each pair of bytes out of an MD5 digest, as an individual hash.  (For the security geeks out there, yes I know MD5 has been declared too weak to rely on for security, but this is just a simple demo.)

#! /usr/bin/ruby

# Dave Thomas Code Kata #5: Bloom Filter
# See http://codekata.pragprog.com/2007/01/kata_five_bloom.html
# for problem -- but not data, he's taken down the word list.
# Solution by Dave Aronson


require 'digest/md5'


class Bloom_Filter

  FixnumSize = 32

  def initialize
    # WARNING WILL ROBINSON, DANGER, DANGER!
    # There's probably some way to tell Ruby "give me this size chunk
    # of RAM" but I haven't stumbled across it yet.  We need 2^16 bits
    # cuz we're going to take the hash 2 bytes at a time; divide by
    # FixnumSize cuz that's the storage unit for our bits.
    @bytes = Array.new((2 ** 16) / FixnumSize, 0)
  end

  def add_datum datum
    hash = Digest::MD5.hexdigest datum
    # could also convert hash to arbitrary-size integer en masse
    # and then take byte pairs as binary... this way seemed easier....
    (hash.length / 2).times { |pair| set_bit hash[pair*2 .. pair*2+1].hex }
  end

  def check_bit bit_num
    (@bytes[bit_num / FixnumSize] & (1 << (bit_num % FixnumSize))) != 0
  end

  def check_datum datum
    hash = Digest::MD5.hexdigest datum
    (hash.length / 2).times do |subhash|
      return false if ! check_bit hash[subhash * 2 .. subhash * 2 + 1].hex
    end
    true
  end

  def load_data data
    data.each { |datum| add_datum datum }
  end

  def set_bit bit_num
    @bytes[bit_num / FixnumSize] |= (1 << bit_num % FixnumSize)
  end

end


# TESTS

require 'test/unit'

class TC_MyTest < Test::Unit::TestCase

  def setup
    @words = [ 'zero', 'one', 'two', 'three', 'four', 'five',
                       'six', 'seven', 'eight', 'nine', 'ten' ]
    @filter = Bloom_Filter.new
    @filter.load_data @words
  end

  def helper lang_words
    lang_words.each do |word|
      assert_equal @words.member?(word), @filter.check_datum(word),
                   "error: checking #{word} failed!"
    end
  end

  def test_finding_same_words
    helper @words
  end

  def test_find_commonality_with_english
    helper ['zero', 'one',    'two',    'three', 'four',   'five',
                    'six',    'seven',  'eight', 'nine',   'ten',]
  end

  # given how the helper works, this is really meaningless,
  # included just for illustration
  def test_finding_nothing
    helper []
  end

  def test_finding_empty
    helper ['']
  end

  def test_finding_space
    helper [' ']
  end

  def test_find_commonality_with_french
    helper ['zero', 'un',     'deux',   'trois', 'quatre', 'cinq',
                    'six',    'sept',   'huite', 'neuf',   'dix']
  end

  def test_find_commonality_with_spanish
    helper ['zero', 'uno',   'dos',    'tres',  'cuatro', 'cinco',
                    'seis',  'siete',  'ocho',  'neuve',  'dies']
  end

  def test_find_commonality_with_german
    helper ['zero', 'eins',   'zwei',   'drei',  'fier',   'funf',
                    'sex',    'sieben', 'ocht',  'neun',   'zehn']
  end

  def test_find_commonality_with_japanese
    helper ['zero', 'ichi',   'ni',     'san',   'yon',    'go',
                    'ryoku',  'shichi', 'hachi', 'kyu',    'ju']
  end

  def test_find_commonality_with_multiple
    helper ['onetwo', 'one two three']
  end

  def test_find_commonality_with_nonsense
    helper ['ooee', 'ooahah', 'tingtang', 'wallawallabingbang',
            'flibbertygibbet', 'asdfasdf', 'qwertyuiop', '@#$%^&*!']
  end

end

   So now, as usual, it's the Peanut Gallery's turn.  What would you do differently, and why?  Would you use a completely different algorithm?  A different way to allocate the memory, and set and check the bits?  Would you test it differently?  Or is it (ha!) perfect?