Saturday, July 24, 2010

Code Kata Five (Bloom Filter) Solution

   Kata number five is about Bloom Filters.  No, that isn't anything to do with flowers, though of course it could be used for that purpose.  Dave Thomas' description will quite suffice.  If you're not familiar with them, go read it now.  I'll wait.

   Oh, you're back.  Good.  Now that you have a good understanding of what they do and how, here's my implementation.  It uses each pair of bytes out of an MD5 digest, as an individual hash.  (For the security geeks out there, yes I know MD5 has been declared too weak to rely on for security, but this is just a simple demo.)

#! /usr/bin/ruby

# Dave Thomas Code Kata #5: Bloom Filter
# See http://codekata.pragprog.com/2007/01/kata_five_bloom.html
# for problem -- but not data, he's taken down the word list.
# Solution by Dave Aronson


require 'digest/md5'


class Bloom_Filter

  FixnumSize = 32

  def initialize
    # WARNING WILL ROBINSON, DANGER, DANGER!
    # There's probably some way to tell Ruby "give me this size chunk
    # of RAM" but I haven't stumbled across it yet.  We need 2^16 bits
    # cuz we're going to take the hash 2 bytes at a time; divide by
    # FixnumSize cuz that's the storage unit for our bits.
    @bytes = Array.new((2 ** 16) / FixnumSize, 0)
  end

  def add_datum datum
    hash = Digest::MD5.hexdigest datum
    # could also convert hash to arbitrary-size integer en masse
    # and then take byte pairs as binary... this way seemed easier....
    (hash.length / 2).times { |pair| set_bit hash[pair*2 .. pair*2+1].hex }
  end

  def check_bit bit_num
    (@bytes[bit_num / FixnumSize] & (1 << (bit_num % FixnumSize))) != 0
  end

  def check_datum datum
    hash = Digest::MD5.hexdigest datum
    (hash.length / 2).times do |subhash|
      return false if ! check_bit hash[subhash * 2 .. subhash * 2 + 1].hex
    end
    true
  end

  def load_data data
    data.each { |datum| add_datum datum }
  end

  def set_bit bit_num
    @bytes[bit_num / FixnumSize] |= (1 << bit_num % FixnumSize)
  end

end


# TESTS

require 'test/unit'

class TC_MyTest < Test::Unit::TestCase

  def setup
    @words = [ 'zero', 'one', 'two', 'three', 'four', 'five',
                       'six', 'seven', 'eight', 'nine', 'ten' ]
    @filter = Bloom_Filter.new
    @filter.load_data @words
  end

  def helper lang_words
    lang_words.each do |word|
      assert_equal @words.member?(word), @filter.check_datum(word),
                   "error: checking #{word} failed!"
    end
  end

  def test_finding_same_words
    helper @words
  end

  def test_find_commonality_with_english
    helper ['zero', 'one',    'two',    'three', 'four',   'five',
                    'six',    'seven',  'eight', 'nine',   'ten',]
  end

  # given how the helper works, this is really meaningless,
  # included just for illustration
  def test_finding_nothing
    helper []
  end

  def test_finding_empty
    helper ['']
  end

  def test_finding_space
    helper [' ']
  end

  def test_find_commonality_with_french
    helper ['zero', 'un',     'deux',   'trois', 'quatre', 'cinq',
                    'six',    'sept',   'huite', 'neuf',   'dix']
  end

  def test_find_commonality_with_spanish
    helper ['zero', 'uno',   'dos',    'tres',  'cuatro', 'cinco',
                    'seis',  'siete',  'ocho',  'neuve',  'dies']
  end

  def test_find_commonality_with_german
    helper ['zero', 'eins',   'zwei',   'drei',  'fier',   'funf',
                    'sex',    'sieben', 'ocht',  'neun',   'zehn']
  end

  def test_find_commonality_with_japanese
    helper ['zero', 'ichi',   'ni',     'san',   'yon',    'go',
                    'ryoku',  'shichi', 'hachi', 'kyu',    'ju']
  end

  def test_find_commonality_with_multiple
    helper ['onetwo', 'one two three']
  end

  def test_find_commonality_with_nonsense
    helper ['ooee', 'ooahah', 'tingtang', 'wallawallabingbang',
            'flibbertygibbet', 'asdfasdf', 'qwertyuiop', '@#$%^&*!']
  end

end

   So now, as usual, it's the Peanut Gallery's turn.  What would you do differently, and why?  Would you use a completely different algorithm?  A different way to allocate the memory, and set and check the bits?  Would you test it differently?  Or is it (ha!) perfect?

Saturday, July 17, 2010

Code Kata Four (Data Munging) Solution

   Kata number three isn't a coding exercise, so let's move on to number four: find the minimum temperature spread in a file of weather data, find the minumum win-loss spread in a file of soccer league results... and then refactor the two together, to remove duplication.

   This is an example of the Agile principle of DRY: Don't Repeat Yourself.  If you have the same objective being done in multiple places, there are multiple places where code needs to be changed if a bug is found or the objective is changed.  Likewise, if you suspect from the symptoms, that some particular algorithm is being done wrong, you know exactly where to change it if your code is dry, i.e., it adheres to the DRY principle.  If it isn't... then you're all wet.  (Badum-pum.)

   Actually, it's not so much an Agile principle, as a good basic software development principle that the Agile movement has chosen to emphasize.  Somewhat like how so much of so-called Object Oriented programming is really just good old encapsulation, which was a well-known and commonly-taught software development principle long before OOP got so popular.  Just about the only thing OOP added, other than convenient shorthand for such encapsulation, was polymorphism.  But don't get me started -- I'm a self-starter.  ;-)

   Anyway, here's my Ruby solution to #4, including of course tests, as Bryan "TATFT" Liles will be happy to see.  (What's TATFT stand for?  "Test All The Time".  At least on a G-rated blog it does....)

#! /usr/bin/ruby

# Dave Thomas Code Kata 4, Data Munging.
# See http://codekata.pragprog.com/2007/01/kata_four_data_.html
# for purpose, and data files.
# Solution by Dave Aronson.

# Note: Could do these with File.readlines(file).each, but then
# I'd have to keep track of whether we'd read the header yet.



# Part 1: Weather Data

def min_weather_spread
  file = File.open 'weather.dat', 'r'
  first = file.gets.split[0] until first == 'Dy'
  min_spread = 9999  # unrealistically high value for this application
  while not file.eof?
    parts = file.gets.split
    next if parts.length == 0
    day = parts[0].to_i
    # to_i returns 0 on non-numeric; no exceptions thrown;
    # maybe there should be a String#is_numeric? method....
    break if day == 0
    spread = (parts[1].to_i - parts[2].to_i).abs
    if spread < min_spread
      min_spread = spread
      min_day = day
    end
  end
  puts "Min spread is #{min_spread}, on day #{min_day}"
  min_spread
end


# Part 2: Soccer League Table
# Soccer, football, whatever.  Most of the people reading this will be
# my fellow 'Murrikens, so I'll call it what we call it.

def min_soccer_spread
  file = File.open 'football.dat', 'r'
  first = file.gets.split[0] until first == 'Team'
  min_spread = 9999  # unrealistically high value for this application
  while not file.eof?
    parts = file.gets.split
    # data format is more consistent than weather one, so we can do this;
    # just be aware that in other applications, # of line parts may stay same.
    next if parts.length != 10
    spread = (parts[6].to_i - parts[8].to_i).abs
    if spread < min_spread
      min_spread = spread
      min_team = parts[1]
    end
  end
  puts "Min spread is #{min_spread}, for #{min_team}"
  min_spread
end



# Part 3: DRY Fusion.  Dry up the weather, and dry out the drunken football
# hooligans, er, I mean, soccer fans.  ;-)
#
# Note that the 9999 init, and skipping any where the first column evaluates to
# zero, works fine in THESE specific two cases, but may be problematic in other
# applications.  So why didn't I go ahead and make it fully generalized, such
# as by using the maximum supported number, or a parameter?  Another agile
# acronym: YAGNI.  If you don't grok it, Google it.

def min_spread filename, first_header, name_col, data_col_1, data_col_2
  file = File.open filename, 'r'
  first = file.gets.split[0] until first == first_header
  min_spread = 9999
  while not file.eof?
    parts = file.gets.split
    next if parts[0].to_i == 0
    spread = (parts[data_col_1].to_i - parts[data_col_2].to_i).abs
    if spread < min_spread
      min_spread = spread
      min_item = parts[name_col]
    end
  end
  # Could also have passed in a descriptor here, like 'day' or 'team'....
  puts "Min spread is #{min_spread}, for #{min_item}"
  min_spread
end



# TESTS


require 'test/unit'

class TC_MyTest < Test::Unit::TestCase

  def test_weather_case
    assert_equal min_spread('weather.dat', 'Dy', 0, 1, 2), min_weather_spread,
                 'Error: DRY equivalent does not work same as weather case'
  end

  def test_soccer_case
    assert_equal min_spread('football.dat', 'Team', 1, 6, 8), min_soccer_spread,
                 'Error: DRY equivalent does not work same as soccer case'

  end

end

   So why am I not including the data files?  I want you to go read Dave Thomas' CodeKata blog (and his regular blog as well) too.  Yes, the first hasn't been updated in years and the second in months, but if you're the kind of person who'd read this blog, you'd want to see that one too, and the comments.

Thursday, July 8, 2010

Code Katas

Some of you may be familiar with the term kata.  It comes from martial arts, mainly karate, and means a practice exercise, especially one done over and over to train your "muscle memory".

After enough years programming, you build up similar semi-automatic approaches, like how to organize a program, or thinking imperatively or functionally, or iteratively or recursively.  Code katas help you practice such things, with deliberate emphasis on a particular aspect.  (Actually, IMHO this makes them more like Toastmasters speech projects, but that's another story.)

There are many sources of lists of code katas out there.  Recently, to sharpen my Ruby-fu, I began working through the ones that Dave Thomas, of Pragmatic Programmer fame, put up at http://codekata.pragprog.com/.  The first kata does not involve code.  So, here is my solution to the second one:

#! /usr/bin/ruby

# Dave Thomas Code Kata #2, Karate Chop
# See http://codekata.pragprog.com/2007/01/kata_two_karate.html
# Solution by Dave Aronson

# TODO MAYBE: see if I can do the helper funcs as lambdas or some such....


# version 1: iterative

def chop1 tgt, arr
  # TODO: check arg types
  # TODO: check for empty array
  hi = arr.length - 1
  lo = 0
  while hi >= lo do
     mid = (hi + lo) / 2
     test = arr[mid]
     if test < tgt: lo = mid + 1
     elsif test > tgt: hi = mid - 1
     else return mid
     end
  end
  -1
end



# version 2: recursive and not functional

def chop2 tgt, arr
  chop2_helper tgt, arr, 0, arr.length - 1
end

def chop2_helper tgt, arr, min, max
  slot = (min + max) / 2
  test = arr[slot]
  if test == tgt: slot
  elsif min == max: -1
  elsif test < tgt: chop2_helper tgt, arr, slot + 1, max
  else chop2_helper tgt, arr, min, slot - 1
  end
end



# version 3: recursive and functional, passing array slice.

def chop3 tgt, arr
  # use another func to stop having to calc middle slot over and over
  chop3_pass_mid tgt, arr, arr.length / 2
end

def chop3_pass_mid tgt, arr, mid
  # use another func to stop having to retrieve middle value over and over
  chop3_pass_test tgt, arr, mid, arr[mid]
end

def chop3_pass_test tgt, arr, mid, test
  if test == tgt: mid
  elsif mid == 0: -1
  elsif test < tgt
    chop3_fix_offset mid + 1, chop3(tgt, arr[mid + 1 .. arr.length])
  else chop3 tgt, arr[0 .. mid - 1]
  end
end

# kind of a kluge; TODO MAYBE: do something cleaner with this problem!
def chop3_fix_offset offset, result
  result == -1 ? -1 : result + offset
end



# version 4: recursive and functional, passing whole array.

def chop4 tgt, arr
  chop4_helper tgt, arr, 0, arr.length - 1
end

def chop4_helper tgt, arr, min, max
  # use another func to not have to calc midpoint over and over
  chop4_pass_midpoint tgt, arr, min, max, (min + max) / 2
end

def chop4_pass_midpoint tgt, arr, min, max, midpt
  # use another func to not have to retrieve middle value over and over
  chop4_pass_midval tgt, arr, min, max, midpt, arr[midpt]
end

def chop4_pass_midval tgt, arr, min, max, midpt, midval
  if midval == tgt: midpt
  elsif min == max: -1
  elsif midval < tgt: chop4_helper tgt, arr, midpt + 1, max
  else chop4_helper tgt, arr, min, midpt - 1
  end
end



# TESTS

require 'test/unit'

class TC_MyTest < Test::Unit::TestCase

  def setup
    @arr = [ 0, 1, 2, 3, 5, 8, 13, 21, 34 ]  # fibonacci, btw
  end

  def test_chops
    [:chop1, :chop2, :chop3, :chop4].each do |func|

      # test that each thing in there, is found
      @arr.length.times do |slot|
        tgt = @arr[slot]
        assert_equal slot, send(func, tgt, @arr), "testing #{func.to_s}(#{tgt})"
      end

      # test that a bunch of things that should NOT be in there, are not found
      [@arr[0]-1, @arr[0]-0.1, 4, 33, @arr.last+0.1, @arr.last+1].each do |tgt|
        assert_equal -1, send(func, tgt, @arr), "testing #{func.to_s}(#{tgt})"
      end

    end

  end

end

Note, I'm not saying that this is the best solution, let alone the One True Solution.  Your mileage is practically guaranteed to vary, especially on matters of style.

In fact, I'd like to hear how you would have done it differently -- feedback is how one improves.  What would you have done differently, and why?  Would it be faster, clearer, more canonically Rubyish, or what?  Inquiring minds want to know!

Welcome to Codosaurus!

Greetings, dear readers!  This is my brand-spanking-new coding blog, as opposed to my leadership blog over at http://www.dare2xl.com/.  Dare2XL is mainly about daring to stand up to The Man for what's right, such as choice of language, "wasting" time on developer-level testing, and so on, including many situations having nothing whatsoever to do with software development.  This blog will be about how to choose and use your language, write and run unit and integration tests, and other aspects of software development -- primarily those dealing directly with code, versus, say, requirements definition, or writing documentation.  You know, the fun stuff!  :-)

Why "codosaurus"?  Long story short, let's just say I'm a bit older than most of my current colleagues -- enough so, that I grew up on plain old C, not Java.  But I'm not a dinosaur, as this old mammal continues to learn new tricks.  I learned C++, Java, Python, Javascript (including AJAX), Perl, Ruby (including Rails), and a lot more, in my spare time.  This blog is my way of sharing the overall software engineering experience I've gained over the years -- though often instantiated via specific languages, especially the ones the li'l whippersnappers have probably been exposed to -- or might want to be exposed to.

Now it's your turn.  What advice would you seasoned programmers/engineers/developers/whatever, give to today's up-and-comers?  What questions would you ask of me?  What is the airspeed velocity of an unladen swallow?