Sunday, August 15, 2010

Code Kata Eight: Conflicting Objectives, Part 2 (Make It Fast)

   As you may recall from last time, kata number eight has three parts: write it to be readable, write it to be fast, and write it to be extensible.  I covered readability last time, so now let's go for speed.  (Cue The Distance, by Cake.)

#! /usr/bin/ruby

# Dave Thomas Code Kata #8: Conflicting Objectives
# See http://codekata.pragprog.com/2007/01/kata_eight_conf.html
# Solution by Dave Aronson
# version 2: make it fast

# note that it's no longer in functions, to save the call overhead

# unroll the inner loop again, for speed
one_letter_words = {}
two_letter_words = {}
three_letter_words = {}
four_letter_words = {}
five_letter_words = {}

# this one is still fine as an array, since we just chug through it.
# no need for the extra functionality of a hash, nor the overhead.
six_letter_words = []

start_time = Time.now

File.new(ARGV[0], 'r').each_line do |line|
  word = line.chomp
  case word.length
    # TODO: check if there's a faster way than this... maybe a trie?
    # we don't need the value, just whether the key is there.
    when 1: one_letter_words  [word] = 0
    when 2: two_letter_words  [word] = 0
    when 3: three_letter_words[word] = 0
    when 4: four_letter_words [word] = 0
    when 5: five_letter_words [word] = 0
    when 6: six_letter_words << word
  end
end

# accumulate results, rather than reporting them as we find them;
# also use this to track # of hits
results = []

six_letter_words.each do |word|

  part_1 = word[0 .. 0]
  part_2 = word[1 .. 5]
  # do longer part first to make dropout most likely
  if five_letter_words.has_key? part_2 and one_letter_words.has_key? part_1
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 1]
  part_2 = word[2 .. 5]
  # do longer part first to make dropout most likely
  if four_letter_words.has_key? part_2 and two_letter_words.has_key? part_1
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 2]
  part_2 = word[3 .. 5]
  if three_letter_words.has_key? part_1 and three_letter_words.has_key? part_2
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 3]
  part_2 = word[4 .. 5]
  if four_letter_words.has_key? part_1 and two_letter_words.has_key? part_2
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 4]
  part_2 = word[5 .. 5]
  if five_letter_words.has_key? part_1 and one_letter_words.has_key? part_2
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

end

puts results
puts "#{results.length} hits"

puts "Running time: #{Time.now - start_time} seconds"

   This version is about 1/3 faster than the "readable to Ruby programmers" version.  (And of course hugely faster than the "readable to non-Rubyists" version, with arrays!)  I also surmised that perhaps the "case word.length" part might be inefficient, and used again an array of hashes... but apparently the indexing was sufficiently slow that it was in fact worse than before!

   Now it's your turn again.  Despite the fact that I'm doing these in Ruby, I don't claim to be a Ruby guru, so there may have been some cute Ruby tricks I missed, that could speed this up.  Maybe even some algorithm tweaks.  What would you do differently?