Attack of the Codosaurus!: August 2010

Sunday, August 29, 2010

Kata 8: Conflicting Objectives, Part 3B

Okay, now I've had some time to think about what it means for the code to be "extensible". Offhand I'm thinking it's as in the "open" part of the Open-Closed Principle, which states that a class should be open to extension but closed to modification. That is, you should be able to extend it o do new and different things, but not so much monkey around with the existing functionality and change how that works -- at least in its interactions with the rest of the world. (How it works inside is another story. That's why encapsulation and Information Hiding are such Good Things.)

That got me thinking, just wrap the functionality in a class, arranged such that one could easily override key methods. So, forthwith my class, and a couple of subclasses:

#! /usr/bin/ruby

# Dave Thomas Code Kata #8: Conflicting Objectives
# See http://codekata.pragprog.com/2007/01/kata_eight_conf.html
# Solution by Dave Aronson
# version 3: make it extensible
# part B: take "extensible" as meaning "open to doing additional things"
# or "open to doing the same things in different ways".  In an OO system,
# this generally means "more subclassable".  Remember the open part of the
# open/closed principle: a class should be *open to extension*.


class Word_Manager

  # This could be overridden to have, for instance, a variable
  # number of subparts; here we assume two.  It would also need to
  # override find_combos, and decide @max_whole_len differently.
  def initialize min_len, max_len
    @max_part_len = max_len
    @min_part_len = min_len
    @max_whole_len = @min_part_len + @max_part_len
    @long_words = []
    @short_word_lists = []
    (@min_part_len .. @max_part_len).each do |len|
      @short_word_lists[len] = {}
    end
  end

  def append word
    len = word.length
    if len == @max_whole_len then @long_words << word
    elsif @min_part_len <= len and len <= @max_part_len
      @short_word_lists[len][word] = nil
    end
  end

  def find_combos
    hits = []
    # hmmm, i'm thinking there may be some way to iterate over
    # the long words generically, passing in a block....
    @long_words.each do |word|
      (@min_part_len .. @max_part_len).each do |len|
        part_1 = word[0 .. len - 1]
        part_2 = word[len .. @max_whole_len - 1]
        if has_word? part_1 and has_word? part_2
          hits << "#{part_1} + #{part_2} => #{word}"
        end
      end
    end
    hits
  end

  def has_word? word
    len = word.length
    # note: generally for parts, not wholes
    if len == @max_whole_len then @long_words.include? word
    elsif @min_part_len <= len and len <= @max_part_len
      @short_word_lists[len].has_key? word
    else false
    end
  end

  def read_file file_name
    # a subclass could insert code to unify case, or translate, or . . . .
    File.new(file_name, 'r').each_line { |line| append line.chomp }
  end

end


# Now let's make a couple subclasses.


# one that's not case-sensitive
class Caseless_Word_Manager < Word_Manager
  def append word
    super word.downcase
  end
end


# one that will consider a part to be OK both forward and backward;
# only makes sense if also caseless
class Reversible_Word_Manager < Word_Manager
  def append word
    down = word.downcase
    super down
    # could instead override has_word to check for reversed version;
    # which one makes sense depends on frequency of insert vs. check
    super down.reverse if down.length < @max_whole_len
  end
end


# Now let's see what each one does with our word list!


def helper obj, description
  obj.read_file 'conflict_words.txt'
  hits = obj.find_combos
  puts "#{description} hits:\n#{hits.join "\n"}\n#{hits.length} hits"
end


helper Word_Manager.new(1, 5), 'Normal'
puts
helper Caseless_Word_Manager.new(1, 5), 'Caseless'
puts
helper Reversible_Word_Manager.new(1, 5), 'Reversible'

And while I'm at it... here is the list of words I've been using, so you can test the code yourself without waiting while it chugs through your system wordlist:

A
AT
ATTACK
Reb
ate
atwirl
be
bed
bedpan
bemoan
benighted
bie
big
bigger
boo
booger
but
conic
credible
days
debbie
ed
er
ers
ger
German
i
iconic
in
incredible
inside
insight
intake
irides
istoke
land
lander
man
moan
morrow
night
on
pan
rebate
rides
rub
rubber
side
sight
stoke
tack
tackon
take
takeon
to
todays
tomorrow
tubers
twirl

So now, as always, it's your turn. What does "extensible" mean to you? How would you have made it so?

Sunday, August 22, 2010

Kata 8: Conflicting Objectives, Part 3A

Part 3, much like Part 1, is open to much interpretation. I figure that the sense of "extensible" that Dave Thomas probably meant to drive at, was "easy to add features to", most likely by making subclassing easy, and accepting code blocks, and things like that. I've been giving that some thought, and it makes my head hurt. :-)

So for now, I'm going to "creatively misinterpret" it as "flexible" and "easy to put to many different uses by accepting a number of different ways it can do the job". Here is my ~~extensible~~ flexible version:

#! /usr/bin/ruby

# Dave Thomas Code Kata #8: Conflicting Objectives
# See http://codekata.pragprog.com/2007/01/kata_eight_conf.html
# Solution by Dave Aronson
# version 3: make it extensible
# part A: creatively misinterpret "extensible" as "flexible" :-)


# IDEA to make it more extensible, tho MUCH slower:
# ability to split into larger # of parts?
# could mitigate by also allowing minimum part length.


def make_word_lists file_name, case_sensitive, max_part_len, min_part_len
  long_words = []
  short_word_lists = []

  (min_part_len .. max_part_len).each { |len| short_word_lists[len] = {} }

  max_len = min_part_len + max_part_len

  File.new(file_name, 'r').each_line do |line|
    word = line.chomp
    len = word.length
    word.downcase! if not case_sensitive
    if len == max_len then long_words << word
    elsif min_part_len <= len and len <= max_part_len
      short_word_lists[len][word] = nil
    end
  end
  return [long_words, short_word_lists]
end


def find_combos long_words, short_word_lists, max_part_len, min_part_len
  hits = []
  max_len = min_part_len + max_part_len
  long_words.each do |word|
    (min_part_len .. max_part_len).each do |len|
      part_1 = word[0 .. len - 1]
      part_2 = word[len .. max_len - 1]
      if short_word_lists[len].has_key? part_1 and
         short_word_lists[max_len-len].has_key? part_2
        hits << "#{part_1} + #{part_2} => #{word}"
      end
    end
  end
  hits
end


file_name = 'conflict_words.txt'
max_len = 6
max_part_len = 5
min_part_len = 1
case_sensitive = false

# I'll skip all the overhead of getting and parsing the above as args....

start_time = Time.now
long_words, short_word_lists = make_word_lists file_name, case_sensitive,
                                               max_part_len, min_part_len

hits = find_combos long_words, short_word_lists, max_part_len, min_part_len
puts "hits:\n#{hits.join "\n"}\n#{hits.length} hits"
puts "Running time: #{Time.now - start_time} seconds"

Okay, now I could use a bit of help from the assorted Rubyists out there. Where do you see places I could install hooks on which people can hang code blocks? What do you think would ease subclassing to change the behavior? Or do you think that's the wrong approach entirely, that I should separate the workings out, as in the Strategy or Visitor patterns? I've got too much more important stuff to do today, to give this the thought it deserves right now.

Sunday, August 15, 2010

Code Kata Eight: Conflicting Objectives, Part 2 (Make It Fast)

As you may recall from last time, kata number eight has three parts: write it to be readable, write it to be fast, and write it to be extensible. I covered readability last time, so now let's go for speed. (Cue The Distance, by Cake.)

#! /usr/bin/ruby

# Dave Thomas Code Kata #8: Conflicting Objectives
# See http://codekata.pragprog.com/2007/01/kata_eight_conf.html
# Solution by Dave Aronson
# version 2: make it fast

# note that it's no longer in functions, to save the call overhead

# unroll the inner loop again, for speed
one_letter_words = {}
two_letter_words = {}
three_letter_words = {}
four_letter_words = {}
five_letter_words = {}

# this one is still fine as an array, since we just chug through it.
# no need for the extra functionality of a hash, nor the overhead.
six_letter_words = []

start_time = Time.now

File.new(ARGV[0], 'r').each_line do |line|
  word = line.chomp
  case word.length
    # TODO: check if there's a faster way than this... maybe a trie?
    # we don't need the value, just whether the key is there.
    when 1: one_letter_words  [word] = 0
    when 2: two_letter_words  [word] = 0
    when 3: three_letter_words[word] = 0
    when 4: four_letter_words [word] = 0
    when 5: five_letter_words [word] = 0
    when 6: six_letter_words << word
  end
end

# accumulate results, rather than reporting them as we find them;
# also use this to track # of hits
results = []

six_letter_words.each do |word|

  part_1 = word[0 .. 0]
  part_2 = word[1 .. 5]
  # do longer part first to make dropout most likely
  if five_letter_words.has_key? part_2 and one_letter_words.has_key? part_1
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 1]
  part_2 = word[2 .. 5]
  # do longer part first to make dropout most likely
  if four_letter_words.has_key? part_2 and two_letter_words.has_key? part_1
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 2]
  part_2 = word[3 .. 5]
  if three_letter_words.has_key? part_1 and three_letter_words.has_key? part_2
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 3]
  part_2 = word[4 .. 5]
  if four_letter_words.has_key? part_1 and two_letter_words.has_key? part_2
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

  part_1 = word[0 .. 4]
  part_2 = word[5 .. 5]
  if five_letter_words.has_key? part_1 and one_letter_words.has_key? part_2
    results << "#{part_1} + #{part_2} => #{word}\n"
  end

end

puts results
puts "#{results.length} hits"

puts "Running time: #{Time.now - start_time} seconds"

This version is about 1/3 faster than the "readable to Ruby programmers" version. (And of course hugely faster than the "readable to non-Rubyists" version, with arrays!) I also surmised that perhaps the "case word.length" part might be inefficient, and used again an array of hashes... but apparently the indexing was sufficiently slow that it was in fact worse than before!

Now it's your turn again. Despite the fact that I'm doing these in Ruby, I don't claim to be a Ruby guru, so there may have been some cute Ruby tricks I missed, that could speed this up. Maybe even some algorithm tweaks. What would you do differently?

Saturday, August 7, 2010

Code Kata Eight: Conflicting Objectives, Part 1

Kata number seven is a non-coding kata, so let's move on to kata number eight. This one is about conflicting objectives. To quote the original:

Write the program three times.

The first time, make program as readable as you can make it.

The second time, optimize the program to run fast fast as you can make it.

The third time, write as extendible a program as you can.

Even that, though, leaves a lot of room for interpretation. For instance, "readable" to whom? I decided to divide this into two audiences. The first is people entirely new to the language (in this case, Ruby) and possibly to programming in general. Ruby has quite an advantage here, being so expressive and easy to pick up. The second is people who do know the language, including common idioms, and presumably programming in general.

Since the two are so radically different, at least the way I did them, I'll give you them both at once. First the one for complete newbies:

#! /usr/bin/ruby

# Dave Thomas Code Kata #8: Conflicting Objectives
# See http://codekata.pragprog.com/2007/01/kata_eight_conf.html
# Solution by Dave Aronson
# version 1: make it readable
# part A: make it readable to a ruby newbie, possibly non-programmer


def make_word_lists file_name

  word_lists = []

  for len in 1 .. 6 do
    word_lists[len] = []
  end

  puts "Reading #{file_name}..."
  dict = File.new file_name, 'r'
  for line in dict do
    word = line.chomp
    if    word.length == 1 then word_lists[1] << word
    elsif word.length == 2 then word_lists[2] << word
    elsif word.length == 3 then word_lists[3] << word
    elsif word.length == 4 then word_lists[4] << word
    elsif word.length == 5 then word_lists[5] << word
    elsif word.length == 6 then word_lists[6] << word
    end
  end
  dict.close

  return word_lists

end



def find_combinations word_lists

  puts 'Finding combinations....'
  
  combinations = 0
  
  for word in word_lists[6] do
  
    part_1 = word[0 .. 0]
    part_2 = word[1 .. 5]
    if word_lists[1].include? part_1 and word_lists[5].include? part_2
      puts "#{part_1} + #{part_2} = #{word}"
      combinations = combinations + 1
    end
  
    part_1 = word[0 .. 1]
    part_2 = word[2 .. 5]
    if word_lists[2].include? part_1 and word_lists[4].include? part_2
      puts "#{part_1} + #{part_2} = #{word}"
      combinations = combinations + 1
    end
  
    part_1 = word[0 .. 2]
    part_2 = word[3 .. 5]
    if word_lists[3].include? part_1 and word_lists[3].include? part_2
      puts "#{part_1} + #{part_2} = #{word}"
      combinations = combinations + 1
    end
  
    part_1 = word[0 .. 3]
    part_2 = word[4 .. 5]
    if word_lists[4].include? part_1 and word_lists[2].include? part_2
      puts "#{part_1} + #{part_2} = #{word}"
      combinations = combinations + 1
    end
  
    part_1 = word[0 .. 4]
    part_2 = word[5 .. 5]
    if word_lists[5].include? part_1 and word_lists[1].include? part_2
      puts "#{part_1} + #{part_2} = #{word}"
      combinations = combinations + 1
    end
  
  end
  
  return combinations

end


start_time = Time.now
word_lists = make_word_lists ARGV[0]
combinations = find_combinations word_lists
puts "We found a total of #{combinations} combinations"

end_time = Time.now
run_time = end_time - start_time

puts "Running time: #{run_time} seconds"

Yes, it's incredibly slow, due to use of simple arrays rather than hashes, since Ruby-newbies, or even more so, programming-newbies, would not be familiar with hashes or other such semi-advanced data structures. I could make it faster by using a binary search function, but that would also have to be in there and count against the readability! Besides, speed is for Part 2. And now the one for Ruby programmers (who are presumably familiar with hashes):

#! /usr/bin/ruby

# Dave Thomas Code Kata #8: Conflicting Objectives
# See http://codekata.pragprog.com/2007/01/kata_eight_conf.html
# Solution by Dave Aronson
# version 1: make it readable
# part B: make it readable to someone who knows Ruby


def make_word_lists file_name

  word_lists = []
  
  for len in 1 .. 6 do word_lists[len] = {} end
  
  dict = File.open file_name, 'r'
  dict.each_line do |line|
    word = line.chomp
    len = word.length
    word_lists[len][word] = nil if len <= 6
  end
  dict.close

  word_lists
end
  

def find_combinations word_lists
  combinations = 0
  for word in word_lists[6].keys.sort do
    for first_part_len in 1 .. 5 do
      part_1 = word[0 .. first_part_len - 1]
      part_2 = word[first_part_len .. 5]
      if word_lists[first_part_len].include? part_1 and
         word_lists[6-first_part_len].include? part_2
        puts "#{part_1} + #{part_2} = #{word}"
        combinations += 1
      end
    end
  end
  combinations
end


start_time = Time.now
word_lists = make_word_lists ARGV[0]
combinations = find_combinations word_lists
puts "We found a total of #{combinations} combinations"
puts "Running time: #{Time.now - start_time} seconds"

So, dear readers, let's hear your input. Did you find any of that poorly readable? What do you think would have been more readable? Should I have coded up a binary search, to make Part A tolerably fast? Or going the other way, do you think I went overboard? (Especially in the horribly verbose Part A, with the inner loop unrolled?)

Sunday, August 1, 2010

Anna Graham?

No, kata number six is about anagrams, not some lady named Anna Graham. Follow the link for more details, but briefly the objective is to load in a bunch of words, say which ones were anagrams of which, what the longest anagrammed word is, and what the largest set of anagrams (i.e., words using the same letters) is. My solution hinges around the load_words method. The implementation, in Ruby, complete with tests of course, is:

#! /usr/bin/ruby

# Dave Thomas Code Kata #6: Anagrams
# See http://codekata.pragprog.com/2007/01/kata_six_anagra.html
# for problem -- but not data, he's taken down the word list.
# Solution by Dave Aronson.
# Note, some of the things that are recalculated upon demand,
# could have been figured out up front once and for all and cached,
# and vice-versa.  Meant to show mix of options.  Real-world choice
# would depend on intended use.


class Anagram_Detector

  def initialize
    @sets = {}
    @longest_word = ''
    @longest_set = []
  end

  attr :sets
  attr :longest_set
  attr :longest_word

  def bad_sets
    @sets.reject { |key, val| val.length > 1 }
  end

  def canonicalize word
    word.split(//).sort.join
  end

  # TODO MAYBE: let user specify whether to
  # show anagrams of longest anagrammed word
  def dump
    puts 'Anagrams:'
    good_sets.each_pair { |key, val| puts "  #{key}: #{val.join ' '}" }
    puts "#{num_words} words in #{num_sets} sets"
    # note that the longest_word will always be the *first* of its set,
    # *in order of insertion*; whether that's alphabetical depends on use.
    # also it may not be the first one of its size, but the first one
    # with a second arrangement (i.e., an anagram) inserted.
    # further words of that same size will not replace it.
    puts "Longest word is #{@longest_word} (and its anagrams)"
    puts "Longest set is #{@longest_set.join ' '}"
    puts 'Others:'
    bad_sets.each_pair { |key, val| puts "  #{key}: #{val}" }
  end

  def good_sets
    @sets.reject { |key, val| val.length == 1 }
  end

  # TODO MAYBE: let user specify options, such as:
  # whether to fold case
  # whether to dump dupes
  # how to handle spaces
  def load_words words
    words.each do |word|
      letters = canonicalize word
      # could alternately use "sets.key? letters" but
      # we need to retrieve what's there (if any) anyway.
      so_far = @sets[letters]
      if so_far == nil
        so_far = []
      # only bother checking length upon finding 1st anagram
      elsif so_far.length == 1 and word.length > @longest_word.length:
        @longest_word = so_far[0] 
      end
      so_far << word
      @longest_set = so_far if so_far.length > @longest_set.length
      @sets[letters] = so_far
    end
  end

  def num_sets
    good_sets.length
  end

  def num_words
    good_sets.inject(0) { |count, set| count += set[1].length }
  end

end



# TESTS

Words = [ 'foo', 'bar', 'baz', 'clang', 'honk', 'tweet',
          'post', 'pots', 'spot', 'stop', 'tops',
          'dog', 'god', 'waster', 'waters', 'rawest' ]

require 'test/unit'

class TC_MyTest < Test::Unit::TestCase

  # WARNING: done by hand, so possibly wrong;
  # must be correct for tests to work correctly!
  # Should be the result for Words, below.
  Anagrammed_By_Hand = {
    'abr'      => [ 'bar' ],
    'abz'      => [ 'baz' ],
    'acgln'    => [ 'clang' ],
    'dgo'      => [ 'dog', 'god' ],
    'eettw'    => [ 'tweet' ],
    'foo'      => [ 'foo' ],
    'hkno'     => [ 'honk' ],
    'opst'     => [ 'post', 'pots', 'spot', 'stop', 'tops' ],
    'aerstw'   => [ 'rawest', 'waster', 'waters' ]
  }

  def setup
    @detector = Anagram_Detector.new
    @detector.load_words Words
  end

  def test_number_of_sets_match
    # beware magic numbers in tests... should find a way around this....
    assert_equal 3, @detector.num_sets
  end

  def test_number_of_words_match
    # beware magic numbers in tests... should find a way around this....
    assert_equal 10, @detector.num_words
  end

  # sort results just in case things got switched around;
  # it's set membership that's important, not order

  def test_keys_match
    assert_equal Anagrammed_By_Hand.keys.sort,
                 @detector.sets.keys.sort
  end

  def test_results_match
    @detector.sets.each_key do |key|
      assert_equal Anagrammed_By_Hand[key].sort,
                   @detector.sets[key].sort
    end
  end

  def test_got_right_longest_set
    assert_equal Anagrammed_By_Hand['opst'].sort,
                 @detector.longest_set.sort
  end

  def test_got_right_longest_word
    assert Anagrammed_By_Hand['aerstw'].include? @detector.longest_word
  end

  # TODO MAYBE: test erroneous conditions

end

detector = Anagram_Detector.new
detector.load_words Words
puts "Dump:"
detector.dump
puts "\nTests:"
# Test::Unit will run the tests AFTER this

Another thing a bit unusual about this one is that I tried to make more use of Ruby's functional-style methods, like inject and reject. (Sorry, Alice's Restaurant fans, I used inspect during informal testing, and select in a previous version, but couldn't find a decent excuse to use detect, and AFAIK there are no common Ruby classes with methods infect or neglect.) Having "grown up" mostly with C, the imperative/procedural style is most intuitive to me. However, IMNSHO, one should always strive to at least consider, or better yet understand, alternatives to The Way We've Always Done It -- and maybe even think up more.

So now, as always, it's your turn. Comments? Questions? Concerns? Character assassinations? Compliments even?