Friday, March 29, 2013

All Right, Break It Up!

   On a recent project, I worked on a very large Ruby on Rails application.  My temporary employer  joined the project after it was well underway.  It was composed of four "portals", three of which were already mostly written, and fairly large and complex.  Our work was mainly (and mine was exclusively) on the fourth, at the same time as the client was working on it too.

   An app composed of four clearly defined parts could be written as four, or at least three, separate Rails engines mounted on one app.  Or possibly as four separate apps with a library of shared code, all accessing the same database.  Or if the four parts are relatively small, one application that includes them all.  In this case... it was monolithic, but not of small parts at all.  We repeatedly recommended breaking it up, but the client just didn't want to.  Funny thing about consulting is, the client is always right -- not just even when he's wrong, but especially when he's clearly, provably, absolutely flat-out dead wrong.

   "So what?", you might be wondering.  "What good would breaking it up do, aside from pleasing the ivory-tower purists?"

   The problem is, this has quite an effect on the rate of progress.  Even if the engineering of the code itself is very clean, so you don't have god-objects causing churn and interference in the actual code (as the client did), and you have excellent developers making rapid progress in developing features (as, thankfully, both the client and we did, he said ever so humbly)... look at what happens when someone's just trying to commit a feature.  The skill level of the people involved doesn't matter at all at this stage, it's just pure statistics.

   Suppose you are on a team of about twenty developers, making a web app to run a school, with different portals for teachers, administrators, students, and parents.  You and a colleague pair program on the next highest-priority ticket from the backlog in the issue tracker.  It turns out to be in the parent-portal.  You create a feature-branch, off the master branch.   You get the feature passing the automated acceptance tests given in the user story, pull the latest master, merge it into your feature-branch (resolving any conflicts), and run the whole test suite.  It passes, so you didn't break anything.

   Now what?  Merge into master?  You try... and get rejected by the revision control system because someone committed changes to master while you were testing.  Not surprising, as there are twenty of you, all hard at work, and maybe some of them aren't pairing so there are even more than ten features under active development... and the test suite takes an hour to run.  The changes were in a different portal, so unlikely to interfere... but better safe than sorry.  Nobody wants to be the jerk who broke the build, or even the Continuous Integration server.  So you pull the latest master, merge it into your branch,  and run the test suite again.  Still green, so your changes don't break anything that was merged into master during your previous test run.  Nice to know, but it means that one of the test runs was a total waste of time.  Maybe you used it for something else productive... or not.  Either way, it's nearly certain that your mental house of cards concerning that feature has utterly collapsed, because you've been thinking about something else.

   So now, insert the above paragraph, again.  Maybe you pull some fancy tricks like parallelization and extensive mocking and stubbing of slow services and expensive object creation and so on, and cut the time down to fifteen minutes.  That's still plenty enough time for it to happen again.  And again.  Lather, rinse, repeat, ad nauseam, which gets to feeling like ad infinitum... especially to your client, who is breathing down your neck, waiting for this feature.

   Nobody's happy, not him, not your boss, and certainly not you.  You're a good developer, you want to be productive, you believe in testing... but you hate the inane futility of having to do it over and over in vain.

   Now let's consider what it might be like if the app had been broken up into separate engines or apps.  Even assuming a fifth item (an app to mount engines on, or a library of code shared among apps), the size of the codebase you need to deal with at one time is still cut down to 40% (assuming all pieces are of equal size).  You could possibly make it 20%, but let's even assume the worst (within the assumptions already made).  How would that help with this problem?

   First, only 20 to 40% of the changes being worked on (assuming equal distribution of current work) are likely to block your code from being accepted by version control.  (You're probably only working on one of the apps/engines, and maybe the library or main app but usually not.)  Call it 30% to keep it simple.  This means only 30% of the probability that any given change, to the overall system, will make you run the test suite again.

   But wait!  There's more!  (Or rather, there's less!)  The time it takes to run the test suite, should also get cut to about 40%.  (Less when you don't change anything the shared code depends on (so you only have to run one portal/app's tests), same when you do (that plus the shared code's tests), more when you change the shared code (everything).  Call it a wash to make the math easy.  Still certainly a smallish fraction.)

   Not only is that an instant time-savings right there, and not only does it make it easier for you to stay mentally on-task in case some more changes are needed, but:

   Combine these two factors, and they mean you have only 12% of the original probability of an interfering change happening during your tests, making you run them again.  Suppose the original probability was 75%, so that you had to run them again 3/4 of the time.  Your new probability is 9%, only about 1 in 11.  Since that's 3/4 versus 1/11 of all test runs, including the ones you did as repeats, that means a drastically shortened chain of tests and retests, with shorter links and vastly fewer of them.

   Now let's finally pile on all the other benefits.  Of course better separation of concerns would lead to cleaner design, for easier extension, maintenance, and repair.  But it would also ease some of the other pains we were having.  If people were devoted to just some particular portal, and maybe permitted to mess with the share code, they could separate the backlogs, and the source repositories, and so on.  That means only 20-40% as much email flooding their inboxes about the latest feature additions, bug reports, pull requests, and so on.  They might even turn their email notifiers back on, so they can find out about real emergencies in a reasonable amount of time.  They might be less tempted to skip testing.  They might be able to develop deep expertise on a piece of the project, instead of knowing just enough about all of it to be dangerous.  They might not abandon the project (or the entire company) in frustration.  And on and on.

   So now it's your turn, dear reader.  Have you worked on a megalithic app, with obvious seams to break it apart at?  Did you do it?  If so, what benefits did it bring -- what pains did it ease, what pleasures did it bring?  If not, why not?  Either way, what eventually happened?