Wednesday, February 21, 2007

Living the sysadmin dream

There's a legendary threat of sysadmins: "Go away, or I will replace you with a very small shell script".

Well, today I got a taste of that.

At work, we've got a new web site almost ready to go. The content is stored the plainest-possible XHTML. That makes it somewhat future-proof. We process that through Kid templates to make it look snazzy.

We wanted to reuse a lot of the content from our existing site, which is a mix of HTML and PHP with some abuse of divs and other incidental wackiness. We knew the error of our ways, and we wanted to get pure, so we hired a contractor to convert our site over. He knew Python. He seemed to have a good grasp of advanced web site architectures.

Boy, were we ever wrong.

Originally expected to take one month, the job ended up taking three. In the process, I had to write a script to do automated testing of the guy's work, because he was screwing up on the mechanical stuff. Problems that can be solved with search and replace, he was choosing to solve by hand, and getting wrong. It was insulting getting work with the same mechanical errors time and time again.

What I had, perhaps naively, expected was that he would write a conversion script, eyeball the results, and fix up any problems.

There were a few files we'd asked him not to convert, but we eventually decided to convert some of them after all. The task fell to me, and it was a real joy to set to it. After all that time spent poring over his work, I'd developed a very clear idea of what I thought he should have done in the first place.

Phase 1: Text Substitution
In this phase, we convert PHP to HTML. In the general case, you can't do this without implementing PHP. But our existing site uses PHP only as a templating system, so there are only a couple of stock phrases to worry about. re.sub handles this nicely.

Another thing we do is remove any ® symbols we encounter, because our house style has changed, and we no longer want to affix this to every mention of our brand.


Phase 2: Tidy HTML
This phase consists of shoving all our files through Tidy, so that they become well-formed, almost-valid HTML. It's wonderful to have a utility like Tidy, because programs can't feel pain.

Phase 3: Ending div abuse

Our existing content uses div classes for headings all over the place. To tackle these problems, we should really operate in the XML domain, so that we avoid parsing mistakes. Now that tidy has made the documents well-formed, we can parse them with ElementTree's HTML parser. Then we process it through a Kid Template with a series of match rules. These turn divs into h2s, and so on. We can also clean up our document title (which has an inexplicable trailing non-breaking spaces)

Phase 4: Whitespace cleanup
The kid processing has fixed the semantics of the document and exported it as XHTML, but we're left with swaths of whitespace and line breaks in unreasonable places. Another run through Tidy fixes that.

Automation is wunnerful
Not only did I manage to get a basic conversion routine in place in about three hours, I actually stayed at long past my notional end-of-shift, because I was enjoying what I was doing. Writing a script to convert files to XHTML is fun. Doing the conversion yourself is not. That's a problem our contractor faced. Not only was he slow when he was actually working, but he had trouble getting motivated to work, so he didn't put in as much time as we wanted him to.

They say that good programmers are 10x more productive than average programmers, and truly excellent programmers are 10x more productive again. I suspect automation is one of the factors in that difference. Unless you stop them, an excellent programmer will find a way to eliminate the boring bits. Even if automating the job takes just as long as doing it by hand, a programmer will be happier and more motivated because the work is much more stimulating.

So today, I replaced a contractor with a reasonably small Python script. I wish I'd done it months ago.