Wednesday, December 21, 2005

Tree transforms on POSIX filesystems and related structures

This post comes from some discussion I had with Denys Duchier about reimplementing the inventory-rewriting step of the bzr merge procedure.

The bzr inventory system has a lot of the same constraints as a POSIX filesystem does. You can't have two files with the same name, at any point. You must never delete a directory before deleting its children. You must always create a directory before putting files inside it.

On the other hand, you can simply replace one inventory with another, updating the whole tree structure in one fell swoop. This is fairly straightforward, so that's what I did.

On the filesystem, there's no such shortcut. But there is a rather elegent way of doing it, one that I learned from reading the the GNU Arch source code.

There are 4 things you can do with a file:
  1. Create it
  2. Delete it
  3. Rename it
  4. Modify it
For regular files and directories, "modify" may mean permission changes. For files, it may mean contents changes. And for symlinks, it may mean target changes. It also includes changing a file from one type to another.

In order to allow operations to happen in the required order, we need to split rename into two operations: rename FROM, and rename TO. This means that we can classify the operations like so:
  1. Name removals (delete, rename FROM)
  2. Name insertions (create, rename TO)
  3. File modifications
If you do all your removals before insertions, you can be certain that insertions will never try to use a name that is pending removal. So as long as you don't try to insert the same name twice, you'll never violate the POSIX requirement of only one file per name.

So for name operations, we need two phases
  1. removals
  2. insertions
Splitting rename into two operations means that, during the removal phase, we need to keep renamed files somewhere else. We can keep them in a temp directory, then move them into place afterward.

It's important to do rename-FROM and delete at the same time, because of the ordering issues. Children must always be removed before parents. If you delete the parent first, your delete will fail. If you rename the parent first, you'll lose track of the child, and be unable to perform your operation on it. Because the inheritance hierarchy may include both deletes and rename-FROMs, you can't do renames or deletes as separate steps. They must be intermingled.

The inverse is true of insertions. If you try to create a child before creating its parents, the operation will fail. If you try to rename a child to its new name, the operation will also fail.

Modifications are largely free of naming concerns. However, there is one kind of modification that can affect the availability of names: file type changes. Turning a file into a directory enables it to have children. Turning a directory into something else makes it impossible for it to have children. Since file type modification requires that removals be done first, and since file type modification is required before insertions are safe, the logical place for it is in the middle.

Unfortunately, this means that you must do contents changes while renamed files are still in the temp dir.

So to recap, the three phases are:
  1. Do deletions and rename-to-temp in child-to-parent order
  2. Do file modifications
  3. Do creations and rename-to-final in parent-to-child order
I hope this is as fascinating to you as it is to me.

Thursday, December 01, 2005

When they say "Turbo", they mean it

I've been fiddling around with TurboGears recently. One web app I've worked on is a service that lets you know when the local bands you like are playing in a calendar format. So that you can say "I feel like seeing a band tonight" and see what's on. Another was a help system.

Well, I've just spent a fun few hours putting together a web front end to my Bugs Everywhere bug tracker. It was definitely a "turbo" experience. Went from zero to "does a lot and looks decent" in mere hours. It was also a "gears" experience, because half the fun was how easily everything fit together. Rock on, Turbogears.

There have been a spate of web interfaces for bzr, and I wonder whether they'd be easier using TurboGears or Django. My guess is yes-- lots easier.

These new web megaframeworks interest me a lot, partly because writing Panoramic Feedback required coming up with one. Yes, PF does templates, data persistence, url dispatch and JavaScript (not to mention having its own XML parser and PDF report generation subsystem). But if I was starting out today, there's no way I'd build our own megaframework. They're easy, they're fun, and they let you focus on what you really want to do.

TurboGears isn't the only megaframework in town, and I wouldn't turn up my nose at Django or Rails. But I'd prefer to write in Python, since that's compatible with most of my software. Ruby seems too similar to Python for it to be worthwile knowing both.

What I especially like about TurboGears is its philosophy of collaboration, and Kid templates. The fact that they're valid XHTML is neat. The fact that they pack the whole expressive power of Python (should you need it) is great. In my bug tracker front end, I'm actually passing list of Bug objects to the display template, and the template is retrieving the members it needs. Nice.

I'd like it if CherryPy had better support for RESTful URLs. SQLObject seems a little buggy to me. And CatWalk will be very nice, when it happens.

But it's already come a very long way.

Tuesday, November 15, 2005

Landed

I'm back from Ubuntu Below Zero. It was a great experience. Lots of stimulating conversations with smart people. Lots of people appreciating what we've done with bzr. Great progress on our future plans.

On the sunday, there was a dinner where we told them something people would never guess about us, and then everyone had to figure out who the facts were related to. I couldn't think of anything else, so I told them I started programming when I was eight. Turns out, that's not very unusual for Canonical employees. It may even make me a late bloomer.

Perhaps what I should have said was, "I've worked on five VCS clients in the past year".
  • Gnu Arch
  • Friendly Arch Interface (My python front-end to tla)
  • Bazaar (Canonical's fork of tla)
  • BaZing (My attempt to implement Martin Pool's idea for a new VCS)
  • Bazaar-NG (Martin Pool implementation of his idea)
Totally nuts.

Of course, they all factor into each other.
Early BaZing code was usable from Fai, the BaZing merge code was merged into bzr, and it took some ideas from tla. Lately, I've adapted Fai's shell mode for use with bzr.

Hopefully, things will settle down in the next few years. Funny how, after being a stagnant backwater for so long, VCSes are thriving now.

Sunday, October 23, 2005

Implementing merge

It seems like I've been implementing merge forever, but it's only been a year.

Merge is a fundamental operation when you have groups of people working together on a single piece of source code. It essentially means "take these other changes and combine them with mine."

The merge tech you'll see in CVS, SVN, Bazaar and Gnu Arch is called "three-way merging", because it looks at three versions of the file. The one you had before any changes were made ("BASE"), the one containing the changes the other person made ("OTHER"), and the one containing the changes you made ("THIS").

If THIS or OTHER (or both) introduced a change, it goes into the final version. But if THIS and OTHER made different changes to the same code, that's considered a conflict.

In CVS, "update" was used to merge other peoples' committed changes with your recent work. In distributed systems, it's typically its own command, and unlike CVS, it's typically invoked after you commit, not before.

My first work on merge was implementing three-way behaviour for baz merge-- at the Baz Code Sprint last November. At the time, baz and tla only supported three-way merging of text, and only when a particular command was used. I was making baz merge follow three-way behaviour.

At that code sprint, Martin Pool introduced his ideas for a new revision control system, now known as Bazaar-NG or bzr. I decided to take a stab at implementing those ideas, myself. My project, BaZing, got as far as being able to do merges and apply Arch changesets. Then I threw in with Martin to work on bzr.

I ported over the merge code to bzr. For a while, that meant a system of shims and adaptors to make one codebase speak to the other. Lately, I've been working on integrating the code better.

Martin implemented the actual text-merging portion of bzr, so we don't depend on diff3. But it generates more conflicts than I think it should, so I've been playing around with my own three-way algorithm.

And I went ahead and integrated weave merge, a technique used by SCCS and (we believe) BitKeeper, into bzr. Again, Martin had written the code, but I hooked it up to bzr's merge tech.

It's important that VCS creators not get hung up on merge. It's only one aspect of usability.

Normal, boring three-way merges are good enough quite a lot of the time. Merge has quite a lot of strange corner cases, but they're not hit all that often. Improvements are welcome, but we will never get it 100% right, because merge doesn't address combining programs, but combining text. Most merge tools don't understand what that text means. What we need, ideally, is artificial intelligence. Since we don't have that, we need tricks that make the program seem intelligent without actually being intelligent.

Merging is an art, not a science. So merge tech is a tar pit. It can take as much time as you're willing to throw at it, and still not be perfect. There are other things bzr needs, like better remote operations, central storage, better Windows support. So at times, it can be frustrating working on merging yet again.

Sunday, October 02, 2005

The Cathedral is bizarre

I can't friggin' believe Larry McVoy. I mean, I just don't understand him.

Here he is, lead designer of a powerful version control system(VCS). For a long time, BitKeeper had very good buzz in the open source world. (And, perhaps, even in the Free Software one.)

You'd think he would be proud. You'd think he'd focus on how to do even better. The last thing I expect from someone who's done great service to Linux is anticompetitive behaviour.

But lately, that's the kind of stuff we've seen. A while back, he cut Linus out because Tridge was writing an open-source Bitkeeper client. How does that work again? Now he's forced Brian O'Sullivan to stop working Mercurial, claiming he fears O'Sullivan will copy Bitkeeper's secret sauce.

Well, last time I checked, BitKeeper was a proprietary, closed-source program. Since Brian can't copy the source code, it can't be an issue of copyright infringement. No, Larry fears that Brian will copy ideas from BitKeeper.

In the first place, isn't that totally wrong? You shouldn't build a better mousetrap if you know how current mousetraps work? Edison has to invent lightbulbs in the dark? The hell?

In the second place, if Larry thinks his ideas are so special, why doesn't he patent them?

One possible reason is that not all the ideas are his own. BK is heavily based on SCCS, a 30 year-old VCS. It uses SCCS files to store its data. From what we can tell, its merge technology is also based on SCCS.

So Larry can base his VCS on someone else's, but Brian can't base his VCS on Larry's? Sure, that seems fair.

Look at the FOSS side of things, and there are no secrets. There's more than a few projects to build a great distributed VCS in progress at the moment, like Bazaar-NG (the one I'm with), Monotone, Codeville, Mercurial, Darcs, SVK, and more. Not only is the code open, but we're always chatting on IRC about merge algorithms, the merits and demerits of GUIDs for files, and other technology. IRC's where I first heard about Larry's latest escapades.

Maybe you think I should be happy that Mercurial's hit a bump in the road? Don't they say the enemy of my enemy is my friend? Maybe they do, but Brian O'Sullivan isn't my enemy, he's a competitor. We both want the better open-source VCSes. Why shouldn't we copy each others' best ideas?

Larry, he could have been another friendly competitor. But with anticompetitive behaviour and his talk of "innovation", he's starting to remind me of another of Free Software's enemies. But Microsoft has Visual Source Safe, which makes them BitKeeper's enemy, too. So I guess the enemy of my enemy is my enemy.