Wdiff Compare – Best
Brendan Scott, July MMIX
[update Dec 2010: click here for my attempt at an online comparison service.]
I have earlier commented on the efficacy (or lack of it) of OOo’s compare function. In my earlier posts I reviewed some of the comparison solutions which are available and are also free software. One of the most promising of those was wdiff. I did an initial review on some sample text but have not, to date, made any post on any real world use of these solutions. This is that post.
I had four text files of comparatively large (about 10-12,000 words – or about 30 pages – each) agreements which I needed to compare. This comparison was ‘live’ in the sense that the comparison was needed in the course of my review of these documents for a client. Unfortunately, this also means that I’m not at liberty to post comparison examples since the documents are confidential (the comparisons in the earlier posts were run on a sample clause that I created).
Running Word compare on one pair of agreements was a disaster. Word seems to get to a certain point in the documents and then fall over dramatically (something I have witnessed in other situations). For example, minor changes in a paragraph would make the whole of the balance of the document (or a large portion of it) marked as a change. In order to get a useable comparison with Word I needed to run a comparison, then extract the section on which it failed, then run compare on those sections and so on iteratively. In fact, in the end I needed to manually extract most of the clauses either individually or with a small collection of clauses and (manually) run a comparison because Word repeatedly failed even on the shortened extracts. This ran to about 20 or 30 separate extractions/comparisons using Word and was, frankly, pretty painful.
On a different pair of contracts I used wdiff. Wdiff had similar problems to Word in that it got to a certain point at which it began marking the whole balance of the document as a change. Wdiff’s raw performance was superior to Word’s in this respect, although the initial output, without any manual intervention was still not useable practically. I could have used a similar strategy to that for Word of systematically extracting failed sections and re-comparing them. Wdiff, however, had a significant advantage over Word in that it was able to be automated pretty easily (I used Python). I wrote some scripts to compare specific clauses. That is, compare clause x of agreement 1 against clause y of agreement 2, output the results in html markup (see note 1) and join them together. The results were fantastic. Following this strategy, with clause by clause comparison wdiff didn’t get confused and produced a very good markup of the whole of each of the documents. The downside was that it was not fully automated, but the results justified the comparatively small amount of manual input I needed to make.
Indeed, the automation allowed me to do some interesting things like compare different clauses within the same document (eg compare clause 1.1 against 1.2, 1.3, 1.4 and 1.5) to see how they were different. Openoffice doesn’t properly render the html which is output though (it doesn’t show strikethrough) so Firefox needs to be used for display.
On that basis I’m revising my earlier rankings as follows:
wdiff – 8.5 (revised up because it is easy to script and this results in unexpectedly useful benefits)
Word – 7 (revised down because the markup goes feral repeatedly on the long documents I compared)
I did not revisit Docdiff, although I suspect it could also be automated in a similar fashion.