Real World Test – Wdiff Best


Wdiff Compare – Best

Brendan Scott, July MMIX

[update Dec 2010: click here for my attempt at an online comparison service.]

(See also: docdiff post, wdiff post, OOo post)

I have earlier commented on the efficacy (or lack of it) of OOo’s compare function.  In my earlier posts I reviewed some of the comparison solutions which are available and are also free software.  One of the most promising of those was wdiff.  I did an initial review on some sample text but have not, to date, made any post on any real world use of these solutions.  This is that post.

I had four text files of comparatively large (about 10-12,000 words – or about 30 pages – each) agreements which I needed to compare.  This comparison was ‘live’ in the sense that the comparison was needed in the course of my review of these documents for a client.  Unfortunately, this also means that I’m not at liberty to post comparison examples since the documents are confidential (the comparisons in the earlier posts were run on a sample clause that I created).

Running Word compare on one pair of agreements  was a disaster.  Word seems to get to a certain point in the documents and then fall over dramatically (something I have witnessed in other situations).  For example, minor changes in a paragraph would make the whole of the balance of the document (or a large portion of it) marked as a change.  In order to get a useable comparison with Word I needed to run a comparison, then extract the section on which it failed, then run compare on those sections and so on iteratively. In fact, in the end I needed to manually extract most of the clauses either individually or with a small collection of clauses and (manually) run a comparison because Word repeatedly failed even on the shortened extracts.  This ran to about 20 or 30 separate extractions/comparisons using Word and was, frankly, pretty painful.

On a different pair of contracts I used wdiff.  Wdiff had similar problems to Word in that it got to a certain point at which it began marking the whole balance of the document as a change.  Wdiff’s raw performance was superior to Word’s in this respect, although the initial output, without any manual intervention was still not useable practically.  I could have used a similar strategy to that for Word of systematically extracting failed sections and re-comparing them.  Wdiff, however, had a significant advantage over Word in that it was able to be automated pretty easily (I used Python).  I wrote some scripts to compare specific clauses. That is, compare clause x of agreement 1 against clause y of agreement 2, output the results in html markup (see note 1) and join them together.   The results were fantastic.  Following this strategy, with clause by clause comparison wdiff didn’t get confused and produced a very good markup of the whole of each of the documents.  The downside was that it was not fully automated, but the results justified the comparatively small amount of manual input I needed to make.

Indeed, the automation allowed me to do some interesting things like compare different clauses within the same document (eg compare clause 1.1 against 1.2, 1.3, 1.4 and 1.5) to see how they were different.  Openoffice doesn’t properly render the html which is output though (it doesn’t show strikethrough) so Firefox needs to be used for display.

On that basis I’m revising my earlier rankings as follows:

wdiff – 8.5 (revised up because it is easy to script and this results in unexpectedly useful benefits)

Word – 7 (revised down because the markup goes feral repeatedly on the long documents I compared)

I did not revisit Docdiff, although I suspect it could also be automated in a similar fashion.

6 Responses to “Real World Test – Wdiff Best”


  1. 1 Nick 15 July 2009 at 4:52 pm

    Is it not possible to run the documents through something that will produce plain text or html or something and use standard diff tools? Is this just as bad?

  2. 2 brendanscott 15 July 2009 at 9:06 pm

    Hi Nick

    Thanks for the comment. In re diff have a look at the “see also” posts I mention at the top of the post particularly this one:

    Docdiff compare – Good! + Roundup

    In short line based diff is as bad as OOo compare == awful.

  3. 3 Bill 17 July 2009 at 11:30 am

    I’ve have had to do some some of this type of stuff on huge documents (think more than 1k pages or more). My solution was to use antiword (converts the document to ascii text) and then use the multitude of diff programs that work on ascii text. This well for me (in comparison to Word). I’m sure that you could use converters to html, xml, or some other quasi-ascii formats as well.

    YMMV,

    Bill


  1. 1 Docdiff compare – Good! + Roundup « Brendan Scott’s Weblog Trackback on 15 July 2009 at 4:04 pm
  2. 2 OOo Compare: Inadequate « Brendan Scott’s Weblog Trackback on 15 July 2009 at 4:05 pm
  3. 3 Wdiff compare – Good! « Brendan Scott’s Weblog Trackback on 15 July 2009 at 4:07 pm

Leave a comment




Blog Stats

  • 278,985 hits

OSWALD Newsletter

If you would like to receive OSWALD, a weekly open source news digest please send an email to oswald (with the subject "subscribe") at opensourcelaw.biz