Docdiff compare – Good! + Roundup
(see also: Real World Test)
Uwe (thanks Uwe!) posted a comment to my Wdiff post suggesting I try docdiff. Docdiff is a ruby script which has a number of output formatting options (and looks prettier than wdiff). Running docdiff on the sample files gives:
1.1Contractor may invoice [Customer] for ongoing fees quarterly in arrears and other fees monthly in arrears. Contractor must provide [Customer] with a tax invoice in respect of all GST charged. [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.
The cut and paste here has removed the colored highlighting from each add/remove – see screenshot above for a general indication – I could do without the different colors for different types of addition/deletion. This output looks very similar to the markup provided by wdiff – including the marking of “[Customer]” being replaced by “Customer”. It has also caught “charged” as a deletion/addition when it shouldn’t have – wdiff didn’t make this mistake. However, docdiff has another trick up its sleeve – it can be a character based engine with the use of the –char switch. With this on the output is:
Con tractor may invoice [Customer ] for ongoing fees quarterly in a rrea rs and other fees monthly in arrears. Con tractor must provide [Customer ] with a tax invoice in respect of all GST charged. [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.
The markup is more minimal – and it hasn’t incorrectly caught “charged”. However, it’s come up with a strange result I’ve never seen in a mark up before – a letter by letter change of Contractor to Vendor. It’s a bit unnerving to see something like
Con tractor appear in a markup. Conceptually, you probably do want the whole word marked as a change – to indicate you’ve stopped calling one of the parties one thing, and are now calling it another.
The Wash Up So Far
Docdiff provides very readable output, albeit subject to some inaccuracy at the word level of resolution. I suspect that it would not be too hard to coax wdiff to produce similar output (using the –start-delete=STRING etc switches). That said docdiff also provides a variety of different output formats and the ability to use the (slightly weird) –char option. It’s hard to say which is better.
In an update to my wdiff post, I also tested meld – which performed poorly, albeit better than OOo (sad, but true). A test of meld is unfair though as this is outside its intended domain of application. Everybody loves a ranking, so my current rankings (out of 10) of them are:
Word 8.5 (markup known to go feral at times)
docdiff 7.5 (idiosyncracies in the operation of the engine are offset by its output and flexibility – revised down a little based on some more testing of the engine)
wdiff 7.5 (I am assuming I can script something relatively easily to produce more visually appealing markup)
This is all subject to the caveat that I haven’t tried these tools (other than Word) on long documents. I may find that they go feral too for long files. The take away point though is that there are free software solutions which give acceptable markup performance, at least on .txt files. I have never had the need to compare formatting, so a .txt limitation is no great shakes for me.
I have had a search in YAST for diff. It’s hard to tell, but the things it throws up seem to be line based (like diff). If anything looks interesting there I will investigate further.
Finally, improving compare performance has been added as a potential project for Go-OO in Google’s Summer of Code. If you’re interested, here’s your chance to make your mark! (up… errr …. so to speak)
 The answer is:
wdiff –start-delete=\<del\> –end-delete=\<\/del\> –start-insert=\<ins\> –end-insert=\<\/ins\>
wdiff –start-delete=\<del\ style=\”color:red\”\> –end-delete=\<\/del\> –start-insert=\<ins\ style=\”color:blue\”\> –end-insert=\<\/ins\>
Although, it needs more smarts for multi-line text b/c browsers need <p> tags