Wdiff compare – Good!
[update Dec 2010: click here for my attempt at an online comparison service.]
(see also: Real World Test, Docdiff post, OOo compare post)
I previously commented (adversely) on the performance of OpenOffice.org’s compare function. Tridge suggested that I might try wdiff – and I’m pleasantly surprised. When I ran wdiff on the two text files I used in that post it gave me this:
[-1.1Contractor-]{+1.1[Vendor]+} may invoice [-[Customer]-] {+Customer the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer+} for ongoing fees quarterly in [-arrears-] {+advance+} and other fees [-monthly-] in arrears. [-Contractor must-] {+[Vendor] will+} provide [-[Customer]-] {+Customer+} with a tax invoice in respect of all GST charged. [-[Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.-]
Which, if you can decipher the inline formatting instructions, is a better performance than OOo (not that that would be difficult mind…). In fact, if I’m reading it correctly, it seems to be only slightly worse than the output of Word’s compare (it marks changes within “words” as a change of a word – so the [-[Customer]-] {+Customer+} is not a minimal change set- it should be [-[-] and [-]-]). I will try to find sometime to test it on longer clauses or whole agreements. A bonus is that its output format is cut-and-paste compatible, albeit a little visually unappealing. It shouldn’t be too much trouble to pipe it through something to pretty it up (the -p option apparently prints overstrikes, but this doesn’t work on my tty – ah, -t is what i need for more visual output on a tty). A second bonus is that -s gives change statistics (I don’t recall ever seeing this in Word).
The other nice thing about wdiff is that it was already installed on my machine! About two (maybe three??) years ago I googled around for something that would do diffs on text files, but the searches only showed diff, kdiff etc. Wdiff apparently dates back to the early 90s so my search terms maybe weren’t the best – or too many other people are using and linking to diff! Anyway, wdiff seems to be part of the standard install for OpenSuSE and I’d had it all along.
And thanks Tridge!
Update:
Tridge says:
There are probably quite a few other similar utilities. Minimal diff algorithms are a nice little computer science task. I'm sure heaps of people have written variants. I expect quite a few CS courses use this sort of thing as an exercise for 2nd year programming students... Improving the algorithms in OOo would be a nice Google summer of code project.
Which I think would be a great idea and a valuable addition to OOo. If anyone’s game to try (either as a student or a supervisor) please drop me an email or lodge a comment and I’ll try to connect you.
Update 2:
Following someone’s suggestion, I have tried meld – am impressed by the interface, but not by the markup. Run on the same two sample text files it marked:
with a tax invoice in respect of all GST charged.
as unchanged, with the balance as deletion/addition. Sadly, this is a better performance than OOo (but is still inadequate).
I pipe the output of wdiff through a short C program I wrote which turns wdiff’s markup into the ANSI terminal codes for colour. So the “old” text is marked with a red background and the “new” text with a green background.
Looks great in a bash terminal.
I never tried it but this extension looks intersting evenb if it is not opensource
http://extensions.services.openoffice.org/project/DeltaXMLODTCompare
The program wdiff is a front end to diff for comparing files on a word per word basis. A word is anything between whitespace. This is useful for comparing two texts in which a few words have been changed and for which paragraphs have been refilled. It works by creating two temporary files, one word per line, and then executes diff on these files. It collects the diff output and uses it to produce a nicer displany of word differences between the original files.
This is the first paragraph of the first link in your article. As you can see wdiff is designed to work on a word by word basis. It doesn’t try to find a minimal change set on a per character (or per bit) basis. Besides, I’m not sure, whether this would actually be useful.
If you’re just comparing plain text files, have a look at ‘meld’ (http://meld.sourceforge.net/). This is a graphical diff and editor utility in one. Works line by line though, I’m afraid.
Jo
docdiff!
I think docdiff is what you are looking for.
It produces (using html markup) a diff file in which word by
word the differences are highlighted.
A part from this, it is part of debian/ubuntu
Uwe Brauer
Hey…
I am a student interested in doing this GSoc project…I did some research behind wdiff and find the actual diff program and then the actual algorithm this stuff is based upon…the Longest Common Subsequence algo…I would really love to this stuff..so please try to *connect* me…..
thnx….
Hi Akshay
The first place to look is the Go-OO Summer of Code Page:
http://freedesktop.org/wiki/Software/ooo-build/SummerOfCode/2009
I will see if I can dig up some other info for you.
Regards
Brendan
(X)emacs has the ediff utility, which at least for not too big files, allows
wordwise comparison and navigation between these differences.