Docdiff compare – Good! + Roundup


Docdiff compare – Good! + Roundup

(see also: Real World Test)

Docdiff compare of sample clauses

This is part 3 in my (unfortunately, apparently neverending) search for an adequate markup facility that doesn’t rely on MS Word.  The earlier two installments related to OOo and wdiff.

Uwe (thanks Uwe!) posted a comment to my Wdiff post suggesting I try docdiff.  Docdiff is a ruby script which has a number of output formatting options (and looks prettier than wdiff).  Running docdiff on the sample files gives:

1.1Contractor 1.1[Vendor] may invoice [Customer] Customer the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer for ongoing fees quarterly in arrears advance and other fees monthly in arrears. Contractor must [Vendor] will provide [Customer] Customer with a tax invoice in respect of all GST charged.  [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice. charged.

The cut and paste here has removed the colored highlighting from each add/remove – see screenshot above for a general indication –  I could do without the different colors for different types of addition/deletion.  This output looks very similar to the markup provided by wdiff – including the marking of “[Customer]” being replaced by “Customer”.  It has also caught “charged” as a deletion/addition when it shouldn’t have – wdiff didn’t make this mistake.  However, docdiff has another trick up its sleeve – it can be a character based engine with the use of the –char switch.  With this on the output is:

1.1Co[Ventractdor] may invoice [Customer] the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer for ongoing fees quarterly in arredvarsnce and other fees monthly in arrears. Co[Ventractdor] mustwill provide [Customer] with a tax invoice in respect of all GST charged. [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.

The markup is more minimal – and it hasn’t incorrectly caught “charged”.  However, it’s come up with a strange result I’ve never seen in a mark up before – a letter by letter change of Contractor to Vendor.  It’s a bit unnerving to see something like  Co[Ventractdor] appear in a markup.  Conceptually, you probably do want the whole word marked as a change – to indicate you’ve stopped calling one of the parties one thing, and are now calling it another.

The Wash Up So Far

Docdiff provides very readable output, albeit subject to some inaccuracy at the word level of resolution.  I suspect that it would not be too hard to coax wdiff to produce similar output (using the –start-delete=STRING  etc switches).[1]  That said docdiff also provides a variety of different output formats and the ability to use the (slightly weird) –char option.   It’s hard to say which is better.

In an update to my wdiff post, I also tested meld – which performed poorly, albeit better than OOo (sad, but true).  A test of meld is unfair though as this is outside its intended domain of application.  Everybody loves a ranking, so my current rankings (out of 10) of them are:

Word 8.5 (markup known to go feral at times)

docdiff 7.5 (idiosyncracies in the operation of the engine are offset by its output and flexibility – revised down a little based on some more testing of the engine)

wdiff 7.5 (I am assuming I can script something relatively easily to produce more visually appealing markup)

meld 3

OOo 2

diff 2

This is all subject to the caveat that I haven’t tried these tools (other than Word) on long documents.  I may find that they go feral too for long files.  The take away point though is that there are free software solutions which give acceptable markup performance, at least on .txt files.  I have never had the need to compare formatting, so a .txt limitation is no great shakes for me.

I have had a search in YAST for diff.  It’s hard to tell, but the things it throws up seem to be line based (like diff).  If anything looks interesting there I will investigate further.

Finally, improving compare performance has been added as a potential project for Go-OO in Google’s Summer of Code.  If you’re interested, here’s your chance to make your mark! (up… errr …. so to speak)

Notes:

[1]  The answer is:

wdiff –start-delete=\<del\> –end-delete=\<\/del\> –start-insert=\<ins\> –end-insert=\<\/ins\>

or, prettier:

wdiff –start-delete=\<del\ style=\”color:red\”\> –end-delete=\<\/del\> –start-insert=\<ins\ style=\”color:blue\”\> –end-insert=\<\/ins\>

Although, it needs more smarts for multi-line text b/c browsers need <p> tags

15 Responses to “Docdiff compare – Good! + Roundup”


  1. 1 Uwe Brauer 25 March 2009 at 11:23 pm

    Hi Brendan,

    Since you are willing to try a great variety of (markup) formats, I cannot resist to recommend latexdiff to you.

    You should wrap the text(s) you want to compare
    with some minimal latex syntax

    (BTW OOpenoffice already has some reasonable latex output filter, lyx would be another option)

    like
    \documentstyle{article}
    \begin{document}
    this is the text I want to compare.
    \end{document}

    and the same for the second text,
    then run latexdiff
    a perl script on both documents, the result is a third document

    which must be post processed:

    execute latex or better pdflatex on it
    to produces a pdf document in which the differences are highlighted nicely.

    latexdiff uses different diff algorithm some based on perl, or standard diff.

    BTW tex4ht includes an option to convert the latex file into odf format, I have not tested it for these diff files.

    Uwe

  2. 2 CasTex 26 March 2009 at 6:00 am

    Thanks for this post, I am interested in.

  3. 3 Uwe Brauer 26 March 2009 at 8:27 pm

    Hi

    since the problem of how to compare different versions
    of the same document is important to me I did various tests in the past months and that is why I have one more recommendation, which is, I must admit, the most
    “hackerish” one:

    you will need the (X)emacs editor for this, to be precise I have only tested it with Xemacs and I am not sure whether this functionality is installed in GNU emacs. But on Debian/Ubuntu you just have to install Xemacs, which ships this diff utility, called ediff.

    So what you have to do is to open both files you want to compare and then you do:

    M-x
    ediff-regions-wordwise
    enter

    M-x stands for pressing the alt or meta key and then the x, a prompt appears and you type the above command.

    You will then be asked which buffers (that means “basically” files which are open) you want to compare:
    You have to select with the mouse the text you want to compare and off you go.

    Wordwise differences are nicely highlighted, but *best* of all you can now even _merge_ the differences.

    This is really the only application I am aware of, which allows you to merge word-wise differences. Meld only offers to merge line-wise differences.

    I can send you a screenshot if you want.

  4. 4 brendanscott 26 March 2009 at 9:46 pm

    Uwe,

    It’s been over 15 years since I last used \latex and it was about the same time I stopped using emacs. Don’t make me go back now!

    Merging differences is not a function I’ve ever had a need for.

    Cheers

    Brendan

  5. 5 Uwe Brauer 26 March 2009 at 10:03 pm

    Well, I think a picture says more than 1000 words so I could provide you with 2 screenshots, but on the other hand if you abandoned the ONE TRUE editor…..

    As with respect to merging, lucky you, I need that feature
    very much.

    Uwe

  6. 6 brendanscott 26 March 2009 at 10:12 pm

    I thought emacs was way cool when I used it, but if you don’t use it for a while then go back to it, it is just a mass of complexity, you need to relearn everything each time. In vi at least I can remember i and :w :p (I mostly use Kate) for text editing.

    Another thanks – I installed 00o3 a couple of weeks ago, then uninstalled it because it interfered with PyUNO:

    OOo-Python Integration Annoyances

    You’ve tweaked my mind to note I’ve now got compare automated using wdiff (more in another post?) and given 000’s awful compare performance there is no point automating OOo3 – so I can go ahead and reinstall it.

  7. 7 Aldi 28 March 2009 at 9:43 am

    In the OpenOffice.org wiki, there is a wiki page collecting some information and ideas regarding the features ‘compare” and “track changes” of OpenOffice.org:

    http://wiki.services.openoffice.org/wiki/Track_changes

  8. 8 brendanscott 30 March 2009 at 10:09 pm

    I’ve just had a close encounter with why I don’t use emacs anymore.

    M-x ediff-regions-wordwise
    gets me to select the buffers and then mark the regions. After marking the regions I’m supposed to C-M-c
    Lord only knows what that means.
    The FAQ (which really, should I have to refer to in order to get this command working?) says

    * `M-C-x’: press the key while holding down both and

    * `C-M-x’: a synonym for the above

    I’m assuming my meta is the escape key…

    When I try to do this I throw the KDE System Guard process table up (in order to kill processes – emacs might be a good start?)

  9. 9 Uwe Brauer 30 March 2009 at 11:59 pm

    oops!

    Lets analyze this

    you typed:

    M-x ediff-regions-wordwise

    Right there are two ways of doing this.

    the old way: (which I don’t use) and that is you press
    FIRST the ESCAPE key and THEN x and THEN you type the command

    the new way: you pess the META key (on a standard keyboard this key
    is most if the time called ALT and is left to the space key and
    WHILE you keep this key pressed you press x, and THEN you type the
    command.

    once you have done this successfully you know where you META key is.

    Now you should type

    C-M-c

    that means:
    press at the SAME time,
    META (which you just have localised),
    CONTROL and the key
    c (that is letter c)

    I will send you soon a screen shot.

    Uwe

  10. 10 Arnoud Engelfriet 30 April 2009 at 7:01 pm

    Brendan, have you looked at http://extensions.services.openoffice.org/project/DeltaXMLODTCompare ? According to the comments, it works very well. I have to play with it a bit more but it does seem promising.

  11. 11 Tomer Shalit 9 November 2009 at 7:24 am

    [Note, this comment is advertising, but it is for a web based document comparison, and is therefore related, so I haven’t binned it. As it is a SaaS, it falls into that gray area for freedom. In the absence of an available free implementation, I urge caution].

    Brendan,

    Another solution that has just been launched: a free web-service http://www.CompareMyDocs.com does a multiversion document comparison and provides a GUI to recombine and merge the versions. The site is a showcase of the TextFlow API http://www.texflow.com

    Documents loose inserted images and tables in the beta, but does a good job with text.

  12. 12 aqlegwtwb@gmail.com 6 January 2015 at 4:28 am

    Lovely Webpage, Continue the good job. Appreciate it!


  1. 1 OOo Compare: Inadequate « Brendan Scott’s Weblog Trackback on 26 March 2009 at 9:52 pm
  2. 2 Real World Test – Wdiff Best « Brendan Scott’s Weblog Trackback on 15 July 2009 at 4:03 pm
  3. 3 Wdiff compare – Good! « Brendan Scott’s Weblog Trackback on 15 July 2009 at 4:07 pm

Leave a comment




Blog Stats

  • 279,063 hits

OSWALD Newsletter

If you would like to receive OSWALD, a weekly open source news digest please send an email to oswald (with the subject "subscribe") at opensourcelaw.biz