Posts Tagged 'oooannoyances'

Linux .doc to Text Conversions Inadequate

Linux .doc to Text Conversions Inadequate

Having a look at converting doc and html to text on Linux. While these are good for indexing purposes, they aren’t that great for actually presenting the output.

update (16/6/11): LibreOffice seems to do numbering export correctly – at least after a quick look.  I need to check in detail.

Column Width

The engines which perform pretty well, like wvText and lynx (via html) have the annoying ‘feature’ of formatting to a column width.   This means that it is very hard to predict what are actually paragraph breaks and what are just the pager splitting lines.  Using wvHtml can then be transformed by html2text -width 0 to avoid this width problem (although results are a little haphazard at times). Lynx maxes out at a width of about 990 characters, but there are plenty of paragraphs around which exceed that length. … I take it back about html2text -width 0, which seems to fail more often than it succeeds.  However, it does seem to honour large width arguments (unlike lynx which has a magic number limiting the width) – width 20000 seems to work (although if there is a line, it inserts 20000 ‘=’ characters) – w3m also looks promising.

Numbering

It is possible to automate conversion with open office (and seems to be less fragile with version 3 than earlier versions). However, version 3 can’t import Word outline numbering correctly. It drops the first numbering level, marking it as a heading style (see this, this and this among others). Given that they have had 10 years to work this stuff out and have been putting heaps of effort into braindead things like commenting (and, really, whoever introduced the commenting anti-feature in MS Word should be broken on the wheel – maybe not, but only just) and it’s probably something simple (like setting the right numbering level in the xml style)  not importing outline numbering correctly strikes me as pretty astounding.

Openoffice 3.1 and earlier also has a problem that it runs numbering together with the text without a separator. (eg you get something like: “1.1This is second level numbering” rather than “1.1 This is second level numbering” or “1.1\tThis is second level numbering”). This, by the way, means that your indexes will be wrong for the first word in any numbered paragraph. This problem has been fixed in 3.2. You might think it would be easy to regex after export to insert a tab or something after leading numbering, until you realise that structures like A, A1, a., (a), 1A, 1a, I, IV,  are valid numbering structures -> you would fail on something like “1A A First level number”.

wvText and Abiword don’t get outline numbering right. They do recognise outline numbering, but don’t preserve the numbering symbols (eg a numbering running A, B, C will become 1,2 ,3) or decorators “(a)” becomes “a.”. wvText also outputs numbering on a line by itself (probably because the import is going via html?). So 1.1 This is second level numbering will become something like:

"1.

    1.

        This is second level numbering"

Perhaps this could be scriptable into some sort of sanity but numbering like A, 1A will still be a problem.

IBM’s Lotus Symphony does import numbering correctly (hurrah!) – including decorators on numbering (eg “(a)” and “1.1”), however, being based on an old version of open office, suffers from the export problem of running the numbering together with the text <sigh>.  Maybe .doc -> Symphony -> html -> text might work?

Moreover, there seems no obvious way to get the pyUno bridge working for Symphony in the way that you can for openoffice. How, for example, do you get Symphony to run headless/listen on a port?

Yet to be tried are go-oo (which apparently conflicts with openoffice?) and oxygen office….

Note to self:

/opt/ibm/lotus/Symphony/framework/shared/eclipse/plugins/com.ibm.symphony.brand.linux_2.0.0.20100823-2000/program/soffice.bin -nologo -nodefault -norestore -nocrashreport -nofirststartwizard -symphony -accept=pipe,name=_214862394_Office;urp;StarOffice.ServiceManager

Real World Test – Wdiff Best

Wdiff Compare – Best

Brendan Scott, July MMIX

[update Dec 2010: click here for my attempt at an online comparison service.]

(See also: docdiff post, wdiff post, OOo post)

I have earlier commented on the efficacy (or lack of it) of OOo’s compare function.  In my earlier posts I reviewed some of the comparison solutions which are available and are also free software.  One of the most promising of those was wdiff.  I did an initial review on some sample text but have not, to date, made any post on any real world use of these solutions.  This is that post.

I had four text files of comparatively large (about 10-12,000 words – or about 30 pages – each) agreements which I needed to compare.  This comparison was ‘live’ in the sense that the comparison was needed in the course of my review of these documents for a client.  Unfortunately, this also means that I’m not at liberty to post comparison examples since the documents are confidential (the comparisons in the earlier posts were run on a sample clause that I created).

Running Word compare on one pair of agreements  was a disaster.  Word seems to get to a certain point in the documents and then fall over dramatically (something I have witnessed in other situations).  For example, minor changes in a paragraph would make the whole of the balance of the document (or a large portion of it) marked as a change.  In order to get a useable comparison with Word I needed to run a comparison, then extract the section on which it failed, then run compare on those sections and so on iteratively. In fact, in the end I needed to manually extract most of the clauses either individually or with a small collection of clauses and (manually) run a comparison because Word repeatedly failed even on the shortened extracts.  This ran to about 20 or 30 separate extractions/comparisons using Word and was, frankly, pretty painful.

On a different pair of contracts I used wdiff.  Wdiff had similar problems to Word in that it got to a certain point at which it began marking the whole balance of the document as a change.  Wdiff’s raw performance was superior to Word’s in this respect, although the initial output, without any manual intervention was still not useable practically.  I could have used a similar strategy to that for Word of systematically extracting failed sections and re-comparing them.  Wdiff, however, had a significant advantage over Word in that it was able to be automated pretty easily (I used Python).  I wrote some scripts to compare specific clauses. That is, compare clause x of agreement 1 against clause y of agreement 2, output the results in html markup (see note 1) and join them together.   The results were fantastic.  Following this strategy, with clause by clause comparison wdiff didn’t get confused and produced a very good markup of the whole of each of the documents.  The downside was that it was not fully automated, but the results justified the comparatively small amount of manual input I needed to make.

Indeed, the automation allowed me to do some interesting things like compare different clauses within the same document (eg compare clause 1.1 against 1.2, 1.3, 1.4 and 1.5) to see how they were different.  Openoffice doesn’t properly render the html which is output though (it doesn’t show strikethrough) so Firefox needs to be used for display.

On that basis I’m revising my earlier rankings as follows:

wdiff – 8.5 (revised up because it is easy to script and this results in unexpectedly useful benefits)

Word – 7 (revised down because the markup goes feral repeatedly on the long documents I compared)

I did not revisit Docdiff, although I suspect it could also be automated in a similar fashion.

Learning to H^HDislike OOo-3

Apparently, not only does OOo-3 break python, it also dis-innovates with presentation layouts.  One of the advantages that OOo used to have was to allow you to print handouts on an otherwise blank page.  This allowed plenty of room for taking notes, especially if one moved the slides a little away from their starting points.  OOo-3 apparently “improves” on this by mimicking MS Office’s stupid note taking lines.  As far as I can tell they are surgically implanted, unable to be removed.  So now my handouts are useless to me because the lines are the wrong size for me to write on and take up all the note taking space.

So this is now the second time I am rolling back from OOo-3.  Each time my experience of it has survived less than a day.

Docdiff compare – Good! + Roundup

Docdiff compare – Good! + Roundup

(see also: Real World Test)

Docdiff compare of sample clauses

This is part 3 in my (unfortunately, apparently neverending) search for an adequate markup facility that doesn’t rely on MS Word.  The earlier two installments related to OOo and wdiff.

Uwe (thanks Uwe!) posted a comment to my Wdiff post suggesting I try docdiff.  Docdiff is a ruby script which has a number of output formatting options (and looks prettier than wdiff).  Running docdiff on the sample files gives:

1.1Contractor 1.1[Vendor] may invoice [Customer] Customer the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer for ongoing fees quarterly in arrears advance and other fees monthly in arrears. Contractor must [Vendor] will provide [Customer] Customer with a tax invoice in respect of all GST charged.  [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice. charged.

The cut and paste here has removed the colored highlighting from each add/remove – see screenshot above for a general indication –  I could do without the different colors for different types of addition/deletion.  This output looks very similar to the markup provided by wdiff – including the marking of “[Customer]” being replaced by “Customer”.  It has also caught “charged” as a deletion/addition when it shouldn’t have – wdiff didn’t make this mistake.  However, docdiff has another trick up its sleeve – it can be a character based engine with the use of the –char switch.  With this on the output is:

1.1Co[Ventractdor] may invoice [Customer] the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer for ongoing fees quarterly in arredvarsnce and other fees monthly in arrears. Co[Ventractdor] mustwill provide [Customer] with a tax invoice in respect of all GST charged. [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.

The markup is more minimal – and it hasn’t incorrectly caught “charged”.  However, it’s come up with a strange result I’ve never seen in a mark up before – a letter by letter change of Contractor to Vendor.  It’s a bit unnerving to see something like  Co[Ventractdor] appear in a markup.  Conceptually, you probably do want the whole word marked as a change – to indicate you’ve stopped calling one of the parties one thing, and are now calling it another.

The Wash Up So Far

Docdiff provides very readable output, albeit subject to some inaccuracy at the word level of resolution.  I suspect that it would not be too hard to coax wdiff to produce similar output (using the –start-delete=STRING  etc switches).[1]  That said docdiff also provides a variety of different output formats and the ability to use the (slightly weird) –char option.   It’s hard to say which is better.

In an update to my wdiff post, I also tested meld – which performed poorly, albeit better than OOo (sad, but true).  A test of meld is unfair though as this is outside its intended domain of application.  Everybody loves a ranking, so my current rankings (out of 10) of them are:

Word 8.5 (markup known to go feral at times)

docdiff 7.5 (idiosyncracies in the operation of the engine are offset by its output and flexibility – revised down a little based on some more testing of the engine)

wdiff 7.5 (I am assuming I can script something relatively easily to produce more visually appealing markup)

meld 3

OOo 2

diff 2

This is all subject to the caveat that I haven’t tried these tools (other than Word) on long documents.  I may find that they go feral too for long files.  The take away point though is that there are free software solutions which give acceptable markup performance, at least on .txt files.  I have never had the need to compare formatting, so a .txt limitation is no great shakes for me.

I have had a search in YAST for diff.  It’s hard to tell, but the things it throws up seem to be line based (like diff).  If anything looks interesting there I will investigate further.

Finally, improving compare performance has been added as a potential project for Go-OO in Google’s Summer of Code.  If you’re interested, here’s your chance to make your mark! (up… errr …. so to speak)

Notes:

[1]  The answer is:

wdiff –start-delete=\<del\> –end-delete=\<\/del\> –start-insert=\<ins\> –end-insert=\<\/ins\>

or, prettier:

wdiff –start-delete=\<del\ style=\”color:red\”\> –end-delete=\<\/del\> –start-insert=\<ins\ style=\”color:blue\”\> –end-insert=\<\/ins\>

Although, it needs more smarts for multi-line text b/c browsers need <p> tags

Wdiff compare – Good!

Wdiff compare – Good!

[update Dec 2010: click here for my attempt at an online comparison service.]

(see also: Real World Test, Docdiff post, OOo compare post)

I previously commented (adversely) on the performance of OpenOffice.org’s compare function.   Tridge suggested that I might try wdiff – and I’m pleasantly surprised.  When I ran wdiff on the two text files I used in that post it gave me this:

[-1.1Contractor-]{+1.1[Vendor]+} may invoice [-[Customer]-] {+Customer the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer+} for ongoing fees quarterly in [-arrears-] {+advance+} and other fees [-monthly-] in arrears.  [-Contractor must-]  {+[Vendor] will+} provide [-[Customer]-] {+Customer+} with a tax invoice in respect of all GST charged.  [-[Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.-]

Which, if you can decipher the inline formatting instructions, is a better performance than OOo (not that that would be difficult mind…).  In fact, if I’m reading it correctly, it seems to be only slightly worse than the output of Word’s compare (it marks changes within “words” as a change of a word – so the  [-[Customer]-] {+Customer+} is not a minimal change set- it should be  [-[-] and [-]-]).  I will try to find sometime to test it on longer clauses or whole agreements.  A bonus is that its output format is cut-and-paste compatible, albeit a little visually unappealing.  It shouldn’t be too much trouble to pipe it through something to pretty it up (the -p option apparently prints overstrikes, but this doesn’t work on my tty – ah, -t is what i need for more visual output on a tty).  A second bonus is that -s gives change statistics (I don’t recall ever seeing this in Word).

The other nice thing about wdiff is that it was already installed on my machine!  About two (maybe three??) years ago I googled around for something that would do diffs on text files, but the searches only showed diff, kdiff etc.  Wdiff apparently dates back to the early 90s so my search terms maybe weren’t the best – or too many other people are using and linking to diff!  Anyway, wdiff seems to be part of the standard install for OpenSuSE and I’d had it all along.

And thanks Tridge!

Update:

Tridge says:

There are probably quite a few other similar utilities. Minimal diff
algorithms are a nice little computer science task. I'm sure heaps of
people have written variants. I expect quite a few CS courses use this
sort of thing as an exercise for 2nd year programming students...

Improving the algorithms in OOo would be a nice Google summer of code
project.

Which I think would be a great idea and a valuable addition to OOo.  If anyone’s game to try (either as a student or a supervisor) please drop me an email or lodge a comment and I’ll try to connect you.

Update 2:

Following someone’s suggestion, I have tried meld – am impressed by the interface, but not by the markup.  Run on the same two sample text files it marked:

with a tax invoice in respect of all GST charged.

as unchanged, with the balance as deletion/addition.  Sadly, this is a better performance than OOo (but is still inadequate).

OOo Compare: Inadequate

OOo Compare: Inadequate

Update 4: see also my wdiff post and (update 5) my docdiff post and (update 6) my Real World Test post.

OpenOffice.org’s compare function has, historically, performed very poorly for me.  Being able to more or less accurately compare two documents for changes is an essential function for any law practice which does any sort of transactional work.   Without it, OpenOffice will never find a place in legal firms.  Moreover, this functionality is of use to anyone who wants to have changes between versions readily identifiable without the need to have track changes on all of the time.

OpenOffice’s compare performance is inadequate.  By way of a test I took two similar clauses and saved them to separate text files.  I then used OpenOffice’s compare function to show the changes.  This was the output:

Comparison of clause by OpenOffice.org showing markup

Comparison of clause by OpenOffice.org showing markup

Despite there being some commonality between the two clauses – eg “may invoice” are the second and third words of both clauses, OpenOffice simply marked the whole thing as a change.  The clauses are each less than 60 words – a markup should not be difficult.  The output is unhelpful and this sort of output from compare is not unusual for OpenOffice.  It should come as no surprise that things do not improve with longer documents.

By way of example, Word’s compare function gave this output:

Word's compare function of sample clauses

Word's compare function operating on the same two sample clauses

This markup shows pretty much an accurate record of the changes which were made.   Text which has not changed is shown as unchanged.   Changed text is shown as a change.  This is not to say that Word does not go markup haywire from time to time, but its track record is vastly superior to OOo’s in my experience.  That said, OOo’s mark up has been so poor for me that the number of times I have experienced it is not that great.

Exactly who uses OOo’s compare function?  What is its intended domain of application?

Update: Richard [thanks Richard] has posted a comment linking to an issue dating to 2005 in the OOo bug tracker.  If you think this should be improved please go vote for the issue.

Update 2: Oh, and based on my experience – any compare based on diff will also be inadequate.  Don’t even consider it.  FWIW I ran them through diff (the output is at least more reader friendly):

< 1.1[Vendor] may invoice Customer the Fees for each Service in accordance with the Payment Terms and, where no relevant time is set out in the Payment Terms, [Vendor] may invoice the Customer for ongoing fees quarterly in advance and other fees in arrears.  [Vendor] will provide Customer with a tax invoice in respect of all GST charged.
---
> 1.1Contractor may invoice [Customer] for ongoing fees quarterly in arrears and other fees monthly in arrears.  Contractor must provide [Customer] with a tax invoice in respect of all GST charged.  [Customer] is not liable to pay any amount in respect of GST except as set out in a valid tax invoice.

Update 3: One of the comments on a linking site points out this code (which I haven’t tried): http://www.plagiarism.phys.virginia.edu/copyfind.cpp

Update 4: Wdiff seems to be very promising.

Notes:

OOo compare results were the same for both versions of OOo I tried (2.3.something and 3.0.1).


Blog Stats

  • 279,048 hits

OSWALD Newsletter

If you would like to receive OSWALD, a weekly open source news digest please send an email to oswald (with the subject "subscribe") at opensourcelaw.biz