Linux .doc to Text Conversions Inadequate


Linux .doc to Text Conversions Inadequate

Having a look at converting doc and html to text on Linux. While these are good for indexing purposes, they aren’t that great for actually presenting the output.

update (16/6/11): LibreOffice seems to do numbering export correctly – at least after a quick look.  I need to check in detail.

Column Width

The engines which perform pretty well, like wvText and lynx (via html) have the annoying ‘feature’ of formatting to a column width.   This means that it is very hard to predict what are actually paragraph breaks and what are just the pager splitting lines.  Using wvHtml can then be transformed by html2text -width 0 to avoid this width problem (although results are a little haphazard at times). Lynx maxes out at a width of about 990 characters, but there are plenty of paragraphs around which exceed that length. … I take it back about html2text -width 0, which seems to fail more often than it succeeds.  However, it does seem to honour large width arguments (unlike lynx which has a magic number limiting the width) – width 20000 seems to work (although if there is a line, it inserts 20000 ‘=’ characters) – w3m also looks promising.

Numbering

It is possible to automate conversion with open office (and seems to be less fragile with version 3 than earlier versions). However, version 3 can’t import Word outline numbering correctly. It drops the first numbering level, marking it as a heading style (see this, this and this among others). Given that they have had 10 years to work this stuff out and have been putting heaps of effort into braindead things like commenting (and, really, whoever introduced the commenting anti-feature in MS Word should be broken on the wheel – maybe not, but only just) and it’s probably something simple (like setting the right numbering level in the xml style)  not importing outline numbering correctly strikes me as pretty astounding.

Openoffice 3.1 and earlier also has a problem that it runs numbering together with the text without a separator. (eg you get something like: “1.1This is second level numbering” rather than “1.1 This is second level numbering” or “1.1\tThis is second level numbering”). This, by the way, means that your indexes will be wrong for the first word in any numbered paragraph. This problem has been fixed in 3.2. You might think it would be easy to regex after export to insert a tab or something after leading numbering, until you realise that structures like A, A1, a., (a), 1A, 1a, I, IV,  are valid numbering structures -> you would fail on something like “1A A First level number”.

wvText and Abiword don’t get outline numbering right. They do recognise outline numbering, but don’t preserve the numbering symbols (eg a numbering running A, B, C will become 1,2 ,3) or decorators “(a)” becomes “a.”. wvText also outputs numbering on a line by itself (probably because the import is going via html?). So 1.1 This is second level numbering will become something like:

"1.

    1.

        This is second level numbering"

Perhaps this could be scriptable into some sort of sanity but numbering like A, 1A will still be a problem.

IBM’s Lotus Symphony does import numbering correctly (hurrah!) – including decorators on numbering (eg “(a)” and “1.1″), however, being based on an old version of open office, suffers from the export problem of running the numbering together with the text <sigh>.  Maybe .doc -> Symphony -> html -> text might work?

Moreover, there seems no obvious way to get the pyUno bridge working for Symphony in the way that you can for openoffice. How, for example, do you get Symphony to run headless/listen on a port?

Yet to be tried are go-oo (which apparently conflicts with openoffice?) and oxygen office….

Note to self:

/opt/ibm/lotus/Symphony/framework/shared/eclipse/plugins/com.ibm.symphony.brand.linux_2.0.0.20100823-2000/program/soffice.bin -nologo -nodefault -norestore -nocrashreport -nofirststartwizard -symphony -accept=pipe,name=_214862394_Office;urp;StarOffice.ServiceManager

About these ads

5 Responses to “Linux .doc to Text Conversions Inadequate”


  1. 1 Tony 24 September 2010 at 7:16 am

    Interesting analysis. Unfortunately the same is true for office open xml and open document formats conversion to HTML (maybe not what you want exactly).

    I’ve never been able to find a reliable means to create ‘clean’ html from word-processor documents or PDF. The most practical is manual copy/paste – adding ‘mark-up’ again to plain text. It would be great to be able to choose which formatting to convert (e.g. keep: Headings, Bold, Paragraphs, Bullets – remove Fonts, Colors etc).

    Some promising avenues – Koffice had a great HTML export feature a few years ago. (I have not tried it recently). The python project ‘beautifulsoup’ is also very interesting.

  2. 3 Density 20 October 2010 at 9:35 pm

    I got the best results with antiword. Antiword doesn’t use non-break spaces or other funny and decorations for tables are minimal, so you can easily work on the files in shell scripts. After removing the place holders for pictures and tables even word counts with wc are correct. Something that seems to be impossible with wxText.

    http://en.wikipedia.org/wiki/Antiword

  3. 4 Tony 1 November 2010 at 9:57 pm

    Hi again. Some more thoughts…
    The latest version of Koffice does not seem to do import/export as well as the old version.

    Another avenue worth exploring:

    Export document to HTML. Email it to a Gmail account. In Gmail, click on the ‘View’ option and the document opens in a new window/tab. You can save as HTML-only, or perhaps copy-paste as plain text.

    The resulting HTML source is fairly clean and converted to UTF-8.
    Not sure if it solves any or all the issues above with numbering and spaces, but it might be worth investigating.

    Good luck!


  1. 1 Links 22/9/2010: Dell Announces Another Linux-powered Tablet, Facebook Adds to MySQL | Techrights Trackback on 23 September 2010 at 9:23 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Blog Stats

  • 136,581 hits

OSWALD Newsletter

If you would like to receive OSWALD, a weekly open source news digest please send an email to oswald (with the subject "subscribe") at opensourcelaw.biz

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: