Linux .doc to Text Conversions Inadequate
Having a look at converting doc and html to text on Linux. While these are good for indexing purposes, they aren’t that great for actually presenting the output.
update (16/6/11): LibreOffice seems to do numbering export correctly – at least after a quick look. I need to check in detail.
The engines which perform pretty well, like wvText and lynx (via html) have the annoying ‘feature’ of formatting to a column width. This means that it is very hard to predict what are actually paragraph breaks and what are just the pager splitting lines. Using wvHtml can then be transformed by html2text -width 0 to avoid this width problem (although results are a little haphazard at times). Lynx maxes out at a width of about 990 characters, but there are plenty of paragraphs around which exceed that length. … I take it back about html2text -width 0, which seems to fail more often than it succeeds. However, it does seem to honour large width arguments (unlike lynx which has a magic number limiting the width) – width 20000 seems to work (although if there is a line, it inserts 20000 ‘=’ characters) – w3m also looks promising.
It is possible to automate conversion with open office (and seems to be less fragile with version 3 than earlier versions). However, version 3 can’t import Word outline numbering correctly. It drops the first numbering level, marking it as a heading style (see this, this and this among others). Given that they have had 10 years to work this stuff out and have been putting heaps of effort into braindead things like commenting (and, really, whoever introduced the commenting anti-feature in MS Word should be broken on the wheel – maybe not, but only just) and it’s probably something simple (like setting the right numbering level in the xml style) not importing outline numbering correctly strikes me as pretty astounding.
Openoffice 3.1 and earlier also has a problem that it runs numbering together with the text without a separator. (eg you get something like: “1.1This is second level numbering” rather than “1.1 This is second level numbering” or “1.1\tThis is second level numbering”). This, by the way, means that your indexes will be wrong for the first word in any numbered paragraph. This problem has been fixed in 3.2. You might think it would be easy to regex after export to insert a tab or something after leading numbering, until you realise that structures like A, A1, a., (a), 1A, 1a, I, IV, are valid numbering structures -> you would fail on something like “1A A First level number”.
wvText and Abiword don’t get outline numbering right. They do recognise outline numbering, but don’t preserve the numbering symbols (eg a numbering running A, B, C will become 1,2 ,3) or decorators “(a)” becomes “a.”. wvText also outputs numbering on a line by itself (probably because the import is going via html?). So 1.1 This is second level numbering will become something like:
"1. 1. This is second level numbering"
Perhaps this could be scriptable into some sort of sanity but numbering like A, 1A will still be a problem.
IBM’s Lotus Symphony does import numbering correctly (hurrah!) – including decorators on numbering (eg “(a)” and “1.1”), however, being based on an old version of open office, suffers from the export problem of running the numbering together with the text <sigh>. Maybe .doc -> Symphony -> html -> text might work?
Moreover, there seems no obvious way to get the pyUno bridge working for Symphony in the way that you can for openoffice. How, for example, do you get Symphony to run headless/listen on a port?
Yet to be tried are go-oo (which apparently conflicts with openoffice?) and oxygen office….
Note to self:
/opt/ibm/lotus/Symphony/framework/shared/eclipse/plugins/com.ibm.symphony.brand.linux_18.104.22.16800823-2000/program/soffice.bin -nologo -nodefault -norestore -nocrashreport -nofirststartwizard -symphony -accept=pipe,name=_214862394_Office;urp;StarOffice.ServiceManager