Posts Tagged 'python'

Announce: Release of pyShotDetect

Announce: Release of pyShotDetect

I have released some Python code for auto-detecting transitions in video files.  If you have a “.dv” video file comprising a number of different scenes (or, technically, shots) these shots can be automatically detected by pyShotDetect, depending on threshold parameters that are set interactively.  Once you are happy with the detection, pyShotDetect will write a metadata file (an old style Kino file) which demarcates each shot.  This file can subsequently be edited in Kino (eg to extract each shot).

It is particularly effective on fade to/through black transitions and cuts.

Load a file:

The high ridges show shot changes which have been auto detected (sensitivity can be changed by adjusting a threshold).  The spike in the middle shows a fade through black transition.  Right click and chose fade to black here automatically detects the start and finish of the transition:

The transition itself is  not exported (red triangle), but this can be changed if you want to keep a record of the transitions themselves.  You don’t have to do each individually, there is an option to have pyShotDetect just find all fade to black transitions for you automatically.

The code is written in Python, and has openCV, Tkinter and matplotlib as dependencies. Licensed under GPLv3

Pick it up at Sourceforge.

Originally posted at brendanscott.wordpress.com.

Inline those Word Comments with iwc

Inline those Word Comments with iwc

Microsoft Word has a somewhat repugnant feature called “comments”.  The concept of comments is not so bad, but the implementation in Word is definitely annoying.  Why?  Because to read the comments you basically need to be in Word with the document open and either use their awful “bubble” interface to read the comments (where the comments are shown in the margin), hover your pointer over the (nearly imperceptible) comment markers () or open up the comment pane at the bottom.

The upshot is that printing the doc and reading it in a more comfortable setting is a no no. If you *do* print it out, the comments are unhelpfully located right at the end of the document so you need to back-and-forth from the text to the comments.

So I wrote myself a short script to inline those damnable comments (GPLv3).  It requires the command line and is written in Python.  If there is a modicum of interest it wouldn’t be much trouble to write a simple GUI frontend.

The process is:

* using Word, “save as” the document to a .txt file (choose ‘unicode – utf8’ but default should also work)

* python iwc.py -i infile.txt -o outfile.txt

* open up outfile.txt with your favourite editor

Be warned – the quid pro quo for doing this is you lose a lot of your formatting :( Try using Word’s compare to see the markup between the original.doc and outfile.txt to show the comments.

Click here (or the link above) to get the script

Linux .doc to Text Conversions Inadequate

Linux .doc to Text Conversions Inadequate

Having a look at converting doc and html to text on Linux. While these are good for indexing purposes, they aren’t that great for actually presenting the output.

update (16/6/11): LibreOffice seems to do numbering export correctly – at least after a quick look.  I need to check in detail.

Column Width

The engines which perform pretty well, like wvText and lynx (via html) have the annoying ‘feature’ of formatting to a column width.   This means that it is very hard to predict what are actually paragraph breaks and what are just the pager splitting lines.  Using wvHtml can then be transformed by html2text -width 0 to avoid this width problem (although results are a little haphazard at times). Lynx maxes out at a width of about 990 characters, but there are plenty of paragraphs around which exceed that length. … I take it back about html2text -width 0, which seems to fail more often than it succeeds.  However, it does seem to honour large width arguments (unlike lynx which has a magic number limiting the width) – width 20000 seems to work (although if there is a line, it inserts 20000 ‘=’ characters) – w3m also looks promising.

Numbering

It is possible to automate conversion with open office (and seems to be less fragile with version 3 than earlier versions). However, version 3 can’t import Word outline numbering correctly. It drops the first numbering level, marking it as a heading style (see this, this and this among others). Given that they have had 10 years to work this stuff out and have been putting heaps of effort into braindead things like commenting (and, really, whoever introduced the commenting anti-feature in MS Word should be broken on the wheel – maybe not, but only just) and it’s probably something simple (like setting the right numbering level in the xml style)  not importing outline numbering correctly strikes me as pretty astounding.

Openoffice 3.1 and earlier also has a problem that it runs numbering together with the text without a separator. (eg you get something like: “1.1This is second level numbering” rather than “1.1 This is second level numbering” or “1.1\tThis is second level numbering”). This, by the way, means that your indexes will be wrong for the first word in any numbered paragraph. This problem has been fixed in 3.2. You might think it would be easy to regex after export to insert a tab or something after leading numbering, until you realise that structures like A, A1, a., (a), 1A, 1a, I, IV,  are valid numbering structures -> you would fail on something like “1A A First level number”.

wvText and Abiword don’t get outline numbering right. They do recognise outline numbering, but don’t preserve the numbering symbols (eg a numbering running A, B, C will become 1,2 ,3) or decorators “(a)” becomes “a.”. wvText also outputs numbering on a line by itself (probably because the import is going via html?). So 1.1 This is second level numbering will become something like:

"1.

    1.

        This is second level numbering"

Perhaps this could be scriptable into some sort of sanity but numbering like A, 1A will still be a problem.

IBM’s Lotus Symphony does import numbering correctly (hurrah!) – including decorators on numbering (eg “(a)” and “1.1”), however, being based on an old version of open office, suffers from the export problem of running the numbering together with the text <sigh>.  Maybe .doc -> Symphony -> html -> text might work?

Moreover, there seems no obvious way to get the pyUno bridge working for Symphony in the way that you can for openoffice. How, for example, do you get Symphony to run headless/listen on a port?

Yet to be tried are go-oo (which apparently conflicts with openoffice?) and oxygen office….

Note to self:

/opt/ibm/lotus/Symphony/framework/shared/eclipse/plugins/com.ibm.symphony.brand.linux_2.0.0.20100823-2000/program/soffice.bin -nologo -nodefault -norestore -nocrashreport -nofirststartwizard -symphony -accept=pipe,name=_214862394_Office;urp;StarOffice.ServiceManager

Python packaging and adding menu entries

Python packaging and adding menu entries

Well, I am now unofficially an open source programmer.  Having decided to add a hint function to a “GPL v2 or later” python/pygame game called Jools I discovered I had to rewrite much of the game logic simply to understand it.  One thing led to another and I’ve learnt some pygame and forked Jools, changing the object of the game and added some new graphics.  The new version has been released as a “GPL v3 or later” game called Symbology.  It is sort of like bejeweled skinned for maths geeks.

Now is where I get to struggle with the packaging of python programs.  Apparently Nobody Expects Python Packaging.  It certainly seems convoluted and I fear that now that I’ve finished the program I will need to re-engineer it to be able to access the data files which seem like they will end up somewhere I don’t want them (or bounded in a nutshell/zip file) if I use setuptools or distutils.

Another problem I face is how to create a download which will automatically install menu entries in relevant service menus (eg for KDE, GNOME and/or win).  At the moment, the install is just a not very smart gzipped tarball.

Help – Parsing HTML in python?

Help – Parsing HTML in python?

I’d like to write a python script which takes an arbitrary html page and (tries to) extract(s) and displays its text.  This strikes me as something someone else would have already done, perhaps a million times.  I have tried to do this with minidom, but it (minidom) choked on the one test html I gave it to parse.  I have also tried to download and install lxml – without any success.

Does anyone have any thoughts?  Are these even the right tools?  What should I be looking at?

[Update: have rethought my requirements and html2ascii.py will probably meet a good part of them – see comments]

[Update 2: Python’s default ascii type string is astoundingly frustrating.  Somewhere, sometime, when you least expect it, it comes up to bite you and you spend an hour or two trying to re-route what you’re doing around some problem with one of the string methods – Arghhhh!!!!]

[Update 3: Essence for understanding unicode: Unicode u’\u201c” == ->“<- is a unicode string. ‘\xe2\x80\x9c’ is not a unicode string.  In fact, it’s just a sequence of hexadecimal bytes. By providing it a context for interpretation it can become a unicode string’\xe2\x80\x9c’.decode(‘utf8′) == u’\u201c’).  If you give it the wrong context for interpretation it will become a unicode string, just the wrong one: ‘xe2\x80\x9c’.decode(‘cp1252′) == u’\xe2\u20ac\u0153’ which is just garbage ->“<- when you print the glyphs.]

OOo-Python Integration Annoyances

Have spent a fair bit of time fiddaddling trying to work out why I couldn’t import com.sun.star.beans

Apparently it’s because I installed OOo-3 a few days ago (I can’t recall why) and OOo-3 doesn’t work with the System version of python.  Not that I can prove this.  I only have this small and ambiguous statement to go by (and my lack of success): ” I also tested it with version 3.0 of OpenOffice but it only runs if you use the version of Python that comes with OpenOffice.

That, of course, and the fact that once I rolled back OOo to 2.3.something-or-other-x64 (which, by the way, was not a pleasant experience in Yast, which uses “gentle persuasion” to move you onto the most recent version – it apparently also refuses to remember that I asked it to lock the ooo versions – grrr) Python could suddenly find the module.

Anyway now:

*  my first attempt at Python-OOo scripting works; and

* I can’t get any of the openoffice components to run [ – correction, I can,  they wouldn’t run b/c i had a headless instance of soffice running?] (remove -headless from command line to be able to edit docs while scripting) and

* I don’t have any OOo artwork/icons – in fact, some widgets (eg checkboxes) aren’t there either and the toolbars look screwy. [Update – kde integration (file dialogs were all wacky fixed by installing the openoffice-kde integration package icons  were hard to remedy – as there was nothing in yast to install to fix it.  Had to find the old themes elsewhere and copy their zip files to /usr/lib64/ooo-2.0/share/config/

Programming Python (e-?)books?

Programming Python (e-?) books? (< Planet LA still can’t get my titles right)

I have been thinking about how to automate some documentation related tasks for lawyers (well, just for one lawyer in particular).  The tasks are largely search and highlight within a document.  I’d assumed that there’d be a wealth of search and indexing solutions which would do this, but apparently there aren’t (or I can’t decipher them).  The solutions out there assume that you want to identify what document the search term is in.  However, I know which document it is in, I just want to know how it is used within the document.

I’m now considering doing a customised solution in Python – a language that I know nothing about (well, except for the 6-8 hours of programming I had at the end of last week).  I hate buying books on these topics because I invariably find I don’t actually use them, or use them very little.   Nevertheless I’m looking at buying one or more O’Reilly Python books, particularly Programming Python.  To get three books (learning, programming and cookbook) looks like it will cost about AUD$217 (inc. 2-5 business days’ shipping- roughly 33% of this is shipping).  To get the same three as pdfs will cost about AU$125, and to just get Programming Python as a pdf will cost $75.  Alternatively I could buy the dead tree versions from Dymocks here for… oh, about $300 (programming python = $125). (three books+ three pdfs + shipping = about $260)

Has anyone had any experience with O’Reilly pdfs?  Will a 1596 page pdf   be manageable?  Will I end up printing the whole thing out to use it anyway?  (maybe I’ll end up printing the particular sections i need when I need them?)

Update [more details here]:

I bought Programming Python Third Edition, Learning Python and Python Cookbook as ebooks (pdfs).  The pdf reader interface is at times annoying, but overall they have been a good buy.  Would have bought hard copies of all three, but freight was ridiculous.  Have also bought pocket reference as a hard copy.  Of these three I have used PP3E heaps and heaps, LP a bit, and PC never (despite looking in it regularly for hints).  The pocket reference would have been better as an ebook.  It doesn’t really have enough info to be useful most of the time.


Blog Stats

  • 279,048 hits

OSWALD Newsletter

If you would like to receive OSWALD, a weekly open source news digest please send an email to oswald (with the subject "subscribe") at opensourcelaw.biz