Documents from Hell

Every so often you happen to get some document, mostly one written with a certain software package from a company in Redmond, that looks pretty good but needs some minor property changed like the font for the default paragraphs. No problem, you open the “Stylist” in your OpenOffice and change the font for “Default”. Does not work. Hmm, the document indicates every paragraph uses the “Default” preferences. And then, you realise that every paragraph has individually a font set.

Impossible to clean up that mess in the office-software. You can’t click “Clear formatting” for every paragraph, and besides, you would screw up any other formatting like bold faces and italics too. XML to the rescue! In theory it should be possible to unzip the document and edit the XML. As it happens, the several megabytes big XML is very structured indeed: Everything is on one line Thank you very much, this means most normal text-oriented unix-tools won’t work, because you can’t rely on some useful delimiter.

Luckily I found xmlindent which nearly does the job, you can get nearly the original XML (with the exception of one missing linefeed after the XML-declaration) with sed s/\ \ \ \ //g and tr -d "\n" afterward you’ve done editing. Also interesting is Editix a Java-based XML-Editor.

Now, I would like to get rid of 10’000 redundant style-definitions, which either define bold or italic or are used to set small caps and bold to designate a subtitle — preferably set those who define title to a real “Heading”. sed -n '/<style :style style:name=\"P/,/<\/style:style>/p'</style> will give me the whole statements, but what now?

Comments are closed.