Seegras Logbook

Archive for March, 2014

Scanning Books on Linux

Monday, March 24th, 2014

I’ve been scanning books for a long time, and it’s always kinda problematic with the toolchain, and with the workflow. Now I’ve pretty much found out what works, and what does not.

As a note: All the shell-code on this page assumes your files do not have spaces and other weird characters like “;” in them. If not, my bicapitalize can fix that..

Scanner

The first thing you want to have is a decent scanner, preferably one with and automatic document feeder (ADF). According to the internet, what you need would be the Fujitsu ScanSnap iX500, since it appears to be the most reliable.

However, that’s not the one I have, mine is an EPSON Perfection V500, a combined flatbed/ADF scanner, which needs iscan with the epkowa interpreter. It works, but it’s rather slow.

Scanning Software

I mostly use xsane. With the epkowa-interpreter, it gives me some rather weird choices of dpi, that’s why I mostly scan at 200×200 dpi (I would recommend 300x300dpi, but epkowa does not give me that choice, for some weird reason). Also, I scan to png usually, since this gives me the best choices later on, and is better suited to text than jpeg.

Of course, never scan black-and-white; alaways colour or greyscale. Also, don’t scan to pdf directly. Your computer can produce better pdf files than your scanner does, and also, you would need to tear the files apart anyway for postprocessing.

Get Images from PDF

If you happen to have your images already inside a pdf, you can easily extract them with pdfimages (from poppler-utils):
pdfimages -j source.pdf prefix
Usually, they will come out as (the original) jpeg files, but sometimes you will get .ppm or .pbm. In that case, just convert them, something like so:
for i in *.ppm; do convert $i `basename $i .ppm`.jpg; done
(The convert command is of course from graphicsmagick or imagemagick)

Postprocessing Images

Adjust colour levels/unsharp mask

Depending on how your scan looks, you might want to change colour-levels or unsharp mask first. For that, I’ve written some scheme-scripts for gimp:

batch-level.scm
batch-level.sh
batch-unsharp-mask.scm
batch-unsharp-mask.sh

The scheme-files belong into your ~/.gimp-2.8/scripts/ directory, the shell-scripts into your path. Yes, they’re for batch-processing images from the commandline.

Fix DPI

If the DPI is screwed, or not the same for every file, you might want to fix that too (without changing the resolution):
convert -density 300 -units PixelsPerInch source.jpg target.jpg

Tailor Scans

If your scans basically look good, as far as brightness and gamma is concerned, the thing you need is scantailor. With it, you can correct skew, artifacts at the edges, spine shadows, and even somewhat alleviate errors in brightness.

Be sure to use the same dpi in the output as in the input, as scantailor will happily blow up your output at no gain of quality. Also, don’t set the output to black-and-white, because this will most probably produce very ugly tooth-patterns everywhere.

You will end up with a load of .tif images in the out-folder; which you either can shove off to OCR directly, or produce a pdf out of it.

Don’t even try to use unpaper directly. It requires all the files converted to pnm (2MB jpeg will give 90MB pnm), and unless your scans are extremely consistent and you give it the right parameters, it will just screw up.

Create PDF from Images

We first want to convert the tif-images to jpeg, as it will be possible to insert them into a pdf file directly, without converting them to some intermediate format. Most of all, this will allow us to do it via pdfjam (from texlive-extra-utils) which will do it in seconds instead of hours.
for i in *.tif; do convert $i `basename $i .tif`.jpg; done
And then:
pdfjam --rotateoversize false --outfile target.pdf *.jpg
NEVER, ever use convert to create pdf-files directly. It will run minutes to hours, at 100% load and fill up all your memory. or your disk. And produce huge pdf-files.

Create PDF Index

Even if your PDF consists entirely of images, it might still be worthwile to add an index. You create a file like this:
[ /Title (Title) /Author (Author) /Keywords (keywords, comma-separated) /CreationDate (D:19850101092842) /ISBN (ISBN) /Publisher (Publisher) /DOCINFO pdfmark [/Page 1 /View [/XYZ null null null] /Title (First page) /OUT pdfmark
And then add it to the PDF with gs:
gs -sDEVICE=pdfwrite -q -dBATCH -dNOPAUSE \ -sOutputFile=target.pdf index.info \ -f source.pdf

The upper part, the one with the metadata is entirely optional, but you really might want to add something like this. There’s some other options for adding metadata (see below).

Another option is jpdfbookmarks, however it doesn’t seem to be very comfortable either.

OCR

The end product you want with this is either a) a PDF (or EPUB) in which text is really native text and not an image of text, rendered in a system font, or b) a PDF in which the whole image is underlied with text, in a way in which each image of a character is underlied with the (hopefully correctly recognized) character itself.

Sadly, I don’t know any software on Linux which can do the latter. Unless you want to create an EPUB file, or a PDF which does not contain the original image on top of the text, you need to use some OCR software on some other platform. The trouble of course is, going all the way (no original image of the page) means your OCR needs to be perfect, as there is no way to re-OCR, or sometimes even no way to correct the text manually. And of course, the OCR software should retain the layout.

For books, doing a native text version is of course preferred, but for some things like invoices, you really do need the original image on top of the text.

Apparently, tesseract-ocr now incorporates some code to overlay images on text, but I haven’t tested that. Also, there’s seems to be some option with tesseract and hocr2pdf. But I’m not keen to try it, since ocropus, which can’t do that, has had consistently the better recognition-rate, and even that one is lower than the ones of commercial solutions.

Adding metadata to PDF files

I’ve written about this, and I’ve also written some scripts to handle this. You can do it by hand, with exiftool, or you can use my exif-meta which will do it automatically, based on the file- and directory-name, for a lot of files.

For books, unless Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1” you want to at least add Author, Title, Publishing Year, Publisher. ISBN if you have one.

Needed software

On a decent operating system (Linux) with a decent package-management (Debian or derivative), you can do:

apt-get install scantailor graphicsmagick exiftool xsane poppler-utils texlive-extra-utils

to get all the packages. The rest is linked in the text.