Archive for the 'EBooks' Category

On Ebooks and pricing

Monday, November 17th, 2014

I’m an avid reader. And I’ve got quite a collection of books. But apart from some pdfs I bought at rpgnow and some I’ve got from humblebundle, I bought exactly one ebook on the internet.

One reason for this is DRM. I can’t stand it, and I vote with my boots. I will never support such a completely consumer-hostile scheme, not with my money, and I even try to boycott the most egregious abusers of it by not buying their other products as well. Amazon for instance. There are of course some other abuses producers can indulge in which might want me to boycott them, like lobbying for extensions of copyright or ripping of the academic communities with journals and other such rent-seeking activities. But I won’t go into detail here; if buying a book is not as easy as can be or the book has antifeatures like DRM, you can stop right now, you’ve already lost.

Another reason is price. I’m very well aware that producing an ebook has a lot of the same fixed costs as one on paper has, but still, ebooks are much, much too expensive. The one thing that most people seem to forget, is that while ebooks have the same fixed costs (basically the writer and the editor; see Charlie Stross‘ Blog for details) there are practically no costs associated with the individual file you sell. So that base price of the individual unit which came beforehand from printing, stock and distribution, falls away, what remains is the amount for writer, editor, marketing, etc. which can be spread out over as how many units you like.

The question is, where actually is the “right” price for ebooks (and movies, by the way, which also seem to be much too expensive)?

The facts to keep in mind are: Budget and time are constrained on the buyers side. For most books, the things that tell stories are the immediate competition: Movies and computer games, But basically everything else that entertains people is competing with books, like forums and blogs and other places of discourse on the internet. And the public domain will also compete with newer books. Also, while people can (and will, no matter of copyright) share books with their friends they can’t actually resell them and recoup some of the money they spent. So the price needs to be rather low, to compete with all the other offerings, with public domain books, and with used books on paper.

Now, I’ve noticed from my behaviour regarding computer games, that I actually bought a lot of games I already had again, at gog.com or steam. Why? Mostly because they were cheap. With some autumn- summer- and christmas-sales, I’ve just about re-bought all the computer games I’ve already had.

This leads me to the conclusion of what the “sweet price” for ebooks is: The price where I can re-buy all the books I’ve ever had on paper or ever read within the space of maybe a year or two. In my case, that’s probably around 2000 books, some of which I’ve gotten for SFR 1 at garage sales, some I lent out from libraries, some I bought at retail price. The price for all of them should probably be below SFR 10’000; and with the SFR 1 paper ones as measuring stick, probably not a lot more than that. Say between SFR 1 and 4 (one swiss franc, by the way, is slightly more than a dollar nowadays).

Of course, you won’t want to make every book this price; you might want to put prices of popular ones more towards SFR 4 and unpopular ones more towards SFR 1, and of course, for new releases you want to have prices rather towards the prices of the paper version, SFR 15 maybe. Still, even new releases should have prices markedly lower than the paper version, since you really don’t have to take any logistics into account. Plus, you might want to have sales with huge discounts, and bundles, like “all the works of Isaac Asimov for SFR 30”. Take a look at steampowered.com around christmas, and you’ll see what I mean.

Exactly the same thinking can of course be applied to other digital goods, like movies and television series. I suspect the prices there could be a bit higher, maybe something around SFR 3-10 for movies, and maybe SFR 1-4 for single episodes, or SFR 10-20 for whole series. And again, newer ones priced above that, And bundles (“all the James Bond movies for SFR 50”).

The main point here is: There is a price that is so low, people will buy these things just to have them (or even fear that “it will cost more after this sale”), regardless of whether they even get around on reading or watching them. Just for the sake of collecting. Because they remember them, or because they’ve heard about that author and plan on reading something of him some time in the future. It doesn’t even matter if they already have gotten that book from a friend, for free. And, most importantly, it doesn’t matter if they will even find the time to read it; or even expect to find the time.

Scanning Books on Linux

Monday, March 24th, 2014

I’ve been scanning books for a long time, and it’s always kinda problematic with the toolchain, and with the workflow. Now I’ve pretty much found out what works, and what does not.

As a note: All the shell-code on this page assumes your files do not have spaces and other weird characters like “;” in them. If not, my bicapitalize can fix that..

Scanner

The first thing you want to have is a decent scanner, preferably one with and automatic document feeder (ADF). According to the internet, what you need would be the Fujitsu ScanSnap iX500, since it appears to be the most reliable.

However, that’s not the one I have, mine is an EPSON Perfection V500, a combined flatbed/ADF scanner, which needs iscan with the epkowa interpreter. It works, but it’s rather slow.

Scanning Software

I mostly use xsane. With the epkowa-interpreter, it gives me some rather weird choices of dpi, that’s why I mostly scan at 200×200 dpi (I would recommend 300x300dpi, but epkowa does not give me that choice, for some weird reason). Also, I scan to png usually, since this gives me the best choices later on, and is better suited to text than jpeg.

Of course, never scan black-and-white; alaways colour or greyscale. Also, don’t scan to pdf directly. Your computer can produce better pdf files than your scanner does, and also, you would need to tear the files apart anyway for postprocessing.

Get Images from PDF

If you happen to have your images already inside a pdf, you can easily extract them with pdfimages (from poppler-utils):

pdfimages -j source.pdf prefix

Usually, they will come out as (the original) jpeg files, but sometimes you will get .ppm or .pbm. In that case, just convert them, something like so:

for i in *.ppm; do convert $i `basename $i .ppm`.jpg; done

(The convert command is of course from graphicsmagick or imagemagick)

Postprocessing Images

Adjust colour levels/unsharp mask

Depending on how your scan looks, you might want to change colour-levels or unsharp mask first. For that, I’ve written some scheme-scripts for gimp:

batch-level.scm
batch-level.sh
batch-unsharp-mask.scm
batch-unsharp-mask.sh

The scheme-files belong into your ~/.gimp-2.8/scripts/ directory, the shell-scripts into your path. Yes, they’re for batch-processing images from the commandline.

Fix DPI

If the DPI is screwed, or not the same for every file, you might want to fix that too (without changing the resolution):

convert -density 300 -units PixelsPerInch source.jpg target.jpg

Tailor Scans

If your scans basically look good, as far as brightness and gamma is concerned, the thing you need is scantailor. With it, you can correct skew, artifacts at the edges, spine shadows, and even somewhat alleviate errors in brightness.

Be sure to use the same dpi in the output as in the input, as scantailor will happily blow up your output at no gain of quality. Also, don’t set the output to black-and-white, because this will most probably produce very ugly tooth-patterns everywhere.

You will end up with a load of .tif images in the out-folder; which you either can shove off to OCR directly, or produce a pdf out of it.

Don’t even try to use unpaper directly. It requires all the files converted to pnm (2MB jpeg will give 90MB pnm), and unless your scans are extremely consistent and you give it the right parameters, it will just screw up.

Create PDF from Images

We first want to convert the tif-images to jpeg, as it will be possible to insert them into a pdf file directly, without converting them to some intermediate format. Most of all, this will allow us to do it via pdfjam (from texlive-extra-utils) which will do it in seconds instead of hours.

for i in *.tif; do convert $i `basename $i .tif`.jpg; done

And then:

pdfjam --rotateoversize false --outfile target.pdf *.jpg

NEVER, ever use convert to create pdf-files directly. It will run minutes to hours, at 100% load and fill up all your memory. or your disk. And produce huge pdf-files.

Create PDF Index

Even if your PDF consists entirely of images, it might still be worthwile to add an index. You create a file like this:
[ /Title (Title)
/Author (Author)
/Keywords (keywords, comma-separated)
/CreationDate (D:19850101092842)
/ISBN (ISBN)
/Publisher (Publisher)
/DOCINFO pdfmark
[/Page 1 /View [/XYZ null null null] /Title (First page) /OUT pdfmark

And then add it to the PDF with gs:
gs -sDEVICE=pdfwrite -q -dBATCH -dNOPAUSE \
-sOutputFile=target.pdf index.info \
-f source.pdf

The upper part, the one with the metadata is entirely optional, but you really might want to add something like this. There’s some other options for adding metadata (see below).

Another option is jpdfbookmarks, however it doesn’t seem to be very comfortable either.

OCR

The end product you want with this is either a) a PDF (or EPUB) in which text is really native text and not an image of text, rendered in a system font, or b) a PDF in which the whole image is underlied with text, in a way in which each image of a character is underlied with the (hopefully correctly recognized) character itself.

Sadly, I don’t know any software on Linux which can do the latter. Unless you want to create an EPUB file, or a PDF which does not contain the original image on top of the text, you need to use some OCR software on some other platform. The trouble of course is, going all the way (no original image of the page) means your OCR needs to be perfect, as there is no way to re-OCR, or sometimes even no way to correct the text manually. And of course, the OCR software should retain the layout.

For books, doing a native text version is of course preferred, but for some things like invoices, you really do need the original image on top of the text.

Apparently, tesseract-ocr now incorporates some code to overlay images on text, but I haven’t tested that. Also, there’s seems to be some option with tesseract and hocr2pdf. But I’m not keen to try it, since ocropus, which can’t do that, has had consistently the better recognition-rate, and even that one is lower than the ones of commercial solutions.

Adding metadata to PDF files

I’ve written about this, and I’ve also written some scripts to handle this. You can do it by hand, with exiftool, or you can use my exif-meta which will do it automatically, based on the file- and directory-name, for a lot of files.

For books, unless Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1” you want to at least add Author, Title, Publishing Year, Publisher. ISBN if you have one.

Needed software

On a decent operating system (Linux) with a decent package-management (Debian or derivative), you can do:

apt-get install scantailor graphicsmagick exiftool xsane poppler-utils texlive-extra-utils

to get all the packages. The rest is linked in the text.

See also

I’ve found some other related pages you might find interesting:

Life with Calibre

Tuesday, November 26th, 2013

Calibre is undisputedly the number one when it comes to e-book management. It’s HUGE. It’s got a plethora of functions.

And it’s got quirks, design decisions which may not suit to your workflow. Certainly a lot of them don’t suit to mine.

  • Calibres own space. Every document imported into the library ends up copied into some private directory of calibre, and named according to some /Author/Title/Title scheme. The way I cope with this, is import into calibre, and save-to-disk again.
  • Metadata on the filesystem Metadata is stored not within the file, but in some database, and apparently in some opf-file with the book as well. Luckily, calibre tries to put metadata into the file when saving to disk. So the solution here is the same as above.
  • Name like Yoda, A When writing files, it misnames them to some library sort order, with the article appended at the end. To fix this, there’s a parameter in “Preferences” -> “Tweaks” -> “Control Formatting of Title and Series when used in Templates”, called save_template_title_series_sorting which needs to be set to strictly_alphabetic
  • No such Character There’s a set of characters Calibre does not want in file names. They are the same on all platforms, and while it’s not wise to use asterisks and such on unix filesystems, because they would wreak havoc on shell-processing, they would still work. The only character really not allowed is the “/”. But Calibre also replaces various ballast from Windows, like desirable critters like “:” and “+”. The way to fix this is to edit
    /usr/lib/calibre/calibre/__init__.py and have them removed from _filename_sanitize_unicode.
  • Publishing by the Month Before the advent of the e-books, publishing dates are by definition expressed in years. Copyright law also uses the year only. To get rid of the ridiculous month in the publishing date, go to “Preferences” -> “Tweaks” -> “Control how dates are displayed” and set gui_pubdate_display_format to yyyy
  • Not unique As librarians know, in the absence of ISBN, books are identified by author, title, publishing year and publisher. Now when saving pdf files, Calibre neither puts in an ISBN, nor the publishing year, nor the publisher. Apparently, this is a problem of podofo, which does not know these. Speaking of which:
  • podofail Sometimes podofo also fails to write some tags. It’s not quite clear when this happens, as all my pdf files do not have any encryption, and exiftool can write metadata to them without problems.

Over time, I’ve written a slew of scripts to read and set metadata, these are:

  • epub-meta (c) — A very fast EPUB metadata-viewer based on ebook-tools’ libepub from Ely Levy.
  • epub-rename (perl) — A script to rename epub-files according to the EPUB’s metadata. Needs epub-meta and ebook-meta (from calibre).
  • exif-rename (perl) — A script to rename files according to their EXIF-tags. Tested with PDF, DJVU, M4V, OGM, MKV and MP3
  • exif-meta (perl) — A script to set EXIF/XMP-metatags according to the filename.
  • exif-info (perl) — Displays metadata and textual content of PDF files. Thought as filter for Midnight Commander

For further technical information and rants, you might want to read How to Enter EPUB Metadata Display EPUB metadata and rename EPUB files accordingly and Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1″, also on this blog.

Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1”

Saturday, April 20th, 2013

At least that is what I get from the metadata in your publication.

Google finds about 250’000 of these papers. It gets much worse if you only search for documents called “untitled1”. Not just the documents themselves have this meta-information, but all kinds of conversions, to html, and to pdf as well.

Sometimes, to make the whole thing even more ironic, the publisher has added his own information — but neither the title, nor the author.

Yes, metadata is a kind of a pet issue for me, and I’ve even written about How to Enter EPUB Metadata, apart from also having written Software to fix metadata in PDF- and epub-files (epub-meta/epub-rename and exif-meta/exif-rename. The latter works for PDF; the name comes from exiftool, altough technically the PDF metadata is XMP).

But still, if your paper should be worth anything, it should be worth to be found, and this also means worth being provided with accurate meta-information.

Librarians either work with an ISBN, and if no ISBN can be found (because it was published before 1969, or because no ISBN was ever registered), they need the following to correctly identify a work:

  • Author
  • Title
  • Publishing Year
  • Publisher

So you should take care that at least the first three of those are correctly filled in. If you’re doing a paper or book in the course of your work or study and publish it on the internet, consider entering the university or company as publisher.

Display EPUB metadata and rename EPUB files accordingly

Wednesday, January 25th, 2012

I’ve been programming. First a fast replacement for displaying metadata of an EPUB-file:

$ epub-meta -v 1632.epub File: 1632.epub
Title: 1632
Author: aut: Eric Flint(Flint, Eric )
ID: ISBN (Unspecified:0-671-31972-8)
ID: uuid (uuid_id:b1d2b16c-f68d-4a80-9d02-f6fcac36e1f3)
Subject: Science Fiction
Publisher: Baen Publishing Enterprises
Date: Unspecified: 2000-02-07T05:00:00+00:00
Lang: en
Contrib: bkp: calibre (0.6.51) [http://calibre.kovidgoyal.net](calibre (0.6.51) [http://calibre.kovidgoyal.net])
Meta: calibre:series_index: 1.0
Meta: calibre:timestamp: 2010-05-07T16:59:51.299000+00:00
Meta: cover: 0671578499_Cover
Meta: calibre:series: Ring of Fire
Meta: calibre:user_categories: {}
Meta: calibre:author_link_map: {}

It has commandline-switches to selectively choose which metadata should be displayed. And it does it very fast, 3ms on my system, as opposed to 670ms ebook-meta from Calibre needs. However, it can only display the metadata, not change it. For changing metadata, you still need ebook-meta.

Having the ability to display metadata fast, made it possible to rename EPUB-files. Initially, I had the idea to do that in C too, but working with strings is actually quite tedious in C, so I decided on perl. So there’s now also a program called epub-rename, which renames EPUB-files according to it’s metadata in the format Author - Series SeriesIndex - Title. Moreover, it also has, trough ebook-meta, the ability to fix certain issues in metadata-tags. Namely change inverted Title/Author tags, fix Author-Tags which are in the wrong(!) Last, First-Format, and some more.

Well, here’s the “–help”
$ epub-rename --help
Usage: [options] [directory ...]

Options:
-c|--compat
-f|--fix
-h|--help
-t|--title
-r|--rename
-x|--exchange
-v|--verbose

Options:
-c|--compat
Use ebook-meta from calibre instead of epub-meta. Much slower.

-t|--title
Fix title. This means the tag gets sanitized, as it would if
destined for a filename, and then written back to the metadata.
Uses ebook-meta.

-f|--fix
Fix all tags: author, title, and in some cases date. Uses
ebook-meta and touches every file, even those that don't need
fixing. Slow.

-h|--help
Print a brief help message and exit.

-r|--rename
rename files to the pattern "Author - Series SeriesIndex -
Title"

-x|--exchange
changes title for author-tag and vice versa. For all those files
that have the author in the title-field and the title in the
author- field. Uses ebook-meta, thus is slow.

-v|--verbose
Show how all files would be renamed, not just those really
renamed.

And here’s the program itself: epub-meta-0.2.tar.gz. MIT-Licensed. Enjoy.

If you don’t like the spaces, punctuation and UTF-8-characters in the output filenames, I’d recommend another program of mine: bicapitalize.

How to Enter EPUB Metadata

Friday, January 20th, 2012

If you have a certain library of E-Books from different sources (e.g. Baen, Gutenberg, Archive.org, Google Books) you will notice a disparaging plethora of different styles of annotating EPUB-files, sometimes blatantly wrong and in violation of the EPUB Standard itself.

So this is a Howto on how to enter these metadata correctly. I’ll mostly cover the program “ebook-meta” (part of Calibre) which is available on about every platform.

Encoding

EPUB uses UTF-8, and UTF-8 only. Still, if you don’t use things like left-and-right quotes and backquotes, you’ll make sure your tags don’t get messed up. Ideally, only use the single quote “‘”.

Vocabulary

Try to be consistent in the vocabulary for tags (genres, categories). Sadly, no vocabularies are specified by the standards right now.

Tags

  • Title: This will contain the Title as it’s read. Don’t put in the author (yes, seen that). Don’t anticipate sorting by naming it “Title, The”, this is the task of the library program which sould do this. Don’t enter Series and Series Index. Don’t enter the author here.
  • Title sort: You don’t need to enter that; at least ebook-meta usually sets this correctly.
  • Author(s): Enter the author as named. Don’t enter the title here (also seen..), and don’t enter things like series or title after the author’s name. Don’t anticipate sorting by naming it “Name, First Name”. Enter it in the form “First Middle Last”. If the authors name is usually used with initials, use these. Don’t enter “John Ronald Reuel Tolkien”, but “J. R. R. Tolkien”. After an initial, enter a dot and a space. If there are several authors, enter all of them, when using “ebook-meta” separate them with “&”.
  • Author sort: You don’t need to enter that; at least ebook-meta usually sets this correctly.
  • Publisher: This is the original publisher. If you’re preparing an out-of-copyright e-book, don’t enter yourself. Also, don’t anticipate sorting but enter it as given.
  • Languages: At least one language must be set, you can set several if the book is multi-lingual. The language-code is the 2-letter iso-code. Apparently it ignores localized ones such as “en-gb”.
  • Published: This is the original publishing date. Not the date you’re preparing the e-book!
  • Rights: Enter the year and copyright holder, if applicable, and a license if necessary. Like this: “Copyright 1954 by J. R. R. Tolkien” or “Copyright 2012 by Peter Keel, License CC-By-2.5” or “Public Domain” if the work is not protected by copyright anymore.
  • Identifiers: Here go ISBN or ISSN (for magazines) or UUID. You can put in as many as ou like. “ebook-meta” allows only to set the ISBN and a BookID specifically.
  • Comments: This is actually the “Description”-tag, and it’s supposed to hold the blurb which would otherwise go onto the flap or the back of a physical book. And it should not contain HTML-tags. Also, don’t make this too long.
  • Series: This is a Calibre-specific tag, however it’s honored in many e-book-readers, so you really want to use this. Enter the series as spelled. Don’t take sorting into account. Don’t enter any series number.
  • Series Index: Also Calibre-specific, but goes with support for the “Series”-tag. Enter a number here, corresponding to the number in the series.
  • Tags: This one is really the “Subject”-tag. It contains as many tags as you wish on what the book is about. Enter the genre here as well. Enter tags separated by comma. Do NOT enter a blurb here.
  • Category: This is probably the “Type” tag but support seems to be rather limited. in any way, the genre does NOT go into that, but rather things like “textbook” or “novel”.