Minecraft: Castle Good Hope Construction Site

November 18th, 2014

I somehow lost interest in this (more on that later), so it’s basically a construction site, but I thought I write up how I constructed it.

I was looking around for some more baroque things I could build, maybe a small fortified city or something like that. Initially I came upon the Peter & Paul Fortress in St. Petersburg, but I could not find any useful plan or schematic.

And then I found this: Castle Of Good Hope, Overview. It has a scale, measures around 240×240 meters, and the layout looks gorgeous. The picture has around 2 pixel per meter, so that can be scaled easily. This star-shaped fortress was built between 1666 and 1682 by the Dutch East India Company (VOC).

I loaded the layout into gimp, fiddled around with the colours so that the black pixels were blacker and thicker, and the colours deeper. Then I made a custom palette with five coulors: black, white, orange, blue, green, and converted the image from RGB to indexed with that palette, replaced white with alpha, and rotated it 60 degrees counterclockwise (because I wanted the walls of the building in the middle to be straight). Finally I scaled it to 50%. Which gave me this: GoodHope

This image I then converted to an mcedit-schematic. There’s some free online-tools which can do this, some standalone programs and even some mcedit filters. Which gave me this: Floorplan Schematic which of course is not very precise. This I placed on the ground with mcedit, and then I started building up, while interpolating and correcting.

This is how it looks right now, from a bit up the ground


And from above:


Including the woollen plan beneath it, and with various sandstone blocks to mark where buildings should go.

Now, if you take a close look (or look at a View of Castle Good Hope or a Closeup of the Gate of Castle Good Hope) you realize what this consists of: Basically (granite) cobblestone topped off by red brick! Ugh!. That was the moment I slowly started to decide that I would not finish this.

Add to this that I could not find an elevation-plan of it (only of the earlier square-shaped fortress) and various details remained opaque (although google map view of Castle Good Hope helped a bit; you can see where stairs are), I decided to leave it after I’ve brought into some kind of shape.

So if anyone wants to continue this (perhaps someone frome cape town?) I publish it here. License is public domain, you can do whatever you wish with it. It’s in 1:1 scale, as accurate as possible. The walls are maybe 70% done, although I don’t know if I’ve got the height right, and the topping-off is a bit higher, since I wanted to have half-slabs on top to stop spawning. Also, there’s only grass where It didn’t interfere with the rest of the walls. The Oranje and Leerdam bastions have their gun emplacements, Buuren only two of the five it should have; the rest none yet. And I’m not quite happy with how they look. The entrance lacks everything. The buildings are somewhat staked out, some with approximate height.

The world save is just what you can see in the pictures, the schematic however is the walls only, so you can put it into your world and fill in the rest with whatever you want.

Here they are

On Ebooks and pricing

November 17th, 2014

I’m an avid reader. And I’ve got quite a collection of books. But apart from some pdfs I bought at rpgnow and some I’ve got from humblebundle, I bought exactly one ebook on the internet.

One reason for this is DRM. I can’t stand it, and I vote with my boots. I will never support such a completely consumer-hostile scheme, not with my money, and I even try to boycott the most egregious abusers of it by not buying their other products as well. Amazon for instance. There are of course some other abuses producers can indulge in which might want me to boycott them, like lobbying for extensions of copyright or ripping of the academic communities with journals and other such rent-seeking activities. But I won’t go into detail here; if buying a book is not as easy as can be or the book has antifeatures like DRM, you can stop right now, you’ve already lost.

Another reason is price. I’m very well aware that producing an ebook has a lot of the same fixed costs as one on paper has, but still, ebooks are much, much too expensive. The one thing that most people seem to forget, is that while ebooks have the same fixed costs (basically the writer and the editor; see Charlie Stross‘ Blog for details) there are practically no costs associated with the individual file you sell. So that base price of the individual unit which came beforehand from printing, stock and distribution, falls away, what remains is the amount for writer, editor, marketing, etc. which can be spread out over as how many units you like.

The question is, where actually is the “right” price for ebooks (and movies, by the way, which also seem to be much too expensive)?

The facts to keep in mind are: Budget and time are constrained on the buyers side. For most books, the things that tell stories are the immediate competition: Movies and computer games, But basically everything else that entertains people is competing with books, like forums and blogs and other places of discourse on the internet. And the public domain will also compete with newer books. Also, while people can (and will, no matter of copyright) share books with their friends they can’t actually resell them and recoup some of the money they spent. So the price needs to be rather low, to compete with all the other offerings, with public domain books, and with used books on paper.

Now, I’ve noticed from my behaviour regarding computer games, that I actually bought a lot of games I already had again, at gog.com or steam. Why? Mostly because they were cheap. With some autumn- summer- and christmas-sales, I’ve just about re-bought all the computer games I’ve already had.

This leads me to the conclusion of what the “sweet price” for ebooks is: The price where I can re-buy all the books I’ve ever had on paper or ever read within the space of maybe a year or two. In my case, that’s probably around 2000 books, some of which I’ve gotten for SFR 1 at garage sales, some I lent out from libraries, some I bought at retail price. The price for all of them should probably be below SFR 10’000; and with the SFR 1 paper ones as measuring stick, probably not a lot more than that. Say between SFR 1 and 4 (one swiss franc, by the way, is slightly more than a dollar nowadays).

Of course, you won’t want to make every book this price; you might want to put prices of popular ones more towards SFR 4 and unpopular ones more towards SFR 1, and of course, for new releases you want to have prices rather towards the prices of the paper version, SFR 15 maybe. Still, even new releases should have prices markedly lower than the paper version, since you really don’t have to take any logistics into account. Plus, you might want to have sales with huge discounts, and bundles, like “all the works of Isaac Asimov for SFR 30”. Take a look at steampowered.com around christmas, and you’ll see what I mean.

Exactly the same thinking can of course be applied to other digital goods, like movies and television series. I suspect the prices there could be a bit higher, maybe something around SFR 3-10 for movies, and maybe SFR 1-4 for single episodes, or SFR 10-20 for whole series. And again, newer ones priced above that, And bundles (“all the James Bond movies for SFR 50”).

The main point here is: There is a price that is so low, people will buy these things just to have them (or even fear that “it will cost more after this sale”), regardless of whether they even get around on reading or watching them. Just for the sake of collecting. Because they remember them, or because they’ve heard about that author and plan on reading something of him some time in the future. It doesn’t even matter if they already have gotten that book from a friend, for free. And, most importantly, it doesn’t matter if they will even find the time to read it; or even expect to find the time.

Questions for game-system rule makers

November 2nd, 2014

As I’ve been playing some computer RPGs, and read some of the changelogs and wishlists, I noticed some issues relating to history and physics which I will address here, in the hopes it might help game designers to achieve more believable game-(or mostly combat-)systems.

Q: Does a weapon which has the whole mass distributed along its length do more damage on impact than one that centers it on the top?

A: A two-handed sword distributes its maybe 2.5kg along the say 160cm of the blade, with actually its centre of gravity near the hilt. A polearm has a lot of its 2.5kg near the top of its 240cm. Its clear that the momentum of the polearm upon hitting will be much higher than that of the sword, and thus the damage it can inflict. On stabbing motions however, both weapon will inflict similar damage as this is mostly dependant on the user. The advantage of the sword is of course control, which is much better with the centre of gravity near the users hand.

Of course, this is something a lot of games get wrong.

Q: Heavy is bad?

A: For armour, yes. You want the maximum protection at the least weight. Or maybe the weight you can wear in battle, which is around 20-40kg, depending on your size. Heavier armour than that was only worn for tournaments, and there only on horse.

So don’t make your full suits of armour heavier than that. If your game considers size, have it influence the weight of the armour (and the fun of having the player find armour which just doesn’t fit), otherwise make it 30kg (or less for especially good armour…)

For weapons, you want something rather heavy that you can still control with ease, which brings us to…

Q: How much mass and momentum on an elongated hitting device can you control with one hand or with two hands?

A: This depends a bit on the length and centre of gravity, but it’s about 1kg for a 1 meter long thing, and 2.5kg for a 2.5 meter long thing. Which is nicely supported by historical evidence: One-handed weapons tend to have a weight around 1kg with 1 meter length, and all polearms weigh around 2.5kg with a length of 2.5m. Two-handed swords also tend to weigh 2.5kg with a length of 150cm (with shorter ones being lighter).

Of course this may vary a bit depending on who made it and who wants to wield it, but usually history shows weapons to be much lighter than their equivalents in game systems (It’s gotten a bit better. D&D, 1st ed. shows a one-handed sword at 6lbs and a halberd at 15lbs, D&D, 5th ed. shows them at 3lbs and 6lbs).

Q: Why do you think there are flat wide arrowheads used for hunting and tetragonal ones for war?

A: It’s about damage to flesh versus armour-piercing. This can be very much generalised: A pointy bit used for stabbing that’s broad and flat will probably do more damage to flesh but its chance to penetrate armour will be lower than one that’s square.

This means, with thrusting, damage will depend on that, and mostly on one other factor: whether the weapon was used two- or one-handed or what device was used to launch the thrust.

Q: What’s the difference between a blade and a pointy extrusion?

A: Pick or axe? As far as the pointy versus cutty is concerned, this again is a question of damage to armour versus damage to flesh.

As I already answered the case of where the stabbing bit is at the end of something and is used to thrust. But this is a bit different, since we’re actually hitting, not thrusting. And the momentum will vary a lot depending on how long this thing is and whether its used with both hands or not; and also, the momentum will usually be much bigger than with a straight thrust.

A lot of polearms will allow you to choose whether you want to hit your opponent with a blade or a pick, depending on what kind of armour he wears.

Q: What’s the difference between a rounded blade and a straight blade?

A: The question is, do we have a cutting or hitting edge. This is also a question about the armour worn on the other side. The difference between round and straight edges will probably be small, with the straight edge transferring more energy to the target, whereas the round edge will convert some of that energy into a lateral motion (cutting). The cutting will be rather useless against things it can’t cut, so it’s probably less useful against things like chainmail, whereas the damage might be bigger against things it can cut (leather, skin, flesh).

The other thing of course is the question what happens if the whole thing is curved, and there the answer is that with curved blades you can stab around something, making stabs more difficult to parry.

All in all, if you don’t have mechanisms to take these two issues into consideration, just treat them as equal.

Q: Why would you want to ditch a shield for a two-handed weapon?

A: Because if you’ve got two hands to use on the weapon, you’ve got more control, can use longer weapons, have more momentum and inflict more damage. And since you need something to take care of incoming projectiles, you have armour to take care of that.

You’ll notice that shields vanish from the battlefields with the advent of late middle ages plate armour. Made of steel, getting more and more hardened with time. Because that’s the thing that stops most projectiles. You also notice that romans also had some kind of “plate” armour but still carried shields. That’s because it’s made of iron (or even bronze), not steel, and can be rather easily penetrated by arrows.

Q: Why wouldn’t you want to ditch a shield for a second weapon?

A: If it’s about parrying, the bigger the thing you use to parry is, the bigger the chance of not getting hit. If it’s about projectiles, a second weapon won’t help you, but a shield will. And lastly: You can’t hit somebody with two weapons at the same time. So you’ll use one two bind the others weapon (parry) and attack with the other one. And where’s the advantage in that? You could do it with a shield as well.

Of course, if your opponent only has a one-handed weapon and no shield, you will have an advantage (or no disadvantage, if the other also uses a second weapon).

Dual wielding is inferior to anything but single-handedly wielding only one weapon.

Q: If I had a blade on the other end of the weapon could I hit the enemy with it?

A: Yes, as long as the blades are short (or just a pointy bit), and the stick is long, it makes perfectly sense. If not, the blade on the other end makes it impossible to fight with others alongside you, and it makes you loose momentum and control because of the counterweight.

So these things that are basically two swords attached to a hilt in opposite direction are completely useless. Unwise.

Q: What about the difference between a longsword and a broadsword?

Actually, “longsword” does not exist. “long sword” does, and it refers to a small late medieval two-handed sword, with the size about what you can still carry on the hip. In the 19th century mis-named as “bastard sword” or “one-and-a-half-hander”.

But the broadsword is a term used to distinguish it from the smallsword in the 17th century. Both broadsword and smallsword have about the same length (around 1 meter) and the same weight (around 1 kg). The blade of a broadsword is just so much broader and thinner. It probably has some implications regarding bigger wounds inflicted versus reduced capability of piercing armour compared to the smallsword.

Unless your setting incorporates smallswords, forget about broadswords.

Q: How do you carry a weapon?

A: On the hip. And if its too long, on the shoulder. Yes, that’s it. Apart from small weapons in you boot, throwing knifes on your arms or chest and other things like that, you carry it on your hip. Even quivers, unless you’re an american indian. And you don’t carry weapons on your back you intend to use, because you wouldn’t be able to draw them.

Yes, you could draw some short sword or machete from your back, but then, while you’re drawing it, you’re wide open to attack. There’s a reason nobody ever did that in history.

Q: How are quality differences in arms and armour expressed?

A: Basically, it varies with a) the materials and b) the techniques used to process these materials, and both tend to get better with time (unless suddenly constrained by financial or logistical questions).

Useable materials for weapons are wood (sharpened stick, hardened in fire), flint, copper (yes, there was actually a “copper age”) bronze, iron and steel, plus maybe some mythical metals such as mithril. For armour it’s leather, wood, copper, bronze, paper, cloth, iron, steel (with leather and copper so bad, you don’t want it).

The general mechanics is this: It must be workable; it should not break; it should be able to have an edge; it should keep an edge, it should not bend and stay bent, it must have the right weight. You can’t really have a sword of a material that has a totally different weight unless you make it smaller or more massive. Since weight and possible damage are mostly fixed by the form the product takes, you need to differentiate mostly on durability. Which of course is more interesting for armour, because there it also impacts the protection it offers.

Usually the one that matters is iron and steel.

And there’s a huge variance between different things made of steel. Depending on the techniques used (and whether the ore found already contained the right traces of other elements and carbon) you get from rather soft (roman lorica segmentata) to incredibly hard and resilient (gothic plate armour).

So rather than invent a plethora of new materials, just add techniques (look for “damascene steel”, “crucible steel” and “wootz”) or flowery names of where the steel (or even the product) should come from. It was even common to refer to the workshop. So a “Helmschmied breast plate” or an “Ulfberht sword” might be rather exquisite.

With armour, you could also conflate several layers of armour that was worn above each other at various times: tunic and unriveted chainmal, tunic and riveted chainmail, tunic and lorica segmentata, gambeson and chainmail, gambeson and chainmail and coat-of-plate, light gambeson and chainmail and brigantine, light gambeson and chainmail and (soft) full plate, arming doublet and (hardened) full plate. You get the picture.

Q: But a bronze weapon will cause less damage than a steel weapon?

A: No. Bronze is soft and can be ground to an extremely sharp edge in a very short time. Which it will also loose rather fast. It tends to bend and can’t be worked into very long shapes, which is why bronze axes are more interesting than swords. But against flesh, the effect of a bronze weapon is the same as if it was iron or steel.

The thing changes very much when it comes to armour. A bronze edged weapon goes through leather like butter, has troubles against bronze armour and can’t do anything against anything made of iron (except battering; which incidentally works also extremely well against bronze armour, nicely against iron and soft steel, and not at all against hardened steel).

Q: Leather armour is bad?

A: Well, it’s not armour in most cases, since even stone weapons cut through it nicely, let alone bronze weapons or even medieval eating knives (yes, I rammed an eating knife through 1cm so-called cuir-boilli with ease). It’s one of these roleplaying-game myths.

Leather was used within armour, tough, as carrier for riveting on small metal plates for instance, and sometimes also to cover these up (leading to something wich looks like leather with rivets on it).

Just don’t use it; if you need light armour, go with gambesons or other armour made of layers of cloth, or with only parts of armour. A gothic breast plate and an open helmet don’t inhibit any movement, do not make noise (even less than leather would) and they weigh about 4kg, respectively 2.5kg, but they protect your vitals.

Q: Can I swim with armour?

A: Basically, no. Wearing chainmail, you’re 8-20kg heavier than otherwise, and most people can swim only 2-4 meters with that. Plus there’s probably some gambeson beneath your chainmail. With gambesons alone, in the league from 4-8kg, your chances are better, you might get some 50 meters until any trapped air has gone out and the whole thing starts sucking up water.

The useful thing to do is to get rid of armour while you’re sinking, which actually might be possible with chainmail or gambeson (although you need to loose your belt), or some parts of plate armour (neither shoulder nor arms and legs probably).

Pictures from the late middle ages show soldiers swimming for an attack in their underpants, shoes (they’re rather light: my reconstruction half-high boots are 480g each) and hats, with their pikes(!).

Swisscom Peering Policy Perversions

May 21st, 2014

Was ist ein Peering?

Wenn man von einem Internetprovider Daten zu einem anderen schicken will, dann geht das an erster Stelle über einen Upstream, einen grösseren Internetprovider an dem andere Internetprovider angehängt sind. Diesen Upstream bezahlt man.

Ein Peering ist nun, wenn man eine direkte Leitung zum anderen ISP einrichtet, und allen Traffic vom und zu diesem ISP (aber nur den) direkt darüber leitet. Dies geht relativ kostengünstig wenn man bereits im selben Datacenter Infrastruktur hat, und es gibt Vereine, hier die SwissIX welche die gemeinsame Infrastruktur (die Switches) in diesen Datacentern betreiben.

Mit einem Peering sparen nun beide Seiten Upstream-Kosten, und die Kunden profitieren von kürzen Pfaden, also schnellerem Zugriff. Es ist also in den meisten Fällen eine Win-Win Situation.

Es gibt vereinzelt Fälle wo der eine Partner mehr profitiert als der andere, typischerweise profitiert dann der der mehr Daten saugt als er liefert.

Swisscom saugt

Die Swisscom ist einer der grössten Endkunden-Provider der Schweiz, und damit auch einer der grössten Empfänger von Daten. Man würde nun erwarten dass die Swisscom ein sehr grosses Interesse daran hat zu peeren, speziell mit Providern deren Kunden bei Swisscom-Kunden beliebte Seiten anbieten.

Stattdessen verlangt die Swisscom eine monatliche Miete. Mit anderen Worten, die Swisscom spart Upstream-Kosten, die Kunden der Swisscom haben besseren Zugriff auf Webseiten, und die Swisscom lässt sich das auch noch bezahlen.

Der andere ISP spart etwas Upstream-Kosten, und drückt dann stattdessen gleich wieder Geld an die Swisscom ab. Finanziell kann das nur bei sehr grossem eingespartem Datenvolumen funktionieren; wenn die Einsparungen grösser sind als die monatliche Rente an die Swisscom.

Der einzige Grund weshalb das funktionieren kann, ist dass die Kunden anderer ISPs, die Webseiten anbieten, ein Interesse daran haben dass ihre Seiten schnell bei den Swisscom-Kunden ankommen. Und dieses Interesse haben sie, weil die Swisscom einen Grossteil aller Endkunden bei sich angehängt haben. Ein sehr deutlicher Missbrauch einer marktbeherrschenden Stellung.

Tatsächlich haben sich auch schon Provider in der Schweiz dagegen gewehrt, z.b. hat Init7 einen Teilsieg gegen die Swisscom errungen. Aber dass die Swisscom immer noch für Peerings Geld verlangen kann, zeigt deutlich dass da von Wettbewerb keine Spur vorhanden ist, und die Swisscom nach wie vor bereit ist ihre Kunden, und die Qualität deren Internetverbindung, gegen kleinere Internetprovider auszuspielen.

Die Verlierer dieser Monopolrentenpolitik der Swisscom sind die anderen Internetprovider, deren Service anbietende Kunden, und die Kunden der Swisscom.

A Guide to Movie Encoding

April 26th, 2014

This is a guide to encoding and recoding movies, mostly on Linux, and also partly a rant against the most egregious practices.

I’m talking of encoding here, but actually, just about all the sources you can get movies will already be encoded, be it DVDs, bluerays, modern cameras or files. Very rarely you will get an unencoded stream, e.g. from a VHS. So all this applies actually mostly to re-encoding.

Also, being on Linux, one of the main requirements is that all the formats are supported by open source software. I don’t care about any possible patent-violations, because those would involve software patents, and these would haven been granted illegally anyway.

The tools used and denoted by fixed font are Linux commands or Debian/Ubuntu packages; but most of the software is available on other platforms as well.

Use the source

The quality of the encoding relies most heavily on the quality of the source you have. The more artifacts — no matter where from, be it from the actual film, dust, scratches, the projector, the camera, the VHS-drive, or the more modern electronic encoding-artifacts — the bigger the encoding will get to retain the same quality as the source. Some of the worst things I’ve seen are early black and white movies with loads of dust, scratches and grain.

Basically, artifacts increase the entropy, and the more entropy the less compression is possible.

  • Use the best source available. Usually blueray, unless the producer just interpolated from a DVD, in which case adjust the resolution back down to the DVD level, usually 720 wide (but 704 or 352 is possible).
  • Codecs matter. Some are notoriously efficient at encoding artifacts, that any re-encode will actually increase the file size. DIV3 is one such.
  • Otherwise you might gain from 20% to 50% by re-encoding DIVX, XVID, DX50 with a better codec, with no loss in visible quality. And of course, with MPEG2 from DVDs you can gain around 80-90% space, and with MPEG-4 AVC or VC-1 from bluerays, around 50-80%, depending on your quality needs.
  • Generally, a 500MB file encoded from a blueray will look much better than the same 500MB file encoded from a DVD, at the same resolution. Actually, you might even get a better resolution from the blueray, at the same file size.

Acquiring target

For the target, there are basically three factors that matter in the overall quality: container, codec and encoder. Apart from resolution, of course, but there the maximum is dictated by the source.

  • Container is easy: It must support multiple audio streams, multiple subtitles, preferrably in textual format (e.g. srt or ssa), and metadata, preferrably also cover images. This Comparison of container formats makes clear this is Matroska, probably followed by MP4.
  • Codec is a bit more tricky. But basically, you want one of the newer ones, they’re offering consistently better quality at lower file size. Which about leaves H.264 and VP9. You probably want H.264, Bluerays already mostly come in it, so do youtube-videos nowadays.
  • Stop using DIV3, DIVX, XVID, DX50 right now. They’re vastly inferior compared to what modern codecs deliver in half the filesize.
  • Audio codecs don’t have a large influence on file size, But as AC3 can’t do VBR, you don’t want that, and MP3 can’t do more than 2 channels. That leaves AAC and Opus as viable options, which happen to be the defaults to go with H.264 and VP9 respectively. Don’t use AC3, and don’t use DTS, both are obsolete.
  • Fortunately, handbrake-gtk already comes with H.264 and AAC as defaults, you only need to set the container to Matroska, and you’re good. A quality factor RF of 20 is usually good; 25 is still acceptable everything more is visually bad.
  • If you’ve already got a load of MP4-files encoded with H.264 and AAC, mmg (from mkvtoolnix-gui) can rewrite the container of the file to Matroska without re-encoding. And it also supports adding more audio-tracks, subtitles and image-attachements.
  • If you want to reduce the dimensions of the movie in order to reduce filesize, don’t go below a width of 720, Actually, rather reduce the quality somewhat before reducing dimensions, the visual impact is less noticeable.
  • Don’t ever go for a “filesize of 700MB”, that’s just stupid. Nobody wants to put a movie on a CD (and actually most people wouldn’t, even 15 years ago).
  • But be careful about filesize. Sadly, there’s still VFAT filesystems out there, which can’t contain files bigger than 2GB. some of them used by todays “Smart” TVs.

Dub stands for Dumb

There is only one reason for dubbing a movie — making it available to children who haven’t learned to read yet, and to the illiterate.

  • Whoever, ever had and has the idea to voice-over instead of just leaving the original language alone and subtitle it, is a total moron. And so is everyone encoding a movie with such an audio track. However, it is acceptable to voice-over parts with foreign speakers in documentaries (but not the whole documentary!).
  • If you still want to encode a dubbed audio track, make sure to also include the original language track. If it’s not possible with your container format, you’re using the wrong one.
  • Since not everyone is expected to read every language, include all available subtitles. Again, if your container doesn’t allow that, you’re using the wrong one
  • Hardcoded subtitles (within the movie stream itself) probably means you’re either a moron or using the wrong software. It’s only acceptable if the source had them too.
  • Those pesky vobsub-files, which are actually (mpeg-)streams, can be OCR’d to textfiles (srt, ssa) with vobsub2srt. Whatever vobsub2srt cannot recognize can be OCRd with SubRip (works with wine), for instance, but it will require heavy work. So you would be better off either to get them from opensubtitles.org or just include the vobsub.
  • Subtitles that are out of sync can be fixed with subtitleeditor. If they just start late or early, you can also just set an offset within mmg (from mkvtoolnix-gui)

Finishing Touches

After having a decent file, you might want to add metadata and (if applicable) cover-images.

  • The minimum metadata you need to provide is title, year and director (yes, there are at least two movies with the same name, published the same year).
  • If the movie is a known published one, can fetch the metadata, and my nfo2xml can convert it into a Matroska meta-data xml which can be muxed in with mmg.

Scanning Books on Linux

March 24th, 2014

I’ve been scanning books for a long time, and it’s always kinda problematic with the toolchain, and with the workflow. Now I’ve pretty much found out what works, and what does not.

As a note: All the shell-code on this page assumes your files do not have spaces and other weird characters like “;” in them. If not, my bicapitalize can fix that..


The first thing you want to have is a decent scanner, preferably one with and automatic document feeder (ADF). According to the internet, what you need would be the Fujitsu ScanSnap iX500, since it appears to be the most reliable.

However, that’s not the one I have, mine is an EPSON Perfection V500, a combined flatbed/ADF scanner, which needs iscan with the epkowa interpreter. It works, but it’s rather slow.

Scanning Software

I mostly use xsane. With the epkowa-interpreter, it gives me some rather weird choices of dpi, that’s why I mostly scan at 200×200 dpi (I would recommend 300x300dpi, but epkowa does not give me that choice, for some weird reason). Also, I scan to png usually, since this gives me the best choices later on, and is better suited to text than jpeg.

Of course, never scan black-and-white; alaways colour or greyscale. Also, don’t scan to pdf directly. Your computer can produce better pdf files than your scanner does, and also, you would need to tear the files apart anyway for postprocessing.

Get Images from PDF

If you happen to have your images already inside a pdf, you can easily extract them with pdfimages (from poppler-utils):

pdfimages -j source.pdf prefix

Usually, they will come out as (the original) jpeg files, but sometimes you will get .ppm or .pbm. In that case, just convert them, something like so:

for i in *.ppm; do convert $i `basename $i .ppm`.jpg; done

(The convert command is of course from graphicsmagick or imagemagick)

Postprocessing Images

Adjust colour levels/unsharp mask

Depending on how your scan looks, you might want to change colour-levels or unsharp mask first. For that, I’ve written some scheme-scripts for gimp:


The scheme-files belong into your ~/.gimp-2.8/scripts/ directory, the shell-scripts into your path. Yes, they’re for batch-processing images from the commandline.


If the DPI is screwed, or not the same for every file, you might want to fix that too (without changing the resolution):

convert -density 300 -units PixelsPerInch source.jpg target.jpg

Tailor Scans

If your scans basically look good, as far as brightness and gamma is concerned, the thing you need is scantailor. With it, you can correct skew, artifacts at the edges, spine shadows, and even somewhat alleviate errors in brightness.

Be sure to use the same dpi in the output as in the input, as scantailor will happily blow up your output at no gain of quality. Also, don’t set the output to black-and-white, because this will most probably produce very ugly tooth-patterns everywhere.

You will end up with a load of .tif images in the out-folder; which you either can shove off to OCR directly, or produce a pdf out of it.

Don’t even try to use unpaper directly. It requires all the files converted to pnm (2MB jpeg will give 90MB pnm), and unless your scans are extremely consistent and you give it the right parameters, it will just screw up.

Create PDF from Images

We first want to convert the tif-images to jpeg, as it will be possible to insert them into a pdf file directly, without converting them to some intermediate format. Most of all, this will allow us to do it via pdfjam (from texlive-extra-utils) which will do it in seconds instead of hours.

for i in *.tif; do convert $i `basename $i .tif`.jpg; done

And then:

pdfjam --rotateoversize false --outfile target.pdf *.jpg

NEVER, ever use convert to create pdf-files directly. It will run minutes to hours, at 100% load and fill up all your memory. or your disk. And produce huge pdf-files.

Create PDF Index

Even if your PDF consists entirely of images, it might still be worthwile to add an index. You create a file like this:
[ /Title (Title)
/Author (Author)
/Keywords (keywords, comma-separated)
/CreationDate (D:19850101092842)
/Publisher (Publisher)
/DOCINFO pdfmark
[/Page 1 /View [/XYZ null null null] /Title (First page) /OUT pdfmark

And then add it to the PDF with gs:
gs -sDEVICE=pdfwrite -q -dBATCH -dNOPAUSE \
-sOutputFile=target.pdf index.info \
-f source.pdf

The upper part, the one with the metadata is entirely optional, but you really might want to add something like this. There’s some other options for adding metadata (see below).

Another option is jpdfbookmarks, however it doesn’t seem to be very comfortable either.


The end product you want with this is either a) a PDF (or EPUB) in which text is really native text and not an image of text, rendered in a system font, or b) a PDF in which the whole image is underlied with text, in a way in which each image of a character is underlied with the (hopefully correctly recognized) character itself.

Sadly, I don’t know any software on Linux which can do the latter. Unless you want to create an EPUB file, or a PDF which does not contain the original image on top of the text, you need to use some OCR software on some other platform. The trouble of course is, going all the way (no original image of the page) means your OCR needs to be perfect, as there is no way to re-OCR, or sometimes even no way to correct the text manually. And of course, the OCR software should retain the layout.

For books, doing a native text version is of course preferred, but for some things like invoices, you really do need the original image on top of the text.

Apparently, tesseract-ocr now incorporates some code to overlay images on text, but I haven’t tested that. Also, there’s seems to be some option with tesseract and hocr2pdf. But I’m not keen to try it, since ocropus, which can’t do that, has had consistently the better recognition-rate, and even that one is lower than the ones of commercial solutions.

Adding metadata to PDF files

I’ve written about this, and I’ve also written some scripts to handle this. You can do it by hand, with exiftool, or you can use my exif-meta which will do it automatically, based on the file- and directory-name, for a lot of files.

For books, unless Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1” you want to at least add Author, Title, Publishing Year, Publisher. ISBN if you have one.

Needed software

On a decent operating system (Linux) with a decent package-management (Debian or derivative), you can do:

apt-get install scantailor graphicsmagick exiftool xsane poppler-utils texlive-extra-utils

to get all the packages. The rest is linked in the text.

See also

I’ve found some other related pages you might find interesting:

Die Überwachung und der Skandal

January 30th, 2014

Spätestens seit den späten 80er Jahren ist bekannt dass die NSA alles überwachen möchte, namentlich wurden da Details über das ECHELON Projekt, welches Funksignale auch in Europa z.b. via den Abhörstationen in Menwith Hill (UK) oder Bad Aibling (DE) abhört bekannt. Der Spiegel berichtet 1989 darüber: NSA: Amerikas großes Ohr.

Nicht nur der Funk wurde abgehört, sondern schon damals war es ein offenes Geheimnis dass die NSA in Frankfurt “in unmittelbarer Nachbarschaft der Postzentrale” hunderte von Telefonleitungen betrieb. Und in den 90ern war bekannt dass die NSA im selben Gebäude wie der DE-CIX ein Büro unterhielt, später dann aber nach gegenüber umgezogen sei (Scheinbar ist das nicht ganz korrekt: Die NSA hatte offenbar ein Büro über dem Hauptpostamt, aber das war vor der DE-CIX Zeit. Was aber nichts daran ändert dass die NSA später in unmittelbarer Nähe vom DE-CIX Büros unterhielt).

Dass die NSA versucht im Ausland alles zu Überwachen war also schon in den 90ern klar, und wem sich dafür interessiert hat auch bewusst. Auch klar war dass da mindestens eine Billigung durch entsprechende Behörden in den UK, Deutschland und anderen Ländern vorhanden sein musste.

Was weder mir noch der Öffentlichkeit klar war, ist das Ausmass in dem die NSA, zumindest im 21. Jahrhundert, damit erfolgreich war.

Der Skandal in den USA

Der eigentliche Skandal ist aber ein anderer. Einerseits hat den die USA selber: Die Überwachung der eigenen Bevölkerung war weder Auftrag der NSA, noch legal. Die versuchte Rechtfertigung mit “National Security Letters” und esoterischer Gesetzesauslegung sind nicht mehr als ein Feigenblatt, um die Verfassungs- (und eigentlich auch Gesetzes-)widrigen Machenschaften der NSA und der Regierungen Bush Jr. und Obama zu decken. Aber das ist erstmal das Problem der US Bürger.

Der Skandal hier…

Andererseits haben wir auch einen Skandal. Nämlich wie die Taten der NSA durch lokale Geheimdienste und Regierungen gedeckt wurden. Ja, die USA dürfen nach ihren eigenen Gesetzen bei uns spionieren. Aber nicht nach unseren. Und während die meisten Europäischen Geheimdienste, im Gegensatz zu den US-Amerikanischen, auch die eigenen Einwohner bespitzeln dürfen, so dürfen sie eines nicht: Daten über die eigenen Bürger an fremde Mächte weitergeben. Und statt Spionage-Abwehr zu betreiben haben wohl einige europäische Geheimdienste, darunter ziemlich sicher der BND, wohl genau das Gegenteil getan und ihre eigenen Bürger verraten.

..und der Skandal wie damit umgegangen wird

Und der dritte Skandal an der ganzen Sache ist wie die betroffenen Regierungen damit umgehen. Statt sich sofort hinter die eigene Bevölkerung zu stellen und die Übeltäter im eigenen Land vor Gericht zu ziehen, der NSA die Dependencen zuzumachen und den Datenfluss abzuwürgen übt man sich in lahmen Verleugnungen (“es gibt keinen Skandal”), Relativierungen und Rechtfertigungen.

Klar, einige Leute in den Regierungen wussten vermutlich etwas zuviel, aber das ist weder ein Grund nicht sofort die Geheimdienste an die Leine zu nehmen oder zuzumachen, noch ein Grund die ganze Sache herunterzuspielen. Und für die Parlamente ist das schon gar kein Grund nicht sofort Gesetze zu erlassen die so eine Massenüberwachung in Zukunft verunmöglichen. Stattdessen gibt es immer noch Politiker die die Vorratsdatenspeicherung fordern, was schlussendlich nichts anderes ist als die Bereitstellung von Datensammlungen für fremde Geheimdienste und Kriminelle. Ebenfalls ist es unerklärlich weshalb nicht sofort die Staatsanwaltschaften gegen die Beteiligten zu ermitteln anfangen.

Und dann haben wir noch die Presse, welche sich vor allem in den USA als NSA-Apologet hervortut und statt die Missetäter anzugreifen den Überbringer der Botschaft mundtot machen will.
Aber auch in Europa ist die Reaktion noch moderat, und statt Köpfe rollen zu fordern wird abgewiegelt.

Ja, wir haben einen Skandal, aber der ist nicht dass die NSA alles abhört, sondern dass sie dabei von Kollaborateuren in unseren Ländern unterstützt wird, und dass unsere eigenen Regierungen nichts dagegen unternehmen.

(Der Grund warum der Artikel nicht mehr auf die Schweiz eingeht ist dass hier noch sehr viel offen ist, und eine mögliche Zusammenarbeit von Schweizer Geheimdiensten mit der NSA noch nicht wirklich untersucht, und auch nicht derart offensichtlich wie beim deutschen BND oder beim englischen GCHQ ist).

Life with Calibre

November 26th, 2013

Calibre is undisputedly the number one when it comes to e-book management. It’s HUGE. It’s got a plethora of functions.

And it’s got quirks, design decisions which may not suit to your workflow. Certainly a lot of them don’t suit to mine.

  • Calibres own space. Every document imported into the library ends up copied into some private directory of calibre, and named according to some /Author/Title/Title scheme. The way I cope with this, is import into calibre, and save-to-disk again.
  • Metadata on the filesystem Metadata is stored not within the file, but in some database, and apparently in some opf-file with the book as well. Luckily, calibre tries to put metadata into the file when saving to disk. So the solution here is the same as above.
  • Name like Yoda, A When writing files, it misnames them to some library sort order, with the article appended at the end. To fix this, there’s a parameter in “Preferences” -> “Tweaks” -> “Control Formatting of Title and Series when used in Templates”, called save_template_title_series_sorting which needs to be set to strictly_alphabetic
  • No such Character There’s a set of characters Calibre does not want in file names. They are the same on all platforms, and while it’s not wise to use asterisks and such on unix filesystems, because they would wreak havoc on shell-processing, they would still work. The only character really not allowed is the “/”. But Calibre also replaces various ballast from Windows, like desirable critters like “:” and “+”. The way to fix this is to edit
    /usr/lib/calibre/calibre/__init__.py and have them removed from _filename_sanitize_unicode.
  • Publishing by the Month Before the advent of the e-books, publishing dates are by definition expressed in years. Copyright law also uses the year only. To get rid of the ridiculous month in the publishing date, go to “Preferences” -> “Tweaks” -> “Control how dates are displayed” and set gui_pubdate_display_format to yyyy
  • Not unique As librarians know, in the absence of ISBN, books are identified by author, title, publishing year and publisher. Now when saving pdf files, Calibre neither puts in an ISBN, nor the publishing year, nor the publisher. Apparently, this is a problem of podofo, which does not know these. Speaking of which:
  • podofail Sometimes podofo also fails to write some tags. It’s not quite clear when this happens, as all my pdf files do not have any encryption, and exiftool can write metadata to them without problems.

Over time, I’ve written a slew of scripts to read and set metadata, these are:

  • epub-meta (c) — A very fast EPUB metadata-viewer based on ebook-tools’ libepub from Ely Levy.
  • epub-rename (perl) — A script to rename epub-files according to the EPUB’s metadata. Needs epub-meta and ebook-meta (from calibre).
  • exif-rename (perl) — A script to rename files according to their EXIF-tags. Tested with PDF, DJVU, M4V, OGM, MKV and MP3
  • exif-meta (perl) — A script to set EXIF/XMP-metatags according to the filename.
  • exif-info (perl) — Displays metadata and textual content of PDF files. Thought as filter for Midnight Commander

For further technical information and rants, you might want to read How to Enter EPUB Metadata Display EPUB metadata and rename EPUB files accordingly and Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1″, also on this blog.

Closed Data

November 19th, 2013

Es scheint als ob es zu dem Thema keinen griffigen Titel gibt, es sei denn so ein halbseitiges Untertitel-Ding, welches im 18Jh beliebt war:

“Closed Data, oder wie die Öffentlichkeit dem Staat Forschung und Datenerhebhung finanziert, und wie dieser uns für die Resultate nochmals bezahlen lässt; oder wie er die Daten an Dritte verkauft die sie dann wiederum an uns verkaufen”

Na, immerhin, damit ist gleich gesagt um was es geht.

Wir Bürger eines Staates finanzieren eine ganze Menge wissenschaftliche Forschung über die Steuern, und selbst da wo der Hintergrund nicht wissenschaftlich ist, sondern ganz einfach begründet in Verwaltungstechnischer Notwendigkeit — Zensus, Vermessung, Steuern, Rechteausübung, Justiz — fallen Daten an die wissenschaftlich interessant sind.

Strassen-, Zonen-, und Katasterpläne, Höhenmodelle, hydrographische Modelle, demographische Daten, meteorologische Daten, kriminalstatistische Daten und so weiter. Und dadurch dass diese Vorgänge schon seit längerem laufen, auch historische Daten. Für alle diese Daten haben wir bereits einmal bezahlt, in dem wir ein Bundesamt oder eine Kantons- oder Stadtverwaltung damit betraut haben, und es dafür mit Steuern bezahlen.

Bildungs- und medizinische Einrichtungen produzieren nicht nur Daten, sondern forschen auch wissenschaftlich in sämtlichen Bereichen. Auch solche Institutionen werden aktiv vom Steuerzahler unterhalten. Der Forscher wird bezahlt, dafür dass er für sein Institut, und schlussendlich für die Allgemeinheit Forschung betreibt.

Ein kleiner Teil dieser Daten sind Datenschutztechnisch relevant: Daten die sich auf lebende Personen beziehen: Steuererklärungen, medizinische Verläufe, etc. Bei diesen ist der Halter zu Geheimhaltung verpflichtet, so dass bei allfälliger Veröffentlichung oder Weitergabe nicht auf die Person geschlossen werden kann. Diese Daten haben aber ebenfalls wissenschaftlichen Wert, wenn sie in passend anonymisierter Form vorliegen, z.b. um Gesundheitsrisiken statistisch auswerten zu können.

Damit ist eigentlich klar, dass hochauflösende Karten zu jeglichem Thema, anonymisierte Personenbezogene Daten, Umweltdaten jeglicher Art, und wisschenschaftliche Forschungsergebnisse der Allgemeinheit gehören.

Möchte der Bürger nun aber Zugriff auf diese Daten, so stellen sich ihm plötzlich Hürden in den Weg:

  • Historisch gewachsene Bürokratie. Daten müssen manchmal ausgewertet (z.b. zusammengestellt, anonymisiert, katalogisiert) werden bevor sie veröffentlicht werden können, und historisch gesehen war die Veröffentlichung selbst mit nicht geringen Kosten für Druck, Kopie oder Sendung verbunden. Dies hat dazu geführt dass Institutionen ihre Daten als prinzipiell “intern” angesehen haben, und jeden der zugreifen wollte als Bittsteller, welcher bitte zuerst warten und dann die Veröffentlichungskosten tragen sollte.
  • Finanzieller Druck. Institutionen, seien es Ämter oder Forschungseinrichtungen, stehen unter einem finanziellen Druck von oben. Die Betrieber möchten möglichst wenig Geld ausgeben, und so erscheint es am einfachsten sich nicht nur die externen Kosten der Veröffentlichung bezahlen zu lassen, sondern da auch gleich Einnahmen zu generieren. Sobald die Publikationskosten gegen Null tendieren, was seit Grössenordnung 1995 mit dem Internet der Fall ist, dann sieht man plötzlich wie eine Institution Geld verlangt, für etwas was schon lange bezahlt ist. Weshalb genau kosten Schweizer Karten 1:25’000 in elektronischer Form SFR 14 pro “Blatt” [1]?
  • Propaganda. Der Schritt von “für die Publikation müssen wir die Unkosten der Veröffentlichung selber erstattet haben” zu “für die Publikation wollen wir die Unkosten der ganzen Forschung erstattet haben” ist ein kleiner, aber sehr relevanter. Plötzlich sind die Daten nicht mehr der Öffentlichkeit, sondern zum Rechtsgut derjenigen geworden die sie erstellt haben (auch wenn die Öffentlichkeit sie dafür eigentlich bezahlt hat). Es werden Copyright-Vermerke draufgeknallt, und man versucht die ganze Verwertungskette zu kontrollieren. Gefördert wurde dieses Denken durch die wachsende Propaganda seitens privater Rechteverwerter seit den 1980er Jahren, die es auch geschafft haben, das Urheberrecht seither nicht weniger als X mal zu verschärfen — jedesmal auf Kosten der Öffentlichkeit. Auch hier wieder als Beispiel die Swisstopo, respektive deren Lizenzen [2].
  • Rentensuche. Noch wildere Blüten betreibt das Geschäft mit den Öffentlich finanzierten Daten im akademischen Bereich. Hier haben sich einerseits wissenschaftliche Verlage etabliert, die die Aufmerksamkeit und Reputation Ihrer Leser an potentielle Autoren verkaufen, welche dann nicht nur die Publikationskosten in einem Journal bezahlen, sondern auch noch Reviews der Arbeiten anderer Autoren gratis durchführen damit schlussendlich die Verlage das Journal den Bildungseinrichtungen zu horrenden Abonnementspreisen wieder zur Verfügung stellen können. Eine komplett parasitäre Einrichtung welche eigentlich nur via Bildungsbudgets von der Öffentlichkeit eine Rente bezieht.
  • Futterneid. Die Daten die die eine Institution oder der eine Forscher hat, die sollen andere entweder nicht haben, oder nicht benutzen können ohne dafür zu zahlen. Und natürlich ohne zu berücksichtigen dass die Daten eigentlich schon von der Öffentlichkeit finanziert wurden. Auch hier spielt wieder das Urheberrecht mit, oder wenigstens die von den obig erwähnten Rechteverwertern geprägte Weltbild. Aber noch viel interessanter ist hier ein anderes System, dass es erlaubt allen anderen die Benutzung von eigenen Ideen zu verbieten (Nota bene: Es erlaubt nicht die eigenen Ideen selber zu benutzen; es ist ein reines Veto-Recht gegenüber anderen). Das Patentsystem. Während akademische Forschung früher das früher zur Privatsache erklärt hat, ist es durch das Zusammenspiel der hier erwähnten Faktoren zum Usus geworden als Einrichtung Patente zu fördern, so dass schlussendlich die Öffentlichkeit eine Erfindung die sie bezahlt hat, nicht einmal mehr Nutzen darf ohne Lizenzgebühren zu bezahlen.

Diese ganzen Mechanismen machen es schwierig für Bürger die Daten die mit ihrem eigenen Geld erhoben wurden zu bekommen. Als ich 1996 im Rahmen einer soziologischen Arbeit [3] Daten gesucht habe, konnte ich die in der Schweiz nur entweder auf Papier oder sehr teuer “einzelne Anfrageresultate” auf Diskette bekommen; schlussendlich habe ich stattdessen US-Daten verwendet.

Ich bin nicht der einzige der schlussendlich irgendwie ausgewichen ist. Das http://www.openstreetmap.org Projekt besteht aus Daten die von Leuten ehrenamtlich per GPS gesammelt wurden, obwohl genau dieselben Daten schon in Grundbuchämtern und den Topografischen Institutionen vorhanden gewesen wären.

Wie hingegen die Welt aussieht wenn Bürger und interessierte Stellen Zugriff auf solche Daten haben, das sieht man in den Beispielen auf http://opendata.ch/ Schlussendlich ist die Summe eben grösser als die Anzahl ihrer Teile; und was alles aus irgendwelchen Daten entstehen kann können wir uns im vornherein nicht wirklich genau vorstellen, also ist die einzig sinnvolle Reaktion eben den Zugriff auf diese Daten möglicht vielen Leuten zu ermöglichen.

[1] Swisstopo
[2] Swisstopo: Lizenzen
[3] Attitudes towards Victimless Crimes, Peter Keel, 1996

Minecraft: Semi-Automatic Farm

October 24th, 2013

Welcome, this is my “1890 Fruit Company”, an automatic farm for Minecraft, which isn’t even about fruit. It looks rather 1890ies, though, and I couldn’t resist the name.

1890 Fruit Co.

It produces patatoes, carrots, wheat and seeds. You need to sow and plant yourself, fertilizing and harvest are pretty much automated, and the products are automatically sorted.

The schematic

The license of these files and my screenshots is the OPL 1.0 (which is about the same as CC-by-sa).