Seegras Logbook

Archive for the 'Metadata' Category

Dateinamen sind Schall & Rauch

Tuesday, August 28th, 2018

Vortrag zur CoSin 2018 in dem erklärt wird weshalb Dateinamen Ephemeral sind und wichtige Informationen in der Datei selber gespeichert werden sollen.

Dateinamen Sind Schall Und Rauch (odp)
Dateinamen Sind Schall Und Rauch (pdf)
Vortragsvideo “Dateinamen sind Schall & Rauch” in der CCC mediathek

Posted in Computers, Metadata | Comments Off on Dateinamen sind Schall & Rauch

Life with Calibre

Tuesday, November 26th, 2013

Calibre is undisputedly the number one when it comes to e-book management. It’s HUGE. It’s got a plethora of functions.

And it’s got quirks, design decisions which may not suit to your workflow. Certainly a lot of them don’t suit to mine.

Calibres own space. Every document imported into the library ends up copied into some private directory of calibre, and named according to some /Author/Title/Title scheme. The way I cope with this, is import into calibre, and save-to-disk again.
Metadata on the filesystem Metadata is stored not within the file, but in some database, and apparently in some opf-file with the book as well. Luckily, calibre tries to put metadata into the file when saving to disk. So the solution here is the same as above.
Name like Yoda, A When writing files, it misnames them to some library sort order, with the article appended at the end. To fix this, there’s a parameter in “Preferences” -> “Tweaks” -> “Control Formatting of Title and Series when used in Templates”, called save_template_title_series_sorting which needs to be set to strictly_alphabetic
No such Character There’s a set of characters Calibre does not want in file names. They are the same on all platforms, and while it’s not wise to use asterisks and such on unix filesystems, because they would wreak havoc on shell-processing, they would still work. The only character really not allowed is the “/”. But Calibre also replaces various ballast from Windows, like desirable critters like “:” and “+”. The way to fix this is to edit
/usr/lib/calibre/calibre/__init__.py and have them removed from _filename_sanitize_unicode.
Publishing by the Month Before the advent of the e-books, publishing dates are by definition expressed in years. Copyright law also uses the year only. To get rid of the ridiculous month in the publishing date, go to “Preferences” -> “Tweaks” -> “Control how dates are displayed” and set gui_pubdate_display_format to yyyy
Not unique As librarians know, in the absence of ISBN, books are identified by author, title, publishing year and publisher. Now when saving pdf files, Calibre neither puts in an ISBN, nor the publishing year, nor the publisher. Apparently, this is a problem of podofo, which does not know these. Speaking of which:
podofail Sometimes podofo also fails to write some tags. It’s not quite clear when this happens, as all my pdf files do not have any encryption, and exiftool can write metadata to them without problems.

Over time, I’ve written a slew of scripts to read and set metadata, these are:

epub-meta (c) — A very fast EPUB metadata-viewer based on ebook-tools’ libepub from Ely Levy.
epub-rename (perl) — A script to rename epub-files according to the EPUB’s metadata. Needs epub-meta and ebook-meta (from calibre).
exif-rename (perl) — A script to rename files according to their EXIF-tags. Tested with PDF, DJVU, M4V, OGM, MKV and MP3
exif-meta (perl) — A script to set EXIF/XMP-metatags according to the filename.
exif-info (perl) — Displays metadata and textual content of PDF files. Thought as filter for Midnight Commander

For further technical information and rants, you might want to read How to Enter EPUB Metadata Display EPUB metadata and rename EPUB files accordingly and Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1″, also on this blog.

Posted in Computers, EBooks, Metadata | Comments Off on Life with Calibre

Matroshka and the State of Movie Metadata

Saturday, September 21st, 2013

I like my metadata for files within the file. The reason is simple, a filesystem is only a temporary storage for a file, and things like filenames or paths only make sense within the filesystem itself. If you move the file, the filesystem might not support your particular metadata.

Starting with the path. For instace, /movies/Pirate/ won’t exist on other peoples machines, and it actually can’t even exist on stupid windows filesystems. So the fact that the file residing within this path is probably a pirate movie would get lost. And of course, not every filesystem supports all characters or encodes them the same way, and thus the movie “Pippi Långstrump på de sju haven” might end up with a totally garbled title on a filesystem.

Since I work on the Unix shell and on the web a lot, spaces in filenames tend to get garbled (“%20”) or interfere with commandline processing. So my filenames do not have spaces or umlauts in them, they are instead BiCapitalized. In fact, I’ve written a program bicapitalize to do just that.

Enter Matroshka

When it comes to metadata, the one container format that can just about contain everything is Matroshka. MP4 would be a possibility, but it’s rather constricted in it’s use of subtitles, codecs and audio tracks or even cover images. Also, matroshka looks much less as if “designed by commitee” as MP4 does; and is generally better supported by open source software. Not quite enough, as we’ll see..

To get from, say, avi containers to mkv is easy (after apt-get install mkvtoolnix):

for i in *.avi; do mkvmerge -o `basename $i .avi`.mkv --language 1:eng --title "`respacefilter $i | cut -d . -f 1`" $i ; done

This only changes the container, it won’t recode anything. It usually works with avi, mp4, mpeg, flv, ogm and more, but not with wmv.

You’ll notice the program respacefilter, which I’ve written to display BiCapitalized filenames as strings containing spaces. And if you’ve got some experience with the unix shell, you’ll also notice the above commandline will fail for files containing spaces. That’s exactly the reason why spaces in filenames are bad.

The above command also sets the “Title” tag to something probably meaningful, and the language of the first audio track to english. You can change the latter later on with
mkvpropedit --edit track:a1 --set language=ger target.mkv

If the title is screwed, you could set it with
mkvpropedit --edit info --set title="This Movie" target.mkv

Of course, if you already do have Matroshka files, and their title tags are not set or wrong, you might not want to set all titles by hand. I’ve also written a script called titlemkv to fix this. It can also fix some drawn out all-caps acronyms. Apart from the mkvtools, this needs mediainfo (install on Debian/Ubuntu with apt-get install mediainfo).

All the above can also be done, one file at a time, with the graphical interface mmg (of course: apt-get install mkvtoolnix-gui).

By now, you should have all you movie files in Matroshka-containers, and if not, because things like wmv-files, or files containing ancient codecs can’t just be re-containered, there’s HandBrake (as usual, apt-get install handbrake-gtk)

Matroshka Metadata

Apart from title and the languages of audio-tracks and subtitles, Matroshka files do not contain any metadata directly. Instead, they are in an xml-file, which is muxed into the container. Which makes the whole process obviously rather tedious. You don’t want to do it by hand.

Also, it turns out, most application do not read any metadata from the containers AT ALL. mediainfo of course can do it. So can avinfo, surprisingly. vlc can display most of them in a special window. mpv will display the Title as the window title. But the ones really needing metadata, the media center applications CAN’T. Neither MythTV, nor xbmc. Instead, both of these rely on filenames, and put the metadata into their database, with the added option of using some accompanying file with the movie which gets interpreted as well.

To add insult to injury, given one of these accompanying files with correct data, xbmc will display it, but when trying to fill in the blanks, it will happily try to look it up — by interpreting the filename again, wrongly. At least MediaElch can do this right (and that’s why it gets linked).

So the questions are a) how do we get these “accompanying files” (assuming they’re really needed for getting metadata from the web) and b) how do we get better metadata into them, and c) how do we put this metadata into the files itself.

For this, titlemkv can produce a rudimentary .nfo file for xbmc, when given the -n switch. It will contain the title, and the year, if it is already set in the mkv. Going from this, MediaElch or any other not broken scraper, can now fill in the blanks and produce .nfo files which contain a lot of information, like directors, actors, summaries and so on.

The last piece is my nfo2xml script, which will walk over a directory and produce a mkv-compatible XML file out of every .nfo-file it finds. The XML can the be muxed into the mkv-container, thus:
for i in *.mkv; do mkvpropedit $i --tags all:`basename $i .mkv`.xml ; done

The Future

I’ll probably update titlemkv to generate complete .nfo files from mkv metadata (or split the functionality into another program), also, I want to look at the question of how to incorporate cover images and such. I want all my files to contain useful metadata, and second, as long as this sorry state persists, I want to be able to generate whatever external metadata an application wants out of the incorporated metadata (which has its own merits: I would also be able to rename and sort my whole collection solely according the metadata in the files themselves).

(Edit 1: I wrote a rather stupid shellscript mkvattachcover to convert and attach cover images. It expects them with the filenames provided by MediaElch.)

(Edit 2: For use with mediainfo --inform=file:///path/to/AVinfo.csv I put up a decent template, AVinfo.csv which will show Matroshka specific tags. No, I have no idea why they’re calling their templates .csv, they aren’t.)

But crucially, the media center applications and the file managers will need to support metadata incorporated into files; just as one expects with audio files where this is absolutely the case.

Metadata MUST reside within the same file. I do understand that certain programs do not want to incorporate code to change this metadata, but just about everything accessing these files must be able to READ it, including media players, scrapers and file managers.

(Edit 3: nautilus displays either cover.jpg or small_cover.jpg as icon. But that’s it, apparently it can’t read any other metadata.)

Posted in Computers, Metadata | Comments Off on Matroshka and the State of Movie Metadata

Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1”

Saturday, April 20th, 2013

At least that is what I get from the metadata in your publication.

Google finds about 250’000 of these papers. It gets much worse if you only search for documents called “untitled1”. Not just the documents themselves have this meta-information, but all kinds of conversions, to html, and to pdf as well.

Sometimes, to make the whole thing even more ironic, the publisher has added his own information — but neither the title, nor the author.

Yes, metadata is a kind of a pet issue for me, and I’ve even written about How to Enter EPUB Metadata, apart from also having written Software to fix metadata in PDF- and epub-files (epub-meta/epub-rename and exif-meta/exif-rename. The latter works for PDF; the name comes from exiftool, altough technically the PDF metadata is XMP).

But still, if your paper should be worth anything, it should be worth to be found, and this also means worth being provided with accurate meta-information.

Librarians either work with an ISBN, and if no ISBN can be found (because it was published before 1969, or because no ISBN was ever registered), they need the following to correctly identify a work:

Author
Title
Publishing Year
Publisher

So you should take care that at least the first three of those are correctly filled in. If you’re doing a paper or book in the course of your work or study and publish it on the internet, consider entering the university or company as publisher.

Posted in Computers, EBooks, Metadata | Comments Off on Your name is “Windows User” and your scientific Paper is called “Microsoft Word – Untitled1”

Display EPUB metadata and rename EPUB files accordingly

Wednesday, January 25th, 2012

I’ve been programming. First a fast replacement for displaying metadata of an EPUB-file:
$ epub-meta -v 1632.epub File: 1632.epub Title: 1632 Author: aut: Eric Flint(Flint, Eric ) ID: ISBN (Unspecified:0-671-31972-8) ID: uuid (uuid_id:b1d2b16c-f68d-4a80-9d02-f6fcac36e1f3) Subject: Science Fiction Publisher: Baen Publishing Enterprises Date: Unspecified: 2000-02-07T05:00:00+00:00 Lang: en Contrib: bkp: calibre (0.6.51) [http://calibre.kovidgoyal.net](calibre (0.6.51) [http://calibre.kovidgoyal.net]) Meta: calibre:series_index: 1.0 Meta: calibre:timestamp: 2010-05-07T16:59:51.299000+00:00 Meta: cover: 0671578499_Cover Meta: calibre:series: Ring of Fire Meta: calibre:user_categories: {} Meta: calibre:author_link_map: {}
It has commandline-switches to selectively choose which metadata should be displayed. And it does it very fast, 3ms on my system, as opposed to 670ms ebook-meta from Calibre needs. However, it can only display the metadata, not change it. For changing metadata, you still need ebook-meta.

Having the ability to display metadata fast, made it possible to rename EPUB-files. Initially, I had the idea to do that in C too, but working with strings is actually quite tedious in C, so I decided on perl. So there’s now also a program called epub-rename, which renames EPUB-files according to it’s metadata in the format Author - Series SeriesIndex - Title. Moreover, it also has, trough ebook-meta, the ability to fix certain issues in metadata-tags. Namely change inverted Title/Author tags, fix Author-Tags which are in the wrong(!) Last, First-Format, and some more.

Well, here’s the “–help”
$ epub-rename --help Usage: [options] [directory ...]


     Options:

       -c|--compat

       -f|--fix

       -h|--help

       -t|--title

       -r|--rename

       -x|--exchange

       -v|--verbose
Options:

    -c|--compat

            Use ebook-meta from calibre instead of epub-meta. Much slower.
    -t|--title

            Fix title. This means the tag gets sanitized, as it would if

            destined for a filename, and then written back to the metadata.

            Uses ebook-meta.
    -f|--fix

            Fix all tags: author, title, and in some cases date. Uses

            ebook-meta and touches every file, even those that don't need

            fixing. Slow.
    -h|--help

            Print a brief help message and exit.
    -r|--rename

            rename files to the pattern "Author - Series SeriesIndex -

            Title"
    -x|--exchange

            changes title for author-tag and vice versa. For all those files

            that have the author in the title-field and the title in the

            author- field. Uses ebook-meta, thus is slow.

-v|--verbose Show how all files would be renamed, not just those really renamed.

And here’s the program itself: epub-meta-0.2.tar.gz. MIT-Licensed. Enjoy.

If you don’t like the spaces, punctuation and UTF-8-characters in the output filenames, I’d recommend another program of mine: bicapitalize.

Posted in Computers, EBooks, Metadata | Comments Off on Display EPUB metadata and rename EPUB files accordingly

How to Enter EPUB Metadata

Friday, January 20th, 2012

If you have a certain library of E-Books from different sources (e.g. Baen, Gutenberg, Archive.org, Google Books) you will notice a disparaging plethora of different styles of annotating EPUB-files, sometimes blatantly wrong and in violation of the EPUB Standard itself.

So this is a Howto on how to enter these metadata correctly. I’ll mostly cover the program “ebook-meta” (part of Calibre) which is available on about every platform.

Encoding

EPUB uses UTF-8, and UTF-8 only. Still, if you don’t use things like left-and-right quotes and backquotes, you’ll make sure your tags don’t get messed up. Ideally, only use the single quote “‘”.

Vocabulary

Try to be consistent in the vocabulary for tags (genres, categories). Sadly, no vocabularies are specified by the standards right now.