Matroshka and the State of Movie Metadata
Saturday, September 21st, 2013I like my metadata for files within the file. The reason is simple, a filesystem is only a temporary storage for a file, and things like filenames or paths only make sense within the filesystem itself. If you move the file, the filesystem might not support your particular metadata.
Starting with the path. For instace, /movies/Pirate/ won’t exist on other peoples machines, and it actually can’t even exist on stupid windows filesystems. So the fact that the file residing within this path is probably a pirate movie would get lost. And of course, not every filesystem supports all characters or encodes them the same way, and thus the movie “Pippi Långstrump på de sju haven” might end up with a totally garbled title on a filesystem.
Since I work on the Unix shell and on the web a lot, spaces in filenames tend to get garbled (“%20”) or interfere with commandline processing. So my filenames do not have spaces or umlauts in them, they are instead BiCapitalized. In fact, I’ve written a program bicapitalize to do just that.
Enter Matroshka
When it comes to metadata, the one container format that can just about contain everything is Matroshka. MP4 would be a possibility, but it’s rather constricted in it’s use of subtitles, codecs and audio tracks or even cover images. Also, matroshka looks much less as if “designed by commitee” as MP4 does; and is generally better supported by open source software. Not quite enough, as we’ll see..
To get from, say, avi containers to mkv is easy (after apt-get install mkvtoolnix):
for i in *.avi; do mkvmerge -o `basename $i .avi`.mkv --language 1:eng --title "`respacefilter $i | cut -d . -f 1`" $i ; done 
This only changes the container, it won’t recode anything. It usually works with avi, mp4, mpeg, flv, ogm and more, but not with wmv.
You’ll notice the program respacefilter, which I’ve written to display BiCapitalized filenames as strings containing spaces. And if you’ve got some experience with the unix shell, you’ll also notice the above commandline will fail for files containing spaces. That’s exactly the reason why spaces in filenames are bad.
The above command also sets the “Title” tag to something probably meaningful, and the language of the first audio track to english. You can change the latter later on with
mkvpropedit --edit track:a1 --set language=ger target.mkv
If the title is screwed, you could set it with
mkvpropedit --edit info --set title="This Movie" target.mkv
Of course, if you already do have Matroshka files, and their title tags are not set or wrong, you might not want to set all titles by hand. I’ve also written a script called titlemkv to fix this. It can also fix some drawn out all-caps acronyms. Apart from the mkvtools, this needs mediainfo (install on Debian/Ubuntu with apt-get install mediainfo).
All the above can also be done, one file at a time, with the graphical interface mmg (of course: apt-get install mkvtoolnix-gui). 
By now, you should have all you movie files in Matroshka-containers, and if not, because things like wmv-files, or files containing ancient codecs can’t just be re-containered, there’s HandBrake (as usual, apt-get install handbrake-gtk)
Matroshka Metadata
Apart from title and the languages of audio-tracks and subtitles, Matroshka files do not contain any metadata directly. Instead, they are in an xml-file, which is muxed into the container. Which makes the whole process obviously rather tedious. You don’t want to do it by hand.
Also, it turns out, most application do not read any metadata from the containers AT ALL. mediainfo of course can do it. So can avinfo, surprisingly. vlc can display most of them in a special window. mpv will display the Title as the window title. But the ones really needing metadata, the media center applications CAN’T. Neither MythTV, nor xbmc. Instead, both of these rely on filenames, and put the metadata into their database, with the added option of using some accompanying file with the movie which gets interpreted as well. 
To add insult to injury, given one of these accompanying files with correct data, xbmc will display it, but when trying to fill in the blanks, it will happily try to look it up — by interpreting the filename again, wrongly. At least MediaElch can do this right (and that’s why it gets linked).
So the questions are a) how do we get these “accompanying files” (assuming they’re really needed for getting metadata from the web) and b) how do we get better metadata into them, and c) how do we put this metadata into the files itself.
For this, titlemkv can produce a rudimentary .nfo file for xbmc, when given the -n switch. It will contain the title, and the year, if it is already set in the mkv. Going from this, MediaElch or any other not broken scraper, can now fill in the blanks and produce .nfo files which contain a lot of information, like directors, actors, summaries and so on. 
The last piece is my nfo2xml script, which will walk over a directory and produce a mkv-compatible XML file out of every .nfo-file it finds. The XML can the be muxed into the mkv-container, thus:
 for i in *.mkv; do mkvpropedit $i --tags all:`basename $i .mkv`.xml ; done
The Future
I’ll probably update titlemkv to generate complete .nfo files from mkv metadata (or split the functionality into another program), also, I want to look at the question of how to incorporate cover images and such. I want all my files to contain useful metadata, and second, as long as this sorry state persists, I want to be able to generate whatever external metadata an application wants out of the incorporated metadata (which has its own merits: I would also be able to rename and sort my whole collection solely according the metadata in the files themselves).
(Edit 1: I wrote a rather stupid shellscript mkvattachcover to convert and attach cover images. It expects them with the filenames provided by MediaElch.)
(Edit 2: For use with mediainfo --inform=file:///path/to/AVinfo.csv I put up a decent template, AVinfo.csv which will show Matroshka specific tags. No, I have no idea why they’re calling their templates .csv, they aren’t.)
But crucially, the media center applications and the file managers will need to support metadata incorporated into files; just as one expects with audio files where this is absolutely the case.
Metadata MUST reside within the same file. I do understand that certain programs do not want to incorporate code to change this metadata, but just about everything accessing these files must be able to READ it, including media players, scrapers and file managers.
(Edit 3: nautilus displays either cover.jpg or small_cover.jpg as icon. But that’s it, apparently it can’t read any other metadata.)