A Little Off Code, Computers, Photography and Guns

17Feb/120

OCR Segmentation for Movie Subtitles

A little bit ago I wrote a post on decoding DVD subtitles. Now that the subtitle decoder works well enough to work on other aspects of the project I started writing a program which will scan subtitle images and segment them into a matrix of unique characters.

One of the first problems I considered while writing the matrix generator was how to merge non-contiguous characters such as colons, semicolons, i's, j's and anything with an accent such as foreign language sections. The other part was how to determine if a space between two characters was a space such that they characters were part of the same word or if the space was meant to separate two words.

To measure some of these things I first needed to have an idea of the sorts of features I'd find considering only the vertical and horizontal distances between characters (and the bounds of the image). I wrote a quick Golang program which would find the boundaries of each character and per row or column scan up or right to find the next character or the bounds of the image. What I found is quite interesting.

First of the two was the frequency of horizontal distances between characters. If we consider only the frequency of distances of 50px or less we will most likely find information about how far characters belonging to the same word are typically spaced apart and how far words are typically spaced apart. Sure enough a log plot of the frequencies vs. pixel distances showed this quite clearly.



There were two obvious peaks found when scanning horizontally, the first at 5px is likely the distance between characters belonging to the same word. The second peak at 14px is likely the distance between words.

When looking at vertical scanning distance I expected I would find several peaks in the occurances. The peaks I expected find were the distances between upper-case, lower-case and characters extending beyond the baseline and the next closest character above each of these character classes. These features were not well-defined in the same style plot as in the horizontal scanning.



The horizontal spacing can tell us if the space is a character separator or a word separator and for this particular set of subtitles the character separation is ~5px and the word separation is ~14px. While the vertical spacing is not as well defined as the horizontal spacing and will need more analysis to determine some metric to separate lines and merge non-contiguous characters since most non-contiguous characters have components separated vertically rather than horizontally.

20Nov/110

Decoding DVD Subtitles with Golang

I've always been very fond of subtitles but I'm not sure of the reason why. When transcoding my DVD's to play them on my network media player I realized I needed a good way to keep the subtitles without burning them into the video. The MPEG-4 container will happily include VOBSUB and SRT subtitle streams and my network media player handles this nicely.

The problem though is that including the VOBSUB's exactly as they appeared on the DVD is somewhat problematic, they're almost never the same style between movies, and sometimes they're just plain difficult to read. Converting them to SRT involves a fairly lengthy process of going through and indicating to an OCR program what each character is as it reads all of the subtitles and writes an SRT file. This is also difficult to correct if you mess up one character in the process of encoding.

So now that I've got the itch I decided to scratch it. I decided to write my own subtitle decoder that would write subtitles to images and a pseudo-OCR program to convert those images into individual character files. From there it would be fairly easy to write a quick interface that presents you with a list of letters at which point you can just fill in the character for each one. Once you've done this you can export it as your favorite text subtitle format in one shot instead of doing it as you go along.

The language I decided to write it in is Golang as I've been learning it for a few weeks now and It's currently my favorite language for a large number of reasons I won't get into here.

The first major challenge I ran into is that there's not really any standardized information about decoding DVD subtitles. I did find maybe 3-4 sites that have varying levels of detail into decoding DVD subtitles but there were still a lot of gaps in the information.

To start with, we need to decode MPEG Program Stream packets (PS), these contain MPEG Packetized Elementary Stream packets (PES). The PS header doesn't contain any information we need to decode subtitles. The PES header contains size of the packet's payload, offset to the payload and the length of the additional headers. SubStream refers to the stream id of the subtitle we're decoding. DataSize is the size of the subtitle payload. ControlPtr is the offset to the control sequences for describing the subtitle's payload.

1
2
3
4
5
6
7
8
9
10
type Packet struct {
    PSHeader [14]uint8
    PESHeader [4]uint8
    PacketSize uint16
    Extension uint16
    HeaderSize uint8
    SubStream uint8
    DataSize uint16
    ControlPtr uint16
}

To read data into this structure I've written the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func (p *Packet) Read(r io.Reader) {
    binary.Read(r, binary.BigEndian, &p.PSHeader)
    binary.Read(r, binary.BigEndian, &p.PESHeader)
    binary.Read(r, binary.BigEndian, &p.PacketSize)
    binary.Read(r, binary.BigEndian, &p.Extension)
    binary.Read(r, binary.BigEndian, &p.HeaderSize)
    r.(io.ReadSeeker).Seek(int64(p.HeaderSize), os.SEEK_CUR)
    binary.Read(r, binary.BigEndian, &p.SubStream)
    binary.Read(r, binary.BigEndian, &p.DataSize)
    binary.Read(r, binary.BigEndian, &p.ControlPtr)

    p.PacketSize -= uint16(p.HeaderSize) + 4

    // Back up; DataSize and ControlPtr are part of the payload
    r.(io.ReadSeeker).Seek(-4, os.SEEK_CUR)
}

We read each of the structure's fields in order. We skip the additional headers of the PES packet since we don't care about the data in it. We also compensate for the given packet size since we went ahead and read the SubStream and DataSize seperately. Before leaving this function we back up so that the file cursor is at the right position to start reading data from the offsets.

Subtitles may span more than one packet so we need to be sure to read packets until we've read the entire length of the subtitle given by Packet.DataSize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func ReadSubtitle(s *os.File) (head Packet, data bytes.Buffer) {
    for i := 0; ; i++ {
        var pack Packet
        pack.Read(s)
        if i == 0 {
            head = pack
        }
        ReadFrom(s, &data, int64(pack.PacketSize))
        if data.Len() == int(head.DataSize) {
            break
        }
    }
    return
}

Now that the headers and information like payload size and offsets have been read we can start to decode the subtitle. The first things we need to decode are the control sequences. These sequences give us information about how long to display the current subtitle, it's color and other information like offsets to even and odd fields since the image data is interlaced.

1
2
3
4
type ControlHeader struct {
    Date uint16
    Next uint16
}

ControlHeader represents the start time and offset to the next control sequence. Once we've got this information we can read the controls.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func ReadControlSequences(head Packet, data *bytes.Buffer) (rect Rect, payload Payload,  even, odd uint16) {
    payload.Read(data, head)

    for {
        var header ControlHeader
        err := ReadInto(&payload.Control, &header)
        if err != nil {
            break
        }
        fmt.Printf("%+v\n", header)
        end := false
        for !end {
            cmd, err := payload.Control.ReadByte()
            if err != nil {
                break
            }
            switch cmd {
                case 0x00: fmt.Println("\tForced")
                case 0x01: fmt.Printf("\tStart:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x02: fmt.Printf("\tStop:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x03:
                    fmt.Printf("\tPalette:\t%04X\n", payload.Control.Next(2))
                case 0x04:
                    fmt.Printf("\tAlpha:\t\t%X\n", payload.Control.Next(2))
                case 0x05:
                    buf := payload.Control.Next(6)
                    rect = Rect{((uint16(buf[1]) & 0xF) << 8) | uint16(buf[2]) - (uint16(buf[0]) << 4) | (uint16(buf[1]) >> 4) + 1, ((uint16(buf[4]) & 0xF) << 8) | uint16(buf[5]) - (uint16(buf[3]) << 4) | (uint16(buf[4]) >> 4) + 1}
                    fmt.Printf("\tDimensions:\t%+v\n", rect)
                case 0x06:
                    buf := payload.Control.Next(4)
                    even = uint16(buf[0]) << 8 | uint16(buf[1])
                    odd = uint16(buf[2]) << 8 | uint16(buf[3])
                    fmt.Printf("\tOffsets:\t%d, %d\n", even, odd)
                    fmt.Printf("\tField Len:\t%d, %d\n", odd - even, uint16(payload.Data.Len()) - odd)
                case 0xFF:
                    end = true
            }
        }
    }
    return
}

The control command is 1 byte and is followed by any parameters necessary for that particular control. The different controls are described below:

  • 0x00 - Forced: subtitle displayed whether or not subtitles are selected//enabled. This is typically used for foreign language segments. Takes no arguments.
  • 0x01 - Start: The time at which to start displaying the subtitle, this uses the Date field of ControlHeader. The time in milliseconds to start displaying the subtitle is given by the function: 1024 * ControlHeader.Date / 90. Takes no arguments.
  • 0x02 - Stop: The time at which to stop displaying the subtitle, takes no arguments.
  • 0x03 - Palette: Defines the four colors used for the subtitle. I've decided to ignore implementing this as I will be converting the subtitles to text. Takes 2 bytes of arguments, each color is one nibble.
  • 0x04 - Alpha: Alpha channel information, determines which colors are opaque and which are transparent. Useful for determining the main color as the background will likely have complete transparency. Takes 2 bytes of arguments, each alpha is one nibble.
  • 0x05 - Dimensions: Gives the dimensions of the subtitle image. Takes 6 bytes of arguments, each dimension value is 3 nibbles. Dimensions in pixels is given by the equation: (X1 - X0 + 1) x (Y1 - Y0 + 1)
    • 0x*** - X0: Left-most x-axis bound.
    • 0x*** - X1: Right-most x-axis bound.
    • 0x*** - Y0: Top-most y-axis bound.
    • 0x*** - Y1: Bottom-most y-axis bound.
  • 0x06 - Field Offsets: Gives the offsets to the even and odd fields of the image. Takes 4 bytes of arguments, the first byte is the even field offset, the second byte the odd field offset. This will be useful for rendering each field line in the proper order.
  • 0xFF - End Control: Signals the end of a control sequence.

Now that we've got some information about the dimensions and locations of the subtitle image we can look at decoding and drawing it. Subtitle images are run-length-encoded (RLE). The basic idea behind RLE is to compress the image data into a pixel color and a number of pixels to draw in that color. Using the format for subtitles each pixel is defined by the following alphabet where * represents a wildcard nibble:

  • 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF
  • 0x1*, 0x2*, 0x3*
  • 0x04*, 0x05*, 0x06*, 0x07*, 0x08*, 0x09*, 0x0A*, 0x0B*, 0x0C*, 0x0D*, 0x0E*, 0x0F*
  • 0x01**, 0x02**, 0x03**
  • 0x000*

To determine the color and number of pixels to draw we need to do a little bitwise arithmatic. The number of pixels to draw is given by the operation: X >> 2. The color is given by X & 0x03.

There is one character in the alphabet which has a special meaning and neither of the above operations apply to it. That is 0x000* which is a sort of carriage return character. It means simply fill the rest of the line with the given color. After every carriage return we need to read a line from the opposite field and reset the x position in the image to 0 and increment the y position.

Before we get into the code about drawing images I should mention one of the problems I ran into while writing this. The problem is that Golang doesn't provide any mechanism for reading nibble-aligned data. So I went ahead and wrote a small structure and a few methods for accomplishing this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
type Nibbler struct {
    r *bytes.Buffer
    Current uint8
    Aligned uint8
}

func NewNibbler(r *bytes.Buffer) Nibbler {
    return Nibbler{r, 0, 0}
}

func (n *Nibbler) GetNibble() (b uint8, err os.Error) {
    if n.Aligned == 0 {
        err = ReadInto(n.r, &n.Current)
        if err != nil {
            return 0, err
        }
    }
    n.Aligned ^= 4
    b = (n.Current >> n.Aligned) & 0x0F
    return b, err
}

The basic functionality is achieved by using some bitwise operations to switch which nibble we return each time the GetNibble method is called and reading a new byte every time we've read the 2nd nibble of the current byte. Access is provided to the Aligned field to determine if we're byte-aligned or not since we need to use this in the function that draws the subtitle images.

The following code decodes the RLE image and draws all of the pixels to an image of dimensions specified in the control sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
func DrawPixels(s *image.Gray, x uint16, y uint16, n uint16, c uint8) {
    for i := 0; i < int(n); i++ {
        s.SetGray(int(x) + i, int(y), image.GrayColor{(c + 1) << 6})
    }
}

func ReadRLEImage(rect Rect, payload *Payload, even, odd uint16) (*image.Gray) {
    subImg := image.NewGray(int(rect.w), int(rect.h))
    bData := payload.Data.Bytes()
    evenNibbler := NewNibbler(bytes.NewBuffer(bData[even:odd]))
    oddNibbler := NewNibbler(bytes.NewBuffer(bData[odd:]))

    var x, y uint16
    done := false
    field := true

    for !done {
        var b uint16
        var t uint8

        var currentNibbler *Nibbler

        if field {
            currentNibbler = &evenNibbler
        } else {
            currentNibbler = &oddNibbler
        }

        t, _ = currentNibbler.GetNibble()
        b = (b << 4) | uint16(t)
        if b >= 0x4 {
            run := b >> 2
            DrawPixels(subImg, x, y, run, uint8(b & 0x3))
            x += run
        } else {
            t, _ := currentNibbler.GetNibble()
            b = (b << 4) | uint16(t)
            if b >= 0x10 {
                run := b >> 2
                DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                x += run
            } else {
                t, _ := currentNibbler.GetNibble()
                b = (b << 4) | uint16(t)
                if b >= 0x40 {
                    run := b >> 2
                    DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                    x += run
                } else {
                    t, _ := currentNibbler.GetNibble()
                    b = (b << 4) | uint16(t)
                    if b >= 0x100 {
                        run := b >> 2
                        DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                        x += run
                    } else {
                        DrawPixels(subImg, x, y, rect.w - x, uint8(b & 0x3))
                        x = 0
                        y += 1
                        field = !field
                        if y >= rect.h {
                            done = true
                        }
                        if currentNibbler.Aligned != 0 {
                            currentNibbler.GetNibble()
                        }
                    }
                }
            }
        }
    }
    return subImg
}

You'll notice I used the even and odd field offsets to create buffers for both the even and odd fields of the image. Then to switch between them, the pointer currentNibbler is switched between each field whenever we encounter a carriage return. I've also done some basic math in the DrawPixels function to evenly space the colors used in the subtitle throughout the greyscale range from 0 to 128.

The next step for this project is to write a program which can detect and separate images of each character from a subtitle image. After this I'll write a user interface for the user to give character meanings to each character image. From that an SRT file can be written using this character matrix. This is the same basic operation of most VOBSUB to SRT converters except that I aim to make it easier to use.

The complete source for this program can be found at: Gist: 1381809. Note that this program will read and decode only the first subtitle in the subtitle file. More work will be done on this when I've got time to make a more automated version that will read and decode all subtitles from a file. At some point in the future when I find that GeSHi supports Golang syntax highlighting, I'll update this post to make it more readable.

9Apr/104

DVD Ripping Made Easy

Reading through my normal list of RSS feeds I stumbled upon a post claiming to have found some software that greatly simplifies the process of decrypting and ripping DVD's. And surprisingly for the most part they were right.

The software in question is called MakeMKV[1]. The software seems to do a decent job of both decrypting and ripping DVD's. Mind you this software is not meant for transcoding DVD video into a different format.

The software functions much like most DVD decrypting software does. DVDFab[2] and DVD Decryptor[3] provide the same basic functions as MakeMKV with one major exception. Where both DVDFab Decryptor and DVD Decryptor will provide you with the ability to decrypt a DVD and dump it's contents to a directory, MakeMKV instead muxes all of the video, audio and subtitle streams into a single container instead of having several VOB files from the entire DVD. Each title on the disk is muxed into a single container, which really simplifies the process when backing up TV Seasons from DVD to your computer since each episode is typically it's own title.

All that is left to do once you've ripped to a Matroska[4] container is either leave it by itself since it's a perfectly fine container and format (albiet nearly the same size as the original content) or transcode it into your favorite format//container. Typically when ripping DVD's I use Handbrake and encode the DVD using the High Profile preset which performs decombing and detelecine. The high profile preset is also uses constant quality encoding which seems to be the preferred method for encoding these days since it provides the best perceptible quality vs. compression ratio.

Now the first thing that bothered me about MakeMKV is the fact that the site specifically states that it is free for the beta. But when you read closer it does get better:

Functionality to open DVD discs is free and will always stay free.

So that's promising. At least if you're only interested in ripping DVD's and not Blu-Ray or HDDVD then you'll be golden with this software.

Overall I've decided to just stick with MakeMKV for all my decrypting//backup needs from now on since it seems to do as good of a job or better than any of the other DVD rippers on the market at the moment.

  1. http://makemkv.com/ []
  2. http://www.dvdfab.com/ []
  3. http://www.dvddecrypter.org.uk/ []
  4. Matroska []