A Little Off Code, Computers, Photography and Guns

20Nov/110

Decoding DVD Subtitles with Golang

I've always been very fond of subtitles but I'm not sure of the reason why. When transcoding my DVD's to play them on my network media player I realized I needed a good way to keep the subtitles without burning them into the video. The MPEG-4 container will happily include VOBSUB and SRT subtitle streams and my network media player handles this nicely.

The problem though is that including the VOBSUB's exactly as they appeared on the DVD is somewhat problematic, they're almost never the same style between movies, and sometimes they're just plain difficult to read. Converting them to SRT involves a fairly lengthy process of going through and indicating to an OCR program what each character is as it reads all of the subtitles and writes an SRT file. This is also difficult to correct if you mess up one character in the process of encoding.

So now that I've got the itch I decided to scratch it. I decided to write my own subtitle decoder that would write subtitles to images and a pseudo-OCR program to convert those images into individual character files. From there it would be fairly easy to write a quick interface that presents you with a list of letters at which point you can just fill in the character for each one. Once you've done this you can export it as your favorite text subtitle format in one shot instead of doing it as you go along.

The language I decided to write it in is Golang as I've been learning it for a few weeks now and It's currently my favorite language for a large number of reasons I won't get into here.

The first major challenge I ran into is that there's not really any standardized information about decoding DVD subtitles. I did find maybe 3-4 sites that have varying levels of detail into decoding DVD subtitles but there were still a lot of gaps in the information.

To start with, we need to decode MPEG Program Stream packets (PS), these contain MPEG Packetized Elementary Stream packets (PES). The PS header doesn't contain any information we need to decode subtitles. The PES header contains size of the packet's payload, offset to the payload and the length of the additional headers. SubStream refers to the stream id of the subtitle we're decoding. DataSize is the size of the subtitle payload. ControlPtr is the offset to the control sequences for describing the subtitle's payload.

1
2
3
4
5
6
7
8
9
10
type Packet struct {
    PSHeader [14]uint8
    PESHeader [4]uint8
    PacketSize uint16
    Extension uint16
    HeaderSize uint8
    SubStream uint8
    DataSize uint16
    ControlPtr uint16
}

To read data into this structure I've written the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func (p *Packet) Read(r io.Reader) {
    binary.Read(r, binary.BigEndian, &p.PSHeader)
    binary.Read(r, binary.BigEndian, &p.PESHeader)
    binary.Read(r, binary.BigEndian, &p.PacketSize)
    binary.Read(r, binary.BigEndian, &p.Extension)
    binary.Read(r, binary.BigEndian, &p.HeaderSize)
    r.(io.ReadSeeker).Seek(int64(p.HeaderSize), os.SEEK_CUR)
    binary.Read(r, binary.BigEndian, &p.SubStream)
    binary.Read(r, binary.BigEndian, &p.DataSize)
    binary.Read(r, binary.BigEndian, &p.ControlPtr)

    p.PacketSize -= uint16(p.HeaderSize) + 4

    // Back up; DataSize and ControlPtr are part of the payload
    r.(io.ReadSeeker).Seek(-4, os.SEEK_CUR)
}

We read each of the structure's fields in order. We skip the additional headers of the PES packet since we don't care about the data in it. We also compensate for the given packet size since we went ahead and read the SubStream and DataSize seperately. Before leaving this function we back up so that the file cursor is at the right position to start reading data from the offsets.

Subtitles may span more than one packet so we need to be sure to read packets until we've read the entire length of the subtitle given by Packet.DataSize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func ReadSubtitle(s *os.File) (head Packet, data bytes.Buffer) {
    for i := 0; ; i++ {
        var pack Packet
        pack.Read(s)
        if i == 0 {
            head = pack
        }
        ReadFrom(s, &data, int64(pack.PacketSize))
        if data.Len() == int(head.DataSize) {
            break
        }
    }
    return
}

Now that the headers and information like payload size and offsets have been read we can start to decode the subtitle. The first things we need to decode are the control sequences. These sequences give us information about how long to display the current subtitle, it's color and other information like offsets to even and odd fields since the image data is interlaced.

1
2
3
4
type ControlHeader struct {
    Date uint16
    Next uint16
}

ControlHeader represents the start time and offset to the next control sequence. Once we've got this information we can read the controls.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func ReadControlSequences(head Packet, data *bytes.Buffer) (rect Rect, payload Payload,  even, odd uint16) {
    payload.Read(data, head)

    for {
        var header ControlHeader
        err := ReadInto(&payload.Control, &header)
        if err != nil {
            break
        }
        fmt.Printf("%+v\n", header)
        end := false
        for !end {
            cmd, err := payload.Control.ReadByte()
            if err != nil {
                break
            }
            switch cmd {
                case 0x00: fmt.Println("\tForced")
                case 0x01: fmt.Printf("\tStart:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x02: fmt.Printf("\tStop:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x03:
                    fmt.Printf("\tPalette:\t%04X\n", payload.Control.Next(2))
                case 0x04:
                    fmt.Printf("\tAlpha:\t\t%X\n", payload.Control.Next(2))
                case 0x05:
                    buf := payload.Control.Next(6)
                    rect = Rect{((uint16(buf[1]) & 0xF) << 8) | uint16(buf[2]) - (uint16(buf[0]) << 4) | (uint16(buf[1]) >> 4) + 1, ((uint16(buf[4]) & 0xF) << 8) | uint16(buf[5]) - (uint16(buf[3]) << 4) | (uint16(buf[4]) >> 4) + 1}
                    fmt.Printf("\tDimensions:\t%+v\n", rect)
                case 0x06:
                    buf := payload.Control.Next(4)
                    even = uint16(buf[0]) << 8 | uint16(buf[1])
                    odd = uint16(buf[2]) << 8 | uint16(buf[3])
                    fmt.Printf("\tOffsets:\t%d, %d\n", even, odd)
                    fmt.Printf("\tField Len:\t%d, %d\n", odd - even, uint16(payload.Data.Len()) - odd)
                case 0xFF:
                    end = true
            }
        }
    }
    return
}

The control command is 1 byte and is followed by any parameters necessary for that particular control. The different controls are described below:

  • 0x00 - Forced: subtitle displayed whether or not subtitles are selected//enabled. This is typically used for foreign language segments. Takes no arguments.
  • 0x01 - Start: The time at which to start displaying the subtitle, this uses the Date field of ControlHeader. The time in milliseconds to start displaying the subtitle is given by the function: 1024 * ControlHeader.Date / 90. Takes no arguments.
  • 0x02 - Stop: The time at which to stop displaying the subtitle, takes no arguments.
  • 0x03 - Palette: Defines the four colors used for the subtitle. I've decided to ignore implementing this as I will be converting the subtitles to text. Takes 2 bytes of arguments, each color is one nibble.
  • 0x04 - Alpha: Alpha channel information, determines which colors are opaque and which are transparent. Useful for determining the main color as the background will likely have complete transparency. Takes 2 bytes of arguments, each alpha is one nibble.
  • 0x05 - Dimensions: Gives the dimensions of the subtitle image. Takes 6 bytes of arguments, each dimension value is 3 nibbles. Dimensions in pixels is given by the equation: (X1 - X0 + 1) x (Y1 - Y0 + 1)
    • 0x*** - X0: Left-most x-axis bound.
    • 0x*** - X1: Right-most x-axis bound.
    • 0x*** - Y0: Top-most y-axis bound.
    • 0x*** - Y1: Bottom-most y-axis bound.
  • 0x06 - Field Offsets: Gives the offsets to the even and odd fields of the image. Takes 4 bytes of arguments, the first byte is the even field offset, the second byte the odd field offset. This will be useful for rendering each field line in the proper order.
  • 0xFF - End Control: Signals the end of a control sequence.

Now that we've got some information about the dimensions and locations of the subtitle image we can look at decoding and drawing it. Subtitle images are run-length-encoded (RLE). The basic idea behind RLE is to compress the image data into a pixel color and a number of pixels to draw in that color. Using the format for subtitles each pixel is defined by the following alphabet where * represents a wildcard nibble:

  • 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF
  • 0x1*, 0x2*, 0x3*
  • 0x04*, 0x05*, 0x06*, 0x07*, 0x08*, 0x09*, 0x0A*, 0x0B*, 0x0C*, 0x0D*, 0x0E*, 0x0F*
  • 0x01**, 0x02**, 0x03**
  • 0x000*

To determine the color and number of pixels to draw we need to do a little bitwise arithmatic. The number of pixels to draw is given by the operation: X >> 2. The color is given by X & 0x03.

There is one character in the alphabet which has a special meaning and neither of the above operations apply to it. That is 0x000* which is a sort of carriage return character. It means simply fill the rest of the line with the given color. After every carriage return we need to read a line from the opposite field and reset the x position in the image to 0 and increment the y position.

Before we get into the code about drawing images I should mention one of the problems I ran into while writing this. The problem is that Golang doesn't provide any mechanism for reading nibble-aligned data. So I went ahead and wrote a small structure and a few methods for accomplishing this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
type Nibbler struct {
    r *bytes.Buffer
    Current uint8
    Aligned uint8
}

func NewNibbler(r *bytes.Buffer) Nibbler {
    return Nibbler{r, 0, 0}
}

func (n *Nibbler) GetNibble() (b uint8, err os.Error) {
    if n.Aligned == 0 {
        err = ReadInto(n.r, &n.Current)
        if err != nil {
            return 0, err
        }
    }
    n.Aligned ^= 4
    b = (n.Current >> n.Aligned) & 0x0F
    return b, err
}

The basic functionality is achieved by using some bitwise operations to switch which nibble we return each time the GetNibble method is called and reading a new byte every time we've read the 2nd nibble of the current byte. Access is provided to the Aligned field to determine if we're byte-aligned or not since we need to use this in the function that draws the subtitle images.

The following code decodes the RLE image and draws all of the pixels to an image of dimensions specified in the control sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
func DrawPixels(s *image.Gray, x uint16, y uint16, n uint16, c uint8) {
    for i := 0; i < int(n); i++ {
        s.SetGray(int(x) + i, int(y), image.GrayColor{(c + 1) << 6})
    }
}

func ReadRLEImage(rect Rect, payload *Payload, even, odd uint16) (*image.Gray) {
    subImg := image.NewGray(int(rect.w), int(rect.h))
    bData := payload.Data.Bytes()
    evenNibbler := NewNibbler(bytes.NewBuffer(bData[even:odd]))
    oddNibbler := NewNibbler(bytes.NewBuffer(bData[odd:]))

    var x, y uint16
    done := false
    field := true

    for !done {
        var b uint16
        var t uint8

        var currentNibbler *Nibbler

        if field {
            currentNibbler = &evenNibbler
        } else {
            currentNibbler = &oddNibbler
        }

        t, _ = currentNibbler.GetNibble()
        b = (b << 4) | uint16(t)
        if b >= 0x4 {
            run := b >> 2
            DrawPixels(subImg, x, y, run, uint8(b & 0x3))
            x += run
        } else {
            t, _ := currentNibbler.GetNibble()
            b = (b << 4) | uint16(t)
            if b >= 0x10 {
                run := b >> 2
                DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                x += run
            } else {
                t, _ := currentNibbler.GetNibble()
                b = (b << 4) | uint16(t)
                if b >= 0x40 {
                    run := b >> 2
                    DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                    x += run
                } else {
                    t, _ := currentNibbler.GetNibble()
                    b = (b << 4) | uint16(t)
                    if b >= 0x100 {
                        run := b >> 2
                        DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                        x += run
                    } else {
                        DrawPixels(subImg, x, y, rect.w - x, uint8(b & 0x3))
                        x = 0
                        y += 1
                        field = !field
                        if y >= rect.h {
                            done = true
                        }
                        if currentNibbler.Aligned != 0 {
                            currentNibbler.GetNibble()
                        }
                    }
                }
            }
        }
    }
    return subImg
}

You'll notice I used the even and odd field offsets to create buffers for both the even and odd fields of the image. Then to switch between them, the pointer currentNibbler is switched between each field whenever we encounter a carriage return. I've also done some basic math in the DrawPixels function to evenly space the colors used in the subtitle throughout the greyscale range from 0 to 128.

The next step for this project is to write a program which can detect and separate images of each character from a subtitle image. After this I'll write a user interface for the user to give character meanings to each character image. From that an SRT file can be written using this character matrix. This is the same basic operation of most VOBSUB to SRT converters except that I aim to make it easier to use.

The complete source for this program can be found at: Gist: 1381809. Note that this program will read and decode only the first subtitle in the subtitle file. More work will be done on this when I've got time to make a more automated version that will read and decode all subtitles from a file. At some point in the future when I find that GeSHi supports Golang syntax highlighting, I'll update this post to make it more readable.