A Little Off Code, Computers, Photography and Guns

27Nov/1218

Reverse Engineering the BodyBugg

I recently bought a BodyBuggSP after a lot of research into fitness measurement//recording devices which I wanted to use for recording some information about my sleeping habits.

The BodyBugg V3 and BodyBuggSP are both essentially rebranded flavors of devices designed and manufactured by BodyMedia. As far as I can tell all of the devices have more or less the same sensors although some come with bluetooth or RF communication built in for use with a smart phone or display.

The biggest drawback for me is that the way the devices are designed and the software to retrieve data from them was written you are required to pay for the online service used to store and interpret the data. You can view the data using bluetooth if you got a BodyBuggSP but this will only view. You must sync the device with the web service in order to erase the device's memory, which means if you no longer use the web service, your device will simply run out of memory and stop logging.

This sort of commercial douche-baggery really doesn't sit well with me and I immediately started looking around for any projects for reverse engineering the protocol used to communicate with the device and download//clear memory. There are as far as I can tell 3 people who have started projects like this.

One of these projects is a python script which took the approach of reverse engineering the protocol for communicating with the device. Since the device communicates with a computer via serial over USB you can pretty easily sniff the traffic on a serial port. This turned out to be a very messy endeavor as there are a lot of unknown fields and discovering the structure of requests and responses isn't very straight forward.

Another involved sniffing the HTTP traffic between the java applet used for downloading and clearing device memory. Unfortunately this method is only capable of recording data, not clearing device memory after the service period has elapsed.

The last project was a Java program which has basically no documentation or source available since it was distributed as a JAR file and lived on google docs. Once BodyMedia caught wind that class files from their Java applet were being distributed verbatim in this program DMCA's were filed with google to remove the files.

First thing I did was attempt to rewrite some of the python script in a more general fashion but quickly gave up since the protocol seems to have slight differences depending on the device used.

I eventually decided to look at the Java data upload applet used with the web service. Once the JAR's for the applet were downloaded I got to work disassembling the classes within. I found pretty soon that there was a class used strictly for serial communication and provides methods for writing commands to the device and retrieving the response which is exactly what I wanted. This class essentially wraps a dll in Java.

The bare minimum classes needed from the applet jars are shown below:

1
2
3
4
5
6
import com.bodymedia.device.usb.Usb;
import com.bodymedia.device.serial.SerialPort3;

import com.bodymedia.common.applets.logger.Logger;
import com.bodymedia.common.applets.CommandException;
import com.bodymedia.common.applets.device.util.LibraryException;

For ease of use the Usb class can discover serial ports any bodymedia devices are connected to. The code below simply finds ports the devices exist on and fails if there are no devices or more than one.

1
2
3
4
5
6
7
8
9
10
11
12
Usb usb = new Usb("bmusbapex5", Usb.ANY);
String[] ports = usb.getArmbandPorts();

if(ports.length < 1) {
    System.out.println("No BodyBuggs detected.");
    System.exit(1);
} else if(ports.length > 1) {
    System.out.println("Multiple BodyBuggs detected, re-run with only one connected.");
    System.exit(1);
}

String serialPort = ports[0];

Once the port the device is connected on is discovered it's fairly simple to connect to it and communicate with it.

1
2
3
4
5
SerialPort3 ser = new SerialPort3("bmcommapex5", serialPort);
ser.setAddr(0x0000000E, 0xFFFFFFFF);
ser.open();
...
ser.close();

The call to setAddr was the last key to getting communication to work. I'm not sure if the device address is constant on all devices or not. The values used were discovered by sniffing serial communication.

Dissassembling the class ArmbandCommand revealed several string constants which represent commands that can be issued to the device.

Executing a command on the device is as simple as:

1
2
    ser.writeCommand("command here");
    System.out.println(ser.readResponse());

Some testing revealed that the device responds much like a unix program by giving usage if the command is invalid (or a multi-part command). Using this I was able to compile as far as I can tell a complete list of commands the device understands. Two important commands are get and set.

get [value [value value]] set value [value [value]]
age modrunning age modthreshold
banner modtarget birth modtoday
battery modthreshold bluetoothaddr modyesterday
birth modtoday bluetoothname offbodydelay
bluetoothaddr modyesterday boardnum operationmode
bluetoothname offbodydelay boardseries orientation
boardnum operationmode bootdiagnostic password
boardseries orientation bt_link_data personalized
bootdiagnostic password bt_local_port productcode
bt_link_data personalized bt_mysterybyte radiotimeout
bt_local_port productcode bt_remote_address rrcollect
bt_mysterybyte radiotimeout deepsleeptimeout sampletimeout
bt_remote_address rrcollect defaultconfig secondactivity
capabilities sampletimeout dialstring serialnum
charge secondactivity disable sex
checksum serialnum display1224 smoker
deepsleeptimeout sex eedebug spewmode
defaultconfig smoker eegoalmsg stepgoalmsg
dialstring spewmode eerunning stepsrunning
disable stepgoalmsg eetarget stepstoday
display1224 stepsrunning eetoday stepsyesterday
eedebug stepstoday eeyesterday steptarget
eegoalmsg stepsyesterday epoch subject
eerunning steptarget gsrthresh timezonechange
eetarget subject handed usbspeed
eetoday timezonechange height username
eeyesterday usbspeed hostname viggoalmsg
epoch username hrtarget vigrunning
gsrthresh value iapbundleseedid vigtarget
handed version iapprotocol vigthreshold
height viggoalmsg ipaddress vigtoday
hostname vigrunning lastdataupdate vigyesterday
hrtarget vigtarget led volume
iapbundleseedid vigthreshold modgoalmsg weight
iapprotocol vigtoday modrunning welcomemsg
ipaddress vigyesterday modtarget
lastdataupdate volume
led weight
memory welcomemsg
modgoalmsg

Commands available to deal with recorded data are as follows:

1
file init | clear | size | msgerase

I'm not sure what the exact difference between init and clear is, in my source I've been using file init which appears to only clear data memory, not configuration.

The possible types of data recorded by the device are shown using channel show:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Chan#   Name
-----   ----
    0   RAWACCFW
    1   RAWTSKIN
    2   RAWGSR
    3   RAWACCTR
    4   RAWACCLO
    5   RAWVBAT
    6   RAWTCOV
    7   RAWON
    8   EE
    9   MOVTSKIN
   10   MOVGSR
   11   MOVACCTR
   12   MOVACCLO
   13   MOVVBAT
   14   MOVTCOV
   15   MOVON
   16   MADACCTR
   17   MADACCLO
   18   F0CROSS
   19   HRATE
   20   PEDO3
   21   PLATEAU
   22   MADECG
   23   TRPEAKS
   24   MOVTHETA
   25   FWPEAKS
   26   FCOUNT
   27   MOVACCFW
   28   MADACCFW
   29   TCOUNT
   30   LCOUNT
   31   PEDO3TOE
   32   TIMESTMP
   33   HEARTBT
   34   T0CROSS
   35   L0CROSS
   36   RAWECG
   37   LOGSWEEP
   38   MADTHETA
   39   LOPEAKS
   40   COMPGSR
   41   RAWCGSR

One channel listed here which is conditionally recorded is TIMESTMP. This channel records a timestamp when the button on the device is pressed while being worn. This is very useful for delimiting important events during use.

The information recorded per unit time can be modified using the record command.

1
2
3
record show
record define <index> <type> <name> <divisor> <channels[1..8]> <bytes>
record commit

The output of record show is shown below:

1
2
3
4
5
6
7
#  Type   Name   Div               Channels         Bytes
-  ----  --------    ---  -------------------------------  -----
0   16   V6RES1 1920    9  11  12  27  14  16  17  40   13
1   17   V6RES2 1920   20  21  23  24  38  29  37 254   12
2   18   V6RES3 1920   30  34  35  31  39  13  28 254   12
3   19   V6RES4 1920   26  18  25  10   8 254 254 254    9
4   20   UNUSED    0  254 254 254 254 254 254 254 254    0

Theoretically every channel available may be recorded. However to define a new record the bytes parameter must be calculated for each record index. The bytes parameter is calculated as follows; where n is the number of non-empty channels (maximum of 8 channels) in the record:

$$bytes = \left\lceil\frac{n\cdot 12}{8}\right\rceil + 1$$

The divisor is used to modify how often data is recorded for the particular record. The default value of 1920 produces records once per minute, halving this value produces records once every 30 seconds. Any given interval may be calculated using divisor of 32 = 1 second.

The transient command appears to be used for getting simple overview information about recent data for use with the mobile phone application. Two parameters are available: get and reset.

The last important command I've found is retrieve PDP, which will return information about each channel's recorded data including a starting timestamp. Example output is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000080c0000000300200780EE  
fff0bf232
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000090c00000005002003c0MOVTSKIN
fff8a58b58c28cc
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000000a0c0000000300200780MOVGSR  
fff56857b
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000000b0c00000005002003c0MOVACCTR
8ad8a986d8ad885
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000000c0c00000005002003c0MOVACCLO
9139088f38ef913
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000000d0c0000000300200780MOVVBAT
bf4bedbee
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000000e0c00000005002003c0MOVTCOV
fff7967a87b87c6
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000100c00000005002003c0MADACCTR
00003903d0420ec
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000110c00000005002003c0MADACCLO
00002b02f0290d5
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000120c0000000300200780F0CROSS
00001a06f
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000140c0000000300200780PEDO3  
000000000
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000150c0000000300200780PLATEAU
000000000
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000170c0000000300200780TRPEAKS
0000d3130
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000180c0000000300200780MOVTHETA
8007f8823
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000190c0000000300200780FWPEAKS
00009e0e2
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000001a0c0000000300200780FCOUNT  
00004814a
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000001b0c00000005002003c0MOVACCFW
6cc6c86a56b7723
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000001c0c0000000300200780MADACCFW
00002a08f
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000001d0c0000000300200780TCOUNT  
000083193
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000001e0c0000000300200780LCOUNT  
000070148
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf000000001f0c0000000300200780PEDO3TOE
000000000
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000220c0000000300200780T0CROSS
00003f092
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000230c0000000300200780L0CROSS
000026091
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000250c0000000300200780LOGSWEEP
0004e76c2
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000260c0000000300200780MADTHETA
00136e3a3
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000270c0000000300200780LOPEAKS
0000a0107
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf00000000280c00000005002003c0COMPGSR
fff00a01301b026
SESSION-BEGIN0502Firefly2_00000000e00e3ec50b48bbf0000000068100000000100000002DIAGNSTC
1354009655 5 1

Each SESSION-BEGIN statement contains a hexadecimal string which contains at least the timestamp recording was started at, this is shown below delimited by pipes. I'm not sure what other information is recorded in the hexadecimal string:

1
SESSION-BEGIN0502Firefly2_00000000e00e3ec|50b48bbf|00000000080c0000000300200780EE

Each line following the SESSION-BEGIN statement contains a hexadecimal string which represents each data point recorded for the channel. Data points are represented as unaligned 12-bit integers.

There are several other commands which I've not tested yet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
bin show
bin set <index> <value>
bin reset

log text <listen_timeout> <text>
log data <listen_timeout> <mfg> <units> <data ...>

a <audible-number> <count>
wav <audible-number> <count>
visual <visual-number> <count>

word address [value]
long address [value]

remind N at hhmm
remind N cancel
remind list
remind N gettext
remind N settext <text>

noop
system
reboot

The remaining logic for the tool I've written for downloading data and clearing memory is shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ser.writeCommand("get lastdataupdate");

Pattern lastUpdateRegex = Pattern.compile("Last Data Update: ([0-9]+)");
Matcher matcher = lastUpdateRegex.matcher(ser.readResponse().toString());

matcher.find();
String lastUpdate = matcher.group(1);

String logPath = String.format("%s.log", lastUpdate);
System.out.printf("Writing data to: %s\n", logPath);
FileWriter out = new FileWriter(logPath);

ser.writeCommand("retrieve PDP");
out.write(ser.readResponse().toString());

out.close();

System.out.println("Clearing device memory and updating timestamps.");
ser.writeCommand("file init");
ser.writeCommand(String.format("set lastdataupdate %d", System.currentTimeMillis() / 1000L));
ser.writeCommand(String.format("set epoch %d", System.currentTimeMillis() / 1000L));

This will get the last timestamp data was cleared at to name the output file. Output of retrieve PDP is then written to this file. Once this is complete file init is executed to clear recorded data and finally the device's time and last data clear times are updated.

To avoid the same fate as FreeTheBugg, I am not packaging any of the jars this program needs to run, instead you'll need to download and place them in the directory bodybuggbypass_lib which must be in the same directory as the executable jar bodybuggbypass.jar:

Full source presented here as well as sample data and a few other tools can be found at: https://github.com/bemasher/BodyBuggBypass

17Feb/122

OCR Segmentation for Movie Subtitles

A little bit ago I wrote a post on decoding DVD subtitles. Now that the subtitle decoder works well enough to work on other aspects of the project I started writing a program which will scan subtitle images and segment them into a matrix of unique characters.

One of the first problems I considered while writing the matrix generator was how to merge non-contiguous characters such as colons, semicolons, i's, j's and anything with an accent such as foreign language sections. The other part was how to determine if a space between two characters was a space such that they characters were part of the same word or if the space was meant to separate two words.

To measure some of these things I first needed to have an idea of the sorts of features I'd find considering only the vertical and horizontal distances between characters (and the bounds of the image). I wrote a quick Golang program which would find the boundaries of each character and per row or column scan up or right to find the next character or the bounds of the image. What I found is quite interesting.

First of the two was the frequency of horizontal distances between characters. If we consider only the frequency of distances of 50px or less we will most likely find information about how far characters belonging to the same word are typically spaced apart and how far words are typically spaced apart. Sure enough a log plot of the frequencies vs. pixel distances showed this quite clearly.



There were two obvious peaks found when scanning horizontally, the first at 5px is likely the distance between characters belonging to the same word. The second peak at 14px is likely the distance between words.

When looking at vertical scanning distance I expected I would find several peaks in the occurances. The peaks I expected find were the distances between upper-case, lower-case and characters extending beyond the baseline and the next closest character above each of these character classes. These features were not well-defined in the same style plot as in the horizontal scanning.



The horizontal spacing can tell us if the space is a character separator or a word separator and for this particular set of subtitles the character separation is ~5px and the word separation is ~14px. While the vertical spacing is not as well defined as the horizontal spacing and will need more analysis to determine some metric to separate lines and merge non-contiguous characters since most non-contiguous characters have components separated vertically rather than horizontally.

20Nov/110

Decoding DVD Subtitles with Golang

I've always been very fond of subtitles but I'm not sure of the reason why. When transcoding my DVD's to play them on my network media player I realized I needed a good way to keep the subtitles without burning them into the video. The MPEG-4 container will happily include VOBSUB and SRT subtitle streams and my network media player handles this nicely.

The problem though is that including the VOBSUB's exactly as they appeared on the DVD is somewhat problematic, they're almost never the same style between movies, and sometimes they're just plain difficult to read. Converting them to SRT involves a fairly lengthy process of going through and indicating to an OCR program what each character is as it reads all of the subtitles and writes an SRT file. This is also difficult to correct if you mess up one character in the process of encoding.

So now that I've got the itch I decided to scratch it. I decided to write my own subtitle decoder that would write subtitles to images and a pseudo-OCR program to convert those images into individual character files. From there it would be fairly easy to write a quick interface that presents you with a list of letters at which point you can just fill in the character for each one. Once you've done this you can export it as your favorite text subtitle format in one shot instead of doing it as you go along.

The language I decided to write it in is Golang as I've been learning it for a few weeks now and It's currently my favorite language for a large number of reasons I won't get into here.

The first major challenge I ran into is that there's not really any standardized information about decoding DVD subtitles. I did find maybe 3-4 sites that have varying levels of detail into decoding DVD subtitles but there were still a lot of gaps in the information.

To start with, we need to decode MPEG Program Stream packets (PS), these contain MPEG Packetized Elementary Stream packets (PES). The PS header doesn't contain any information we need to decode subtitles. The PES header contains size of the packet's payload, offset to the payload and the length of the additional headers. SubStream refers to the stream id of the subtitle we're decoding. DataSize is the size of the subtitle payload. ControlPtr is the offset to the control sequences for describing the subtitle's payload.

1
2
3
4
5
6
7
8
9
10
type Packet struct {
    PSHeader [14]uint8
    PESHeader [4]uint8
    PacketSize uint16
    Extension uint16
    HeaderSize uint8
    SubStream uint8
    DataSize uint16
    ControlPtr uint16
}

To read data into this structure I've written the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func (p *Packet) Read(r io.Reader) {
    binary.Read(r, binary.BigEndian, &amp;p.PSHeader)
    binary.Read(r, binary.BigEndian, &amp;p.PESHeader)
    binary.Read(r, binary.BigEndian, &amp;p.PacketSize)
    binary.Read(r, binary.BigEndian, &amp;p.Extension)
    binary.Read(r, binary.BigEndian, &amp;p.HeaderSize)
    r.(io.ReadSeeker).Seek(int64(p.HeaderSize), os.SEEK_CUR)
    binary.Read(r, binary.BigEndian, &amp;p.SubStream)
    binary.Read(r, binary.BigEndian, &amp;p.DataSize)
    binary.Read(r, binary.BigEndian, &amp;p.ControlPtr)

    p.PacketSize -= uint16(p.HeaderSize) + 4

    // Back up; DataSize and ControlPtr are part of the payload
    r.(io.ReadSeeker).Seek(-4, os.SEEK_CUR)
}

We read each of the structure's fields in order. We skip the additional headers of the PES packet since we don't care about the data in it. We also compensate for the given packet size since we went ahead and read the SubStream and DataSize seperately. Before leaving this function we back up so that the file cursor is at the right position to start reading data from the offsets.

Subtitles may span more than one packet so we need to be sure to read packets until we've read the entire length of the subtitle given by Packet.DataSize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func ReadSubtitle(s *os.File) (head Packet, data bytes.Buffer) {
    for i := 0; ; i++ {
        var pack Packet
        pack.Read(s)
        if i == 0 {
            head = pack
        }
        ReadFrom(s, &amp;data, int64(pack.PacketSize))
        if data.Len() == int(head.DataSize) {
            break
        }
    }
    return
}

Now that the headers and information like payload size and offsets have been read we can start to decode the subtitle. The first things we need to decode are the control sequences. These sequences give us information about how long to display the current subtitle, it's color and other information like offsets to even and odd fields since the image data is interlaced.

1
2
3
4
type ControlHeader struct {
    Date uint16
    Next uint16
}

ControlHeader represents the start time and offset to the next control sequence. Once we've got this information we can read the controls.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func ReadControlSequences(head Packet, data *bytes.Buffer) (rect Rect, payload Payload,  even, odd uint16) {
    payload.Read(data, head)

    for {
        var header ControlHeader
        err := ReadInto(&amp;payload.Control, &amp;header)
        if err != nil {
            break
        }
        fmt.Printf("%+v\n", header)
        end := false
        for !end {
            cmd, err := payload.Control.ReadByte()
            if err != nil {
                break
            }
            switch cmd {
                case 0x00: fmt.Println("\tForced")
                case 0x01: fmt.Printf("\tStart:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x02: fmt.Printf("\tStop:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x03:
                    fmt.Printf("\tPalette:\t%04X\n", payload.Control.Next(2))
                case 0x04:
                    fmt.Printf("\tAlpha:\t\t%X\n", payload.Control.Next(2))
                case 0x05:
                    buf := payload.Control.Next(6)
                    rect = Rect{((uint16(buf[1]) &amp; 0xF) &lt;&lt; 8) | uint16(buf[2]) - (uint16(buf[0]) &lt;&lt; 4) | (uint16(buf[1]) &gt;&gt; 4) + 1, ((uint16(buf[4]) &amp; 0xF) &lt;&lt; 8) | uint16(buf[5]) - (uint16(buf[3]) &lt;&lt; 4) | (uint16(buf[4]) &gt;&gt; 4) + 1}
                    fmt.Printf("\tDimensions:\t%+v\n", rect)
                case 0x06:
                    buf := payload.Control.Next(4)
                    even = uint16(buf[0]) &lt;&lt; 8 | uint16(buf[1])
                    odd = uint16(buf[2]) &lt;&lt; 8 | uint16(buf[3])
                    fmt.Printf("\tOffsets:\t%d, %d\n", even, odd)
                    fmt.Printf("\tField Len:\t%d, %d\n", odd - even, uint16(payload.Data.Len()) - odd)
                case 0xFF:
                    end = true
            }
        }
    }
    return
}

The control command is 1 byte and is followed by any parameters necessary for that particular control. The different controls are described below:

  • 0x00 - Forced: subtitle displayed whether or not subtitles are selected//enabled. This is typically used for foreign language segments. Takes no arguments.
  • 0x01 - Start: The time at which to start displaying the subtitle, this uses the Date field of ControlHeader. The time in milliseconds to start displaying the subtitle is given by the function: 1024 * ControlHeader.Date / 90. Takes no arguments.
  • 0x02 - Stop: The time at which to stop displaying the subtitle, takes no arguments.
  • 0x03 - Palette: Defines the four colors used for the subtitle. I've decided to ignore implementing this as I will be converting the subtitles to text. Takes 2 bytes of arguments, each color is one nibble.
  • 0x04 - Alpha: Alpha channel information, determines which colors are opaque and which are transparent. Useful for determining the main color as the background will likely have complete transparency. Takes 2 bytes of arguments, each alpha is one nibble.
  • 0x05 - Dimensions: Gives the dimensions of the subtitle image. Takes 6 bytes of arguments, each dimension value is 3 nibbles. Dimensions in pixels is given by the equation: (X1 - X0 + 1) x (Y1 - Y0 + 1)
    • 0x*** - X0: Left-most x-axis bound.
    • 0x*** - X1: Right-most x-axis bound.
    • 0x*** - Y0: Top-most y-axis bound.
    • 0x*** - Y1: Bottom-most y-axis bound.
  • 0x06 - Field Offsets: Gives the offsets to the even and odd fields of the image. Takes 4 bytes of arguments, the first byte is the even field offset, the second byte the odd field offset. This will be useful for rendering each field line in the proper order.
  • 0xFF - End Control: Signals the end of a control sequence.

Now that we've got some information about the dimensions and locations of the subtitle image we can look at decoding and drawing it. Subtitle images are run-length-encoded (RLE). The basic idea behind RLE is to compress the image data into a pixel color and a number of pixels to draw in that color. Using the format for subtitles each pixel is defined by the following alphabet where * represents a wildcard nibble:

  • 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF
  • 0x1*, 0x2*, 0x3*
  • 0x04*, 0x05*, 0x06*, 0x07*, 0x08*, 0x09*, 0x0A*, 0x0B*, 0x0C*, 0x0D*, 0x0E*, 0x0F*
  • 0x01**, 0x02**, 0x03**
  • 0x000*

To determine the color and number of pixels to draw we need to do a little bitwise arithmatic. The number of pixels to draw is given by the operation: X >> 2. The color is given by X & 0x03.

There is one character in the alphabet which has a special meaning and neither of the above operations apply to it. That is 0x000* which is a sort of carriage return character. It means simply fill the rest of the line with the given color. After every carriage return we need to read a line from the opposite field and reset the x position in the image to 0 and increment the y position.

Before we get into the code about drawing images I should mention one of the problems I ran into while writing this. The problem is that Golang doesn't provide any mechanism for reading nibble-aligned data. So I went ahead and wrote a small structure and a few methods for accomplishing this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
type Nibbler struct {
    r *bytes.Buffer
    Current uint8
    Aligned uint8
}

func NewNibbler(r *bytes.Buffer) Nibbler {
    return Nibbler{r, 0, 0}
}

func (n *Nibbler) GetNibble() (b uint8, err os.Error) {
    if n.Aligned == 0 {
        err = ReadInto(n.r, &amp;n.Current)
        if err != nil {
            return 0, err
        }
    }
    n.Aligned ^= 4
    b = (n.Current &gt;&gt; n.Aligned) &amp; 0x0F
    return b, err
}

The basic functionality is achieved by using some bitwise operations to switch which nibble we return each time the GetNibble method is called and reading a new byte every time we've read the 2nd nibble of the current byte. Access is provided to the Aligned field to determine if we're byte-aligned or not since we need to use this in the function that draws the subtitle images.

The following code decodes the RLE image and draws all of the pixels to an image of dimensions specified in the control sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
func DrawPixels(s *image.Gray, x uint16, y uint16, n uint16, c uint8) {
    for i := 0; i &lt; int(n); i++ {
        s.SetGray(int(x) + i, int(y), image.GrayColor{(c + 1) &lt;&lt; 6})
    }
}

func ReadRLEImage(rect Rect, payload *Payload, even, odd uint16) (*image.Gray) {
    subImg := image.NewGray(int(rect.w), int(rect.h))
    bData := payload.Data.Bytes()
    evenNibbler := NewNibbler(bytes.NewBuffer(bData[even:odd]))
    oddNibbler := NewNibbler(bytes.NewBuffer(bData[odd:]))

    var x, y uint16
    done := false
    field := true

    for !done {
        var b uint16
        var t uint8

        var currentNibbler *Nibbler

        if field {
            currentNibbler = &amp;evenNibbler
        } else {
            currentNibbler = &amp;oddNibbler
        }

        t, _ = currentNibbler.GetNibble()
        b = (b &lt;&lt; 4) | uint16(t)
        if b &gt;= 0x4 {
            run := b &gt;&gt; 2
            DrawPixels(subImg, x, y, run, uint8(b &amp; 0x3))
            x += run
        } else {
            t, _ := currentNibbler.GetNibble()
            b = (b &lt;&lt; 4) | uint16(t)
            if b &gt;= 0x10 {
                run := b &gt;&gt; 2
                DrawPixels(subImg, x, y, run, uint8(b &amp; 0x3))
                x += run
            } else {
                t, _ := currentNibbler.GetNibble()
                b = (b &lt;&lt; 4) | uint16(t)
                if b &gt;= 0x40 {
                    run := b &gt;&gt; 2
                    DrawPixels(subImg, x, y, run, uint8(b &amp; 0x3))
                    x += run
                } else {
                    t, _ := currentNibbler.GetNibble()
                    b = (b &lt;&lt; 4) | uint16(t)
                    if b &gt;= 0x100 {
                        run := b &gt;&gt; 2
                        DrawPixels(subImg, x, y, run, uint8(b &amp; 0x3))
                        x += run
                    } else {
                        DrawPixels(subImg, x, y, rect.w - x, uint8(b &amp; 0x3))
                        x = 0
                        y += 1
                        field = !field
                        if y &gt;= rect.h {
                            done = true
                        }
                        if currentNibbler.Aligned != 0 {
                            currentNibbler.GetNibble()
                        }
                    }
                }
            }
        }
    }
    return subImg
}

You'll notice I used the even and odd field offsets to create buffers for both the even and odd fields of the image. Then to switch between them, the pointer currentNibbler is switched between each field whenever we encounter a carriage return. I've also done some basic math in the DrawPixels function to evenly space the colors used in the subtitle throughout the greyscale range from 0 to 128.

The next step for this project is to write a program which can detect and separate images of each character from a subtitle image. After this I'll write a user interface for the user to give character meanings to each character image. From that an SRT file can be written using this character matrix. This is the same basic operation of most VOBSUB to SRT converters except that I aim to make it easier to use.

The complete source for this program can be found at: Gist: 1381809. Note that this program will read and decode only the first subtitle in the subtitle file. More work will be done on this when I've got time to make a more automated version that will read and decode all subtitles from a file. At some point in the future when I find that GeSHi supports Golang syntax highlighting, I'll update this post to make it more readable.

31May/110

System Memory Analysis

Choosing the best RAM for your system can be difficult, as there are a lot of things to consider. Doing comparisons by hand can net you some pretty decent results but picking the best price per capacity per ... can get fairly complicated if you're doing it by hand.

A while ago you may remember my SSD analysis script[1] that scraped HTML from Newegg to calculate scores for each product to choose the best one. I've also recently discovered that Newegg does indeed have an API[2] that greatly simplifies this whole process[3].

Once I had explored Newegg's API enough to get the data I needed I set to work to update the SSD script as well as write a few others for HDD's and system memory as well. Of the scripts I wrote the one for system memory turned out to be particularly useful as it made finding great deals very easy. It also illustrated that popular brands may not always be the best deal.

The first major improvement over the previous scripts was the use of threading to make multiple API requests in parallel which sped things up quite a bit. While Python's threading library doesn't allow for parallelism of the CPU[4] it does for file I/O. Below is the class used for grabbing urls throughout the script.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import threading
import urllib, urllib2
import json, re
from Queue import Queue

class GetURL(threading.Thread):
    def __init__(self, urlQueue, jsonQueue):
        threading.Thread.__init__(self)
        self.urlQueue = urlQueue
        self.jsonQueue = jsonQueue
   
    def run(self):
        while True:
            itemNumber, url = self.urlQueue.get()
            raw = urllib2.urlopen(url).read()
            jsonQueue.put((itemNumber, json.loads(raw)))
            self.urlQueue.task_done()

Newegg's API paginates the data as the Android app displays the data directly to the user which means there's no easy way to retrieve all results in one request. So you must make successive calls incrementing the page number until all results for the query have been retrieved.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
itemSpecURL = "http://www.ows.newegg.com/Products.egg/{}/Specification"
searchURL = "http://www.ows.newegg.com/Search.egg/Advanced"

itemList = getItems()

urlQueue = Queue()
jsonQueue = Queue()
items = {}
for item in itemList:
    specURL = itemSpecURL.format(item["ItemNumber"])
    urlQueue.put((item["ItemNumber"], specURL))
    items[item["ItemNumber"]] = item
   
for worker in xrange(2):
    t = GetURL(urlQueue, jsonQueue)
    t.setDaemon(True)
    t.start()

urlQueue.join()

These basic setups are fairly generic and can be used to analyze just about any product from Newegg. Anything beyond this point however is specific to the type of product you're analyzing. This will grab each item's basic data including price as well as it's detailed specifications. I should also note that the parameters passed to the API in getItems is generated using the query builder available in the post about Newegg's API.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
speed_re = re.compile('DDR\d\s(\d+).*')
capacity_re = re.compile("(\d+)GB\s\((\d+)\sx\s(\d+)GB\)")
timing_re = re.compile('(\d+-\d+-\d+-\d+)')
features = ['Brand', 'Model', 'ItemNumber', 'Price', 'Speed', 'Capacity', 'Dimms', 'Timing', 'Voltage']

while not jsonQueue.empty():
    itemNumber, specs = jsonQueue.get()
   
    item = {}
    for group in specs['SpecificationGroupList']:
        for pair in group['SpecificationPairList']:
            if pair['Key'] in features:
                item[pair['Key']] = pair['Value'].encode('ascii', errors='ignore')
   
    if 'Capacity' in item:
        capacity = capacity_re.match(item['Capacity'])
        if capacity:
            item['Capacity'] = capacity.group(1)
            item['Dimms'] = capacity.group(2)
    if 'Speed' in item:
        speed = speed_re.match(item['Speed'])
        if speed:
            item['Speed'] = speed.group(1)
    if 'Timing' in item:
        timing = timing_re.match(item['Timing'])
        if timing:
            item['Timing'] = timing.group(1).replace('-','\t')
        else:
            continue
    item['Price'] = items[itemNumber]['FinalPrice']
    item['ItemNumber'] = specs['NeweggItemNumber']
   
    try:
        print '\t'.join(map(lambda x: item[x], features))
    except KeyError:
        pass
    jsonQueue.task_done()

The basic purpose of the above code is to go through each item and format each feature into usable data[5]. Once the data has been formatted and printed I continue the rest of the filtering and analysis in Microsoft's Excel.

The equation used to calculate a score for each set of system memory is as follows:

\frac{(\text{Capacity}\times1024^3)\times\text{Speed}}{\text{Price}\times CL\times T_{RCD} \times T_{RP} \times T_{RAS}}

Currently it looks like G.Skill has the best to offer in the DDR3 memory market if you're looking for a quad-channel set for Sandy Bridge's enthusiast hardware due out in the next quarter[6].

  1. Choosing an SSD (A more different S) []
  2. Newegg's JSON API []
  3. Even though it was never intended for that what I'm using it for. []
  4. Python is crippled in this way due to a global interpreter lock. []
  5. Or at least the data that we're interested in using for analysis. []
  6. G.SKILL Ripjaws X Series 16GB (4 x 4GB) 1333Mhz []
16Mar/1132

Newegg’s JSON API

For the longest time I've wanted access to Newegg's product list. For me they've been one of the better and more structured websites for buying computer hardware. So naturally they're usually my first choice when it comes to finding a good deal on a particular piece of hardware. They're also rather useful for seeing what's out there since their product catalog is fairly complete.

A while back I had started wanting to sort through items to heuristically pick the best deal based on a number of features Newegg generally provides for each item. This method works pretty well on SSD's and system memory. But until a recent discovery I was limited to scraping Newegg's website in order to get any kind of information from them. If you've ever tried this sort of thing you know that it is messy and generally a bad idea because every single time Newegg changes the structure of their website or any minute detail this will almost always break your scraping script.

The discovery came in the form of a mobile application for Android[1]. The mobile app lets you browse their website in a clean and fast manner. But what got me thinking is that unlike some other mobile applications out there that are just application wrappers for the mobile version of their websites this one operates directly through the native GUI. Now this is where it got interesting. I knew that if Newegg had written the app to use the native GUI then they had to be providing the data to it somehow and I knew it had to be more structured than HTML scraping like what I've been doing[2]. You have no idea how happy I was to discover that I was right.

First thing I did was connect my Droid 2 Global to my home network via WiFi in order to sniff some of the traffic going to and from the mobile app. This was accomplished by mounting a CIFS drive from my Windows 7 desktop to my router running Tomato based firmware. The share had a binary for TCPDump which I then used to sniff for traffic originating or going to my phone's IP address. After setting this up and performing all of the basic operations I would need in order to "reverse engineer" the data source I got to work on filtering the important bits.

In WireShark I immediately discovered that they had a sub-domain they were using for these operations. All of the web requests that weren't images or for customer metrics and tracking went to this host:

http://www.ows.newegg.com/

Because this API is structured more or less the same as navigating their site and the identifiers are different I decided to start with writing a query builder. Basically the purpose was to allow me to browse to the particular category I was interested in analyzing and filter it down to just a few simple requirements to simplify the analysis.

The first major entry point in the process of browsing to what you're interested in pulling is:

http://www.ows.newegg.com/Stores.egg/Menus

This takes no parameters and provides the main menu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[
    {
        "StoreDepa": "ComputerHardware",
        "StoreID": 1,
        "ShowSeeAllDeals": true,
        "Title": "Computer Hardware"
    },
    {
        "StoreDepa": "PCNotebook",
        "StoreID": 3,
        "ShowSeeAllDeals": true,
        "Title": "PCs &amp; Laptops"
    },
    {
        "StoreDepa": "Electronics",
        "StoreID": 10,
        "ShowSeeAllDeals": true,
        "Title": "Electronics"
    },
    ...

Once you've selected a store to browse the next uri is:

http://www.ows.newegg.com/Stores.egg/Categories/{StoreID}

The only parameter it takes is StoreID which you'll find in the first query. This will return all of the categories within a store. I haven't really explored this very much as I'm only really interested in browsing system memory and SSD's. Using the Computer Hardware store the output is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
    {
        "Description": "Backup Devices &amp; Media",
        "StoreID": 1,
        "NodeId": 6642,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 2
    },
    {
        "Description": "Barebone / Mini Computers",
        "StoreID": 1,
        "NodeId": 6668,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 3
    },
    {
        "Description": "CD / DVD Burners &amp; Media",
        "StoreID": 1,
        "NodeId": 6646,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 10
    },
    ...

StoreID is included from the parameters of the request. I'm not exactly sure how to describe the purpose of NodeID but it appears to be a distinguishing feature of a category or subcategory. CategoryID is used for filtering results down to a specific category and can be either a root category or a subcategory. CategoryType determines whether CategoryID is a root category or if it contains subcategories. A value of 1 for CategoryType indicates that it is the root category.

Now depending on CategoryType you either move straight to the search query or onto a navigation query. The navigation query is used if there are subcategories:

http://www.ows.newegg.com/Stores.egg/Navigation/{StoreID}/{CategoryID}/{NodeID}

This query takes StoreID, CategoryID and NodeID, which you can get from the category listing of a particular store. It will return a subcategory list. Below is the subcategory listing for the memory category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
    {
        "Description": "Desktop Memory",
        "StoreID": 1,
        "NodeId": 7611,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 147
    },
    {
        "Description": "Flash Memory",
        "StoreID": 1,
        "NodeId": 8038,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 68
    },
    {
        "Description": "Laptop Memory",
        "StoreID": 1,
        "NodeId": 7609,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 381
    },
    ...

From here you will go to the search query[3]. At this point it does get a little tricky as the parameters for the query are no longer sent via GET they are instead sent using POST[4] which basically will require a programmatic method for making a search query. The search query given a category, store and node will list quite a lot of things. The first thing in the list is search filtering parameters, these will allow you to limit the products shown in the listing.

Data being posted is necessary to receive a non-404 response from the server, if you really wanted to you could just send an empty dictionary as this would just query newegg's entire product list. Any of the query options can be omitted, integer values may be omitted by substituting their value with -1.

The parameters you should concern yourself with are as follows along with the URL the data should be posted in JSON format to:

http://www.ows.newegg.com/Search.egg/Advanced

1
2
3
4
5
6
7
8
9
data = {
    "SubCategoryId": 147,
    "NValue": "",
    "StoreDepaId": 1,
    "NodeId": 7611,
    "BrandId": -1,
    "PageNumber": 1,
    "CategoryId": 17
}

NValue is a space separated list of NValues from the search parameters. Mind you, you cannot filter against more than one item in any category of search filters. For example in system memory you can't select DDR3 1333 (PC3 10600), DDR3 1333 (PC3 10660) and DDR3 1333 (PC3 10666). The query will return an unsucessful search result. The rest of the parameters are fairly self-explanatory.

The result returned will contain the following elements: RelatedLinkList, CoremetricsInfo, NavigationContentList, PaginationInfo, ProductListItems. CoremetricsInfo and RelatedLinkList can usually be ignored, the elements we're interested in are the NavigationContentList which is a list of search parameters//filters you can apply to the search. PaginationInfo describes how many elements were returned, what page we're on and how many elements there are per page. Last but not least the ProductListItems which provides a list of the products returned by the query along with some basic listing info for each one.

Below is a portion of the NavigationContentList:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
    "NavigationContentList": [
        {
            "NavigationItemList": [
                {
                    "SubCategoryId": -1,
                    "Description": "Free Shipping",
                    "StoreDepaId": 94,
                    "NValue": "100007611 600006050 600052012 4808",
                    "BrandId": -1,
                    "StoreType": 4,
                    "ItemCount": 194,
                    "CategoryId": -1,
                    "ElementValue": "4808"
                },
                {
                    "SubCategoryId": -1,
                    "Description": "Top Sellers",
                    "StoreDepaId": -1,
                    "NValue": "100007611 600006050 600052012 4802",
                    "BrandId": -1,
                    "StoreType": -1,
                    "ItemCount": 39,
                    "CategoryId": -1,
                    "ElementValue": "4802"
                },
                ...

This section will also contain a group name:

1
2
3
4
5
6
7
8
9
10
11
12
13
            ...
            "TitleItem": {
                "SubCategoryId": -1,
                "Description": "Useful Links",
                "StoreDepaId": -1,
                "NValue": "4800",
                "BrandId": -1,
                "StoreType": -2,
                "ItemCount": 0,
                "CategoryId": -1,
                "ElementValue": "4800"
            }
            ...

The PaginationInfo and ProductListItem elements will look like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
    ...
    "PaginationInfo": {
        "TotalCount": 233,
        "PageNumber": 1,
        "PageSize": 20
    },
    "ProductListItems": [
        {
            "SellerId": null,
            "ItemOwnerType": 0,
            "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139",
            "ItemGroupID": 0,
            "ReviewSummary": {
                "Rating": 5,
                "TotalReviews": "[1]"
            },
            "IsCellPhoneItem": false,
            "Discount": null,
            "FinalPrice": "$104.99",
            "ItemNumber": "20-148-372",
            "MappingFinalPrice": "$104.99",
            "FreeShippingFlag": true,
            "OriginalPrice": "$104.99",
            "IsComboBundle": false,
            "MailInRebateText": null,
            "ProductStockType": 0,
            "Model": "BL2KIT25664FN2139",
            "ShowOriginalPrice": false,
            "Image": {
                "FullPath": "http://images17.newegg.com/is/image/newegg/20-148-372-TS?$S125W$",
                "SmallImagePath": null,
                "ThumbnailImagePath": null,
                "Title": null
            },
            "SellerName": null,
            "ParentItem": null
        },
        ...

At this point you might be wondering what good will all this do me if I can't get specifications on an item? Well, you can and here's how: In each ProductListItems element you'll find an ItemNumber, this is essentially the primary key that each product is related to within this interface to newegg's product list. Using the following url you can obtain the full details page on any given item using it's ItemNumber:

http://www.ows.newegg.com/Products.egg/{ItemNumber}/Specification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
{
    "SpecificationGroupList": [
        {
            "GroupName": "Model",
            "SpecificationPairList": [
                {
                    "Value": "Crucial",
                    "Key": "Brand"
                },
                {
                    "Value": "Ballistix",
                    "Key": "Series"
                },
                {
                    "Value": "BL2KIT25664FN2139",
                    "Key": "Model"
                },
                {
                    "Value": "240-Pin DDR3 SDRAM",
                    "Key": "Type"
                }
            ]
        },
        {
            "GroupName": "Tech Spec",
            "SpecificationPairList": [
                {
                    "Value": "4GB (2 x 2GB)",
                    "Key": "Capacity"
                },
                {
                    "Value": "DDR3 2133 (PC3 17000)",
                    "Key": "Speed"
                },
                {
                    "Value": "9",
                    "Key": "Cas Latency"
                },
                {
                    "Value": "9-10-9-24",
                    "Key": "Timing"
                },
                {
                    "Value": "1.65V",
                    "Key": "Voltage"
                },
                {
                    "Value": "No",
                    "Key": "ECC"
                },
                {
                    "Value": "Unbuffered",
                    "Key": "Buffered/Registered"
                },
                {
                    "Value": "Dual Channel Kit",
                    "Key": "Multi-channel Kit"
                }
            ]
        },
        {
            "GroupName": "Manufacturer Warranty",
            "SpecificationPairList": [
                {
                    "Value": "Lifetime limited",
                    "Key": "Parts"
                },
                {
                    "Value": "Lifetime limited",
                    "Key": "Labor"
                }
            ]
        }
    ],
    "NeweggItemNumber": "N82E16820148372",
    "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139"
}

From this point on you can grab all of the features and specifications of any particular item you're interested in. In the near future I'll be writing a new post for both my memory and SSD analysis scripts using this interface.

The full code for my query builder is as follows, though you should note this was a quick script and is in no way complete or fully functional. As soon as it was to a useable point I moved onto the main point of this whole ordeal. You should also note that this requires CherryPy[5] and lxml[6]. The end result of this program is a query which you can use to retrieve a list of products matching the options you've selected. This is mainly to simplify product list selection and to minimalize the need to hardcode in certain values as newegg as a tendency to change things around on a regular basis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import cherrypy, json, urllib, urllib2
from lxml import etree
from lxml.builder import E

class QueryBuilder(object):
    def index(self):
        request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Menus")
        response = request.read()
        data = json.loads(response)
       
        body = E.body()
       
        ul = E.ul()
        for store in data:
            ul.append(E.li(E.a(
                store['Title'],
                href= '/Store?StoreID={}'.format(store['StoreID'])
            )))
       
        page = E.html(E.body(ul))
       
        return etree.tostring(page, pretty_print=True)
    index.exposed = True
   
    def Store(self, StoreID=None):
        if StoreID is not None:
            request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Categories/{}".format(StoreID))
            response = request.read()
            data = json.loads(response)
           
            body = E.body()
       
            ul = E.ul()
            for category in data:
                if category['CategoryType'] == 1:
                    ul.append(E.li(E.a(
                        category['Description'],
                        href='/Search?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId'])
                    )))
                else:
                    ul.append(E.li(E.a(
                        category['Description'],
                        href='/Category?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId'])
                    )))
           
            page = E.html(E.body(ul))
           
            return etree.tostring(page, pretty_print=True)
        else:
            return "Invalid parameters."
    Store.exposed = True
   
    def Category(self, StoreID, CategoryID, NodeID):
        if None not in [StoreID, CategoryID, NodeID]:
            request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Navigation/{}/{}/{}".format(StoreID, CategoryID, NodeID))
            response = request.read()
            data = json.loads(response)
           
            body = E.body()
       
            ul = E.ul()
            for subcategory in data:
                ul.append(E.li(E.a(
                    subcategory['Description'],
                    href= '/Search?StoreID={}&CategoryID={}&SubCategoryID={}&NodeID={}'.format(StoreID, CategoryID, subcategory['CategoryID'], subcategory['NodeId'])
                )))
           
            page = E.html(E.body(ul))
           
            return etree.tostring(page, pretty_print=True)
        else:
            return "Invalid parameters."
    Category.exposed = True
   
    def Search(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None):
        url = "http://www.ows.newegg.com/Search.egg/Advanced"
        data = {
            "IsUPCCodeSearch":      False,
            "IsSubCategorySearch"True,
            "isGuideAdvanceSearch": False,
            "StoreDepaId":          StoreID,
            "CategoryId":           CategoryID,
            "SubCategoryId":        SubCategoryID,
            "NodeId":               NodeID,
            "BrandId":              -1,
            "NValue":               "",
            "Keyword":              "",
            "Sort":                 "FEATURED",
            "PageNumber":           1
        }
       
        params = json.dumps(data).replace("null", "-1")
        request = urllib2.Request(url, params)
        response = urllib2.urlopen(request)
        data = json.loads(response.read())
       
        if data['NavigationContentList'] is None:
            return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True)
       
        body = E.body()
   
        form = E.form(name='PowerSearch', action='GenerateURL', method='GET')
       
        table = E.table()
        form.append(table)
        for section in data['NavigationContentList']:
            index = 0
            tr = E.tr(E.td(section['TitleItem']['Description'], colspan='3'))
            table.append(tr)
            for option in section['NavigationItemList']:
                if index % 3 == 0:
                    tr = E.tr()
                    table.append(tr)
                index += 1
                checkbox = E.td(E.input(option["Description"], type="checkbox", name=section['TitleItem']['Description'].replace(" ", ""), value=option['NValue']))
                tr.append(checkbox)
       
        for param, value in [('StoreID', StoreID), ('CategoryID', CategoryID), ('SubCategoryID', SubCategoryID), ('NodeID',NodeID)]:
            try:
                form.append(E.input(type='hidden', name=param, value=value))
            except KeyError:
                pass
        form.append(E.input(type='submit', value='Submit'))
        page = E.html(E.body(form))
       
        return etree.tostring(page, pretty_print=True)
    Search.exposed = True
   
    def GenerateURL(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None, **kwargs):
        NValue = set([])
        for arg in kwargs:
            if type(kwargs[arg]) == list:
                for value in kwargs[arg]:
                    NValue.add(value)
            else:
                NValue.add(kwargs[arg])
       
        NValue = list(NValue)
        NValue.sort()
        if StoreID is None:
            StoreID = -1
        if CategoryID is None:
            CategoryID = -1
        if SubCategoryID is None:
            SubCategoryID = -1
        if NodeID is None:
            NodeID = -1
        data = {
            "StoreDepaId":          int(StoreID),
            "CategoryId":           int(CategoryID),
            "SubCategoryId":        int(SubCategoryID),
            "NodeId":               int(NodeID),
            "BrandId":              -1,
            "NValue":               ' '.join(NValue),
            "PageNumber":           1
        }
        return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True)
    GenerateURL.exposed = True
   
cherrypy.quickstart(QueryBuilder())
  1. And iOS devices I assume as well. []
  2. Because lets face it, that would be stupid. []
  3. ... or get to the search query from selecting a root category in the main category listing for a store []
  4. At least this is the method used by the mobile app. []
  5. CherryPy: CherryPy is a pythonic, object-oriented HTTP framework. []
  6. lxml: A Pythonic binding for the C libraries libxml2 and libxslt. []
25Feb/111

Android Remote Start Desktop

After getting my server setup again I've been messing with ssh and all that when it struck me, a clever idea.

My router runs Tomato[1] which has a built in SSH server[2] and my phone just happens to be an android based phone which has ConnectBot[3]. My router also supports wake on lan[4] and this got me thinking a little bit. What if I automated that somewhat into a sort of "remote start" button on my phone for my desktop.

So first thing I did was look up what flavor of WOL client my router used. By default I believe Tomato comes with ether-wake[5]. So I just setup a session on connectbot that ssh's into my router at home and added ether-wake ##:##:##:##:##:## && exit in post-login automation for the session.

Now, this was great and all but it still required me to enter my password every single time and hit enter as soon as it connected, as it only fills in the command it doesn't automatically execute it[6]. The next thing that came to mind was just setting up a key pair for key authentication with ssh which would bypass having to enter the password every single time. All I would have to do at this point was enter the password once to unlock the key and then I could just log in whenever I needed to.

After generating the key pair and added the public key to the list of authorized keys for ssh on my router everything worked exactly as I intended. So just for a finishing touch I added a shortcut to my phone's home screen labelled: Wake Audbox[7]. Now whenever I want to remote start my system I just hit that button[8] and then when it connects and enters the command and then I just hit enter. After hitting enter it wakes my system and exits the session.

  1. As you may have read in a previous post or two. []
  2. Much like DD-WRT and almost all of the other custom firmwares you can get for most routers these days. []
  3. ConnectBot: a SSH client for the Android platform. []
  4. Wake-on-LAN []
  5. I'm using TomatoUSB which has been modified somewhat and I'm not sure if this is part of the modifications or not, though I doubt that it is. []
  6. Which doesn't really seem all that intuitive to me. []
  7. Audbox is the name of my desktop. []
  8. Assuming that I've got the key unlocked otherwise it'll just ask me for my password. []
23Feb/112

Installing Ubuntu via Network

At some point in the last 6 months or so I may or may not have accidentally left my 1GB Sandisk Cruzer in a pair of jeans when they went through the washer AND the dryer. As such it's not exactly in peak physical condition[1] and for whatever reason I've had issues with using it for installing certain things[2] lately[3].

Anyway it has become time again to get my file server back up and running and I needed to reinstall Ubuntu on it. Given my extreme laziness when it comes to doing this sort of stuff I was in no mood to move everything to the top of my desktop[4] so I opted to try pxe booting[5] again.

I've messed with pxe booting in the past, particularly with GeeXboX[6] for my media center and that was a nightmare at the time and essentially required you to have a linux system in order to do it. Since then a wonderful application has made its way into the internet: tftpd32[7]. Tftpd32 greatly simplifies the whole process by not requiring you to install anything or make any major system changes.

Before you continue take note, these instructions assume a few things:

  • You're serving the netboot images from a windows system.
  • You have a tomato based router, although these instructions can be easily modified to work with any router firmware that uses DNSMasq or allows you to change advanced settings for the DHCP server.

Things you'll need:

  • Ubuntu Alternative ISO: This will be used for setting up the local http repository.
  • Ubuntu NetBoot Image: Grab netboot.tar.gz
  • tftpd32: This will be used for serving files during PXE booting.
  • HFS ~ HTTP File Server: This will be used for setting up a local http repository for installing from our local network instead of having ubuntu download everything from a mirror.

Router Settings:

  • Advanced -> DHCP / DNS -> Dnsmasq Custom Configuration
  • dhcp-boot=pxelinux.0,,[tftpd32 server ip address]
  • Save.

For ease of readability from this point forward files will be bolded and directories will be italicized.

  • Untar netboot.tar.gz into a folder, which I'll refer to as netboot from now on.
  • Delete pxelinux.0 and pxelinux.cfg from netboot/ as these are symlinks which will not work in windows.
  • Create the directory netboot/pxelinux.cfg/
  • Copy pxelinux.0 from netboot/ubuntu-installer/i386/ to netboot/
  • Copy sysconfig.cfg from netboot/unbuntu-installer/i386/boot-screens/ to netboot/pxelinux.cfg/
  • Rename netboot/pxelinux.cfg/sysconfig.cfg to netboot/pxelinux.cfg/default

Preparing tftpd32:

  • Run tftpd32
  • Browse to the netboot folder we just finished setting up.
  • Tftpd32 should be serving the files in that directory at this point.

Preparing the local HTTP Ubuntu Repository:

  • Run HFS.exe
  • Extract all of the files from ubuntu-10.10-alternate-i386.iso to a folder which I'll refer to as ubuntu-alt from this point on.
  • In the Virtual File System pane right click -> Add Folder from disk...
    • Browse to and select ubuntu-alt
    • When HFS prompts you to ask what kind of folder it should be added as, select Real Folder
  • Note the link in the address bar next to Open in browser, you'll use this link when installing ubuntu.

Installing Ubuntu:

  • Boot the system you're attempting to install Ubuntu on from your network device.
  • If you have tftpd32 up on another monitor at this point you should see a deluge of requests in the tftp server tab.
  • Ubuntu should show a boot menu select install.
  • Now I'm not going to go into full detail on how to install Ubuntu but when you get to mirror selection at the very top of the list there should be the option to enter a mirror manually this is where you should enter the address from the address bar in HFS, be sure to also include the port value.
  • If all goes well it should start installing and you should see another deluge of requests in HFS.
  1. In fact it's pretty far from peak physical condition. []
  2. Like ubuntu for example. []
  3. I'm not entirely sure if this is due to washing it or just from it being nearly 5 years old. []
  4. So the cable for the USB adapter I've got my DVD drive connected to in my desktop can reach my mini-itx board. []
  5. Preboot Execution Environment []
  6. GeeXboX []
  7. tftpd32: An open-source tftp//dhcp//syslog server for Windows. []
14Dec/100

Proximity of Creativity

Maybe this is just a veiled form of addiction but it seems like my creativity is inversely related to my proximity to a computer[1]. I suppose its a good reason for me to regularly visit a coffee shop[2].

To be continued...

  1. Particularly my own computer. []
  2. It seems like a coffee shop is more useful for thinking than it is for drinking coffee except that in order to be welcome at one you have to at least purchase the latter. []
Filed under: Uncategorized No Comments
12Dec/101

Choosing an SSD (A more different S)

I've been periodically going back and revisiting the results for my SSD analysis script for newegg.com. The last few times I ran it I noticed that it was broken. It looks like newegg has modified a few things in their power search results page. One thing which is a little obnoxious[1] is that they no longer include the capacity in the description of the item or as a feature in the feature list when viewing the results page. This only seems to be an issue on the SSD page although I can't figure out why they decided it didn't need to be there in the first place. I see it this way: SSD's are first and foremost a storage device, you'd think that one of the most important features that should be listed with every SSD is the capacity at least.

Anyway, this change broke my script which I had been meaning to rewrite since regular expressions are definitely not the most efficient or cleanest way to parse HTML. I've been working with XML a more often lately despite my original prejudice against it for being a really bloated way to transfer data. One thing I discovered that makes XML a lot less painful is XPath[2] which is an incredibly useful "language" for selecting data from an XML document.

Once I had gone through and read several tutorials and references about XPath I set out to use it in writing a show calendar script which parses data from tvrage.com's XML API. After that useful exercise I realized I could very easily and cleanly apply it to my SSD analysis script. Since HTML is similar in nature to XML[3] I set out to parse Newegg's results page using XPath. This presented the first problem: Newegg's page isn't strictly XML or even XHTML for that matter. After a great deal of googling and research I landed on the lxml[4] website which as it turns out has an HTML parser for navigating and extracting data from HTML in the same way you would from an xml.etree.ElementTree[5]. With this in mind I immediately began rewriting the script.

First off lets consider my criteria for a "good" SSD on Newegg. The SSD can be either the typical 2.5" form factor, or a PCI-Express card[6]. The interface can be SATAII, SATAIII or PCI-Express. Capacity must be greater than or equal to 120GB[7]. Last but not least, the disk should be sub $300[8].

The above requirements give us the following power search[9] which we will be using as the source for the script:

1
2
3
4
5
6
url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

Now the first thing that made me cringe as I was rewriting this was the fact that I would basically have no choice but to load each individual product page from the results page as capacity is no longer included in either the description or the features list of each product in the results page. Eventually I will get around to multi-threading this to make it a little less painful, or I'll get lucky and Newegg will add the capacity feature back to the item listing in power searches for SSD's. The following is the full source code of the parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import re, math
from lxml import etree

url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

featureMap = {
    'Capacity': 'capacity',
    'Sequential Access - Write:': 'write',
    'Sequential Access - Write': 'write',
    'Sequential Access - Read:': 'read',
    'Sequential Access - Read': 'read',
    'Interface Type': 'interface',
    'Brand': 'brand',
    'Model': 'model',
    'Series': 'series'
}

speed_re = re.compile(r'(\d+)\s?MB/s')
capacity_re = re.compile(r'(\d+)GB')

parser = etree.HTMLParser()
# tree = etree.parse("temp.html", parser)
tree = etree.parse(url, etree.HTMLParser())
root = tree.getroot()

items = []

for node in root.findall(".//div[@class='itemCell']"):
    item = {}

    # Get link
    link = node.find(".//a[@title='View Details']")
    item["link"] = link.attrib["href"]
   
    # Get feature list (loads each item's url, should multi-thread this in the future)
    itemPage = etree.parse(item["link"], etree.HTMLParser()).getroot()
    featureList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dt"))
    valueList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dd"))
    features = zip(featureList, valueList)
    for feature, value in features:
        if value is not None and feature in featureMap:
            # If it's a speed feature parse out the speed
            if featureMap[feature] in ("read", "write"):
                item[featureMap[feature]] = min(map(lambda x: int(x), speed_re.findall(value)))
            # If it's a capacity feature, parse out the capacity
            elif featureMap[feature] == "capacity":
                item[featureMap[feature]] = min(map(lambda x: int(x), capacity_re.findall(value)))
            # If the value doesn't need to be parsed, just store the value in item
            else:
                item[featureMap[feature]] = value.strip()
               
    # Get price
    price = map(lambda n: n.text, node.findall(".//li[@class='priceFinal']/*"))
    item["price"] = float(''.join(price[1:]))
   
    # Only add the item if it has the features we need in it
    if "read" in item and "write" in item and "capacity" in item and "series" in item:
        score = (item["read"] * item["write"] * item["capacity"]) / ((math.log(abs(item["read"] - item["write"])) + 1) + item["price"])
        item["score"] = score
        items.append(item)
       
   
sorted = {}
for item in items:
    # Open addressing like in a hash table, so we don't wind
    # up with any collisions, unlikely but good practice anyway
    score = item["score"]
    while score in sorted:
        score += 1
    sorted[score] = item

sortOrder = sorted.keys()
sortOrder.sort()
sortOrder.reverse()

headers = ['brand', 'series', 'model', 'link', 'interface', 'price', 'capacity', 'read', 'write', 'score']
print '\t'.join(headers)
for key in sortOrder:
    item = sorted[key]
    print '\t'.join(map(lambda x: str(item[x]), headers))

At this point if you've gone through and read the entire script you'll probably notice that I've made a slight change to the scoring equation, it has been changed from the following:

\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}

To the following:
\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{(\log_{10}(|\text{Read} - \text{Write}|) + 1) \times \text{Price}}

I discovered that using the difference in read//write speed heavily penalized drives with anything greater than 10MB/s difference. So I figured that it may be a little more subtle to simply penalize drives based on the magnitude of the difference.

Now you're probably wondering: "When is this blathering idiot going to get to the damned results already?". And you'd be pleasantly surprised to know that I'm getting to them as you waste your time reading this.


Manufacturer: OCZ OCZ G.Skill OCZ
Series: RevoDrive Vertex 2 Phoenix Pro Series Agility 2
Capacity: 120GB 180GB 120GB 120GB
Read: 540MB/s 285MB/s 285MB/s 285MB/s
Write: 490MB/s 275MB/s 275MB/s 275MB/s
Item: N82E16820227578[10] N82E16820227602[11] N82E16820231378[12] N82E16820227593[13]
Price: $299.99 $294.99 $214.99 $214.99

As you can see the RevoDrive far out-scores all the rest of the SSD's considered in this analysis. The main reason is that they've essentially included two 60GB SSD's on the same card and you're expected to perform software raid on them in your own system[14]. Despite the incredible speeds they boast I don't think I would purchase one of these to use as my OS//Program disk because compatibility is a major limitation. You must be sure that your motherboard's BIOS supports booting via PCI-Express cards. And last but not least, the main reason I would pass up this card is the lack of TRIM support. As far as I can tell these cards do not support TRIM which is a major downside as far as I'm concerned.

The second disk in the list is the OCZ Vertex 2 180GB version. I'd probably skip this one just because I don't really consider the extra 60GB worth the extra $80.

Which leaves me with the last two disks which are as far as my analysis is concerned, identical. If you take into account the detailed features you'll notice that the G.Skill claims 50k IOPS on the 4k Random write test which seems a bit... optimistic. The OCZ makes no such claim and as far as I'm concerned both disks are more less the same thing. So it's pretty much up to brand preference at this point.

  1. I've already sent feedback to them suggesting that they fix this. []
  2. Only if the XML parser you're using supports it, which it seems is not a whole lot of them. At least not all of them support the full specification which is annoying since nobody really seems to document which bits and pieces they support and which whey don't. []
  3. Although not necessarily XML depending on the particular doctype you've chosen, Newegg's is transitional HTML. []
  4. lxml: http://codespeak.net/lxml/ []
  5. xml.etree.ElementTree: http://docs.python.org/library/xml.etree.elementtree.html []
  6. Some of the PCI-Express SSD's are stupidly fast and more expensive except that it doesn't look like any of them support TRIM yet which is a major problem for me. []
  7. It is rare that I have a matured (read: haven't reformatted in a while) install of windows along with all of my most commonly used programs and games that exceeds 60GB so I estimate that doubling this should accommodate for any sudden urges to install really big things. []
  8. I can't really justify spending much more than $300 on a single storage device. It had better be one hell of a storage device if I ever find myself spending more than $300 on it. []
  9. This will likely need to be updated at least once a month as Newegg is constantly adding new criteria and changing things. []
  10. OCZ RevoDrive []
  11. OCZ Vertex 2 []
  12. G.SKILL Phoenix Pro Series []
  13. OCZ Agility 2 []
  14. They show up as two separate physical devices despite being located on the same card. []
4Dec/100

Automagic TV Show Calendar

A little while ago I was browsing the web and discovered a website called tvrage.com[1] which seems to be the definitive online TV guide. I didn't originally enter the site on the main index but on a page describing the functionality of an XML API[2] they host for accessing their database of TV shows.

To me, this is like opening presents on christmas day. Just imagine the possibilities! I immediately began exploring the kind of data they provide. The very first idea I had was to use this to create events on my google calendar automatically for unaired episodes of my favorite TV shows.

I've previously written python scripts that interface with gdata but I find their implementation for python to be kind of cumbersome to deal with so I began researching their Protocol API[3]. At first I wasted a lot of time attempting to build the necessary XML structures to add events and the like. This got old very fast and I decided to just give JSON-C[4] a try. Turns out you can use the built-in JSON module in python for creating the necessary structures.

For parsing the results I got from tvrage I ended up using python's xml.etree.ElementTree which was simple enough to setup to retrieve only the information for each episode I was interested in.[5]

I had a bit of trouble initially with adding events to google calendar. This stemmed from the fact that google often will return an HTTP Redirect which includes a url with an appended gsession attribute which you're supposed to resubmit the exact data from the first request to. Once I figured this out it was turtles all the way down. I even managed to get the whole script multi-threaded to speed things up since it's impossible to perform batch-requests with JSON-C.

I should note that for the configuration file the calendar should be the "Calendar ID" for the calendar that can be found by looking at the settings page for the individual calendar, it is grouped with the XML and iCal feeds.

ShowList.txt:[6]

1
2
3
4
5
6
7
8
9
10
11
12
Castle  19267
House   3908
Bones   2870
Big Bang Theory, The    8511
Mentalist, The  18967
Rizzoli & Isles 24996
Venture Bros., The  6270
Top Gear    6753
Mythbusters 4605
Archer  23354
NCIS    4628
Community   22589

Config.cfg:

1
2
3
4
[Credentials]
username = someuser@gmail.com
password = somebase64encodedpassword
calendar = somecalendarid@group.calendar.google.com

AirDate.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import urllib2, urllib, json, ConfigParser, base64
from datetime import date
from xml.etree import ElementTree
from threading import Thread

calendar = ""
header = {}

# Thread for retrieving a list of episodes for a given show_id
class airDate(Thread):
    # Initialize thread and set some local attributes
    def __init__(self, show_name, show_id):
        Thread.__init__(self)
        self.show_name = show_name
        self.show_id = show_id
   
    # Get episode list from tvrage.com based on the show_id
    def run(self):
        # Retrieve XML episode_list from tvrage.com
        xml_data = urllib2.urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=%s" % self.show_id).read()
        # Pares XML into ElementTree.Element()
        xml_tree = ElementTree.fromstring(xml_data)
        self.result = []
       
        # For each season
        for season in xml_tree.findall("Episodelist/Season"):
            # Get the season number
            season_num = int(season.get("no"))
            # For each episode in the episode list
            for episode in season.findall("episode"):
                # Get episode number and title
                episode_num = int(episode.find("seasonnum").text)
                episode_title = episode.find("title").text
               
                # Build the episode code S##E##
                episode_code = "S%02dE%02d" % (season_num, episode_num)
               
                # Parse the airdate into year, month and day
                year, month, day = map(lambda x: int(x), episode.find("airdate").text.split("-"))
                try:
                    episode_airdate = date(year, month, day)
                    today = date.today()
                    # If episode hasn't aired yet
                    if episode_airdate >= today:
                        # Add episode to results list
                        self.result.append("%s %s - %s" % (str(episode_airdate), self.show_name, episode_code))
                except ValueError:
                    # If the airdate is invalid (tvrage.com sometimes
                    # includes 00's for unknown sections of the date
                    pass

class addEvent(Thread):
    # Thread for adding events to google calendar
   
    # Initialize thread and set local episode variable
    def __init__(self, episode):
        Thread.__init__(self)
        self.episode = episode
   
    # Add new entry to google calendar
    def run(self):
        # Build entry structure
        entry = {"data": {"details": self.episode, "quickAdd": True}}
        # Convert to JSON
        entry = json.dumps(entry)
       
        # Build request including necessary headers and data
        calReq = urllib2.Request("http://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), entry, header)
        # Execute the request
        calRes = urllib2.urlopen(calReq)
        # Get the redirect url (gsession appended)
        redirectReq = urllib2.Request(calRes.geturl(), entry, header)
        try:
            redirectRes = urllib2.urlopen(redirectReq)
        except HTTPError:
            # If we get some sort of HTTP error code
            # skip entry, can always run again
            pass
   
# Get list of events already added to
# the calendar from previous executions
def getExistingEpisodes(header):
    # Get JSON-C representation of calendar
    calReq = urllib2.Request(url="https://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), headers=header)
    calRes = urllib2.urlopen(calReq)
   
    # Parse JSON-C
    data = json.loads(calRes.read())
    # If the calendar has events on it
    if "items" in data["data"]:
        # Get the list of events
        events = data["data"]["items"]
        existing_episodes = []
        # For each event
        for event in events:
            # Append just the title of the event to the results
            existing_episodes.append(event["title"])
           
        return existing_episodes
    else:
        # We don't have any events on this calendar
        # so just return an empty list
        return []

if __name__ == '__main__':
    # Open the configuration file and get the necessary
    # credentials and settings
    config = ConfigParser.ConfigParser()
    config.readfp(open("Config.cfg"))
    username = config.get("Credentials", "username")
    password = config.get("Credentials", "password")
    # Password is stored as base64 encoded string just so
    # we don't have our password sitting out in plain sight
    password = base64.b64decode(password)
    calendar = config.get("Credentials", "calendar")
   
    # Build loginData structure, this is used to get
    # authentication data from google
    loginData = {
        "Email": username,
        "Passwd": password,
        "source": "BeMasher-ETR-2",
        "service": "cl"
    }

    # Encode the loginData for usage in a url
    loginData = urllib.urlencode(loginData)
    # Get authentication data
    gdataLogin = urllib2.urlopen("https://www.google.com/accounts/ClientLogin", data=loginData)
    SID, LSID, Auth = gdataLogin.read().splitlines()
   
    # Build header structure, this will be used for
    # all requests to google calendar from now on
    header = {
        "Authorization": "GoogleLogin %s" % (Auth),
        "GData-Version": 2,
        "Content-Type": "application/json"
    }
   
    # Open a list of the shows we're interested in
    # Stored as "show_name\tshow_id", one per line
    show_list = open("ShowList.txt")
    jobs = []
    for line in show_list:
        show = line.strip().split("\t")
        jobs.append(show)
   
    # Get a list of existing events from previous
    # executions so we don't wind up with duplicates
    existingEpisodes = getExistingEpisodes(header)
   
    threadQueue = []
    # For each episode we've retrieved that is unaired
    for job in jobs:
        show_name, show_id = job
        # Create an instance of the airDate thread
        thread = airDate(show_name, show_id)
        # Start it
        thread.start()
        # Add it to the threadQueue
        threadQueue.append(thread)
       
    episodes = []
    # While we've still got running threads
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()
        # For each episode in the results
        for episode in thread.result:
            # If it hasn't already been added to google calendar
            if episode[11:] not in existingEpisodes:
                print episode
                # Add to list of episodes that need events created
                episodes.append(episode)
   
    # For each episode that doesn't have an
    # event on google calendar already
    for episode in episodes:
        # Create an addEvent thread, start it
        # and add it to the threadQueue
        thread = addEvent(episode)
        thread.start()
        threadQueue.append(thread)
   
    # While we still have threads running
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()

This was all done shortly before I discovered that tvrage.com also provides iCal feeds for your favorite shows provided that you register and add some to your list. Unfortunately the iCal feed they generate creates events for exact air times of each episode which I'm not really all that concerned about. So I use this script still to add all-day events for each episode which is easier to view//see when there's a new episode.

I did write another script using their XML API but that will have to wait for another post.

  1. http://tvrage.com/ []
  2. http://services.tvrage.com/ []
  3. Data API Developer's Guide: The Protocol []
  4. Google's own flavor of JSON which is almost identical to plain old JSON. []
  5. I only really needed the original air date, title, season number and episode number. []
  6. You can find the show_id via the show search found on their XML API page. []