A Little Off Code, Computers, Photography and Guns

20Nov/110

Decoding DVD Subtitles with Golang

I've always been very fond of subtitles but I'm not sure of the reason why. When transcoding my DVD's to play them on my network media player I realized I needed a good way to keep the subtitles without burning them into the video. The MPEG-4 container will happily include VOBSUB and SRT subtitle streams and my network media player handles this nicely.

The problem though is that including the VOBSUB's exactly as they appeared on the DVD is somewhat problematic, they're almost never the same style between movies, and sometimes they're just plain difficult to read. Converting them to SRT involves a fairly lengthy process of going through and indicating to an OCR program what each character is as it reads all of the subtitles and writes an SRT file. This is also difficult to correct if you mess up one character in the process of encoding.

So now that I've got the itch I decided to scratch it. I decided to write my own subtitle decoder that would write subtitles to images and a pseudo-OCR program to convert those images into individual character files. From there it would be fairly easy to write a quick interface that presents you with a list of letters at which point you can just fill in the character for each one. Once you've done this you can export it as your favorite text subtitle format in one shot instead of doing it as you go along.

The language I decided to write it in is Golang as I've been learning it for a few weeks now and It's currently my favorite language for a large number of reasons I won't get into here.

The first major challenge I ran into is that there's not really any standardized information about decoding DVD subtitles. I did find maybe 3-4 sites that have varying levels of detail into decoding DVD subtitles but there were still a lot of gaps in the information.

To start with, we need to decode MPEG Program Stream packets (PS), these contain MPEG Packetized Elementary Stream packets (PES). The PS header doesn't contain any information we need to decode subtitles. The PES header contains size of the packet's payload, offset to the payload and the length of the additional headers. SubStream refers to the stream id of the subtitle we're decoding. DataSize is the size of the subtitle payload. ControlPtr is the offset to the control sequences for describing the subtitle's payload.

1
2
3
4
5
6
7
8
9
10
type Packet struct {
    PSHeader [14]uint8
    PESHeader [4]uint8
    PacketSize uint16
    Extension uint16
    HeaderSize uint8
    SubStream uint8
    DataSize uint16
    ControlPtr uint16
}

To read data into this structure I've written the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func (p *Packet) Read(r io.Reader) {
    binary.Read(r, binary.BigEndian, &p.PSHeader)
    binary.Read(r, binary.BigEndian, &p.PESHeader)
    binary.Read(r, binary.BigEndian, &p.PacketSize)
    binary.Read(r, binary.BigEndian, &p.Extension)
    binary.Read(r, binary.BigEndian, &p.HeaderSize)
    r.(io.ReadSeeker).Seek(int64(p.HeaderSize), os.SEEK_CUR)
    binary.Read(r, binary.BigEndian, &p.SubStream)
    binary.Read(r, binary.BigEndian, &p.DataSize)
    binary.Read(r, binary.BigEndian, &p.ControlPtr)

    p.PacketSize -= uint16(p.HeaderSize) + 4

    // Back up; DataSize and ControlPtr are part of the payload
    r.(io.ReadSeeker).Seek(-4, os.SEEK_CUR)
}

We read each of the structure's fields in order. We skip the additional headers of the PES packet since we don't care about the data in it. We also compensate for the given packet size since we went ahead and read the SubStream and DataSize seperately. Before leaving this function we back up so that the file cursor is at the right position to start reading data from the offsets.

Subtitles may span more than one packet so we need to be sure to read packets until we've read the entire length of the subtitle given by Packet.DataSize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func ReadSubtitle(s *os.File) (head Packet, data bytes.Buffer) {
    for i := 0; ; i++ {
        var pack Packet
        pack.Read(s)
        if i == 0 {
            head = pack
        }
        ReadFrom(s, &data, int64(pack.PacketSize))
        if data.Len() == int(head.DataSize) {
            break
        }
    }
    return
}

Now that the headers and information like payload size and offsets have been read we can start to decode the subtitle. The first things we need to decode are the control sequences. These sequences give us information about how long to display the current subtitle, it's color and other information like offsets to even and odd fields since the image data is interlaced.

1
2
3
4
type ControlHeader struct {
    Date uint16
    Next uint16
}

ControlHeader represents the start time and offset to the next control sequence. Once we've got this information we can read the controls.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func ReadControlSequences(head Packet, data *bytes.Buffer) (rect Rect, payload Payload,  even, odd uint16) {
    payload.Read(data, head)

    for {
        var header ControlHeader
        err := ReadInto(&payload.Control, &header)
        if err != nil {
            break
        }
        fmt.Printf("%+v\n", header)
        end := false
        for !end {
            cmd, err := payload.Control.ReadByte()
            if err != nil {
                break
            }
            switch cmd {
                case 0x00: fmt.Println("\tForced")
                case 0x01: fmt.Printf("\tStart:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x02: fmt.Printf("\tStop:\t\t%dms\n", 1024 * header.Date / 90)
                case 0x03:
                    fmt.Printf("\tPalette:\t%04X\n", payload.Control.Next(2))
                case 0x04:
                    fmt.Printf("\tAlpha:\t\t%X\n", payload.Control.Next(2))
                case 0x05:
                    buf := payload.Control.Next(6)
                    rect = Rect{((uint16(buf[1]) & 0xF) << 8) | uint16(buf[2]) - (uint16(buf[0]) << 4) | (uint16(buf[1]) >> 4) + 1, ((uint16(buf[4]) & 0xF) << 8) | uint16(buf[5]) - (uint16(buf[3]) << 4) | (uint16(buf[4]) >> 4) + 1}
                    fmt.Printf("\tDimensions:\t%+v\n", rect)
                case 0x06:
                    buf := payload.Control.Next(4)
                    even = uint16(buf[0]) << 8 | uint16(buf[1])
                    odd = uint16(buf[2]) << 8 | uint16(buf[3])
                    fmt.Printf("\tOffsets:\t%d, %d\n", even, odd)
                    fmt.Printf("\tField Len:\t%d, %d\n", odd - even, uint16(payload.Data.Len()) - odd)
                case 0xFF:
                    end = true
            }
        }
    }
    return
}

The control command is 1 byte and is followed by any parameters necessary for that particular control. The different controls are described below:

  • 0x00 - Forced: subtitle displayed whether or not subtitles are selected//enabled. This is typically used for foreign language segments. Takes no arguments.
  • 0x01 - Start: The time at which to start displaying the subtitle, this uses the Date field of ControlHeader. The time in milliseconds to start displaying the subtitle is given by the function: 1024 * ControlHeader.Date / 90. Takes no arguments.
  • 0x02 - Stop: The time at which to stop displaying the subtitle, takes no arguments.
  • 0x03 - Palette: Defines the four colors used for the subtitle. I've decided to ignore implementing this as I will be converting the subtitles to text. Takes 2 bytes of arguments, each color is one nibble.
  • 0x04 - Alpha: Alpha channel information, determines which colors are opaque and which are transparent. Useful for determining the main color as the background will likely have complete transparency. Takes 2 bytes of arguments, each alpha is one nibble.
  • 0x05 - Dimensions: Gives the dimensions of the subtitle image. Takes 6 bytes of arguments, each dimension value is 3 nibbles. Dimensions in pixels is given by the equation: (X1 - X0 + 1) x (Y1 - Y0 + 1)
    • 0x*** - X0: Left-most x-axis bound.
    • 0x*** - X1: Right-most x-axis bound.
    • 0x*** - Y0: Top-most y-axis bound.
    • 0x*** - Y1: Bottom-most y-axis bound.
  • 0x06 - Field Offsets: Gives the offsets to the even and odd fields of the image. Takes 4 bytes of arguments, the first byte is the even field offset, the second byte the odd field offset. This will be useful for rendering each field line in the proper order.
  • 0xFF - End Control: Signals the end of a control sequence.

Now that we've got some information about the dimensions and locations of the subtitle image we can look at decoding and drawing it. Subtitle images are run-length-encoded (RLE). The basic idea behind RLE is to compress the image data into a pixel color and a number of pixels to draw in that color. Using the format for subtitles each pixel is defined by the following alphabet where * represents a wildcard nibble:

  • 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF
  • 0x1*, 0x2*, 0x3*
  • 0x04*, 0x05*, 0x06*, 0x07*, 0x08*, 0x09*, 0x0A*, 0x0B*, 0x0C*, 0x0D*, 0x0E*, 0x0F*
  • 0x01**, 0x02**, 0x03**
  • 0x000*

To determine the color and number of pixels to draw we need to do a little bitwise arithmatic. The number of pixels to draw is given by the operation: X >> 2. The color is given by X & 0x03.

There is one character in the alphabet which has a special meaning and neither of the above operations apply to it. That is 0x000* which is a sort of carriage return character. It means simply fill the rest of the line with the given color. After every carriage return we need to read a line from the opposite field and reset the x position in the image to 0 and increment the y position.

Before we get into the code about drawing images I should mention one of the problems I ran into while writing this. The problem is that Golang doesn't provide any mechanism for reading nibble-aligned data. So I went ahead and wrote a small structure and a few methods for accomplishing this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
type Nibbler struct {
    r *bytes.Buffer
    Current uint8
    Aligned uint8
}

func NewNibbler(r *bytes.Buffer) Nibbler {
    return Nibbler{r, 0, 0}
}

func (n *Nibbler) GetNibble() (b uint8, err os.Error) {
    if n.Aligned == 0 {
        err = ReadInto(n.r, &n.Current)
        if err != nil {
            return 0, err
        }
    }
    n.Aligned ^= 4
    b = (n.Current >> n.Aligned) & 0x0F
    return b, err
}

The basic functionality is achieved by using some bitwise operations to switch which nibble we return each time the GetNibble method is called and reading a new byte every time we've read the 2nd nibble of the current byte. Access is provided to the Aligned field to determine if we're byte-aligned or not since we need to use this in the function that draws the subtitle images.

The following code decodes the RLE image and draws all of the pixels to an image of dimensions specified in the control sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
func DrawPixels(s *image.Gray, x uint16, y uint16, n uint16, c uint8) {
    for i := 0; i < int(n); i++ {
        s.SetGray(int(x) + i, int(y), image.GrayColor{(c + 1) << 6})
    }
}

func ReadRLEImage(rect Rect, payload *Payload, even, odd uint16) (*image.Gray) {
    subImg := image.NewGray(int(rect.w), int(rect.h))
    bData := payload.Data.Bytes()
    evenNibbler := NewNibbler(bytes.NewBuffer(bData[even:odd]))
    oddNibbler := NewNibbler(bytes.NewBuffer(bData[odd:]))

    var x, y uint16
    done := false
    field := true

    for !done {
        var b uint16
        var t uint8

        var currentNibbler *Nibbler

        if field {
            currentNibbler = &evenNibbler
        } else {
            currentNibbler = &oddNibbler
        }

        t, _ = currentNibbler.GetNibble()
        b = (b << 4) | uint16(t)
        if b >= 0x4 {
            run := b >> 2
            DrawPixels(subImg, x, y, run, uint8(b & 0x3))
            x += run
        } else {
            t, _ := currentNibbler.GetNibble()
            b = (b << 4) | uint16(t)
            if b >= 0x10 {
                run := b >> 2
                DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                x += run
            } else {
                t, _ := currentNibbler.GetNibble()
                b = (b << 4) | uint16(t)
                if b >= 0x40 {
                    run := b >> 2
                    DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                    x += run
                } else {
                    t, _ := currentNibbler.GetNibble()
                    b = (b << 4) | uint16(t)
                    if b >= 0x100 {
                        run := b >> 2
                        DrawPixels(subImg, x, y, run, uint8(b & 0x3))
                        x += run
                    } else {
                        DrawPixels(subImg, x, y, rect.w - x, uint8(b & 0x3))
                        x = 0
                        y += 1
                        field = !field
                        if y >= rect.h {
                            done = true
                        }
                        if currentNibbler.Aligned != 0 {
                            currentNibbler.GetNibble()
                        }
                    }
                }
            }
        }
    }
    return subImg
}

You'll notice I used the even and odd field offsets to create buffers for both the even and odd fields of the image. Then to switch between them, the pointer currentNibbler is switched between each field whenever we encounter a carriage return. I've also done some basic math in the DrawPixels function to evenly space the colors used in the subtitle throughout the greyscale range from 0 to 128.

The next step for this project is to write a program which can detect and separate images of each character from a subtitle image. After this I'll write a user interface for the user to give character meanings to each character image. From that an SRT file can be written using this character matrix. This is the same basic operation of most VOBSUB to SRT converters except that I aim to make it easier to use.

The complete source for this program can be found at: Gist: 1381809. Note that this program will read and decode only the first subtitle in the subtitle file. More work will be done on this when I've got time to make a more automated version that will read and decode all subtitles from a file. At some point in the future when I find that GeSHi supports Golang syntax highlighting, I'll update this post to make it more readable.

16Mar/1116

Newegg’s JSON API

For the longest time I've wanted access to Newegg's product list. For me they've been one of the better and more structured websites for buying computer hardware. So naturally they're usually my first choice when it comes to finding a good deal on a particular piece of hardware. They're also rather useful for seeing what's out there since their product catalog is fairly complete.

A while back I had started wanting to sort through items to heuristically pick the best deal based on a number of features Newegg generally provides for each item. This method works pretty well on SSD's and system memory. But until a recent discovery I was limited to scraping Newegg's website in order to get any kind of information from them. If you've ever tried this sort of thing you know that it is messy and generally a bad idea because every single time Newegg changes the structure of their website or any minute detail this will almost always break your scraping script.

The discovery came in the form of a mobile application for Android[1]. The mobile app lets you browse their website in a clean and fast manner. But what got me thinking is that unlike some other mobile applications out there that are just application wrappers for the mobile version of their websites this one operates directly through the native GUI. Now this is where it got interesting. I knew that if Newegg had written the app to use the native GUI then they had to be providing the data to it somehow and I knew it had to be more structured than HTML scraping like what I've been doing[2]. You have no idea how happy I was to discover that I was right.

First thing I did was connect my Droid 2 Global to my home network via WiFi in order to sniff some of the traffic going to and from the mobile app. This was accomplished by mounting a CIFS drive from my Windows 7 desktop to my router running Tomato based firmware. The share had a binary for TCPDump which I then used to sniff for traffic originating or going to my phone's IP address. After setting this up and performing all of the basic operations I would need in order to "reverse engineer" the data source I got to work on filtering the important bits.

In WireShark I immediately discovered that they had a sub-domain they were using for these operations. All of the web requests that weren't images or for customer metrics and tracking went to this host:

http://www.ows.newegg.com/

Because this API is structured more or less the same as navigating their site and the identifiers are different I decided to start with writing a query builder. Basically the purpose was to allow me to browse to the particular category I was interested in analyzing and filter it down to just a few simple requirements to simplify the analysis.

The first major entry point in the process of browsing to what you're interested in pulling is:

http://www.ows.newegg.com/Stores.egg/Menus

This takes no parameters and provides the main menu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[
    {
        "StoreDepa": "ComputerHardware",
        "StoreID": 1,
        "ShowSeeAllDeals": true,
        "Title": "Computer Hardware"
    },
    {
        "StoreDepa": "PCNotebook",
        "StoreID": 3,
        "ShowSeeAllDeals": true,
        "Title": "PCs & Laptops"
    },
    {
        "StoreDepa": "Electronics",
        "StoreID": 10,
        "ShowSeeAllDeals": true,
        "Title": "Electronics"
    },
    ...

Once you've selected a store to browse the next uri is:

http://www.ows.newegg.com/Stores.egg/Categories/{StoreID}

The only parameter it takes is StoreID which you'll find in the first query. This will return all of the categories within a store. I haven't really explored this very much as I'm only really interested in browsing system memory and SSD's. Using the Computer Hardware store the output is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
    {
        "Description": "Backup Devices & Media",
        "StoreID": 1,
        "NodeId": 6642,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 2
    },
    {
        "Description": "Barebone / Mini Computers",
        "StoreID": 1,
        "NodeId": 6668,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 3
    },
    {
        "Description": "CD / DVD Burners & Media",
        "StoreID": 1,
        "NodeId": 6646,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 10
    },
    ...

StoreID is included from the parameters of the request. I'm not exactly sure how to describe the purpose of NodeID but it appears to be a distinguishing feature of a category or subcategory. CategoryID is used for filtering results down to a specific category and can be either a root category or a subcategory. CategoryType determines whether CategoryID is a root category or if it contains subcategories. A value of 1 for CategoryType indicates that it is the root category.

Now depending on CategoryType you either move straight to the search query or onto a navigation query. The navigation query is used if there are subcategories:

http://www.ows.newegg.com/Stores.egg/Navigation/{StoreID}/{CategoryID}/{NodeID}

This query takes StoreID, CategoryID and NodeID, which you can get from the category listing of a particular store. It will return a subcategory list. Below is the subcategory listing for the memory category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
    {
        "Description": "Desktop Memory",
        "StoreID": 1,
        "NodeId": 7611,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 147
    },
    {
        "Description": "Flash Memory",
        "StoreID": 1,
        "NodeId": 8038,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 68
    },
    {
        "Description": "Laptop Memory",
        "StoreID": 1,
        "NodeId": 7609,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 381
    },
    ...

From here you will go to the search query[3]. At this point it does get a little tricky as the parameters for the query are no longer sent via GET they are instead sent using POST[4] which basically will require a programmatic method for making a search query. The search query given a category, store and node will list quite a lot of things. The first thing in the list is search filtering parameters, these will allow you to limit the products shown in the listing.

Data being posted is necessary to receive a non-404 response from the server, if you really wanted to you could just send an empty dictionary as this would just query newegg's entire product list. Any of the query options can be omitted, integer values may be omitted by substituting their value with -1.

The parameters you should concern yourself with are as follows along with the URL the data should be posted in JSON format to:

http://www.ows.newegg.com/Search.egg/Advanced

1
2
3
4
5
6
7
8
9
data = {
    "SubCategoryId": 147,
    "NValue": "",
    "StoreDepaId": 1,
    "NodeId": 7611,
    "BrandId": -1,
    "PageNumber": 1,
    "CategoryId": 17
}

NValue is a space separated list of NValues from the search parameters. Mind you, you cannot filter against more than one item in any category of search filters. For example in system memory you can't select DDR3 1333 (PC3 10600), DDR3 1333 (PC3 10660) and DDR3 1333 (PC3 10666). The query will return an unsucessful search result. The rest of the parameters are fairly self-explanatory.

The result returned will contain the following elements: RelatedLinkList, CoremetricsInfo, NavigationContentList, PaginationInfo, ProductListItems. CoremetricsInfo and RelatedLinkList can usually be ignored, the elements we're interested in are the NavigationContentList which is a list of search parameters//filters you can apply to the search. PaginationInfo describes how many elements were returned, what page we're on and how many elements there are per page. Last but not least the ProductListItems which provides a list of the products returned by the query along with some basic listing info for each one.

Below is a portion of the NavigationContentList:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
    "NavigationContentList": [
        {
            "NavigationItemList": [
                {
                    "SubCategoryId": -1,
                    "Description": "Free Shipping",
                    "StoreDepaId": 94,
                    "NValue": "100007611 600006050 600052012 4808",
                    "BrandId": -1,
                    "StoreType": 4,
                    "ItemCount": 194,
                    "CategoryId": -1,
                    "ElementValue": "4808"
                },
                {
                    "SubCategoryId": -1,
                    "Description": "Top Sellers",
                    "StoreDepaId": -1,
                    "NValue": "100007611 600006050 600052012 4802",
                    "BrandId": -1,
                    "StoreType": -1,
                    "ItemCount": 39,
                    "CategoryId": -1,
                    "ElementValue": "4802"
                },
                ...

This section will also contain a group name:

1
2
3
4
5
6
7
8
9
10
11
12
13
            ...
            "TitleItem": {
                "SubCategoryId": -1,
                "Description": "Useful Links",
                "StoreDepaId": -1,
                "NValue": "4800",
                "BrandId": -1,
                "StoreType": -2,
                "ItemCount": 0,
                "CategoryId": -1,
                "ElementValue": "4800"
            }
            ...

The PaginationInfo and ProductListItem elements will look like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
    ...
    "PaginationInfo": {
        "TotalCount": 233,
        "PageNumber": 1,
        "PageSize": 20
    },
    "ProductListItems": [
        {
            "SellerId": null,
            "ItemOwnerType": 0,
            "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139",
            "ItemGroupID": 0,
            "ReviewSummary": {
                "Rating": 5,
                "TotalReviews": "[1]"
            },
            "IsCellPhoneItem": false,
            "Discount": null,
            "FinalPrice": "$104.99",
            "ItemNumber": "20-148-372",
            "MappingFinalPrice": "$104.99",
            "FreeShippingFlag": true,
            "OriginalPrice": "$104.99",
            "IsComboBundle": false,
            "MailInRebateText": null,
            "ProductStockType": 0,
            "Model": "BL2KIT25664FN2139",
            "ShowOriginalPrice": false,
            "Image": {
                "FullPath": "http://images17.newegg.com/is/image/newegg/20-148-372-TS?$S125W$",
                "SmallImagePath": null,
                "ThumbnailImagePath": null,
                "Title": null
            },
            "SellerName": null,
            "ParentItem": null
        },
        ...

At this point you might be wondering what good will all this do me if I can't get specifications on an item? Well, you can and here's how: In each ProductListItems element you'll find an ItemNumber, this is essentially the primary key that each product is related to within this interface to newegg's product list. Using the following url you can obtain the full details page on any given item using it's ItemNumber:

http://www.ows.newegg.com/Products.egg/{ItemNumber}/Specification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
{
    "SpecificationGroupList": [
        {
            "GroupName": "Model",
            "SpecificationPairList": [
                {
                    "Value": "Crucial",
                    "Key": "Brand"
                },
                {
                    "Value": "Ballistix",
                    "Key": "Series"
                },
                {
                    "Value": "BL2KIT25664FN2139",
                    "Key": "Model"
                },
                {
                    "Value": "240-Pin DDR3 SDRAM",
                    "Key": "Type"
                }
            ]
        },
        {
            "GroupName": "Tech Spec",
            "SpecificationPairList": [
                {
                    "Value": "4GB (2 x 2GB)",
                    "Key": "Capacity"
                },
                {
                    "Value": "DDR3 2133 (PC3 17000)",
                    "Key": "Speed"
                },
                {
                    "Value": "9",
                    "Key": "Cas Latency"
                },
                {
                    "Value": "9-10-9-24",
                    "Key": "Timing"
                },
                {
                    "Value": "1.65V",
                    "Key": "Voltage"
                },
                {
                    "Value": "No",
                    "Key": "ECC"
                },
                {
                    "Value": "Unbuffered",
                    "Key": "Buffered/Registered"
                },
                {
                    "Value": "Dual Channel Kit",
                    "Key": "Multi-channel Kit"
                }
            ]
        },
        {
            "GroupName": "Manufacturer Warranty",
            "SpecificationPairList": [
                {
                    "Value": "Lifetime limited",
                    "Key": "Parts"
                },
                {
                    "Value": "Lifetime limited",
                    "Key": "Labor"
                }
            ]
        }
    ],
    "NeweggItemNumber": "N82E16820148372",
    "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139"
}

From this point on you can grab all of the features and specifications of any particular item you're interested in. In the near future I'll be writing a new post for both my memory and SSD analysis scripts using this interface.

The full code for my query builder is as follows, though you should note this was a quick script and is in no way complete or fully functional. As soon as it was to a useable point I moved onto the main point of this whole ordeal. You should also note that this requires CherryPy[5] and lxml[6]. The end result of this program is a query which you can use to retrieve a list of products matching the options you've selected. This is mainly to simplify product list selection and to minimalize the need to hardcode in certain values as newegg as a tendency to change things around on a regular basis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import cherrypy, json, urllib, urllib2
from lxml import etree
from lxml.builder import E

class QueryBuilder(object):
    def index(self):
        request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Menus")
        response = request.read()
        data = json.loads(response)
       
        body = E.body()
       
        ul = E.ul()
        for store in data:
            ul.append(E.li(E.a(
                store['Title'],
                href= '/Store?StoreID={}'.format(store['StoreID'])
            )))
       
        page = E.html(E.body(ul))
       
        return etree.tostring(page, pretty_print=True)
    index.exposed = True
   
    def Store(self, StoreID=None):
        if StoreID is not None:
            request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Categories/{}".format(StoreID))
            response = request.read()
            data = json.loads(response)
           
            body = E.body()
       
            ul = E.ul()
            for category in data:
                if category['CategoryType'] == 1:
                    ul.append(E.li(E.a(
                        category['Description'],
                        href='/Search?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId'])
                    )))
                else:
                    ul.append(E.li(E.a(
                        category['Description'],
                        href='/Category?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId'])
                    )))
           
            page = E.html(E.body(ul))
           
            return etree.tostring(page, pretty_print=True)
        else:
            return "Invalid parameters."
    Store.exposed = True
   
    def Category(self, StoreID, CategoryID, NodeID):
        if None not in [StoreID, CategoryID, NodeID]:
            request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Navigation/{}/{}/{}".format(StoreID, CategoryID, NodeID))
            response = request.read()
            data = json.loads(response)
           
            body = E.body()
       
            ul = E.ul()
            for subcategory in data:
                ul.append(E.li(E.a(
                    subcategory['Description'],
                    href= '/Search?StoreID={}&CategoryID={}&SubCategoryID={}&NodeID={}'.format(StoreID, CategoryID, subcategory['CategoryID'], subcategory['NodeId'])
                )))
           
            page = E.html(E.body(ul))
           
            return etree.tostring(page, pretty_print=True)
        else:
            return "Invalid parameters."
    Category.exposed = True
   
    def Search(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None):
        url = "http://www.ows.newegg.com/Search.egg/Advanced"
        data = {
            "IsUPCCodeSearch":      False,
            "IsSubCategorySearch"True,
            "isGuideAdvanceSearch": False,
            "StoreDepaId":          StoreID,
            "CategoryId":           CategoryID,
            "SubCategoryId":        SubCategoryID,
            "NodeId":               NodeID,
            "BrandId":              -1,
            "NValue":               "",
            "Keyword":              "",
            "Sort":                 "FEATURED",
            "PageNumber":           1
        }
       
        params = json.dumps(data).replace("null", "-1")
        request = urllib2.Request(url, params)
        response = urllib2.urlopen(request)
        data = json.loads(response.read())
       
        if data['NavigationContentList'] is None:
            return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True)
       
        body = E.body()
   
        form = E.form(name='PowerSearch', action='GenerateURL', method='GET')
       
        table = E.table()
        form.append(table)
        for section in data['NavigationContentList']:
            index = 0
            tr = E.tr(E.td(section['TitleItem']['Description'], colspan='3'))
            table.append(tr)
            for option in section['NavigationItemList']:
                if index % 3 == 0:
                    tr = E.tr()
                    table.append(tr)
                index += 1
                checkbox = E.td(E.input(option["Description"], type="checkbox", name=section['TitleItem']['Description'].replace(" ", ""), value=option['NValue']))
                tr.append(checkbox)
       
        for param, value in [('StoreID', StoreID), ('CategoryID', CategoryID), ('SubCategoryID', SubCategoryID), ('NodeID',NodeID)]:
            try:
                form.append(E.input(type='hidden', name=param, value=value))
            except KeyError:
                pass
        form.append(E.input(type='submit', value='Submit'))
        page = E.html(E.body(form))
       
        return etree.tostring(page, pretty_print=True)
    Search.exposed = True
   
    def GenerateURL(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None, **kwargs):
        NValue = set([])
        for arg in kwargs:
            if type(kwargs[arg]) == list:
                for value in kwargs[arg]:
                    NValue.add(value)
            else:
                NValue.add(kwargs[arg])
       
        NValue = list(NValue)
        NValue.sort()
        if StoreID is None:
            StoreID = -1
        if CategoryID is None:
            CategoryID = -1
        if SubCategoryID is None:
            SubCategoryID = -1
        if NodeID is None:
            NodeID = -1
        data = {
            "StoreDepaId":          int(StoreID),
            "CategoryId":           int(CategoryID),
            "SubCategoryId":        int(SubCategoryID),
            "NodeId":               int(NodeID),
            "BrandId":              -1,
            "NValue":               ' '.join(NValue),
            "PageNumber":           1
        }
        return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True)
    GenerateURL.exposed = True
   
cherrypy.quickstart(QueryBuilder())
  1. And iOS devices I assume as well. []
  2. Because lets face it, that would be stupid. []
  3. ... or get to the search query from selecting a root category in the main category listing for a store []
  4. At least this is the method used by the mobile app. []
  5. CherryPy: CherryPy is a pythonic, object-oriented HTTP framework. []
  6. lxml: A Pythonic binding for the C libraries libxml2 and libxslt. []
23Feb/112

Installing Ubuntu via Network

At some point in the last 6 months or so I may or may not have accidentally left my 1GB Sandisk Cruzer in a pair of jeans when they went through the washer AND the dryer. As such it's not exactly in peak physical condition[1] and for whatever reason I've had issues with using it for installing certain things[2] lately[3].

Anyway it has become time again to get my file server back up and running and I needed to reinstall Ubuntu on it. Given my extreme laziness when it comes to doing this sort of stuff I was in no mood to move everything to the top of my desktop[4] so I opted to try pxe booting[5] again.

I've messed with pxe booting in the past, particularly with GeeXboX[6] for my media center and that was a nightmare at the time and essentially required you to have a linux system in order to do it. Since then a wonderful application has made its way into the internet: tftpd32[7]. Tftpd32 greatly simplifies the whole process by not requiring you to install anything or make any major system changes.

Before you continue take note, these instructions assume a few things:

  • You're serving the netboot images from a windows system.
  • You have a tomato based router, although these instructions can be easily modified to work with any router firmware that uses DNSMasq or allows you to change advanced settings for the DHCP server.

Things you'll need:

  • Ubuntu Alternative ISO: This will be used for setting up the local http repository.
  • Ubuntu NetBoot Image: Grab netboot.tar.gz
  • tftpd32: This will be used for serving files during PXE booting.
  • HFS ~ HTTP File Server: This will be used for setting up a local http repository for installing from our local network instead of having ubuntu download everything from a mirror.

Router Settings:

  • Advanced -> DHCP / DNS -> Dnsmasq Custom Configuration
  • dhcp-boot=pxelinux.0,,[tftpd32 server ip address]
  • Save.

For ease of readability from this point forward files will be bolded and directories will be italicized.

  • Untar netboot.tar.gz into a folder, which I'll refer to as netboot from now on.
  • Delete pxelinux.0 and pxelinux.cfg from netboot/ as these are symlinks which will not work in windows.
  • Create the directory netboot/pxelinux.cfg/
  • Copy pxelinux.0 from netboot/ubuntu-installer/i386/ to netboot/
  • Copy sysconfig.cfg from netboot/unbuntu-installer/i386/boot-screens/ to netboot/pxelinux.cfg/
  • Rename netboot/pxelinux.cfg/sysconfig.cfg to netboot/pxelinux.cfg/default

Preparing tftpd32:

  • Run tftpd32
  • Browse to the netboot folder we just finished setting up.
  • Tftpd32 should be serving the files in that directory at this point.

Preparing the local HTTP Ubuntu Repository:

  • Run HFS.exe
  • Extract all of the files from ubuntu-10.10-alternate-i386.iso to a folder which I'll refer to as ubuntu-alt from this point on.
  • In the Virtual File System pane right click -> Add Folder from disk...
    • Browse to and select ubuntu-alt
    • When HFS prompts you to ask what kind of folder it should be added as, select Real Folder
  • Note the link in the address bar next to Open in browser, you'll use this link when installing ubuntu.

Installing Ubuntu:

  • Boot the system you're attempting to install Ubuntu on from your network device.
  • If you have tftpd32 up on another monitor at this point you should see a deluge of requests in the tftp server tab.
  • Ubuntu should show a boot menu select install.
  • Now I'm not going to go into full detail on how to install Ubuntu but when you get to mirror selection at the very top of the list there should be the option to enter a mirror manually this is where you should enter the address from the address bar in HFS, be sure to also include the port value.
  • If all goes well it should start installing and you should see another deluge of requests in HFS.
  1. In fact it's pretty far from peak physical condition. []
  2. Like ubuntu for example. []
  3. I'm not entirely sure if this is due to washing it or just from it being nearly 5 years old. []
  4. So the cable for the USB adapter I've got my DVD drive connected to in my desktop can reach my mini-itx board. []
  5. Preboot Execution Environment []
  6. GeeXboX []
  7. tftpd32: An open-source tftp//dhcp//syslog server for Windows. []
12Dec/100

Choosing an SSD (A more different S)

I've been periodically going back and revisiting the results for my SSD analysis script for newegg.com. The last few times I ran it I noticed that it was broken. It looks like newegg has modified a few things in their power search results page. One thing which is a little obnoxious[1] is that they no longer include the capacity in the description of the item or as a feature in the feature list when viewing the results page. This only seems to be an issue on the SSD page although I can't figure out why they decided it didn't need to be there in the first place. I see it this way: SSD's are first and foremost a storage device, you'd think that one of the most important features that should be listed with every SSD is the capacity at least.

Anyway, this change broke my script which I had been meaning to rewrite since regular expressions are definitely not the most efficient or cleanest way to parse HTML. I've been working with XML a more often lately despite my original prejudice against it for being a really bloated way to transfer data. One thing I discovered that makes XML a lot less painful is XPath[2] which is an incredibly useful "language" for selecting data from an XML document.

Once I had gone through and read several tutorials and references about XPath I set out to use it in writing a show calendar script which parses data from tvrage.com's XML API. After that useful exercise I realized I could very easily and cleanly apply it to my SSD analysis script. Since HTML is similar in nature to XML[3] I set out to parse Newegg's results page using XPath. This presented the first problem: Newegg's page isn't strictly XML or even XHTML for that matter. After a great deal of googling and research I landed on the lxml[4] website which as it turns out has an HTML parser for navigating and extracting data from HTML in the same way you would from an xml.etree.ElementTree[5]. With this in mind I immediately began rewriting the script.

First off lets consider my criteria for a "good" SSD on Newegg. The SSD can be either the typical 2.5" form factor, or a PCI-Express card[6]. The interface can be SATAII, SATAIII or PCI-Express. Capacity must be greater than or equal to 120GB[7]. Last but not least, the disk should be sub $300[8].

The above requirements give us the following power search[9] which we will be using as the source for the script:

1
2
3
4
5
6
url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

Now the first thing that made me cringe as I was rewriting this was the fact that I would basically have no choice but to load each individual product page from the results page as capacity is no longer included in either the description or the features list of each product in the results page. Eventually I will get around to multi-threading this to make it a little less painful, or I'll get lucky and Newegg will add the capacity feature back to the item listing in power searches for SSD's. The following is the full source code of the parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import re, math
from lxml import etree

url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

featureMap = {
    'Capacity': 'capacity',
    'Sequential Access - Write:': 'write',
    'Sequential Access - Write': 'write',
    'Sequential Access - Read:': 'read',
    'Sequential Access - Read': 'read',
    'Interface Type': 'interface',
    'Brand': 'brand',
    'Model': 'model',
    'Series': 'series'
}

speed_re = re.compile(r'(\d+)\s?MB/s')
capacity_re = re.compile(r'(\d+)GB')

parser = etree.HTMLParser()
# tree = etree.parse("temp.html", parser)
tree = etree.parse(url, etree.HTMLParser())
root = tree.getroot()

items = []

for node in root.findall(".//div[@class='itemCell']"):
    item = {}

    # Get link
    link = node.find(".//a[@title='View Details']")
    item["link"] = link.attrib["href"]
   
    # Get feature list (loads each item's url, should multi-thread this in the future)
    itemPage = etree.parse(item["link"], etree.HTMLParser()).getroot()
    featureList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dt"))
    valueList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dd"))
    features = zip(featureList, valueList)
    for feature, value in features:
        if value is not None and feature in featureMap:
            # If it's a speed feature parse out the speed
            if featureMap[feature] in ("read", "write"):
                item[featureMap[feature]] = min(map(lambda x: int(x), speed_re.findall(value)))
            # If it's a capacity feature, parse out the capacity
            elif featureMap[feature] == "capacity":
                item[featureMap[feature]] = min(map(lambda x: int(x), capacity_re.findall(value)))
            # If the value doesn't need to be parsed, just store the value in item
            else:
                item[featureMap[feature]] = value.strip()
               
    # Get price
    price = map(lambda n: n.text, node.findall(".//li[@class='priceFinal']/*"))
    item["price"] = float(''.join(price[1:]))
   
    # Only add the item if it has the features we need in it
    if "read" in item and "write" in item and "capacity" in item and "series" in item:
        score = (item["read"] * item["write"] * item["capacity"]) / ((math.log(abs(item["read"] - item["write"])) + 1) + item["price"])
        item["score"] = score
        items.append(item)
       
   
sorted = {}
for item in items:
    # Open addressing like in a hash table, so we don't wind
    # up with any collisions, unlikely but good practice anyway
    score = item["score"]
    while score in sorted:
        score += 1
    sorted[score] = item

sortOrder = sorted.keys()
sortOrder.sort()
sortOrder.reverse()

headers = ['brand', 'series', 'model', 'link', 'interface', 'price', 'capacity', 'read', 'write', 'score']
print '\t'.join(headers)
for key in sortOrder:
    item = sorted[key]
    print '\t'.join(map(lambda x: str(item[x]), headers))

At this point if you've gone through and read the entire script you'll probably notice that I've made a slight change to the scoring equation, it has been changed from the following:

\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}

To the following:
\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{(\log_{10}(|\text{Read} - \text{Write}|) + 1) \times \text{Price}}

I discovered that using the difference in read//write speed heavily penalized drives with anything greater than 10MB/s difference. So I figured that it may be a little more subtle to simply penalize drives based on the magnitude of the difference.

Now you're probably wondering: "When is this blathering idiot going to get to the damned results already?". And you'd be pleasantly surprised to know that I'm getting to them as you waste your time reading this.


Manufacturer: OCZ OCZ G.Skill OCZ
Series: RevoDrive Vertex 2 Phoenix Pro Series Agility 2
Capacity: 120GB 180GB 120GB 120GB
Read: 540MB/s 285MB/s 285MB/s 285MB/s
Write: 490MB/s 275MB/s 275MB/s 275MB/s
Item: N82E16820227578[10] N82E16820227602[11] N82E16820231378[12] N82E16820227593[13]
Price: $299.99 $294.99 $214.99 $214.99

As you can see the RevoDrive far out-scores all the rest of the SSD's considered in this analysis. The main reason is that they've essentially included two 60GB SSD's on the same card and you're expected to perform software raid on them in your own system[14]. Despite the incredible speeds they boast I don't think I would purchase one of these to use as my OS//Program disk because compatibility is a major limitation. You must be sure that your motherboard's BIOS supports booting via PCI-Express cards. And last but not least, the main reason I would pass up this card is the lack of TRIM support. As far as I can tell these cards do not support TRIM which is a major downside as far as I'm concerned.

The second disk in the list is the OCZ Vertex 2 180GB version. I'd probably skip this one just because I don't really consider the extra 60GB worth the extra $80.

Which leaves me with the last two disks which are as far as my analysis is concerned, identical. If you take into account the detailed features you'll notice that the G.Skill claims 50k IOPS on the 4k Random write test which seems a bit... optimistic. The OCZ makes no such claim and as far as I'm concerned both disks are more less the same thing. So it's pretty much up to brand preference at this point.

  1. I've already sent feedback to them suggesting that they fix this. []
  2. Only if the XML parser you're using supports it, which it seems is not a whole lot of them. At least not all of them support the full specification which is annoying since nobody really seems to document which bits and pieces they support and which whey don't. []
  3. Although not necessarily XML depending on the particular doctype you've chosen, Newegg's is transitional HTML. []
  4. lxml: http://codespeak.net/lxml/ []
  5. xml.etree.ElementTree: http://docs.python.org/library/xml.etree.elementtree.html []
  6. Some of the PCI-Express SSD's are stupidly fast and more expensive except that it doesn't look like any of them support TRIM yet which is a major problem for me. []
  7. It is rare that I have a matured (read: haven't reformatted in a while) install of windows along with all of my most commonly used programs and games that exceeds 60GB so I estimate that doubling this should accommodate for any sudden urges to install really big things. []
  8. I can't really justify spending much more than $300 on a single storage device. It had better be one hell of a storage device if I ever find myself spending more than $300 on it. []
  9. This will likely need to be updated at least once a month as Newegg is constantly adding new criteria and changing things. []
  10. OCZ RevoDrive []
  11. OCZ Vertex 2 []
  12. G.SKILL Phoenix Pro Series []
  13. OCZ Agility 2 []
  14. They show up as two separate physical devices despite being located on the same card. []
4Dec/100

Automagic TV Show Calendar

A little while ago I was browsing the web and discovered a website called tvrage.com[1] which seems to be the definitive online TV guide. I didn't originally enter the site on the main index but on a page describing the functionality of an XML API[2] they host for accessing their database of TV shows.

To me, this is like opening presents on christmas day. Just imagine the possibilities! I immediately began exploring the kind of data they provide. The very first idea I had was to use this to create events on my google calendar automatically for unaired episodes of my favorite TV shows.

I've previously written python scripts that interface with gdata but I find their implementation for python to be kind of cumbersome to deal with so I began researching their Protocol API[3]. At first I wasted a lot of time attempting to build the necessary XML structures to add events and the like. This got old very fast and I decided to just give JSON-C[4] a try. Turns out you can use the built-in JSON module in python for creating the necessary structures.

For parsing the results I got from tvrage I ended up using python's xml.etree.ElementTree which was simple enough to setup to retrieve only the information for each episode I was interested in.[5]

I had a bit of trouble initially with adding events to google calendar. This stemmed from the fact that google often will return an HTTP Redirect which includes a url with an appended gsession attribute which you're supposed to resubmit the exact data from the first request to. Once I figured this out it was turtles all the way down. I even managed to get the whole script multi-threaded to speed things up since it's impossible to perform batch-requests with JSON-C.

I should note that for the configuration file the calendar should be the "Calendar ID" for the calendar that can be found by looking at the settings page for the individual calendar, it is grouped with the XML and iCal feeds.

ShowList.txt:[6]

1
2
3
4
5
6
7
8
9
10
11
12
Castle  19267
House   3908
Bones   2870
Big Bang Theory, The    8511
Mentalist, The  18967
Rizzoli & Isles 24996
Venture Bros., The  6270
Top Gear    6753
Mythbusters 4605
Archer  23354
NCIS    4628
Community   22589

Config.cfg:

1
2
3
4
[Credentials]
username = someuser@gmail.com
password = somebase64encodedpassword
calendar = somecalendarid@group.calendar.google.com

AirDate.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import urllib2, urllib, json, ConfigParser, base64
from datetime import date
from xml.etree import ElementTree
from threading import Thread

calendar = ""
header = {}

# Thread for retrieving a list of episodes for a given show_id
class airDate(Thread):
    # Initialize thread and set some local attributes
    def __init__(self, show_name, show_id):
        Thread.__init__(self)
        self.show_name = show_name
        self.show_id = show_id
   
    # Get episode list from tvrage.com based on the show_id
    def run(self):
        # Retrieve XML episode_list from tvrage.com
        xml_data = urllib2.urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=%s" % self.show_id).read()
        # Pares XML into ElementTree.Element()
        xml_tree = ElementTree.fromstring(xml_data)
        self.result = []
       
        # For each season
        for season in xml_tree.findall("Episodelist/Season"):
            # Get the season number
            season_num = int(season.get("no"))
            # For each episode in the episode list
            for episode in season.findall("episode"):
                # Get episode number and title
                episode_num = int(episode.find("seasonnum").text)
                episode_title = episode.find("title").text
               
                # Build the episode code S##E##
                episode_code = "S%02dE%02d" % (season_num, episode_num)
               
                # Parse the airdate into year, month and day
                year, month, day = map(lambda x: int(x), episode.find("airdate").text.split("-"))
                try:
                    episode_airdate = date(year, month, day)
                    today = date.today()
                    # If episode hasn't aired yet
                    if episode_airdate >= today:
                        # Add episode to results list
                        self.result.append("%s %s - %s" % (str(episode_airdate), self.show_name, episode_code))
                except ValueError:
                    # If the airdate is invalid (tvrage.com sometimes
                    # includes 00's for unknown sections of the date
                    pass

class addEvent(Thread):
    # Thread for adding events to google calendar
   
    # Initialize thread and set local episode variable
    def __init__(self, episode):
        Thread.__init__(self)
        self.episode = episode
   
    # Add new entry to google calendar
    def run(self):
        # Build entry structure
        entry = {"data": {"details": self.episode, "quickAdd": True}}
        # Convert to JSON
        entry = json.dumps(entry)
       
        # Build request including necessary headers and data
        calReq = urllib2.Request("http://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), entry, header)
        # Execute the request
        calRes = urllib2.urlopen(calReq)
        # Get the redirect url (gsession appended)
        redirectReq = urllib2.Request(calRes.geturl(), entry, header)
        try:
            redirectRes = urllib2.urlopen(redirectReq)
        except HTTPError:
            # If we get some sort of HTTP error code
            # skip entry, can always run again
            pass
   
# Get list of events already added to
# the calendar from previous executions
def getExistingEpisodes(header):
    # Get JSON-C representation of calendar
    calReq = urllib2.Request(url="https://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), headers=header)
    calRes = urllib2.urlopen(calReq)
   
    # Parse JSON-C
    data = json.loads(calRes.read())
    # If the calendar has events on it
    if "items" in data["data"]:
        # Get the list of events
        events = data["data"]["items"]
        existing_episodes = []
        # For each event
        for event in events:
            # Append just the title of the event to the results
            existing_episodes.append(event["title"])
           
        return existing_episodes
    else:
        # We don't have any events on this calendar
        # so just return an empty list
        return []

if __name__ == '__main__':
    # Open the configuration file and get the necessary
    # credentials and settings
    config = ConfigParser.ConfigParser()
    config.readfp(open("Config.cfg"))
    username = config.get("Credentials", "username")
    password = config.get("Credentials", "password")
    # Password is stored as base64 encoded string just so
    # we don't have our password sitting out in plain sight
    password = base64.b64decode(password)
    calendar = config.get("Credentials", "calendar")
   
    # Build loginData structure, this is used to get
    # authentication data from google
    loginData = {
        "Email": username,
        "Passwd": password,
        "source": "BeMasher-ETR-2",
        "service": "cl"
    }

    # Encode the loginData for usage in a url
    loginData = urllib.urlencode(loginData)
    # Get authentication data
    gdataLogin = urllib2.urlopen("https://www.google.com/accounts/ClientLogin", data=loginData)
    SID, LSID, Auth = gdataLogin.read().splitlines()
   
    # Build header structure, this will be used for
    # all requests to google calendar from now on
    header = {
        "Authorization": "GoogleLogin %s" % (Auth),
        "GData-Version": 2,
        "Content-Type": "application/json"
    }
   
    # Open a list of the shows we're interested in
    # Stored as "show_name\tshow_id", one per line
    show_list = open("ShowList.txt")
    jobs = []
    for line in show_list:
        show = line.strip().split("\t")
        jobs.append(show)
   
    # Get a list of existing events from previous
    # executions so we don't wind up with duplicates
    existingEpisodes = getExistingEpisodes(header)
   
    threadQueue = []
    # For each episode we've retrieved that is unaired
    for job in jobs:
        show_name, show_id = job
        # Create an instance of the airDate thread
        thread = airDate(show_name, show_id)
        # Start it
        thread.start()
        # Add it to the threadQueue
        threadQueue.append(thread)
       
    episodes = []
    # While we've still got running threads
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()
        # For each episode in the results
        for episode in thread.result:
            # If it hasn't already been added to google calendar
            if episode[11:] not in existingEpisodes:
                print episode
                # Add to list of episodes that need events created
                episodes.append(episode)
   
    # For each episode that doesn't have an
    # event on google calendar already
    for episode in episodes:
        # Create an addEvent thread, start it
        # and add it to the threadQueue
        thread = addEvent(episode)
        thread.start()
        threadQueue.append(thread)
   
    # While we still have threads running
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()

This was all done shortly before I discovered that tvrage.com also provides iCal feeds for your favorite shows provided that you register and add some to your list. Unfortunately the iCal feed they generate creates events for exact air times of each episode which I'm not really all that concerned about. So I use this script still to add all-day events for each episode which is easier to view//see when there's a new episode.

I did write another script using their XML API but that will have to wait for another post.

  1. http://tvrage.com/ []
  2. http://services.tvrage.com/ []
  3. Data API Developer's Guide: The Protocol []
  4. Google's own flavor of JSON which is almost identical to plain old JSON. []
  5. I only really needed the original air date, title, season number and episode number. []
  6. You can find the show_id via the show search found on their XML API page. []
9Oct/100

Choosing an SSD (Update)

My brother is in the planning stages of building a new desktop. One of the things he's planning on doing differently from his last build[1] is using an SSD for OS + Programs.

I had mentioned to him previously that I a wrote a program for helping to choose an SSD based on what SSD's are meant for and are good at doing. So he asked if I could recommend him one. Below are the results of the latest run of my script based on the most current listings[2] of SSD's newegg offers.


Manufacturer: OCZ G.Skill OCZ
Series: Agility 2 Phoenix Pro Vertex 2
Capacity: 120GB 120GB 120GB
Read: 285MB/s 285MB/s 285MB/s
Write: 275MB/s 275MB/s 275MB/s
Item: N82E16820227543[3] N82E16820231378[4] N82E16820227551[5]
Price: $235.99 $239.99 $240.00


It looks like OCZ has two of the top three places this run and G.Skill is still maintaining one of the top three from before. Between the 3 of them I think i would likely still go for the G.Skill just because of personal preference despite there not really being any significant differences between the three. Excepting price of course.

  1. Which incidentally was right when SSD's were just becoming available to the average consumer. []
  2. As of this date 10/09/2010. []
  3. OCZ Agility 2 []
  4. G.Skill Phoenix Pro []
  5. OCZ Vertex 2 []
11Aug/100

Choosing an SSD

Before I started my new job I had an inordinate amount of free time and for a majority of that time, nothing to spend it doing[1]. I was still thinking about my desktop wishlist[2] and about choosing a better SSD than the one I had previously selected[3].

A long time ago when I was following the HDD market since I was looking to buy some bulk storage I wrote a php script which loaded newegg's product list based on some search parameters you provided newegg's productlist.xml[4]. The script would then parse the list and produce a list sorted based on price per gigabyte. Which is useful when you're in the market for capacity[5].

I decided to do more or less the same thing with SSD's except this time I did it in python since I'm rusty on PHP and I didn't want to mess with setting up a web server to test on. So I got started by doing a power search on newegg for the specific flavor of SSD I was looking for.

The search parameters are as follows:

  • 2.5" Form Factor
  • SATA II/III
  • 120GB or Greater
  • Less than $300
  • Retail or OEM
  • Support TRIM Command

As of this writing those particular search parameters narrows the result to 17 SSD's. Now comes the code. Before I started coding I needed some way to sort them according to what I thought was important. The metric is as follows:

$$\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{|\text{Read} - \text{Write}| \times \text{Price}}$$

After looking closer at the scores this produces I noticed that it heavily penalizes drives with huge differences between read and write speeds which effectively weeds out drives that still have acceptable read//write speeds. So I removed that section of the metric producing:

\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}

The basic idea behind this scoring measure is that sequential read and write speeds are important, as well as capacity. Price and difference between sequential read//write are considered bad[6]. In the equation read and write refer to sequential read and write speeds. The ratio of these will produce a score of the SSD's overall performance for capacity, read//write speeds and price.

The code is relatively simple in purpose. Load the data and parse it into a dictionary then sort based on the metric above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import urllib2, re

# url = "
# http://www.newegg.com/Product/ProductList.aspx?Submit=Property&Subcatego
# ry=636&Description=&Type=&N=100008120&IsNodeId=1&srchInDesc=&MinPrice=&M
# axPrice=&OEMMark=1&OEMMark=0&PropertyCodeValue=4213:30854&PropertyCodeVa
# lue=4214:30848&PropertyCodeValue=4214:39416&PropertyCodeValue=4214:30849
# &PropertyCodeValue=4214:39415&PropertyCodeValue=4215:55552&PropertyCodeV
# alue=4215:41071&PropertyCodeValue=4215:46319"

# data = open("temp.html", "w")
# data.write(urllib2.urlopen(url).read())
# data.close()
raw = open("temp.html").read()

item_re = re.compile(r'<div class="itemCell".*?>(.*?)<br class="clear".*?</div>')
feature_re = re.compile(r"<li>&nbsp;(.*?)</li>")
feature_list_re = re.compile(r'<b>(.*?)\s?\#?\s?:\s?</b>\s?(.*?)</li>')
speed_re = re.compile(r"(up to )?(\d+).*?MB/s")
capacity_re = re.compile(r"(\d+)GB")
price_re = re.compile(r"</span>\$<strong>(\d+)</strong><sup>.(\d+)</sup>")

item_list = []
valid = ['Read', 'Item', 'Interface', 'Capacity', 'Model', 'Write', 'Size']

for item in item_re.findall(raw):
    current = {}
    no_label = []
    features = feature_re.findall(item)
    current["Size"] = features[0]
    current["Capacity"] = features[1]
    current["Interface"] = features[2]
   
    for feature in feature_list_re.findall(item):
        if feature[1].find("\r") != -1:
            current[feature[0]] = feature[1].split("\r")[0]
        else:
            current[feature[0]] = feature[1]
    current["Read"] = int(speed_re.findall(current["Sequential Access - Read"])[0][1])
    current["Write"] = int(speed_re.findall(current["Sequential Access - Write"])[0][1])
    current["Capacity"] = int(capacity_re.findall(current["Capacity"])[0])
    for feature in current.keys():
        if feature not in valid:
            del current[feature]
    current["Price"] = float('.'.join(price_re.findall(item)[0]))
    current["Item"] = "http://www.newegg.com/Product/Product.aspx?Item=%s" % (current["Item"])
    item_list.append(current)
   
sorted = {}
for item in item_list:
    ratio = (item["Read"] * item["Write"] * item["Capacity"]) / (item["Price"])
    sorted[ratio] = item
   
sort_order = sorted.keys()
sort_order.sort()
sort_order.reverse()
for key in sort_order:
    #print '\t'.join(map(lambda x: str(x), sorted[key].keys()))
    print '\t'.join(map(lambda x: str(x), sorted[key].values()))

Now given that there is quite a lot of data to present and analyze all at once I've decided it would be easiest to just provide you with a pretty graph[7]:


If you look closely at the scores of all the disks in the query, you'll notice that this is a noticeable gap between the top 3 and the rest. They are as follows:

Manufacturer: A-DATA Patriot G.Skill
Series: S599 Inferno Phoenix Series
Capacity: 128GB 120GB 120GB
Read: 280MB/s 285MB/s 285MB/s
Write: 270MB/s 275MB/s 275MB/s
Item: N82E16820211471[8] N82E16820220510[9] N82E16820231372[10]
Price: $295.99 $289.99 $299.00


I noticed that if you ignore capacity in the metric then the Patriot Inferno is the clear winner here. So as it turns out the Western Digital SiliconEdge I had selected when I first wrote the wishlist wasn't the best drive for my needs. But then I've always had a soft-spot for Western Digital. But now I'm convinced that the Patriot Inferno is the SSD I'll be getting unless by the time I get around to buying one there are better options[11].

  1. Nothing worth-while anyway []
  2. See previous post: Wishlist. []
  3. Western Digital SiliconEdge 128GB SSD []
  4. Which no longer exists in it's original form. []
  5. Which I was. []
  6. Although we're excluding read//write speed difference. []
  7. Scores have been normalized to 100%. []
  8. A-Data S599 []
  9. Patriot Inferno []
  10. G.Skill Phoenix Series []
  11. Which there probably will be. []
28Jul/100

Matplotlib and Live Data: A Tale of Two Technologies

Being unemployed over the summer is never usually a good thing for me. I get bored very easily if I don't have something to occupy myself with. This last bout of boredom led me to unpack some of my electronics. Dusted off my multimeter, Arduino and a digital thermometer I bought a little while ago. Figured I could use these to solve one of my current problems.

Living in Laramie usually subjects people to harsh winters which leaves most housing developments without central air conditioning installed since, well it's never really needed except maybe one or two days over the summer where it gets above 85 oF. This summer has apparently been hotter than previous summers and It's left my condo in an "uncomfortable state". Mind you I'm used to living in hot weather so this isn't such a terrible thing to me, I'm used to it.

What I'm not used to is not having AC and it cooling off enough at night that it's worthwhile to open a few windows and stick a fan in one of them. Which leaves me with this problem: When is the optimal time to open the windows and turn on the fan to get my condo cooled off earliest//fastest?

In comes my Arduino + digital thermometer[1]. Once I rigged up the proper power//data connections on a breadboard for my Arduino I set out to find code for the thermometer. I" ve setup the thermometer with a sketch on my Arduino before I just didn't feel like wasting a few hours trying to do it from scratch again. Soon enough I found some code[2] that worked perfectly. So I trimmed out some code I didn't need for the project and set it up to just write the temperature as fast as possible[3] to the serial port it's connected to.

After that I wrote a logging program on my desktop in Python to record temperatures sent via serial to my desktop. The program is incredibly simple and uses the pySerial library[4] to read temperatures from the serial port of my desktop and append them to a temperature log. I used a simple windows command to do this since it wouldn't lock the file so I could read data from it simultaneously. There are still occasionally collisions with the processing program locking the file and the logger not being able to write the data to the file but these are rare enough that it's negligible in my situation.

1
2
3
4
5
import serial, os

ser = serial.Serial(2)
while True:
    os.system("echo %s>>out.txt" % (ser.readline().strip()))

The next step in this project was visualizing the data. I've used matplotlib[5] before and I was thinking this time I would like to see if I could write the program to update data live as it recieves it. My first foray into this goal was a miserable disaster. Most of the solutions I could find involved just setting up an infinite loop with a short time delay in it. Which works great except that it sleeps the thread running the plot which makes it impossible to resize the plot or do anything at all with the GUI for that matter. So obviosly this wouldn't work at all.

After poking around for different solutions to this and crashing my computer once from spawning an infinite number of instances of the plot I gave up for a bit, only to discover that there was an example in the documentation which wasn't obviously named. I quickly discovered the best way to do this. I even added some pretty annotations and such.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import gobject
import matplotlib
matplotlib.use('GTKAgg')

import matplotlib.pyplot as plt

current_pos = 0
temps = []
pad = 5.0

f = plt.figure()

def update(vars):
    # Unpack variables that need to be persistent between
    # executions of this method.
    temps = vars[0]
    current_pos = vars[1]
    pad = vars[2]
   
    # Open the data file and get any new data points since
    # the last time we read from this file
    data = open("out.txt", "r")
    data.seek(current_pos)
    new_temps = map(lambda x:
        float(x) * (1 + 4.0/5.0) + 32.0,
        data.read().split("\n")[:-1])
    current_pos = data.tell()
    data.close()
   
    # If we got new data then append it to the list of
    # temperatures and trim to 750 points
    if len(new_temps) > 0:
        temps.extend(new_temps)
        temps = temps[-750:]
   
    f.clear()
    f.suptitle("Live Temperature")
    a = f.add_subplot(111)
    a.grid(True)
    l, = a.plot(temps)
    plt.xlabel("Time (Seconds)")
    plt.ylabel(r'Temperature $^{\circ}$F')
   
    # Get the minimum and maximum temperatures these are
    # used for annotations and scaling the plot of data
    min_t = min(temps)
    max_t = max(temps)
   
    # Add annotations for minimum and maximum temperatures
    a.annotate(r'Min: %0.2f$^{\circ}$F' % (min_t),
        xy=(temps.index(min_t), min_t),
        xycoords='data', xytext=(20, -20),
        textcoords='offset points',
        bbox=dict(boxstyle="round", fc="0.8"),
        arrowprops=dict(arrowstyle="->",
        shrinkA=0, shrinkB=1,
        connectionstyle="angle,angleA=0,angleB=90,rad=10"))

    a.annotate(r'Max: %0.2f$^{\circ}$F' % (max_t),
        xy=(temps.index(max_t), max_t),
        xycoords='data', xytext=(20, 20),
        textcoords='offset points',
        bbox=dict(boxstyle="round", fc="0.8"),
        arrowprops=dict(arrowstyle="->",
        shrinkA=0, shrinkB=1,
        connectionstyle="angle,angleA=0,angleB=90,rad=10"))
   
    # Set the axis limits to make the data more readable
    a.axis([0,len(temps), min_t - pad,max_t + pad])
   
    f.canvas.draw_idle()
   
    # Repack variables that need to be persistent between
    # executions of this method
    vars = {0: temps, 1: current_pos, 2: pad}
   
    return True

vars = {0: temps, 1: current_pos, 2: pad}

# Execute update method every 500ms
gobject.timeout_add(500, update, vars)

# Display the plot
plt.show()

This code generates a plot which updates every 500ms. This is based on an example in the matplotlib examples[6]. An example of the program's output can be seen below.

I imagine that I could have made this simpler by not using the GTK libraries which are a pain to install since there are 3 or 4 modules you have to install in order to make all this work including the GTK+ runtime. I may come back later and post a version written using TK since it can be used without installing extra modules and stuff.

  1. DS18S20 Digital Thermometer Datasheet []
  2. Temperature Measurement using the Dallas DS18B20 by Peter H. Anderson []
  3. Somewhere in the range of 750ms between readings since it is in parasite mode, may change this later to run in non-parasite mode. []
  4. pySerial Python Library []
  5. matplotlib Python Library []
  6. Animation example code: simple_anim_gtk.py []
22Jul/100

Wishlist

I've noticed recently that I tend to spend a lot of time shopping for things I can't afford when I don't have any excess income. I can't really tell if it's just because I'm bored a lot more often over the summer. Especially this one since I've been unemployed for the maojority of it so far[1].

As it stands there is a rather long list of things I intend on buying//upgrading//replacing in the future. First and foremost on this list is a new laptop since my current ASUS Eee-PC 1000H is driving me nuts. It's useful for... writing, and not even that sometimes. For the last year I've used it almost strictly for taking notes in class, which it does well enough. But using it for anything else is essentially impossible. I've found this to be even more true in the last few weeks since I've been spending every other weekend with my parents on their ranch or in Cheyenne. I've just been using it when I went since it's pretty impractical to take my desktop with me everytime. Especially considering a lot of the stuff I work on needs a decent amount of bandwidth and my parents' internet connection is satellite based on their ranch at least so it would be pointless to try and get any real work done.

I've essentially decided that my next laptop will be a 13" Macbook Pro. The main reason is that for the amount of money I intend on spending on a new laptop the Macbook Pro is far superior in both build quality and components to the equivelant Dell which is the manufacturer I've used for all my mobile computing needs until my netbook. Easy decision don't you think?

The next item on my list was building a new desktop. I only really need to replace the core components of my desktop since everything else is more or less in good working order. But that's boring so I've made an entire list of components including core and secondary components to build a new desktop, excluding optical drive and hard drives[2]

The first part I always start with when building a wishlist[3] is the CPU and for this particular one it was a pretty simple choice. Intel's Core i7 series is pretty much the way to go when building a workstation. In this case there were only really two requirements I had for selecting the particular Core i7 I need for this build.

  • $0 < Price < $500
  • Supports Triple Channel DDR3

These requirements narrow down the selection to two processors. The Core i7-920 and the Core i7-930. There are only two differences. The 930 is 2.8Ghz and the 920 is 2.66Ghz and the 930 is $10 cheaper than the 920, so it's pretty obvious which one is the one to go with.

Intel Core i7-930 Bloomfield 2.8Ghz LGA 1366 130W Quad-Core
Model #: BX80601930
Item #: N82E16819115225
Price: $289.99

The second component I select after the CPU is the motherboard. Now this is where it gets tricky because the restrictions I use for selecting a motherboard have a lot less to do with technical capabilities than they do with reliability and proper functionality. This is where newegg becomes the right place to shop. Their product review system is by far the best in the online tech shopping world. I tend to score motherboards based on the number of reviews they receive and the score of the review. This is of course after I've removed motherboards incompatible with the other components I intend on using in the system.

  • LGA 1366
  • Intel X58
  • Intel ICH10R
  • ATX form factor

The motherboard that comes out on top after these restrictions is an EVGA board.

EVGA E758-TR Intel X58

http://www.newegg.com/Product/Product.aspx?Item=N82E16813188046

Model #: 132-BL-E758-TR
Item #: N82E16813188046
Price: $269.99 IMIR -$40.00: $229.99

Next up is system memory. This almost always follows from motherboard since some motherboards support odd RAM speeds and I tend to stick with standard speeds since they are a lot less prone to compatibility issues and just work. In this case the motherboard calls for DDR3 1333[4]. I usually filter out for the specific kind of RAM I want which leaves me with a few dozen sets. Then I score based on CAS latency and price. I've used G.SKILL before and was pleased with it and in this case a G.SKILL set won on both price and CAS latency.

G.SKILL 3GB (3 x 1GB) DDR3 1333 (PC10666) Triple Channel
Model #: F3-10666CL7T-3GBPK
Item #: N82E16820231229
Price: $84.99

Now you're probably thinking "Why does he only want 3GB that's puny!". There are some very good reasons for it. First of all I'm not really all that into 64-bit yet, I still have a few devices without 64-bit compatible drivers[5]. For the most part 2GB of RAM has suited me just fine for nearly anything I've ever needed//wanted to do on my desktop until this point, why should I pile in twice or even three times that amount? Besides if I so desire I could just purchase a second set in the future. The only reason I might consider doing that is if I suddenly became obsessed with running a dozen virtual machines simultaneously[6].

Next in line isn't exactly a component I need to buy, but I've been wanting to upgrade for a long time now and I figure a wishlist is the best place to do it. Ever since I saw an article on Gizmodo[7] about the new NVIDIA GeForce GTX 460 I've pretty much been set on that specific chipset. It was pretty easy to select a brand since they're all exactly the same price at this point and I only want the 768MB model. I've been wanting to do some CUDA development so here's my chance.

EVGA NVIDIA GeForce GTX 460 (Fermi) 768MB
Model #: 768-P3-1360-TR
Item #: N82E16814130562
Price: $199.99

Another component I consider to be core but isn't necessarily a core component is the storage device used for the OS. In the past I've strictly used HDD's for my desktop. But since I installed a Patriot 32GB SSD in my laptop I've fallen in love with SSD's for the OS//Programs drive. You might hear people moan and complain about SSD's being disproportionatly priced based on their capacity. Well I've got news for you, you don't buy SSD's for capacity, you buy them for speed. Anyone reasonably knowledgeable about computer components and their functionality would know that. I'm not interested in price per GB as quite a lot of people might be, at least not for SSD's[8]. I'm a lot more interested in price per MB/s sequential read-write. The particular disk that won in this case is one of the new Western Digital SSD's.

Western Digital SiliconEdge Blue 128GB SSD MLC 220/170 SR/SW
Model #: SSC-D0128SC-2100
Item #: N82E16820250002
Price: $199.99

In case the title of that product doesn't make sense the drive is 220MB/s Sequential Read and 170MB/s Sequential Write.

Everything from this point on I consider to be secondary components as they don't directly do any computation or data transfer//storage.

For this build I've decided that even though I don't need to get a new case, I've added one to the list anyway since my current case is due for an upgrade, especially in regards to aesthetics. I've been oggling this particular case for quite a while now, since it replaced it's predecessor at least. This case won by a long shot in aesthetics and functionality.

Antec P183 Black Aluminum/Steel ATX-Mid
Model #: P183
Item #: N82E16811129061
Price: $179.99 IMIR -$25.00: $154.99

The next item due for an upgrade was actually necessary considering the major increase in power needs for the core components. I've always hated shopping for power supplies because there are far more factors to consider when it comes to selecting one that matches your needs and is of reasonable build quality. If you don't have a decent power supply you may as well just give up. In this case I stuck as close as possible to the power supply I have now. I was only really interested in making sure that there were enough PCI-e power connectors since my current PSU has none. I let the reviews do the majority of selecting for me in this case.

Corsair 650W (ATX|EPS)12V
Model #: CMPSU-650TX
Item #: N82E16817139005
Price: $119.99 IMIR -$30.00: $89.99

This power supply matched most closely to the one I had now, it's simple, doesn't have too many "certifications" and marketing nonsense tacked onto the name and the cables are sheathed in black mesh[9].

Now I don't normally bother with purchasing a 3rd party heatsink//cooling system for my CPU but in this case I had heard mention of a self-contained water cooling system with radiator, pump and CPU waterblock from Corsair. So I checked it out and I am impressed. Since it is self-contained it removes a lot of the frustration with resevoirs and replacing the coolant on a regular basis.

Corsair H50 CPU Cooler
Model #: CWCH50-1
Item #: N82E16835181010
Price: $74.99

The last item in the list is more for interior neatness and organization. I've always hated just leaving components without a proper fastening inside the case. In this situation the SSD I select[10] is 2.5" form factor, suitable for notebooks and less suitable for desktops. So I looked around for a set of 2.5" to 3.5" brackets to secure the drive in one of the HDD bays.

iStarUSA 2.5" to 3.5" HDD Bracket
Model #: DIY-RP-HDD2.5
Item #: N82E16816215157
Price: $5.99

The subtotal for the build excluding shipping and including all instant mail in rebates comes out to $1330.91. Pretty good wouldn't you say? For a decently beefy workstation that would likely last me another 5-6 years before upgrading again. I'm currently on the 5th year since a major overhaul of my current system an Intel Core 2 Duo based rig.

  1. I do have an interview coming up so wish me luck. []
  2. Again, excluding SSD from this list of parts I don't intend to buy. []
  3. Always on Newegg.com, they're the de facto standard in online computer components. []
  4. DDR3 SDRAM PC10666 []
  5. And will probably never be compatible for that matter. []
  6. Which I won't, so I won't. []
  7. I think. []
  8. The only component I consider price per GB on is standard HDD's. []
  9. Which lends nicely to aesthetics should I ever decide to show someone my desktop's inards. []
  10. Like all SSD's. []
27Jun/100

WriteMonkey and Markdown

Recently Download Squad had a post[1] about a practical way to get features and support for open-source programs, specifically through donations. The post was about a program called WriteMonkey which is a minimalistic writing program that the author had originally written about previously[2]. Think of the best code editing program you know of, mine is Notepad++[3]. Now take that program and refactor it specifically for writing articles or blog posts, you've just created WriteMonkey[4].

Something that interested me about WriteMonkey was the Download Squad author's post specifically mentioned writing posts using Markdown[5] syntax. Markdown is a simple plain-text syntax which is parsed into html removing the need to tediously enter html[6] as you write. At first glance it didn't really seem like it would really help all that much when it came to writing blog posts. But I was completely wrong and am better off for it. Now the especially useful part is that WriteMonkey supports this completely as well as having a very useful shortcut for parsing and copying html straight from Markdown source. This is incredibly useful since I can then just go to my website and paste the resulting html into a blog post and hit save and be done with it.

As I looked through the program I realized, this is much much more than just a Markdown IDE. It includes all sorts of useful features like a "progress bar" which tells you how far along you are in a certain quota you specify in the preferences. This led me to write a little bit of SQL[7] to calculate the average word-count of posts in my blog. Excluding the outliers it came out to ~350 words per post. So I just set the quota to 350 words and it displays a bar at the top or bottom of the screen depending on what you choose showing your current progress on the quota.

It also does several other useful things like displaying current battery life as a percentage in the progress bar, showing the file you're writing in. There's also this feature called repository//main. This allows you to store text clippings in repository and then write the blog post in main. When exported as html the repository is ignored and only main is copied. Makes it useful to write notes and such in the middle of authoring a post to keep with everything you write and it's easy enough to switch between the two to make it useful. For this post I just made a list of points I wanted to cover.

After using WriteMonkey for an hour or so I think I've found the new environment I'll be writing all my posts in for the foreseeable future.

  1. Download Squad: Amazing software tip: Pay free software developers to get stuff fixed! []
  2. Download Squad: WriteMonkey is an unbelievable full-screen text editor []
  3. Notepad++ []
  4. WriteMonkey []
  5. Markdown []
  6. HTML: HyperText Markup Language, is the predominant markup language for web pages. []
  7. SQL: Structured Query Language []