A Little Off Code, Computers, Photography and Guns

31May/110

System Memory Analysis

Choosing the best RAM for your system can be difficult, as there are a lot of things to consider. Doing comparisons by hand can net you some pretty decent results but picking the best price per capacity per ... can get fairly complicated if you're doing it by hand.

A while ago you may remember my SSD analysis script[1] that scraped HTML from Newegg to calculate scores for each product to choose the best one. I've also recently discovered that Newegg does indeed have an API[2] that greatly simplifies this whole process[3].

Once I had explored Newegg's API enough to get the data I needed I set to work to update the SSD script as well as write a few others for HDD's and system memory as well. Of the scripts I wrote the one for system memory turned out to be particularly useful as it made finding great deals very easy. It also illustrated that popular brands may not always be the best deal.

The first major improvement over the previous scripts was the use of threading to make multiple API requests in parallel which sped things up quite a bit. While Python's threading library doesn't allow for parallelism of the CPU[4] it does for file I/O. Below is the class used for grabbing urls throughout the script.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import threading
import urllib, urllib2
import json, re
from Queue import Queue

class GetURL(threading.Thread):
    def __init__(self, urlQueue, jsonQueue):
        threading.Thread.__init__(self)
        self.urlQueue = urlQueue
        self.jsonQueue = jsonQueue
   
    def run(self):
        while True:
            itemNumber, url = self.urlQueue.get()
            raw = urllib2.urlopen(url).read()
            jsonQueue.put((itemNumber, json.loads(raw)))
            self.urlQueue.task_done()

Newegg's API paginates the data as the Android app displays the data directly to the user which means there's no easy way to retrieve all results in one request. So you must make successive calls incrementing the page number until all results for the query have been retrieved.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
itemSpecURL = "http://www.ows.newegg.com/Products.egg/{}/Specification"
searchURL = "http://www.ows.newegg.com/Search.egg/Advanced"

itemList = getItems()

urlQueue = Queue()
jsonQueue = Queue()
items = {}
for item in itemList:
    specURL = itemSpecURL.format(item["ItemNumber"])
    urlQueue.put((item["ItemNumber"], specURL))
    items[item["ItemNumber"]] = item
   
for worker in xrange(2):
    t = GetURL(urlQueue, jsonQueue)
    t.setDaemon(True)
    t.start()

urlQueue.join()

These basic setups are fairly generic and can be used to analyze just about any product from Newegg. Anything beyond this point however is specific to the type of product you're analyzing. This will grab each item's basic data including price as well as it's detailed specifications. I should also note that the parameters passed to the API in getItems is generated using the query builder available in the post about Newegg's API.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
speed_re = re.compile('DDR\d\s(\d+).*')
capacity_re = re.compile("(\d+)GB\s\((\d+)\sx\s(\d+)GB\)")
timing_re = re.compile('(\d+-\d+-\d+-\d+)')
features = ['Brand', 'Model', 'ItemNumber', 'Price', 'Speed', 'Capacity', 'Dimms', 'Timing', 'Voltage']

while not jsonQueue.empty():
    itemNumber, specs = jsonQueue.get()
   
    item = {}
    for group in specs['SpecificationGroupList']:
        for pair in group['SpecificationPairList']:
            if pair['Key'] in features:
                item[pair['Key']] = pair['Value'].encode('ascii', errors='ignore')
   
    if 'Capacity' in item:
        capacity = capacity_re.match(item['Capacity'])
        if capacity:
            item['Capacity'] = capacity.group(1)
            item['Dimms'] = capacity.group(2)
    if 'Speed' in item:
        speed = speed_re.match(item['Speed'])
        if speed:
            item['Speed'] = speed.group(1)
    if 'Timing' in item:
        timing = timing_re.match(item['Timing'])
        if timing:
            item['Timing'] = timing.group(1).replace('-','\t')
        else:
            continue
    item['Price'] = items[itemNumber]['FinalPrice']
    item['ItemNumber'] = specs['NeweggItemNumber']
   
    try:
        print '\t'.join(map(lambda x: item[x], features))
    except KeyError:
        pass
    jsonQueue.task_done()

The basic purpose of the above code is to go through each item and format each feature into usable data[5]. Once the data has been formatted and printed I continue the rest of the filtering and analysis in Microsoft's Excel.

The equation used to calculate a score for each set of system memory is as follows:

\frac{(\text{Capacity}\times1024^3)\times\text{Speed}}{\text{Price}\times CL\times T_{RCD} \times T_{RP} \times T_{RAS}}

Currently it looks like G.Skill has the best to offer in the DDR3 memory market if you're looking for a quad-channel set for Sandy Bridge's enthusiast hardware due out in the next quarter[6].

  1. Choosing an SSD (A more different S) []
  2. Newegg's JSON API []
  3. Even though it was never intended for that what I'm using it for. []
  4. Python is crippled in this way due to a global interpreter lock. []
  5. Or at least the data that we're interested in using for analysis. []
  6. G.SKILL Ripjaws X Series 16GB (4 x 4GB) 1333Mhz []
16Mar/1116

Newegg’s JSON API

For the longest time I've wanted access to Newegg's product list. For me they've been one of the better and more structured websites for buying computer hardware. So naturally they're usually my first choice when it comes to finding a good deal on a particular piece of hardware. They're also rather useful for seeing what's out there since their product catalog is fairly complete.

A while back I had started wanting to sort through items to heuristically pick the best deal based on a number of features Newegg generally provides for each item. This method works pretty well on SSD's and system memory. But until a recent discovery I was limited to scraping Newegg's website in order to get any kind of information from them. If you've ever tried this sort of thing you know that it is messy and generally a bad idea because every single time Newegg changes the structure of their website or any minute detail this will almost always break your scraping script.

The discovery came in the form of a mobile application for Android[1]. The mobile app lets you browse their website in a clean and fast manner. But what got me thinking is that unlike some other mobile applications out there that are just application wrappers for the mobile version of their websites this one operates directly through the native GUI. Now this is where it got interesting. I knew that if Newegg had written the app to use the native GUI then they had to be providing the data to it somehow and I knew it had to be more structured than HTML scraping like what I've been doing[2]. You have no idea how happy I was to discover that I was right.

First thing I did was connect my Droid 2 Global to my home network via WiFi in order to sniff some of the traffic going to and from the mobile app. This was accomplished by mounting a CIFS drive from my Windows 7 desktop to my router running Tomato based firmware. The share had a binary for TCPDump which I then used to sniff for traffic originating or going to my phone's IP address. After setting this up and performing all of the basic operations I would need in order to "reverse engineer" the data source I got to work on filtering the important bits.

In WireShark I immediately discovered that they had a sub-domain they were using for these operations. All of the web requests that weren't images or for customer metrics and tracking went to this host:

http://www.ows.newegg.com/

Because this API is structured more or less the same as navigating their site and the identifiers are different I decided to start with writing a query builder. Basically the purpose was to allow me to browse to the particular category I was interested in analyzing and filter it down to just a few simple requirements to simplify the analysis.

The first major entry point in the process of browsing to what you're interested in pulling is:

http://www.ows.newegg.com/Stores.egg/Menus

This takes no parameters and provides the main menu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[
    {
        "StoreDepa": "ComputerHardware",
        "StoreID": 1,
        "ShowSeeAllDeals": true,
        "Title": "Computer Hardware"
    },
    {
        "StoreDepa": "PCNotebook",
        "StoreID": 3,
        "ShowSeeAllDeals": true,
        "Title": "PCs & Laptops"
    },
    {
        "StoreDepa": "Electronics",
        "StoreID": 10,
        "ShowSeeAllDeals": true,
        "Title": "Electronics"
    },
    ...

Once you've selected a store to browse the next uri is:

http://www.ows.newegg.com/Stores.egg/Categories/{StoreID}

The only parameter it takes is StoreID which you'll find in the first query. This will return all of the categories within a store. I haven't really explored this very much as I'm only really interested in browsing system memory and SSD's. Using the Computer Hardware store the output is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
    {
        "Description": "Backup Devices & Media",
        "StoreID": 1,
        "NodeId": 6642,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 2
    },
    {
        "Description": "Barebone / Mini Computers",
        "StoreID": 1,
        "NodeId": 6668,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 3
    },
    {
        "Description": "CD / DVD Burners & Media",
        "StoreID": 1,
        "NodeId": 6646,
        "ShowSeeAllDeals": true,
        "CategoryType": 0,
        "CategoryID": 10
    },
    ...

StoreID is included from the parameters of the request. I'm not exactly sure how to describe the purpose of NodeID but it appears to be a distinguishing feature of a category or subcategory. CategoryID is used for filtering results down to a specific category and can be either a root category or a subcategory. CategoryType determines whether CategoryID is a root category or if it contains subcategories. A value of 1 for CategoryType indicates that it is the root category.

Now depending on CategoryType you either move straight to the search query or onto a navigation query. The navigation query is used if there are subcategories:

http://www.ows.newegg.com/Stores.egg/Navigation/{StoreID}/{CategoryID}/{NodeID}

This query takes StoreID, CategoryID and NodeID, which you can get from the category listing of a particular store. It will return a subcategory list. Below is the subcategory listing for the memory category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
    {
        "Description": "Desktop Memory",
        "StoreID": 1,
        "NodeId": 7611,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 147
    },
    {
        "Description": "Flash Memory",
        "StoreID": 1,
        "NodeId": 8038,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 68
    },
    {
        "Description": "Laptop Memory",
        "StoreID": 1,
        "NodeId": 7609,
        "ShowSeeAllDeals": false,
        "CategoryType": 1,
        "CategoryID": 381
    },
    ...

From here you will go to the search query[3]. At this point it does get a little tricky as the parameters for the query are no longer sent via GET they are instead sent using POST[4] which basically will require a programmatic method for making a search query. The search query given a category, store and node will list quite a lot of things. The first thing in the list is search filtering parameters, these will allow you to limit the products shown in the listing.

Data being posted is necessary to receive a non-404 response from the server, if you really wanted to you could just send an empty dictionary as this would just query newegg's entire product list. Any of the query options can be omitted, integer values may be omitted by substituting their value with -1.

The parameters you should concern yourself with are as follows along with the URL the data should be posted in JSON format to:

http://www.ows.newegg.com/Search.egg/Advanced

1
2
3
4
5
6
7
8
9
data = {
    "SubCategoryId": 147,
    "NValue": "",
    "StoreDepaId": 1,
    "NodeId": 7611,
    "BrandId": -1,
    "PageNumber": 1,
    "CategoryId": 17
}

NValue is a space separated list of NValues from the search parameters. Mind you, you cannot filter against more than one item in any category of search filters. For example in system memory you can't select DDR3 1333 (PC3 10600), DDR3 1333 (PC3 10660) and DDR3 1333 (PC3 10666). The query will return an unsucessful search result. The rest of the parameters are fairly self-explanatory.

The result returned will contain the following elements: RelatedLinkList, CoremetricsInfo, NavigationContentList, PaginationInfo, ProductListItems. CoremetricsInfo and RelatedLinkList can usually be ignored, the elements we're interested in are the NavigationContentList which is a list of search parameters//filters you can apply to the search. PaginationInfo describes how many elements were returned, what page we're on and how many elements there are per page. Last but not least the ProductListItems which provides a list of the products returned by the query along with some basic listing info for each one.

Below is a portion of the NavigationContentList:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
    "NavigationContentList": [
        {
            "NavigationItemList": [
                {
                    "SubCategoryId": -1,
                    "Description": "Free Shipping",
                    "StoreDepaId": 94,
                    "NValue": "100007611 600006050 600052012 4808",
                    "BrandId": -1,
                    "StoreType": 4,
                    "ItemCount": 194,
                    "CategoryId": -1,
                    "ElementValue": "4808"
                },
                {
                    "SubCategoryId": -1,
                    "Description": "Top Sellers",
                    "StoreDepaId": -1,
                    "NValue": "100007611 600006050 600052012 4802",
                    "BrandId": -1,
                    "StoreType": -1,
                    "ItemCount": 39,
                    "CategoryId": -1,
                    "ElementValue": "4802"
                },
                ...

This section will also contain a group name:

1
2
3
4
5
6
7
8
9
10
11
12
13
            ...
            "TitleItem": {
                "SubCategoryId": -1,
                "Description": "Useful Links",
                "StoreDepaId": -1,
                "NValue": "4800",
                "BrandId": -1,
                "StoreType": -2,
                "ItemCount": 0,
                "CategoryId": -1,
                "ElementValue": "4800"
            }
            ...

The PaginationInfo and ProductListItem elements will look like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
    ...
    "PaginationInfo": {
        "TotalCount": 233,
        "PageNumber": 1,
        "PageSize": 20
    },
    "ProductListItems": [
        {
            "SellerId": null,
            "ItemOwnerType": 0,
            "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139",
            "ItemGroupID": 0,
            "ReviewSummary": {
                "Rating": 5,
                "TotalReviews": "[1]"
            },
            "IsCellPhoneItem": false,
            "Discount": null,
            "FinalPrice": "$104.99",
            "ItemNumber": "20-148-372",
            "MappingFinalPrice": "$104.99",
            "FreeShippingFlag": true,
            "OriginalPrice": "$104.99",
            "IsComboBundle": false,
            "MailInRebateText": null,
            "ProductStockType": 0,
            "Model": "BL2KIT25664FN2139",
            "ShowOriginalPrice": false,
            "Image": {
                "FullPath": "http://images17.newegg.com/is/image/newegg/20-148-372-TS?$S125W$",
                "SmallImagePath": null,
                "ThumbnailImagePath": null,
                "Title": null
            },
            "SellerName": null,
            "ParentItem": null
        },
        ...

At this point you might be wondering what good will all this do me if I can't get specifications on an item? Well, you can and here's how: In each ProductListItems element you'll find an ItemNumber, this is essentially the primary key that each product is related to within this interface to newegg's product list. Using the following url you can obtain the full details page on any given item using it's ItemNumber:

http://www.ows.newegg.com/Products.egg/{ItemNumber}/Specification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
{
    "SpecificationGroupList": [
        {
            "GroupName": "Model",
            "SpecificationPairList": [
                {
                    "Value": "Crucial",
                    "Key": "Brand"
                },
                {
                    "Value": "Ballistix",
                    "Key": "Series"
                },
                {
                    "Value": "BL2KIT25664FN2139",
                    "Key": "Model"
                },
                {
                    "Value": "240-Pin DDR3 SDRAM",
                    "Key": "Type"
                }
            ]
        },
        {
            "GroupName": "Tech Spec",
            "SpecificationPairList": [
                {
                    "Value": "4GB (2 x 2GB)",
                    "Key": "Capacity"
                },
                {
                    "Value": "DDR3 2133 (PC3 17000)",
                    "Key": "Speed"
                },
                {
                    "Value": "9",
                    "Key": "Cas Latency"
                },
                {
                    "Value": "9-10-9-24",
                    "Key": "Timing"
                },
                {
                    "Value": "1.65V",
                    "Key": "Voltage"
                },
                {
                    "Value": "No",
                    "Key": "ECC"
                },
                {
                    "Value": "Unbuffered",
                    "Key": "Buffered/Registered"
                },
                {
                    "Value": "Dual Channel Kit",
                    "Key": "Multi-channel Kit"
                }
            ]
        },
        {
            "GroupName": "Manufacturer Warranty",
            "SpecificationPairList": [
                {
                    "Value": "Lifetime limited",
                    "Key": "Parts"
                },
                {
                    "Value": "Lifetime limited",
                    "Key": "Labor"
                }
            ]
        }
    ],
    "NeweggItemNumber": "N82E16820148372",
    "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139"
}

From this point on you can grab all of the features and specifications of any particular item you're interested in. In the near future I'll be writing a new post for both my memory and SSD analysis scripts using this interface.

The full code for my query builder is as follows, though you should note this was a quick script and is in no way complete or fully functional. As soon as it was to a useable point I moved onto the main point of this whole ordeal. You should also note that this requires CherryPy[5] and lxml[6]. The end result of this program is a query which you can use to retrieve a list of products matching the options you've selected. This is mainly to simplify product list selection and to minimalize the need to hardcode in certain values as newegg as a tendency to change things around on a regular basis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import cherrypy, json, urllib, urllib2
from lxml import etree
from lxml.builder import E

class QueryBuilder(object):
    def index(self):
        request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Menus")
        response = request.read()
        data = json.loads(response)
       
        body = E.body()
       
        ul = E.ul()
        for store in data:
            ul.append(E.li(E.a(
                store['Title'],
                href= '/Store?StoreID={}'.format(store['StoreID'])
            )))
       
        page = E.html(E.body(ul))
       
        return etree.tostring(page, pretty_print=True)
    index.exposed = True
   
    def Store(self, StoreID=None):
        if StoreID is not None:
            request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Categories/{}".format(StoreID))
            response = request.read()
            data = json.loads(response)
           
            body = E.body()
       
            ul = E.ul()
            for category in data:
                if category['CategoryType'] == 1:
                    ul.append(E.li(E.a(
                        category['Description'],
                        href='/Search?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId'])
                    )))
                else:
                    ul.append(E.li(E.a(
                        category['Description'],
                        href='/Category?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId'])
                    )))
           
            page = E.html(E.body(ul))
           
            return etree.tostring(page, pretty_print=True)
        else:
            return "Invalid parameters."
    Store.exposed = True
   
    def Category(self, StoreID, CategoryID, NodeID):
        if None not in [StoreID, CategoryID, NodeID]:
            request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Navigation/{}/{}/{}".format(StoreID, CategoryID, NodeID))
            response = request.read()
            data = json.loads(response)
           
            body = E.body()
       
            ul = E.ul()
            for subcategory in data:
                ul.append(E.li(E.a(
                    subcategory['Description'],
                    href= '/Search?StoreID={}&CategoryID={}&SubCategoryID={}&NodeID={}'.format(StoreID, CategoryID, subcategory['CategoryID'], subcategory['NodeId'])
                )))
           
            page = E.html(E.body(ul))
           
            return etree.tostring(page, pretty_print=True)
        else:
            return "Invalid parameters."
    Category.exposed = True
   
    def Search(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None):
        url = "http://www.ows.newegg.com/Search.egg/Advanced"
        data = {
            "IsUPCCodeSearch":      False,
            "IsSubCategorySearch"True,
            "isGuideAdvanceSearch": False,
            "StoreDepaId":          StoreID,
            "CategoryId":           CategoryID,
            "SubCategoryId":        SubCategoryID,
            "NodeId":               NodeID,
            "BrandId":              -1,
            "NValue":               "",
            "Keyword":              "",
            "Sort":                 "FEATURED",
            "PageNumber":           1
        }
       
        params = json.dumps(data).replace("null", "-1")
        request = urllib2.Request(url, params)
        response = urllib2.urlopen(request)
        data = json.loads(response.read())
       
        if data['NavigationContentList'] is None:
            return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True)
       
        body = E.body()
   
        form = E.form(name='PowerSearch', action='GenerateURL', method='GET')
       
        table = E.table()
        form.append(table)
        for section in data['NavigationContentList']:
            index = 0
            tr = E.tr(E.td(section['TitleItem']['Description'], colspan='3'))
            table.append(tr)
            for option in section['NavigationItemList']:
                if index % 3 == 0:
                    tr = E.tr()
                    table.append(tr)
                index += 1
                checkbox = E.td(E.input(option["Description"], type="checkbox", name=section['TitleItem']['Description'].replace(" ", ""), value=option['NValue']))
                tr.append(checkbox)
       
        for param, value in [('StoreID', StoreID), ('CategoryID', CategoryID), ('SubCategoryID', SubCategoryID), ('NodeID',NodeID)]:
            try:
                form.append(E.input(type='hidden', name=param, value=value))
            except KeyError:
                pass
        form.append(E.input(type='submit', value='Submit'))
        page = E.html(E.body(form))
       
        return etree.tostring(page, pretty_print=True)
    Search.exposed = True
   
    def GenerateURL(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None, **kwargs):
        NValue = set([])
        for arg in kwargs:
            if type(kwargs[arg]) == list:
                for value in kwargs[arg]:
                    NValue.add(value)
            else:
                NValue.add(kwargs[arg])
       
        NValue = list(NValue)
        NValue.sort()
        if StoreID is None:
            StoreID = -1
        if CategoryID is None:
            CategoryID = -1
        if SubCategoryID is None:
            SubCategoryID = -1
        if NodeID is None:
            NodeID = -1
        data = {
            "StoreDepaId":          int(StoreID),
            "CategoryId":           int(CategoryID),
            "SubCategoryId":        int(SubCategoryID),
            "NodeId":               int(NodeID),
            "BrandId":              -1,
            "NValue":               ' '.join(NValue),
            "PageNumber":           1
        }
        return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True)
    GenerateURL.exposed = True
   
cherrypy.quickstart(QueryBuilder())
  1. And iOS devices I assume as well. []
  2. Because lets face it, that would be stupid. []
  3. ... or get to the search query from selecting a root category in the main category listing for a store []
  4. At least this is the method used by the mobile app. []
  5. CherryPy: CherryPy is a pythonic, object-oriented HTTP framework. []
  6. lxml: A Pythonic binding for the C libraries libxml2 and libxslt. []
12Dec/100

Choosing an SSD (A more different S)

I've been periodically going back and revisiting the results for my SSD analysis script for newegg.com. The last few times I ran it I noticed that it was broken. It looks like newegg has modified a few things in their power search results page. One thing which is a little obnoxious[1] is that they no longer include the capacity in the description of the item or as a feature in the feature list when viewing the results page. This only seems to be an issue on the SSD page although I can't figure out why they decided it didn't need to be there in the first place. I see it this way: SSD's are first and foremost a storage device, you'd think that one of the most important features that should be listed with every SSD is the capacity at least.

Anyway, this change broke my script which I had been meaning to rewrite since regular expressions are definitely not the most efficient or cleanest way to parse HTML. I've been working with XML a more often lately despite my original prejudice against it for being a really bloated way to transfer data. One thing I discovered that makes XML a lot less painful is XPath[2] which is an incredibly useful "language" for selecting data from an XML document.

Once I had gone through and read several tutorials and references about XPath I set out to use it in writing a show calendar script which parses data from tvrage.com's XML API. After that useful exercise I realized I could very easily and cleanly apply it to my SSD analysis script. Since HTML is similar in nature to XML[3] I set out to parse Newegg's results page using XPath. This presented the first problem: Newegg's page isn't strictly XML or even XHTML for that matter. After a great deal of googling and research I landed on the lxml[4] website which as it turns out has an HTML parser for navigating and extracting data from HTML in the same way you would from an xml.etree.ElementTree[5]. With this in mind I immediately began rewriting the script.

First off lets consider my criteria for a "good" SSD on Newegg. The SSD can be either the typical 2.5" form factor, or a PCI-Express card[6]. The interface can be SATAII, SATAIII or PCI-Express. Capacity must be greater than or equal to 120GB[7]. Last but not least, the disk should be sub $300[8].

The above requirements give us the following power search[9] which we will be using as the source for the script:

1
2
3
4
5
6
url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

Now the first thing that made me cringe as I was rewriting this was the fact that I would basically have no choice but to load each individual product page from the results page as capacity is no longer included in either the description or the features list of each product in the results page. Eventually I will get around to multi-threading this to make it a little less painful, or I'll get lucky and Newegg will add the capacity feature back to the item listing in power searches for SSD's. The following is the full source code of the parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import re, math
from lxml import etree

url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

featureMap = {
    'Capacity': 'capacity',
    'Sequential Access - Write:': 'write',
    'Sequential Access - Write': 'write',
    'Sequential Access - Read:': 'read',
    'Sequential Access - Read': 'read',
    'Interface Type': 'interface',
    'Brand': 'brand',
    'Model': 'model',
    'Series': 'series'
}

speed_re = re.compile(r'(\d+)\s?MB/s')
capacity_re = re.compile(r'(\d+)GB')

parser = etree.HTMLParser()
# tree = etree.parse("temp.html", parser)
tree = etree.parse(url, etree.HTMLParser())
root = tree.getroot()

items = []

for node in root.findall(".//div[@class='itemCell']"):
    item = {}

    # Get link
    link = node.find(".//a[@title='View Details']")
    item["link"] = link.attrib["href"]
   
    # Get feature list (loads each item's url, should multi-thread this in the future)
    itemPage = etree.parse(item["link"], etree.HTMLParser()).getroot()
    featureList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dt"))
    valueList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dd"))
    features = zip(featureList, valueList)
    for feature, value in features:
        if value is not None and feature in featureMap:
            # If it's a speed feature parse out the speed
            if featureMap[feature] in ("read", "write"):
                item[featureMap[feature]] = min(map(lambda x: int(x), speed_re.findall(value)))
            # If it's a capacity feature, parse out the capacity
            elif featureMap[feature] == "capacity":
                item[featureMap[feature]] = min(map(lambda x: int(x), capacity_re.findall(value)))
            # If the value doesn't need to be parsed, just store the value in item
            else:
                item[featureMap[feature]] = value.strip()
               
    # Get price
    price = map(lambda n: n.text, node.findall(".//li[@class='priceFinal']/*"))
    item["price"] = float(''.join(price[1:]))
   
    # Only add the item if it has the features we need in it
    if "read" in item and "write" in item and "capacity" in item and "series" in item:
        score = (item["read"] * item["write"] * item["capacity"]) / ((math.log(abs(item["read"] - item["write"])) + 1) + item["price"])
        item["score"] = score
        items.append(item)
       
   
sorted = {}
for item in items:
    # Open addressing like in a hash table, so we don't wind
    # up with any collisions, unlikely but good practice anyway
    score = item["score"]
    while score in sorted:
        score += 1
    sorted[score] = item

sortOrder = sorted.keys()
sortOrder.sort()
sortOrder.reverse()

headers = ['brand', 'series', 'model', 'link', 'interface', 'price', 'capacity', 'read', 'write', 'score']
print '\t'.join(headers)
for key in sortOrder:
    item = sorted[key]
    print '\t'.join(map(lambda x: str(item[x]), headers))

At this point if you've gone through and read the entire script you'll probably notice that I've made a slight change to the scoring equation, it has been changed from the following:

\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}

To the following:
\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{(\log_{10}(|\text{Read} - \text{Write}|) + 1) \times \text{Price}}

I discovered that using the difference in read//write speed heavily penalized drives with anything greater than 10MB/s difference. So I figured that it may be a little more subtle to simply penalize drives based on the magnitude of the difference.

Now you're probably wondering: "When is this blathering idiot going to get to the damned results already?". And you'd be pleasantly surprised to know that I'm getting to them as you waste your time reading this.


Manufacturer: OCZ OCZ G.Skill OCZ
Series: RevoDrive Vertex 2 Phoenix Pro Series Agility 2
Capacity: 120GB 180GB 120GB 120GB
Read: 540MB/s 285MB/s 285MB/s 285MB/s
Write: 490MB/s 275MB/s 275MB/s 275MB/s
Item: N82E16820227578[10] N82E16820227602[11] N82E16820231378[12] N82E16820227593[13]
Price: $299.99 $294.99 $214.99 $214.99

As you can see the RevoDrive far out-scores all the rest of the SSD's considered in this analysis. The main reason is that they've essentially included two 60GB SSD's on the same card and you're expected to perform software raid on them in your own system[14]. Despite the incredible speeds they boast I don't think I would purchase one of these to use as my OS//Program disk because compatibility is a major limitation. You must be sure that your motherboard's BIOS supports booting via PCI-Express cards. And last but not least, the main reason I would pass up this card is the lack of TRIM support. As far as I can tell these cards do not support TRIM which is a major downside as far as I'm concerned.

The second disk in the list is the OCZ Vertex 2 180GB version. I'd probably skip this one just because I don't really consider the extra 60GB worth the extra $80.

Which leaves me with the last two disks which are as far as my analysis is concerned, identical. If you take into account the detailed features you'll notice that the G.Skill claims 50k IOPS on the 4k Random write test which seems a bit... optimistic. The OCZ makes no such claim and as far as I'm concerned both disks are more less the same thing. So it's pretty much up to brand preference at this point.

  1. I've already sent feedback to them suggesting that they fix this. []
  2. Only if the XML parser you're using supports it, which it seems is not a whole lot of them. At least not all of them support the full specification which is annoying since nobody really seems to document which bits and pieces they support and which whey don't. []
  3. Although not necessarily XML depending on the particular doctype you've chosen, Newegg's is transitional HTML. []
  4. lxml: http://codespeak.net/lxml/ []
  5. xml.etree.ElementTree: http://docs.python.org/library/xml.etree.elementtree.html []
  6. Some of the PCI-Express SSD's are stupidly fast and more expensive except that it doesn't look like any of them support TRIM yet which is a major problem for me. []
  7. It is rare that I have a matured (read: haven't reformatted in a while) install of windows along with all of my most commonly used programs and games that exceeds 60GB so I estimate that doubling this should accommodate for any sudden urges to install really big things. []
  8. I can't really justify spending much more than $300 on a single storage device. It had better be one hell of a storage device if I ever find myself spending more than $300 on it. []
  9. This will likely need to be updated at least once a month as Newegg is constantly adding new criteria and changing things. []
  10. OCZ RevoDrive []
  11. OCZ Vertex 2 []
  12. G.SKILL Phoenix Pro Series []
  13. OCZ Agility 2 []
  14. They show up as two separate physical devices despite being located on the same card. []
4Dec/100

Automagic TV Show Calendar

A little while ago I was browsing the web and discovered a website called tvrage.com[1] which seems to be the definitive online TV guide. I didn't originally enter the site on the main index but on a page describing the functionality of an XML API[2] they host for accessing their database of TV shows.

To me, this is like opening presents on christmas day. Just imagine the possibilities! I immediately began exploring the kind of data they provide. The very first idea I had was to use this to create events on my google calendar automatically for unaired episodes of my favorite TV shows.

I've previously written python scripts that interface with gdata but I find their implementation for python to be kind of cumbersome to deal with so I began researching their Protocol API[3]. At first I wasted a lot of time attempting to build the necessary XML structures to add events and the like. This got old very fast and I decided to just give JSON-C[4] a try. Turns out you can use the built-in JSON module in python for creating the necessary structures.

For parsing the results I got from tvrage I ended up using python's xml.etree.ElementTree which was simple enough to setup to retrieve only the information for each episode I was interested in.[5]

I had a bit of trouble initially with adding events to google calendar. This stemmed from the fact that google often will return an HTTP Redirect which includes a url with an appended gsession attribute which you're supposed to resubmit the exact data from the first request to. Once I figured this out it was turtles all the way down. I even managed to get the whole script multi-threaded to speed things up since it's impossible to perform batch-requests with JSON-C.

I should note that for the configuration file the calendar should be the "Calendar ID" for the calendar that can be found by looking at the settings page for the individual calendar, it is grouped with the XML and iCal feeds.

ShowList.txt:[6]

1
2
3
4
5
6
7
8
9
10
11
12
Castle  19267
House   3908
Bones   2870
Big Bang Theory, The    8511
Mentalist, The  18967
Rizzoli & Isles 24996
Venture Bros., The  6270
Top Gear    6753
Mythbusters 4605
Archer  23354
NCIS    4628
Community   22589

Config.cfg:

1
2
3
4
[Credentials]
username = someuser@gmail.com
password = somebase64encodedpassword
calendar = somecalendarid@group.calendar.google.com

AirDate.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import urllib2, urllib, json, ConfigParser, base64
from datetime import date
from xml.etree import ElementTree
from threading import Thread

calendar = ""
header = {}

# Thread for retrieving a list of episodes for a given show_id
class airDate(Thread):
    # Initialize thread and set some local attributes
    def __init__(self, show_name, show_id):
        Thread.__init__(self)
        self.show_name = show_name
        self.show_id = show_id
   
    # Get episode list from tvrage.com based on the show_id
    def run(self):
        # Retrieve XML episode_list from tvrage.com
        xml_data = urllib2.urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=%s" % self.show_id).read()
        # Pares XML into ElementTree.Element()
        xml_tree = ElementTree.fromstring(xml_data)
        self.result = []
       
        # For each season
        for season in xml_tree.findall("Episodelist/Season"):
            # Get the season number
            season_num = int(season.get("no"))
            # For each episode in the episode list
            for episode in season.findall("episode"):
                # Get episode number and title
                episode_num = int(episode.find("seasonnum").text)
                episode_title = episode.find("title").text
               
                # Build the episode code S##E##
                episode_code = "S%02dE%02d" % (season_num, episode_num)
               
                # Parse the airdate into year, month and day
                year, month, day = map(lambda x: int(x), episode.find("airdate").text.split("-"))
                try:
                    episode_airdate = date(year, month, day)
                    today = date.today()
                    # If episode hasn't aired yet
                    if episode_airdate >= today:
                        # Add episode to results list
                        self.result.append("%s %s - %s" % (str(episode_airdate), self.show_name, episode_code))
                except ValueError:
                    # If the airdate is invalid (tvrage.com sometimes
                    # includes 00's for unknown sections of the date
                    pass

class addEvent(Thread):
    # Thread for adding events to google calendar
   
    # Initialize thread and set local episode variable
    def __init__(self, episode):
        Thread.__init__(self)
        self.episode = episode
   
    # Add new entry to google calendar
    def run(self):
        # Build entry structure
        entry = {"data": {"details": self.episode, "quickAdd": True}}
        # Convert to JSON
        entry = json.dumps(entry)
       
        # Build request including necessary headers and data
        calReq = urllib2.Request("http://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), entry, header)
        # Execute the request
        calRes = urllib2.urlopen(calReq)
        # Get the redirect url (gsession appended)
        redirectReq = urllib2.Request(calRes.geturl(), entry, header)
        try:
            redirectRes = urllib2.urlopen(redirectReq)
        except HTTPError:
            # If we get some sort of HTTP error code
            # skip entry, can always run again
            pass
   
# Get list of events already added to
# the calendar from previous executions
def getExistingEpisodes(header):
    # Get JSON-C representation of calendar
    calReq = urllib2.Request(url="https://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), headers=header)
    calRes = urllib2.urlopen(calReq)
   
    # Parse JSON-C
    data = json.loads(calRes.read())
    # If the calendar has events on it
    if "items" in data["data"]:
        # Get the list of events
        events = data["data"]["items"]
        existing_episodes = []
        # For each event
        for event in events:
            # Append just the title of the event to the results
            existing_episodes.append(event["title"])
           
        return existing_episodes
    else:
        # We don't have any events on this calendar
        # so just return an empty list
        return []

if __name__ == '__main__':
    # Open the configuration file and get the necessary
    # credentials and settings
    config = ConfigParser.ConfigParser()
    config.readfp(open("Config.cfg"))
    username = config.get("Credentials", "username")
    password = config.get("Credentials", "password")
    # Password is stored as base64 encoded string just so
    # we don't have our password sitting out in plain sight
    password = base64.b64decode(password)
    calendar = config.get("Credentials", "calendar")
   
    # Build loginData structure, this is used to get
    # authentication data from google
    loginData = {
        "Email": username,
        "Passwd": password,
        "source": "BeMasher-ETR-2",
        "service": "cl"
    }

    # Encode the loginData for usage in a url
    loginData = urllib.urlencode(loginData)
    # Get authentication data
    gdataLogin = urllib2.urlopen("https://www.google.com/accounts/ClientLogin", data=loginData)
    SID, LSID, Auth = gdataLogin.read().splitlines()
   
    # Build header structure, this will be used for
    # all requests to google calendar from now on
    header = {
        "Authorization": "GoogleLogin %s" % (Auth),
        "GData-Version": 2,
        "Content-Type": "application/json"
    }
   
    # Open a list of the shows we're interested in
    # Stored as "show_name\tshow_id", one per line
    show_list = open("ShowList.txt")
    jobs = []
    for line in show_list:
        show = line.strip().split("\t")
        jobs.append(show)
   
    # Get a list of existing events from previous
    # executions so we don't wind up with duplicates
    existingEpisodes = getExistingEpisodes(header)
   
    threadQueue = []
    # For each episode we've retrieved that is unaired
    for job in jobs:
        show_name, show_id = job
        # Create an instance of the airDate thread
        thread = airDate(show_name, show_id)
        # Start it
        thread.start()
        # Add it to the threadQueue
        threadQueue.append(thread)
       
    episodes = []
    # While we've still got running threads
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()
        # For each episode in the results
        for episode in thread.result:
            # If it hasn't already been added to google calendar
            if episode[11:] not in existingEpisodes:
                print episode
                # Add to list of episodes that need events created
                episodes.append(episode)
   
    # For each episode that doesn't have an
    # event on google calendar already
    for episode in episodes:
        # Create an addEvent thread, start it
        # and add it to the threadQueue
        thread = addEvent(episode)
        thread.start()
        threadQueue.append(thread)
   
    # While we still have threads running
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()

This was all done shortly before I discovered that tvrage.com also provides iCal feeds for your favorite shows provided that you register and add some to your list. Unfortunately the iCal feed they generate creates events for exact air times of each episode which I'm not really all that concerned about. So I use this script still to add all-day events for each episode which is easier to view//see when there's a new episode.

I did write another script using their XML API but that will have to wait for another post.

  1. http://tvrage.com/ []
  2. http://services.tvrage.com/ []
  3. Data API Developer's Guide: The Protocol []
  4. Google's own flavor of JSON which is almost identical to plain old JSON. []
  5. I only really needed the original air date, title, season number and episode number. []
  6. You can find the show_id via the show search found on their XML API page. []
9Oct/100

Choosing an SSD (Update)

My brother is in the planning stages of building a new desktop. One of the things he's planning on doing differently from his last build[1] is using an SSD for OS + Programs.

I had mentioned to him previously that I a wrote a program for helping to choose an SSD based on what SSD's are meant for and are good at doing. So he asked if I could recommend him one. Below are the results of the latest run of my script based on the most current listings[2] of SSD's newegg offers.


Manufacturer: OCZ G.Skill OCZ
Series: Agility 2 Phoenix Pro Vertex 2
Capacity: 120GB 120GB 120GB
Read: 285MB/s 285MB/s 285MB/s
Write: 275MB/s 275MB/s 275MB/s
Item: N82E16820227543[3] N82E16820231378[4] N82E16820227551[5]
Price: $235.99 $239.99 $240.00


It looks like OCZ has two of the top three places this run and G.Skill is still maintaining one of the top three from before. Between the 3 of them I think i would likely still go for the G.Skill just because of personal preference despite there not really being any significant differences between the three. Excepting price of course.

  1. Which incidentally was right when SSD's were just becoming available to the average consumer. []
  2. As of this date 10/09/2010. []
  3. OCZ Agility 2 []
  4. G.Skill Phoenix Pro []
  5. OCZ Vertex 2 []
11Aug/100

Choosing an SSD

Before I started my new job I had an inordinate amount of free time and for a majority of that time, nothing to spend it doing[1]. I was still thinking about my desktop wishlist[2] and about choosing a better SSD than the one I had previously selected[3].

A long time ago when I was following the HDD market since I was looking to buy some bulk storage I wrote a php script which loaded newegg's product list based on some search parameters you provided newegg's productlist.xml[4]. The script would then parse the list and produce a list sorted based on price per gigabyte. Which is useful when you're in the market for capacity[5].

I decided to do more or less the same thing with SSD's except this time I did it in python since I'm rusty on PHP and I didn't want to mess with setting up a web server to test on. So I got started by doing a power search on newegg for the specific flavor of SSD I was looking for.

The search parameters are as follows:

  • 2.5" Form Factor
  • SATA II/III
  • 120GB or Greater
  • Less than $300
  • Retail or OEM
  • Support TRIM Command

As of this writing those particular search parameters narrows the result to 17 SSD's. Now comes the code. Before I started coding I needed some way to sort them according to what I thought was important. The metric is as follows:

$$\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{|\text{Read} - \text{Write}| \times \text{Price}}$$

After looking closer at the scores this produces I noticed that it heavily penalizes drives with huge differences between read and write speeds which effectively weeds out drives that still have acceptable read//write speeds. So I removed that section of the metric producing:

\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}

The basic idea behind this scoring measure is that sequential read and write speeds are important, as well as capacity. Price and difference between sequential read//write are considered bad[6]. In the equation read and write refer to sequential read and write speeds. The ratio of these will produce a score of the SSD's overall performance for capacity, read//write speeds and price.

The code is relatively simple in purpose. Load the data and parse it into a dictionary then sort based on the metric above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import urllib2, re

# url = "
# http://www.newegg.com/Product/ProductList.aspx?Submit=Property&Subcatego
# ry=636&Description=&Type=&N=100008120&IsNodeId=1&srchInDesc=&MinPrice=&M
# axPrice=&OEMMark=1&OEMMark=0&PropertyCodeValue=4213:30854&PropertyCodeVa
# lue=4214:30848&PropertyCodeValue=4214:39416&PropertyCodeValue=4214:30849
# &PropertyCodeValue=4214:39415&PropertyCodeValue=4215:55552&PropertyCodeV
# alue=4215:41071&PropertyCodeValue=4215:46319"

# data = open("temp.html", "w")
# data.write(urllib2.urlopen(url).read())
# data.close()
raw = open("temp.html").read()

item_re = re.compile(r'<div class="itemCell".*?>(.*?)<br class="clear".*?</div>')
feature_re = re.compile(r"<li>&nbsp;(.*?)</li>")
feature_list_re = re.compile(r'<b>(.*?)\s?\#?\s?:\s?</b>\s?(.*?)</li>')
speed_re = re.compile(r"(up to )?(\d+).*?MB/s")
capacity_re = re.compile(r"(\d+)GB")
price_re = re.compile(r"</span>\$<strong>(\d+)</strong><sup>.(\d+)</sup>")

item_list = []
valid = ['Read', 'Item', 'Interface', 'Capacity', 'Model', 'Write', 'Size']

for item in item_re.findall(raw):
    current = {}
    no_label = []
    features = feature_re.findall(item)
    current["Size"] = features[0]
    current["Capacity"] = features[1]
    current["Interface"] = features[2]
   
    for feature in feature_list_re.findall(item):
        if feature[1].find("\r") != -1:
            current[feature[0]] = feature[1].split("\r")[0]
        else:
            current[feature[0]] = feature[1]
    current["Read"] = int(speed_re.findall(current["Sequential Access - Read"])[0][1])
    current["Write"] = int(speed_re.findall(current["Sequential Access - Write"])[0][1])
    current["Capacity"] = int(capacity_re.findall(current["Capacity"])[0])
    for feature in current.keys():
        if feature not in valid:
            del current[feature]
    current["Price"] = float('.'.join(price_re.findall(item)[0]))
    current["Item"] = "http://www.newegg.com/Product/Product.aspx?Item=%s" % (current["Item"])
    item_list.append(current)
   
sorted = {}
for item in item_list:
    ratio = (item["Read"] * item["Write"] * item["Capacity"]) / (item["Price"])
    sorted[ratio] = item
   
sort_order = sorted.keys()
sort_order.sort()
sort_order.reverse()
for key in sort_order:
    #print '\t'.join(map(lambda x: str(x), sorted[key].keys()))
    print '\t'.join(map(lambda x: str(x), sorted[key].values()))

Now given that there is quite a lot of data to present and analyze all at once I've decided it would be easiest to just provide you with a pretty graph[7]:


If you look closely at the scores of all the disks in the query, you'll notice that this is a noticeable gap between the top 3 and the rest. They are as follows:

Manufacturer: A-DATA Patriot G.Skill
Series: S599 Inferno Phoenix Series
Capacity: 128GB 120GB 120GB
Read: 280MB/s 285MB/s 285MB/s
Write: 270MB/s 275MB/s 275MB/s
Item: N82E16820211471[8] N82E16820220510[9] N82E16820231372[10]
Price: $295.99 $289.99 $299.00


I noticed that if you ignore capacity in the metric then the Patriot Inferno is the clear winner here. So as it turns out the Western Digital SiliconEdge I had selected when I first wrote the wishlist wasn't the best drive for my needs. But then I've always had a soft-spot for Western Digital. But now I'm convinced that the Patriot Inferno is the SSD I'll be getting unless by the time I get around to buying one there are better options[11].

  1. Nothing worth-while anyway []
  2. See previous post: Wishlist. []
  3. Western Digital SiliconEdge 128GB SSD []
  4. Which no longer exists in it's original form. []
  5. Which I was. []
  6. Although we're excluding read//write speed difference. []
  7. Scores have been normalized to 100%. []
  8. A-Data S599 []
  9. Patriot Inferno []
  10. G.Skill Phoenix Series []
  11. Which there probably will be. []
28Jul/100

Matplotlib and Live Data: A Tale of Two Technologies

Being unemployed over the summer is never usually a good thing for me. I get bored very easily if I don't have something to occupy myself with. This last bout of boredom led me to unpack some of my electronics. Dusted off my multimeter, Arduino and a digital thermometer I bought a little while ago. Figured I could use these to solve one of my current problems.

Living in Laramie usually subjects people to harsh winters which leaves most housing developments without central air conditioning installed since, well it's never really needed except maybe one or two days over the summer where it gets above 85 oF. This summer has apparently been hotter than previous summers and It's left my condo in an "uncomfortable state". Mind you I'm used to living in hot weather so this isn't such a terrible thing to me, I'm used to it.

What I'm not used to is not having AC and it cooling off enough at night that it's worthwhile to open a few windows and stick a fan in one of them. Which leaves me with this problem: When is the optimal time to open the windows and turn on the fan to get my condo cooled off earliest//fastest?

In comes my Arduino + digital thermometer[1]. Once I rigged up the proper power//data connections on a breadboard for my Arduino I set out to find code for the thermometer. I" ve setup the thermometer with a sketch on my Arduino before I just didn't feel like wasting a few hours trying to do it from scratch again. Soon enough I found some code[2] that worked perfectly. So I trimmed out some code I didn't need for the project and set it up to just write the temperature as fast as possible[3] to the serial port it's connected to.

After that I wrote a logging program on my desktop in Python to record temperatures sent via serial to my desktop. The program is incredibly simple and uses the pySerial library[4] to read temperatures from the serial port of my desktop and append them to a temperature log. I used a simple windows command to do this since it wouldn't lock the file so I could read data from it simultaneously. There are still occasionally collisions with the processing program locking the file and the logger not being able to write the data to the file but these are rare enough that it's negligible in my situation.

1
2
3
4
5
import serial, os

ser = serial.Serial(2)
while True:
    os.system("echo %s>>out.txt" % (ser.readline().strip()))

The next step in this project was visualizing the data. I've used matplotlib[5] before and I was thinking this time I would like to see if I could write the program to update data live as it recieves it. My first foray into this goal was a miserable disaster. Most of the solutions I could find involved just setting up an infinite loop with a short time delay in it. Which works great except that it sleeps the thread running the plot which makes it impossible to resize the plot or do anything at all with the GUI for that matter. So obviosly this wouldn't work at all.

After poking around for different solutions to this and crashing my computer once from spawning an infinite number of instances of the plot I gave up for a bit, only to discover that there was an example in the documentation which wasn't obviously named. I quickly discovered the best way to do this. I even added some pretty annotations and such.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import gobject
import matplotlib
matplotlib.use('GTKAgg')

import matplotlib.pyplot as plt

current_pos = 0
temps = []
pad = 5.0

f = plt.figure()

def update(vars):
    # Unpack variables that need to be persistent between
    # executions of this method.
    temps = vars[0]
    current_pos = vars[1]
    pad = vars[2]
   
    # Open the data file and get any new data points since
    # the last time we read from this file
    data = open("out.txt", "r")
    data.seek(current_pos)
    new_temps = map(lambda x:
        float(x) * (1 + 4.0/5.0) + 32.0,
        data.read().split("\n")[:-1])
    current_pos = data.tell()
    data.close()
   
    # If we got new data then append it to the list of
    # temperatures and trim to 750 points
    if len(new_temps) > 0:
        temps.extend(new_temps)
        temps = temps[-750:]
   
    f.clear()
    f.suptitle("Live Temperature")
    a = f.add_subplot(111)
    a.grid(True)
    l, = a.plot(temps)
    plt.xlabel("Time (Seconds)")
    plt.ylabel(r'Temperature $^{\circ}$F')
   
    # Get the minimum and maximum temperatures these are
    # used for annotations and scaling the plot of data
    min_t = min(temps)
    max_t = max(temps)
   
    # Add annotations for minimum and maximum temperatures
    a.annotate(r'Min: %0.2f$^{\circ}$F' % (min_t),
        xy=(temps.index(min_t), min_t),
        xycoords='data', xytext=(20, -20),
        textcoords='offset points',
        bbox=dict(boxstyle="round", fc="0.8"),
        arrowprops=dict(arrowstyle="->",
        shrinkA=0, shrinkB=1,
        connectionstyle="angle,angleA=0,angleB=90,rad=10"))

    a.annotate(r'Max: %0.2f$^{\circ}$F' % (max_t),
        xy=(temps.index(max_t), max_t),
        xycoords='data', xytext=(20, 20),
        textcoords='offset points',
        bbox=dict(boxstyle="round", fc="0.8"),
        arrowprops=dict(arrowstyle="->",
        shrinkA=0, shrinkB=1,
        connectionstyle="angle,angleA=0,angleB=90,rad=10"))
   
    # Set the axis limits to make the data more readable
    a.axis([0,len(temps), min_t - pad,max_t + pad])
   
    f.canvas.draw_idle()
   
    # Repack variables that need to be persistent between
    # executions of this method
    vars = {0: temps, 1: current_pos, 2: pad}
   
    return True

vars = {0: temps, 1: current_pos, 2: pad}

# Execute update method every 500ms
gobject.timeout_add(500, update, vars)

# Display the plot
plt.show()

This code generates a plot which updates every 500ms. This is based on an example in the matplotlib examples[6]. An example of the program's output can be seen below.

I imagine that I could have made this simpler by not using the GTK libraries which are a pain to install since there are 3 or 4 modules you have to install in order to make all this work including the GTK+ runtime. I may come back later and post a version written using TK since it can be used without installing extra modules and stuff.

  1. DS18S20 Digital Thermometer Datasheet []
  2. Temperature Measurement using the Dallas DS18B20 by Peter H. Anderson []
  3. Somewhere in the range of 750ms between readings since it is in parasite mode, may change this later to run in non-parasite mode. []
  4. pySerial Python Library []
  5. matplotlib Python Library []
  6. Animation example code: simple_anim_gtk.py []
16Apr/090

Compiling Python (w/Jython)

If you're interested in the cross-platform-y-ness of python, you'll probably find this interesting. Initially i've written a few programs for work in python. They needed to be more or less cross-platform and simple to use since it's going to be distributed to users who have pretty much no idea how to do anything in the command line or anything of that nature.

My first experience with packaging python such that it could run on Windows desktops without installing the python interpreter before hand involved using a program called py2exe. Which does just what it says, it compiles all the dependencies and libraries of a particular python program into a single zip file along with a few basic things and generates an executable file for Windows based systems. The only drawback to this is that it only seems to work with Python 2.5 right now, it also has several extra files that must be distributed with the program and are really good at confusing users.

As for running python on Linux and OS X based systems it was a breeze, I had written a few very basic interfaces using Tkinter which is installed by default with python. The scripts by themselves run great on Linux and OS X systems since both come with them by default (RedHat Desktop is the Linux distro in question).

I still wanted an easier // more fool-proof way to distribute the scripts between computers with different flavors of OS's without having to have a different version for each OS. A co-worker suggested I look at Jython which to my surprise happens to be awesome. Jython is a full implementation of python on the Java VM. This is great for me because ALL of the systems I'm writing this stuff for have Java. You're probably wondering why I'm not just writing this stuff in Java to begin with and my answer to that is, these are really simple tasks that take way too much code//time to implement in Java when they took a few hours and about a 4th of the amount of code to perform the same task in python.

I discovered much to my surprise that it ended up being as simple as switching from Tkinter to Swing for the GUI's and then fixing a few random things like OS detection for configuring default paths of certain system files before the code all ran flawlessly in Jython. Now onto the really fun part. I discovered that I could compile the code including all dependencies and libraries for the Jython interpreter into a single Jar file usually about 1MB total. All the user has to do is double-click on the jar and everything fires up and does it's thing.

The first problem I ran into though was that OS X distributes their own version of the Java VM, 1.5.0_16 instead of the latest 1.6.0_13. I eventually discovered that it was as simple as adding the -target option to the java compiler and telling it to compile for the target version. Once all that was said and done I had a single jar file that contained all the necessary files to run the program by itself without any external dependencies.

If you're interested in knowing what command I use to compile to jar it is as follows:

1
C:\Program Files\Jython221\jythonc.bat -J "-target 1.5" --jar "$(NAME_PART).jar" --all --core "$(FULL_CURRENT_PATH)"

Mind you that I run this in Notepad++ on the current open file. $(NAME_PART) is the name portion of the filename excluding the extension. $(FULL_CURRENT_PATH) is you guessed it: the full current path including the filename and extension.

13Mar/090

Power Set Generator

Recently I had a bout of programming withdrawal so I set out to write a power-set generator.

So a little background on the power set. The power set of a given set A is the set containing all subsets of A. Suppose that we have a set:

A = \{x,y,z\}

The power set of A would be:

\mathcal{P}(A) = \left\{\emptyset, \{x\}, \{y\}, \{z\}, \{x, y\}, \{x, z\}, \{y, z\}, \{x, y, z\}\right\}\,\!.

Now looking more carefully at the power set of A you'll notice that it contains 2 to the power of the cardinality of A subsets, always containing the empty set.

2^{|A|} = 8

The code I came up with for this is short and more or less simple[1]:

1
2
3
4
5
6
7
8
9
10
def PowerSet(base):
    power_set = []
    b = len(base)
    map(lambda g: power_set.append(map(lambda x: base[x],
        filter(lambda x: g[x], range(0, b)))),
        map(lambda value: map(lambda x: (value >> x) & 1, range(b - 1, -1, -1)),
        map(lambda value: value ^ (value / 2), range(0, 2**b))))
    return power_set

print PowerSet([1,2,3])

I figured out shortly after I wrote this that I had the right general idea but in this particular case, since I'm not actually using set types... the graycode I'm generating is useless for this sort of thing.

The point of generating graycode for this is that graycode is used for binary counting such that only one digit is changed from one consecutive value to the next. It was originally designed so that mechanical switches in early computers wouldn't cause a race condition while counting fast enough.

In this particular solution using graycode is useful for only having to add one new element to a set at a time which if I were actually using sets, would be faster, but since I'm not, it isn't. I'll probably rewrite it later to make it play nicer with graycode.

The basic procedure here is that given a set A we're going to count from 0 to 2^|A| - 1 and in binary graycode, the 1's determine the elements of the base set that will be added as a new set to the power set and 0's indicate that that element of the base set will be ignored in the current subset.

  1. I lied, i just got a little pythonic functional programming happy. []
2Mar/090

Python ETR and Google Code Hosting

Earlier tonight I was talking to Nick a friend of mine who I've been working with recently on a ETR script. He's one of the webmasters for the University of Arizona Baja Racing Club.

I had already written my own ETR script for easy entry of time records from my google calendar to the ETR form website at my job. So I figured that i could just as easily modify it to work for him.

The main structure for his particular needs will be a main google account which the script will authenticate with. All of the users that he wants to keep time records for will simply create and share a calendar with that account. Then when the script is run it will simply authenticate with the gdata api and retrieve a list of events and their descriptions for each shared calendar on the account. All of this data is then written to a csv file of the same name of the calendar along with the work week it was done in.

The script is more or less in proof-of-concept stage and still needs a lot of polishing but in the middle of doing this I found myself wanting to have a way to organize changes and revisions. I've briefly used subversion before this and never really made a habit of using it even though I should have. I suddenly remembered that Google Code Hosting provides subversion access to open source projects. So I went ahead and made a project of the name pythonetr.

It took about 15 minutes for me to download and install TortoiseSVN, a windows shell integrated gui for SVN. I then imported the first revision of the project to the subversion repository of the project on google code. After that I sent Nick a link to the page so that if there were any feature requests, bugs anything he wanted me to know about for the project he could just submit them as issues there.

Shortly after that I made some modifications to the code since the gdata api was only returning events between specified dates only if their cooresponding UTC time was within the date range. This breaks things if you expect only events between the date ranges that are returned are within your timezone. So I made all the changes and committed the new revision.

It's really satisfying to have a personal record of all changes made to code and I really should do this more often. I'd upload all my other projects but I believe there's a lifetime limit of 10 projects total for google code hosting.

If you're interested in checking out a copy of the project, you can do so with the following command:

1
svn checkout http://pythonetr.googlecode.com/svn/trunk/ pythonetr-read-only