A Little Off Code, Computers, Photography and Guns

12Dec/100

Choosing an SSD (A more different S)

I've been periodically going back and revisiting the results for my SSD analysis script for newegg.com. The last few times I ran it I noticed that it was broken. It looks like newegg has modified a few things in their power search results page. One thing which is a little obnoxious[1] is that they no longer include the capacity in the description of the item or as a feature in the feature list when viewing the results page. This only seems to be an issue on the SSD page although I can't figure out why they decided it didn't need to be there in the first place. I see it this way: SSD's are first and foremost a storage device, you'd think that one of the most important features that should be listed with every SSD is the capacity at least.

Anyway, this change broke my script which I had been meaning to rewrite since regular expressions are definitely not the most efficient or cleanest way to parse HTML. I've been working with XML a more often lately despite my original prejudice against it for being a really bloated way to transfer data. One thing I discovered that makes XML a lot less painful is XPath[2] which is an incredibly useful "language" for selecting data from an XML document.

Once I had gone through and read several tutorials and references about XPath I set out to use it in writing a show calendar script which parses data from tvrage.com's XML API. After that useful exercise I realized I could very easily and cleanly apply it to my SSD analysis script. Since HTML is similar in nature to XML[3] I set out to parse Newegg's results page using XPath. This presented the first problem: Newegg's page isn't strictly XML or even XHTML for that matter. After a great deal of googling and research I landed on the lxml[4] website which as it turns out has an HTML parser for navigating and extracting data from HTML in the same way you would from an xml.etree.ElementTree[5]. With this in mind I immediately began rewriting the script.

First off lets consider my criteria for a "good" SSD on Newegg. The SSD can be either the typical 2.5" form factor, or a PCI-Express card[6]. The interface can be SATAII, SATAIII or PCI-Express. Capacity must be greater than or equal to 120GB[7]. Last but not least, the disk should be sub $300[8].

The above requirements give us the following power search[9] which we will be using as the source for the script:

1
2
3
4
5
6
url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

Now the first thing that made me cringe as I was rewriting this was the fact that I would basically have no choice but to load each individual product page from the results page as capacity is no longer included in either the description or the features list of each product in the results page. Eventually I will get around to multi-threading this to make it a little less painful, or I'll get lucky and Newegg will add the capacity feature back to the item listing in power searches for SSD's. The following is the full source code of the parser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import re, math
from lxml import etree

url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \
    "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \
    "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \
    "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \
    ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \
    ",4215:41071&bop=And&Pagesize=100"

featureMap = {
    'Capacity': 'capacity',
    'Sequential Access - Write:': 'write',
    'Sequential Access - Write': 'write',
    'Sequential Access - Read:': 'read',
    'Sequential Access - Read': 'read',
    'Interface Type': 'interface',
    'Brand': 'brand',
    'Model': 'model',
    'Series': 'series'
}

speed_re = re.compile(r'(\d+)\s?MB/s')
capacity_re = re.compile(r'(\d+)GB')

parser = etree.HTMLParser()
# tree = etree.parse("temp.html", parser)
tree = etree.parse(url, etree.HTMLParser())
root = tree.getroot()

items = []

for node in root.findall(".//div[@class='itemCell']"):
    item = {}

    # Get link
    link = node.find(".//a[@title='View Details']")
    item["link"] = link.attrib["href"]
   
    # Get feature list (loads each item's url, should multi-thread this in the future)
    itemPage = etree.parse(item["link"], etree.HTMLParser()).getroot()
    featureList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dt"))
    valueList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dd"))
    features = zip(featureList, valueList)
    for feature, value in features:
        if value is not None and feature in featureMap:
            # If it's a speed feature parse out the speed
            if featureMap[feature] in ("read", "write"):
                item[featureMap[feature]] = min(map(lambda x: int(x), speed_re.findall(value)))
            # If it's a capacity feature, parse out the capacity
            elif featureMap[feature] == "capacity":
                item[featureMap[feature]] = min(map(lambda x: int(x), capacity_re.findall(value)))
            # If the value doesn't need to be parsed, just store the value in item
            else:
                item[featureMap[feature]] = value.strip()
               
    # Get price
    price = map(lambda n: n.text, node.findall(".//li[@class='priceFinal']/*"))
    item["price"] = float(''.join(price[1:]))
   
    # Only add the item if it has the features we need in it
    if "read" in item and "write" in item and "capacity" in item and "series" in item:
        score = (item["read"] * item["write"] * item["capacity"]) / ((math.log(abs(item["read"] - item["write"])) + 1) + item["price"])
        item["score"] = score
        items.append(item)
       
   
sorted = {}
for item in items:
    # Open addressing like in a hash table, so we don't wind
    # up with any collisions, unlikely but good practice anyway
    score = item["score"]
    while score in sorted:
        score += 1
    sorted[score] = item

sortOrder = sorted.keys()
sortOrder.sort()
sortOrder.reverse()

headers = ['brand', 'series', 'model', 'link', 'interface', 'price', 'capacity', 'read', 'write', 'score']
print '\t'.join(headers)
for key in sortOrder:
    item = sorted[key]
    print '\t'.join(map(lambda x: str(item[x]), headers))

At this point if you've gone through and read the entire script you'll probably notice that I've made a slight change to the scoring equation, it has been changed from the following:

\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}

To the following:
\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{(\log_{10}(|\text{Read} - \text{Write}|) + 1) \times \text{Price}}

I discovered that using the difference in read//write speed heavily penalized drives with anything greater than 10MB/s difference. So I figured that it may be a little more subtle to simply penalize drives based on the magnitude of the difference.

Now you're probably wondering: "When is this blathering idiot going to get to the damned results already?". And you'd be pleasantly surprised to know that I'm getting to them as you waste your time reading this.


Manufacturer: OCZ OCZ G.Skill OCZ
Series: RevoDrive Vertex 2 Phoenix Pro Series Agility 2
Capacity: 120GB 180GB 120GB 120GB
Read: 540MB/s 285MB/s 285MB/s 285MB/s
Write: 490MB/s 275MB/s 275MB/s 275MB/s
Item: N82E16820227578[10] N82E16820227602[11] N82E16820231378[12] N82E16820227593[13]
Price: $299.99 $294.99 $214.99 $214.99

As you can see the RevoDrive far out-scores all the rest of the SSD's considered in this analysis. The main reason is that they've essentially included two 60GB SSD's on the same card and you're expected to perform software raid on them in your own system[14]. Despite the incredible speeds they boast I don't think I would purchase one of these to use as my OS//Program disk because compatibility is a major limitation. You must be sure that your motherboard's BIOS supports booting via PCI-Express cards. And last but not least, the main reason I would pass up this card is the lack of TRIM support. As far as I can tell these cards do not support TRIM which is a major downside as far as I'm concerned.

The second disk in the list is the OCZ Vertex 2 180GB version. I'd probably skip this one just because I don't really consider the extra 60GB worth the extra $80.

Which leaves me with the last two disks which are as far as my analysis is concerned, identical. If you take into account the detailed features you'll notice that the G.Skill claims 50k IOPS on the 4k Random write test which seems a bit... optimistic. The OCZ makes no such claim and as far as I'm concerned both disks are more less the same thing. So it's pretty much up to brand preference at this point.

  1. I've already sent feedback to them suggesting that they fix this. []
  2. Only if the XML parser you're using supports it, which it seems is not a whole lot of them. At least not all of them support the full specification which is annoying since nobody really seems to document which bits and pieces they support and which whey don't. []
  3. Although not necessarily XML depending on the particular doctype you've chosen, Newegg's is transitional HTML. []
  4. lxml: http://codespeak.net/lxml/ []
  5. xml.etree.ElementTree: http://docs.python.org/library/xml.etree.elementtree.html []
  6. Some of the PCI-Express SSD's are stupidly fast and more expensive except that it doesn't look like any of them support TRIM yet which is a major problem for me. []
  7. It is rare that I have a matured (read: haven't reformatted in a while) install of windows along with all of my most commonly used programs and games that exceeds 60GB so I estimate that doubling this should accommodate for any sudden urges to install really big things. []
  8. I can't really justify spending much more than $300 on a single storage device. It had better be one hell of a storage device if I ever find myself spending more than $300 on it. []
  9. This will likely need to be updated at least once a month as Newegg is constantly adding new criteria and changing things. []
  10. OCZ RevoDrive []
  11. OCZ Vertex 2 []
  12. G.SKILL Phoenix Pro Series []
  13. OCZ Agility 2 []
  14. They show up as two separate physical devices despite being located on the same card. []
4Dec/100

Automagic TV Show Calendar

A little while ago I was browsing the web and discovered a website called tvrage.com[1] which seems to be the definitive online TV guide. I didn't originally enter the site on the main index but on a page describing the functionality of an XML API[2] they host for accessing their database of TV shows.

To me, this is like opening presents on christmas day. Just imagine the possibilities! I immediately began exploring the kind of data they provide. The very first idea I had was to use this to create events on my google calendar automatically for unaired episodes of my favorite TV shows.

I've previously written python scripts that interface with gdata but I find their implementation for python to be kind of cumbersome to deal with so I began researching their Protocol API[3]. At first I wasted a lot of time attempting to build the necessary XML structures to add events and the like. This got old very fast and I decided to just give JSON-C[4] a try. Turns out you can use the built-in JSON module in python for creating the necessary structures.

For parsing the results I got from tvrage I ended up using python's xml.etree.ElementTree which was simple enough to setup to retrieve only the information for each episode I was interested in.[5]

I had a bit of trouble initially with adding events to google calendar. This stemmed from the fact that google often will return an HTTP Redirect which includes a url with an appended gsession attribute which you're supposed to resubmit the exact data from the first request to. Once I figured this out it was turtles all the way down. I even managed to get the whole script multi-threaded to speed things up since it's impossible to perform batch-requests with JSON-C.

I should note that for the configuration file the calendar should be the "Calendar ID" for the calendar that can be found by looking at the settings page for the individual calendar, it is grouped with the XML and iCal feeds.

ShowList.txt:[6]

1
2
3
4
5
6
7
8
9
10
11
12
Castle  19267
House   3908
Bones   2870
Big Bang Theory, The    8511
Mentalist, The  18967
Rizzoli & Isles 24996
Venture Bros., The  6270
Top Gear    6753
Mythbusters 4605
Archer  23354
NCIS    4628
Community   22589

Config.cfg:

1
2
3
4
[Credentials]
username = someuser@gmail.com
password = somebase64encodedpassword
calendar = somecalendarid@group.calendar.google.com

AirDate.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import urllib2, urllib, json, ConfigParser, base64
from datetime import date
from xml.etree import ElementTree
from threading import Thread

calendar = ""
header = {}

# Thread for retrieving a list of episodes for a given show_id
class airDate(Thread):
    # Initialize thread and set some local attributes
    def __init__(self, show_name, show_id):
        Thread.__init__(self)
        self.show_name = show_name
        self.show_id = show_id
   
    # Get episode list from tvrage.com based on the show_id
    def run(self):
        # Retrieve XML episode_list from tvrage.com
        xml_data = urllib2.urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=%s" % self.show_id).read()
        # Pares XML into ElementTree.Element()
        xml_tree = ElementTree.fromstring(xml_data)
        self.result = []
       
        # For each season
        for season in xml_tree.findall("Episodelist/Season"):
            # Get the season number
            season_num = int(season.get("no"))
            # For each episode in the episode list
            for episode in season.findall("episode"):
                # Get episode number and title
                episode_num = int(episode.find("seasonnum").text)
                episode_title = episode.find("title").text
               
                # Build the episode code S##E##
                episode_code = "S%02dE%02d" % (season_num, episode_num)
               
                # Parse the airdate into year, month and day
                year, month, day = map(lambda x: int(x), episode.find("airdate").text.split("-"))
                try:
                    episode_airdate = date(year, month, day)
                    today = date.today()
                    # If episode hasn't aired yet
                    if episode_airdate >= today:
                        # Add episode to results list
                        self.result.append("%s %s - %s" % (str(episode_airdate), self.show_name, episode_code))
                except ValueError:
                    # If the airdate is invalid (tvrage.com sometimes
                    # includes 00's for unknown sections of the date
                    pass

class addEvent(Thread):
    # Thread for adding events to google calendar
   
    # Initialize thread and set local episode variable
    def __init__(self, episode):
        Thread.__init__(self)
        self.episode = episode
   
    # Add new entry to google calendar
    def run(self):
        # Build entry structure
        entry = {"data": {"details": self.episode, "quickAdd": True}}
        # Convert to JSON
        entry = json.dumps(entry)
       
        # Build request including necessary headers and data
        calReq = urllib2.Request("http://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), entry, header)
        # Execute the request
        calRes = urllib2.urlopen(calReq)
        # Get the redirect url (gsession appended)
        redirectReq = urllib2.Request(calRes.geturl(), entry, header)
        try:
            redirectRes = urllib2.urlopen(redirectReq)
        except HTTPError:
            # If we get some sort of HTTP error code
            # skip entry, can always run again
            pass
   
# Get list of events already added to
# the calendar from previous executions
def getExistingEpisodes(header):
    # Get JSON-C representation of calendar
    calReq = urllib2.Request(url="https://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), headers=header)
    calRes = urllib2.urlopen(calReq)
   
    # Parse JSON-C
    data = json.loads(calRes.read())
    # If the calendar has events on it
    if "items" in data["data"]:
        # Get the list of events
        events = data["data"]["items"]
        existing_episodes = []
        # For each event
        for event in events:
            # Append just the title of the event to the results
            existing_episodes.append(event["title"])
           
        return existing_episodes
    else:
        # We don't have any events on this calendar
        # so just return an empty list
        return []

if __name__ == '__main__':
    # Open the configuration file and get the necessary
    # credentials and settings
    config = ConfigParser.ConfigParser()
    config.readfp(open("Config.cfg"))
    username = config.get("Credentials", "username")
    password = config.get("Credentials", "password")
    # Password is stored as base64 encoded string just so
    # we don't have our password sitting out in plain sight
    password = base64.b64decode(password)
    calendar = config.get("Credentials", "calendar")
   
    # Build loginData structure, this is used to get
    # authentication data from google
    loginData = {
        "Email": username,
        "Passwd": password,
        "source": "BeMasher-ETR-2",
        "service": "cl"
    }

    # Encode the loginData for usage in a url
    loginData = urllib.urlencode(loginData)
    # Get authentication data
    gdataLogin = urllib2.urlopen("https://www.google.com/accounts/ClientLogin", data=loginData)
    SID, LSID, Auth = gdataLogin.read().splitlines()
   
    # Build header structure, this will be used for
    # all requests to google calendar from now on
    header = {
        "Authorization": "GoogleLogin %s" % (Auth),
        "GData-Version": 2,
        "Content-Type": "application/json"
    }
   
    # Open a list of the shows we're interested in
    # Stored as "show_name\tshow_id", one per line
    show_list = open("ShowList.txt")
    jobs = []
    for line in show_list:
        show = line.strip().split("\t")
        jobs.append(show)
   
    # Get a list of existing events from previous
    # executions so we don't wind up with duplicates
    existingEpisodes = getExistingEpisodes(header)
   
    threadQueue = []
    # For each episode we've retrieved that is unaired
    for job in jobs:
        show_name, show_id = job
        # Create an instance of the airDate thread
        thread = airDate(show_name, show_id)
        # Start it
        thread.start()
        # Add it to the threadQueue
        threadQueue.append(thread)
       
    episodes = []
    # While we've still got running threads
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()
        # For each episode in the results
        for episode in thread.result:
            # If it hasn't already been added to google calendar
            if episode[11:] not in existingEpisodes:
                print episode
                # Add to list of episodes that need events created
                episodes.append(episode)
   
    # For each episode that doesn't have an
    # event on google calendar already
    for episode in episodes:
        # Create an addEvent thread, start it
        # and add it to the threadQueue
        thread = addEvent(episode)
        thread.start()
        threadQueue.append(thread)
   
    # While we still have threads running
    while len(threadQueue) > 0:
        # Get a thread from the queue
        thread = threadQueue.pop()
        # Block until it completes
        thread.join()

This was all done shortly before I discovered that tvrage.com also provides iCal feeds for your favorite shows provided that you register and add some to your list. Unfortunately the iCal feed they generate creates events for exact air times of each episode which I'm not really all that concerned about. So I use this script still to add all-day events for each episode which is easier to view//see when there's a new episode.

I did write another script using their XML API but that will have to wait for another post.

  1. http://tvrage.com/ []
  2. http://services.tvrage.com/ []
  3. Data API Developer's Guide: The Protocol []
  4. Google's own flavor of JSON which is almost identical to plain old JSON. []
  5. I only really needed the original air date, title, season number and episode number. []
  6. You can find the show_id via the show search found on their XML API page. []
5Mar/093

Podcast Downloading on FreeNAS.

I was thinking earlier this week about how files move too and from my file-server. I discovered that it's more convenient to have the server "pull" files to itself. This actually made me think a little bit more about podcast shows I watch on a pretty regular basis. I thought to myself "Wouldn't it be great if my server would get new episodes for me?". Once all that's done I can easily pull up the Podcasts directory on my friend's PS3 (which this will broadcast to using UPnP DLNA) and watch any new episodes that happen to be there.

I then set out almost immediately to figure out the simplest way to do this. I stumbled upon a script called BashPodder written by Linc. So I figured I'd just give it a try in it's original state. That turned out badly since I discovered that this was meant specifically for linux-based systems. My file-server runs FreeNAS a FreeBSD based OS. FreeBSD has an equivalent but different set of basic system tools. Fetch instead of wget, xml instead of xsltproc and so on.

After about 8 hours of learning my way around the basic set of FreeBSD tools and just generally refreshing myself on shell scripting I had heavily modified the BashPodder script to work on FreeNAS.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash

cd "`dirname "$0"`"
script_dir=`pwd`

podcast_dir=/mnt/Main/Content/TV\ Shows/Podcasts/

while read podcast
do
    cd "$script_dir"
    file=$(fetch -q -o - $podcast | xml tr parse_enclosure.xsl)
    cd "$podcast_dir"
    data_dir=$(fetch -q -o - $podcast | xml sel -T -t -v "//channel/title" | sed "s/ (.*)//")
    echo \"$data_dir\"
    mkdir -p "$data_dir"
    chown ftp:wheel "$data_dir"
    cd "$data_dir"
    for url in $file
    do
        filename=`echo $url | sed "s/^\(.*\)*\///g"`
        echo $filename
        if [ -z "`grep $url "episodes.txt"`" ]; then
            fetch -m -q $url
            echo $url >> "episodes.txt"
            if [ -n "`echo $filename | grep -v ".m4v"`" ]; then
                mv $filename `basename $filename m4v`mp4
                chown ftp:wheel `basename $filename m4v`mp4
            else
                chown ftp:wheel $filename
            fi
        fi
    done
done < feeds.list

/etc/rc.d/fuppes updatedb

I think the only thing I didn't end up changing at all was the xml stylesheet Linc had written since it works perfectly. The basic flow of this program is thus:

  1. Move into the directory that the script exists in.
  2. For each line in feeds.list do the following.
    1. Move into the scripts directory (there are files we need to parse the feed here).
    2. Get the rss feed.
    3. Parse out the title of the feed.
    4. Parse the urls for each episode in the feed.
    5. Move into the podcast directory.
    6. Make a directory of the same name as the title if necessary.
    7. Move into the directory we just made (or already exists).
    8. For each url we parsed do the following.
      1. Determine the filename from the url.
      2. If the episode is not in episodes.txt do the following.
        1. Download the episode.
        2. Add the episode to episodes.txt
        3. Change the owner to ftp. (We're running this as root).
  3. Update the Fuppes database.

After all that was done and over with I just added a new file extension to the Fuppes (UPnP DLNA server) configuration for telling the PS3 that m4v files are really just mp4 files.

Also I wrote a slightly modified version of the above script for just getting lists of episodes. This is useful because if I don't want to download every single episode in each feed (for when I add new feeds to the feeds list) all I have to do is remove the urls from the episodes.txt file that is generated and run the main script. The modification is done in the for loop that handles episodes of each feed:

1
2
3
filename=`echo $url | sed "s/^\(.*\)*\///g"`
echo $filename
grep -q $url "episodes.txt" || echo $url >> "episodes.txt"

So instead of downloading each file it just makes sure it's in the episodes.txt file. If you've got any suggestions for optimization of the script let me know!

If you'd like to check out a read-only copy of the script and necessary configurations:

1
svn checkout http://svn.xp-dev.com/svn/bemasher-FreeNASPodder/ FreeNASPodder

Keep in mind that the subversion is the most up-to-date copy I'll have available and it may often contain broken code or errors though I'll try to keep that down to a minimum.