Choosing an SSD (A more different S)
I've been periodically going back and revisiting the results for my SSD analysis script for newegg.com. The last few times I ran it I noticed that it was broken. It looks like newegg has modified a few things in their power search results page. One thing which is a little obnoxious[1] is that they no longer include the capacity in the description of the item or as a feature in the feature list when viewing the results page. This only seems to be an issue on the SSD page although I can't figure out why they decided it didn't need to be there in the first place. I see it this way: SSD's are first and foremost a storage device, you'd think that one of the most important features that should be listed with every SSD is the capacity at least.
Anyway, this change broke my script which I had been meaning to rewrite since regular expressions are definitely not the most efficient or cleanest way to parse HTML. I've been working with XML a more often lately despite my original prejudice against it for being a really bloated way to transfer data. One thing I discovered that makes XML a lot less painful is XPath[2] which is an incredibly useful "language" for selecting data from an XML document.
Once I had gone through and read several tutorials and references about XPath I set out to use it in writing a show calendar script which parses data from tvrage.com's XML API. After that useful exercise I realized I could very easily and cleanly apply it to my SSD analysis script. Since HTML is similar in nature to XML[3] I set out to parse Newegg's results page using XPath. This presented the first problem: Newegg's page isn't strictly XML or even XHTML for that matter. After a great deal of googling and research I landed on the lxml[4] website which as it turns out has an HTML parser for navigating and extracting data from HTML in the same way you would from an xml.etree.ElementTree[5]. With this in mind I immediately began rewriting the script.
First off lets consider my criteria for a "good" SSD on Newegg. The SSD can be either the typical 2.5" form factor, or a PCI-Express card[6]. The interface can be SATAII, SATAIII or PCI-Express. Capacity must be greater than or equal to 120GB[7]. Last but not least, the disk should be sub $300[8].
The above requirements give us the following power search[9] which we will be using as the source for the script:
1 2 3 4 5 6 | url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \ "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \ "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \ "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \ ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \ ",4215:41071&bop=And&Pagesize=100" |
Now the first thing that made me cringe as I was rewriting this was the fact that I would basically have no choice but to load each individual product page from the results page as capacity is no longer included in either the description or the features list of each product in the results page. Eventually I will get around to multi-threading this to make it a little less painful, or I'll get lucky and Newegg will add the capacity feature back to the item listing in power searches for SSD's. The following is the full source code of the parser:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | import re, math from lxml import etree url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \ "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \ "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \ "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \ ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \ ",4215:41071&bop=And&Pagesize=100" featureMap = { 'Capacity': 'capacity', 'Sequential Access - Write:': 'write', 'Sequential Access - Write': 'write', 'Sequential Access - Read:': 'read', 'Sequential Access - Read': 'read', 'Interface Type': 'interface', 'Brand': 'brand', 'Model': 'model', 'Series': 'series' } speed_re = re.compile(r'(\d+)\s?MB/s') capacity_re = re.compile(r'(\d+)GB') parser = etree.HTMLParser() # tree = etree.parse("temp.html", parser) tree = etree.parse(url, etree.HTMLParser()) root = tree.getroot() items = [] for node in root.findall(".//div[@class='itemCell']"): item = {} # Get link link = node.find(".//a[@title='View Details']") item["link"] = link.attrib["href"] # Get feature list (loads each item's url, should multi-thread this in the future) itemPage = etree.parse(item["link"], etree.HTMLParser()).getroot() featureList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dt")) valueList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dd")) features = zip(featureList, valueList) for feature, value in features: if value is not None and feature in featureMap: # If it's a speed feature parse out the speed if featureMap[feature] in ("read", "write"): item[featureMap[feature]] = min(map(lambda x: int(x), speed_re.findall(value))) # If it's a capacity feature, parse out the capacity elif featureMap[feature] == "capacity": item[featureMap[feature]] = min(map(lambda x: int(x), capacity_re.findall(value))) # If the value doesn't need to be parsed, just store the value in item else: item[featureMap[feature]] = value.strip() # Get price price = map(lambda n: n.text, node.findall(".//li[@class='priceFinal']/*")) item["price"] = float(''.join(price[1:])) # Only add the item if it has the features we need in it if "read" in item and "write" in item and "capacity" in item and "series" in item: score = (item["read"] * item["write"] * item["capacity"]) / ((math.log(abs(item["read"] - item["write"])) + 1) + item["price"]) item["score"] = score items.append(item) sorted = {} for item in items: # Open addressing like in a hash table, so we don't wind # up with any collisions, unlikely but good practice anyway score = item["score"] while score in sorted: score += 1 sorted[score] = item sortOrder = sorted.keys() sortOrder.sort() sortOrder.reverse() headers = ['brand', 'series', 'model', 'link', 'interface', 'price', 'capacity', 'read', 'write', 'score'] print '\t'.join(headers) for key in sortOrder: item = sorted[key] print '\t'.join(map(lambda x: str(item[x]), headers)) |
At this point if you've gone through and read the entire script you'll probably notice that I've made a slight change to the scoring equation, it has been changed from the following:
To the following:
I discovered that using the difference in read//write speed heavily penalized drives with anything greater than 10MB/s difference. So I figured that it may be a little more subtle to simply penalize drives based on the magnitude of the difference.
Now you're probably wondering: "When is this blathering idiot going to get to the damned results already?". And you'd be pleasantly surprised to know that I'm getting to them as you waste your time reading this.
| Manufacturer: | OCZ | OCZ | G.Skill | OCZ |
| Series: | RevoDrive | Vertex 2 | Phoenix Pro Series | Agility 2 |
| Capacity: | 120GB | 180GB | 120GB | 120GB |
| Read: | 540MB/s | 285MB/s | 285MB/s | 285MB/s |
| Write: | 490MB/s | 275MB/s | 275MB/s | 275MB/s |
| Item: | N82E16820227578[10] | N82E16820227602[11] | N82E16820231378[12] | N82E16820227593[13] |
| Price: | $299.99 | $294.99 | $214.99 | $214.99 |
As you can see the RevoDrive far out-scores all the rest of the SSD's considered in this analysis. The main reason is that they've essentially included two 60GB SSD's on the same card and you're expected to perform software raid on them in your own system[14]. Despite the incredible speeds they boast I don't think I would purchase one of these to use as my OS//Program disk because compatibility is a major limitation. You must be sure that your motherboard's BIOS supports booting via PCI-Express cards. And last but not least, the main reason I would pass up this card is the lack of TRIM support. As far as I can tell these cards do not support TRIM which is a major downside as far as I'm concerned.
The second disk in the list is the OCZ Vertex 2 180GB version. I'd probably skip this one just because I don't really consider the extra 60GB worth the extra $80.
Which leaves me with the last two disks which are as far as my analysis is concerned, identical. If you take into account the detailed features you'll notice that the G.Skill claims 50k IOPS on the 4k Random write test which seems a bit... optimistic. The OCZ makes no such claim and as far as I'm concerned both disks are more less the same thing. So it's pretty much up to brand preference at this point.
- I've already sent feedback to them suggesting that they fix this. [↩]
- Only if the XML parser you're using supports it, which it seems is not a whole lot of them. At least not all of them support the full specification which is annoying since nobody really seems to document which bits and pieces they support and which whey don't. [↩]
- Although not necessarily XML depending on the particular doctype you've chosen, Newegg's is transitional HTML. [↩]
- lxml: http://codespeak.net/lxml/ [↩]
- xml.etree.ElementTree: http://docs.python.org/library/xml.etree.elementtree.html [↩]
- Some of the PCI-Express SSD's are stupidly fast and more expensive except that it doesn't look like any of them support TRIM yet which is a major problem for me. [↩]
- It is rare that I have a matured (read: haven't reformatted in a while) install of windows along with all of my most commonly used programs and games that exceeds 60GB so I estimate that doubling this should accommodate for any sudden urges to install really big things. [↩]
- I can't really justify spending much more than $300 on a single storage device. It had better be one hell of a storage device if I ever find myself spending more than $300 on it. [↩]
- This will likely need to be updated at least once a month as Newegg is constantly adding new criteria and changing things. [↩]
- OCZ RevoDrive [↩]
- OCZ Vertex 2 [↩]
- G.SKILL Phoenix Pro Series [↩]
- OCZ Agility 2 [↩]
- They show up as two separate physical devices despite being located on the same card. [↩]
Automagic TV Show Calendar
A little while ago I was browsing the web and discovered a website called tvrage.com[1] which seems to be the definitive online TV guide. I didn't originally enter the site on the main index but on a page describing the functionality of an XML API[2] they host for accessing their database of TV shows.
To me, this is like opening presents on christmas day. Just imagine the possibilities! I immediately began exploring the kind of data they provide. The very first idea I had was to use this to create events on my google calendar automatically for unaired episodes of my favorite TV shows.
I've previously written python scripts that interface with gdata but I find their implementation for python to be kind of cumbersome to deal with so I began researching their Protocol API[3]. At first I wasted a lot of time attempting to build the necessary XML structures to add events and the like. This got old very fast and I decided to just give JSON-C[4] a try. Turns out you can use the built-in JSON module in python for creating the necessary structures.
For parsing the results I got from tvrage I ended up using python's xml.etree.ElementTree which was simple enough to setup to retrieve only the information for each episode I was interested in.[5]
I had a bit of trouble initially with adding events to google calendar. This stemmed from the fact that google often will return an HTTP Redirect which includes a url with an appended gsession attribute which you're supposed to resubmit the exact data from the first request to. Once I figured this out it was turtles all the way down. I even managed to get the whole script multi-threaded to speed things up since it's impossible to perform batch-requests with JSON-C.
I should note that for the configuration file the calendar should be the "Calendar ID" for the calendar that can be found by looking at the settings page for the individual calendar, it is grouped with the XML and iCal feeds.
ShowList.txt:[6]
1 2 3 4 5 6 7 8 9 10 11 12 | Castle 19267 House 3908 Bones 2870 Big Bang Theory, The 8511 Mentalist, The 18967 Rizzoli & Isles 24996 Venture Bros., The 6270 Top Gear 6753 Mythbusters 4605 Archer 23354 NCIS 4628 Community 22589 |
Config.cfg:
1 2 3 4 | [Credentials] username = someuser@gmail.com password = somebase64encodedpassword calendar = somecalendarid@group.calendar.google.com |
AirDate.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | import urllib2, urllib, json, ConfigParser, base64 from datetime import date from xml.etree import ElementTree from threading import Thread calendar = "" header = {} # Thread for retrieving a list of episodes for a given show_id class airDate(Thread): # Initialize thread and set some local attributes def __init__(self, show_name, show_id): Thread.__init__(self) self.show_name = show_name self.show_id = show_id # Get episode list from tvrage.com based on the show_id def run(self): # Retrieve XML episode_list from tvrage.com xml_data = urllib2.urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=%s" % self.show_id).read() # Pares XML into ElementTree.Element() xml_tree = ElementTree.fromstring(xml_data) self.result = [] # For each season for season in xml_tree.findall("Episodelist/Season"): # Get the season number season_num = int(season.get("no")) # For each episode in the episode list for episode in season.findall("episode"): # Get episode number and title episode_num = int(episode.find("seasonnum").text) episode_title = episode.find("title").text # Build the episode code S##E## episode_code = "S%02dE%02d" % (season_num, episode_num) # Parse the airdate into year, month and day year, month, day = map(lambda x: int(x), episode.find("airdate").text.split("-")) try: episode_airdate = date(year, month, day) today = date.today() # If episode hasn't aired yet if episode_airdate >= today: # Add episode to results list self.result.append("%s %s - %s" % (str(episode_airdate), self.show_name, episode_code)) except ValueError: # If the airdate is invalid (tvrage.com sometimes # includes 00's for unknown sections of the date pass class addEvent(Thread): # Thread for adding events to google calendar # Initialize thread and set local episode variable def __init__(self, episode): Thread.__init__(self) self.episode = episode # Add new entry to google calendar def run(self): # Build entry structure entry = {"data": {"details": self.episode, "quickAdd": True}} # Convert to JSON entry = json.dumps(entry) # Build request including necessary headers and data calReq = urllib2.Request("http://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), entry, header) # Execute the request calRes = urllib2.urlopen(calReq) # Get the redirect url (gsession appended) redirectReq = urllib2.Request(calRes.geturl(), entry, header) try: redirectRes = urllib2.urlopen(redirectReq) except HTTPError: # If we get some sort of HTTP error code # skip entry, can always run again pass # Get list of events already added to # the calendar from previous executions def getExistingEpisodes(header): # Get JSON-C representation of calendar calReq = urllib2.Request(url="https://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), headers=header) calRes = urllib2.urlopen(calReq) # Parse JSON-C data = json.loads(calRes.read()) # If the calendar has events on it if "items" in data["data"]: # Get the list of events events = data["data"]["items"] existing_episodes = [] # For each event for event in events: # Append just the title of the event to the results existing_episodes.append(event["title"]) return existing_episodes else: # We don't have any events on this calendar # so just return an empty list return [] if __name__ == '__main__': # Open the configuration file and get the necessary # credentials and settings config = ConfigParser.ConfigParser() config.readfp(open("Config.cfg")) username = config.get("Credentials", "username") password = config.get("Credentials", "password") # Password is stored as base64 encoded string just so # we don't have our password sitting out in plain sight password = base64.b64decode(password) calendar = config.get("Credentials", "calendar") # Build loginData structure, this is used to get # authentication data from google loginData = { "Email": username, "Passwd": password, "source": "BeMasher-ETR-2", "service": "cl" } # Encode the loginData for usage in a url loginData = urllib.urlencode(loginData) # Get authentication data gdataLogin = urllib2.urlopen("https://www.google.com/accounts/ClientLogin", data=loginData) SID, LSID, Auth = gdataLogin.read().splitlines() # Build header structure, this will be used for # all requests to google calendar from now on header = { "Authorization": "GoogleLogin %s" % (Auth), "GData-Version": 2, "Content-Type": "application/json" } # Open a list of the shows we're interested in # Stored as "show_name\tshow_id", one per line show_list = open("ShowList.txt") jobs = [] for line in show_list: show = line.strip().split("\t") jobs.append(show) # Get a list of existing events from previous # executions so we don't wind up with duplicates existingEpisodes = getExistingEpisodes(header) threadQueue = [] # For each episode we've retrieved that is unaired for job in jobs: show_name, show_id = job # Create an instance of the airDate thread thread = airDate(show_name, show_id) # Start it thread.start() # Add it to the threadQueue threadQueue.append(thread) episodes = [] # While we've still got running threads while len(threadQueue) > 0: # Get a thread from the queue thread = threadQueue.pop() # Block until it completes thread.join() # For each episode in the results for episode in thread.result: # If it hasn't already been added to google calendar if episode[11:] not in existingEpisodes: print episode # Add to list of episodes that need events created episodes.append(episode) # For each episode that doesn't have an # event on google calendar already for episode in episodes: # Create an addEvent thread, start it # and add it to the threadQueue thread = addEvent(episode) thread.start() threadQueue.append(thread) # While we still have threads running while len(threadQueue) > 0: # Get a thread from the queue thread = threadQueue.pop() # Block until it completes thread.join() |
This was all done shortly before I discovered that tvrage.com also provides iCal feeds for your favorite shows provided that you register and add some to your list. Unfortunately the iCal feed they generate creates events for exact air times of each episode which I'm not really all that concerned about. So I use this script still to add all-day events for each episode which is easier to view//see when there's a new episode.
I did write another script using their XML API but that will have to wait for another post.
- http://tvrage.com/ [↩]
- http://services.tvrage.com/ [↩]
- Data API Developer's Guide: The Protocol [↩]
- Google's own flavor of JSON which is almost identical to plain old JSON. [↩]
- I only really needed the original air date, title, season number and episode number. [↩]
- You can find the show_id via the show search found on their XML API page. [↩]
Podcast Downloading on FreeNAS.
I was thinking earlier this week about how files move too and from my file-server. I discovered that it's more convenient to have the server "pull" files to itself. This actually made me think a little bit more about podcast shows I watch on a pretty regular basis. I thought to myself "Wouldn't it be great if my server would get new episodes for me?". Once all that's done I can easily pull up the Podcasts directory on my friend's PS3 (which this will broadcast to using UPnP DLNA) and watch any new episodes that happen to be there.
I then set out almost immediately to figure out the simplest way to do this. I stumbled upon a script called BashPodder written by Linc. So I figured I'd just give it a try in it's original state. That turned out badly since I discovered that this was meant specifically for linux-based systems. My file-server runs FreeNAS a FreeBSD based OS. FreeBSD has an equivalent but different set of basic system tools. Fetch instead of wget, xml instead of xsltproc and so on.
After about 8 hours of learning my way around the basic set of FreeBSD tools and just generally refreshing myself on shell scripting I had heavily modified the BashPodder script to work on FreeNAS.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | #!/bin/bash cd "`dirname "$0"`" script_dir=`pwd` podcast_dir=/mnt/Main/Content/TV\ Shows/Podcasts/ while read podcast do cd "$script_dir" file=$(fetch -q -o - $podcast | xml tr parse_enclosure.xsl) cd "$podcast_dir" data_dir=$(fetch -q -o - $podcast | xml sel -T -t -v "//channel/title" | sed "s/ (.*)//") echo \"$data_dir\" mkdir -p "$data_dir" chown ftp:wheel "$data_dir" cd "$data_dir" for url in $file do filename=`echo $url | sed "s/^\(.*\)*\///g"` echo $filename if [ -z "`grep $url "episodes.txt"`" ]; then fetch -m -q $url echo $url >> "episodes.txt" if [ -n "`echo $filename | grep -v ".m4v"`" ]; then mv $filename `basename $filename m4v`mp4 chown ftp:wheel `basename $filename m4v`mp4 else chown ftp:wheel $filename fi fi done done < feeds.list /etc/rc.d/fuppes updatedb |
I think the only thing I didn't end up changing at all was the xml stylesheet Linc had written since it works perfectly. The basic flow of this program is thus:
- Move into the directory that the script exists in.
- For each line in feeds.list do the following.
- Move into the scripts directory (there are files we need to parse the feed here).
- Get the rss feed.
- Parse out the title of the feed.
- Parse the urls for each episode in the feed.
- Move into the podcast directory.
- Make a directory of the same name as the title if necessary.
- Move into the directory we just made (or already exists).
- For each url we parsed do the following.
- Determine the filename from the url.
- If the episode is not in episodes.txt do the following.
- Download the episode.
- Add the episode to episodes.txt
- Change the owner to ftp. (We're running this as root).
- Update the Fuppes database.
After all that was done and over with I just added a new file extension to the Fuppes (UPnP DLNA server) configuration for telling the PS3 that m4v files are really just mp4 files.
Also I wrote a slightly modified version of the above script for just getting lists of episodes. This is useful because if I don't want to download every single episode in each feed (for when I add new feeds to the feeds list) all I have to do is remove the urls from the episodes.txt file that is generated and run the main script. The modification is done in the for loop that handles episodes of each feed:
1 2 3 | filename=`echo $url | sed "s/^\(.*\)*\///g"` echo $filename grep -q $url "episodes.txt" || echo $url >> "episodes.txt" |
So instead of downloading each file it just makes sure it's in the episodes.txt file. If you've got any suggestions for optimization of the script let me know!
If you'd like to check out a read-only copy of the script and necessary configurations:
1 | svn checkout http://svn.xp-dev.com/svn/bemasher-FreeNASPodder/ FreeNASPodder |
Keep in mind that the subversion is the most up-to-date copy I'll have available and it may often contain broken code or errors though I'll try to keep that down to a minimum.

