## Choosing an SSD (A more different S)

I've been periodically going back and revisiting the results for my SSD analysis script for newegg.com. The last few times I ran it I noticed that it was broken. It looks like newegg has modified a few things in their power search results page. One thing which is a little obnoxious[1] is that they no longer include the capacity in the description of the item or as a feature in the feature list when viewing the results page. This only seems to be an issue on the SSD page although I can't figure out why they decided it didn't need to be there in the first place. I see it this way: SSD's are first and foremost a storage device, you'd think that one of the most important features that should be listed with every SSD is the capacity at least.

Anyway, this change broke my script which I had been meaning to rewrite since regular expressions are definitely not the most efficient or cleanest way to parse HTML. I've been working with XML a more often lately despite my original prejudice against it for being a really bloated way to transfer data. One thing I discovered that makes XML a lot less painful is XPath[2] which is an incredibly useful "language" for selecting data from an XML document.

Once I had gone through and read several tutorials and references about XPath I set out to use it in writing a show calendar script which parses data from tvrage.com's XML API. After that useful exercise I realized I could very easily and cleanly apply it to my SSD analysis script. Since HTML is similar in nature to XML[3] I set out to parse Newegg's results page using XPath. This presented the first problem: Newegg's page isn't strictly XML or even XHTML for that matter. After a great deal of googling and research I landed on the lxml[4] website which as it turns out has an HTML parser for navigating and extracting data from HTML in the same way you would from an xml.etree.ElementTree[5]. With this in mind I immediately began rewriting the script.

First off lets consider my criteria for a "good" SSD on Newegg. The SSD can be either the typical 2.5" form factor, or a PCI-Express card[6]. The interface can be SATAII, SATAIII or PCI-Express. Capacity must be greater than or equal to 120GB[7]. Last but not least, the disk should be sub $300[8]. The above requirements give us the following power search[9] which we will be using as the source for the script:  123456 url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \ "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \ "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \ "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \ ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \ ",4215:41071&bop=And&Pagesize=100" Now the first thing that made me cringe as I was rewriting this was the fact that I would basically have no choice but to load each individual product page from the results page as capacity is no longer included in either the description or the features list of each product in the results page. Eventually I will get around to multi-threading this to make it a little less painful, or I'll get lucky and Newegg will add the capacity feature back to the item listing in power searches for SSD's. The following is the full source code of the parser:  12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485 import re, math from lxml import etree url = "http://www.newegg.com/Product/ProductList.aspx?Submit=Property&N=100008" + \ "120&IsNodeId=1&maxPrice=300&OEMMark=1,0&PropertyCodeValue=4213:30854,421" + \ "3:41472,4213:47725,4214:46019,4214:72313,4214:57574,4214:58118,4214:3941" + \ "6,4214:47732,4214:30849,4214:47171,4214:46300,4214:77918,4214:72311,4214" + \ ":77919,4214:55178,4214:47733,4214:57755,4214:44038,4215:55552,4215:47726" + \ ",4215:41071&bop=And&Pagesize=100" featureMap = { 'Capacity': 'capacity', 'Sequential Access - Write:': 'write', 'Sequential Access - Write': 'write', 'Sequential Access - Read:': 'read', 'Sequential Access - Read': 'read', 'Interface Type': 'interface', 'Brand': 'brand', 'Model': 'model', 'Series': 'series' } speed_re = re.compile(r'(\d+)\s?MB/s') capacity_re = re.compile(r'(\d+)GB') parser = etree.HTMLParser() # tree = etree.parse("temp.html", parser) tree = etree.parse(url, etree.HTMLParser()) root = tree.getroot() items = [] for node in root.findall(".//div[@class='itemCell']"): item = {} # Get link link = node.find(".//a[@title='View Details']") item["link"] = link.attrib["href"] # Get feature list (loads each item's url, should multi-thread this in the future) itemPage = etree.parse(item["link"], etree.HTMLParser()).getroot() featureList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dt")) valueList = map(lambda n: n.text, itemPage.findall(".//fieldset/dl/dd")) features = zip(featureList, valueList) for feature, value in features: if value is not None and feature in featureMap: # If it's a speed feature parse out the speed if featureMap[feature] in ("read", "write"): item[featureMap[feature]] = min(map(lambda x: int(x), speed_re.findall(value))) # If it's a capacity feature, parse out the capacity elif featureMap[feature] == "capacity": item[featureMap[feature]] = min(map(lambda x: int(x), capacity_re.findall(value))) # If the value doesn't need to be parsed, just store the value in item else: item[featureMap[feature]] = value.strip() # Get price price = map(lambda n: n.text, node.findall(".//li[@class='priceFinal']/*")) item["price"] = float(''.join(price[1:])) # Only add the item if it has the features we need in it if "read" in item and "write" in item and "capacity" in item and "series" in item: score = (item["read"] * item["write"] * item["capacity"]) / ((math.log(abs(item["read"] - item["write"])) + 1) + item["price"]) item["score"] = score items.append(item) sorted = {} for item in items: # Open addressing like in a hash table, so we don't wind # up with any collisions, unlikely but good practice anyway score = item["score"] while score in sorted: score += 1 sorted[score] = item sortOrder = sorted.keys() sortOrder.sort() sortOrder.reverse() headers = ['brand', 'series', 'model', 'link', 'interface', 'price', 'capacity', 'read', 'write', 'score'] print '\t'.join(headers) for key in sortOrder: item = sorted[key] print '\t'.join(map(lambda x: str(item[x]), headers)) At this point if you've gone through and read the entire script you'll probably notice that I've made a slight change to the scoring equation, it has been changed from the following: $\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{\text{Price}}$ To the following: $\frac{\text{Read} \times \text{Write} \times \text{Capacity}}{(\log_{10}(|\text{Read} - \text{Write}|) + 1) \times \text{Price}}$ I discovered that using the difference in read//write speed heavily penalized drives with anything greater than 10MB/s difference. So I figured that it may be a little more subtle to simply penalize drives based on the magnitude of the difference. Now you're probably wondering: "When is this blathering idiot going to get to the damned results already?". And you'd be pleasantly surprised to know that I'm getting to them as you waste your time reading this.  Manufacturer: OCZ OCZ G.Skill OCZ Series: RevoDrive Vertex 2 Phoenix Pro Series Agility 2 Capacity: 120GB 180GB 120GB 120GB Read: 540MB/s 285MB/s 285MB/s 285MB/s Write: 490MB/s 275MB/s 275MB/s 275MB/s Item: N82E16820227578[10] N82E16820227602[11] N82E16820231378[12] N82E16820227593[13] Price:$299.99 $294.99$214.99 $214.99 As you can see the RevoDrive far out-scores all the rest of the SSD's considered in this analysis. The main reason is that they've essentially included two 60GB SSD's on the same card and you're expected to perform software raid on them in your own system[14]. Despite the incredible speeds they boast I don't think I would purchase one of these to use as my OS//Program disk because compatibility is a major limitation. You must be sure that your motherboard's BIOS supports booting via PCI-Express cards. And last but not least, the main reason I would pass up this card is the lack of TRIM support. As far as I can tell these cards do not support TRIM which is a major downside as far as I'm concerned. The second disk in the list is the OCZ Vertex 2 180GB version. I'd probably skip this one just because I don't really consider the extra 60GB worth the extra$80.

Which leaves me with the last two disks which are as far as my analysis is concerned, identical. If you take into account the detailed features you'll notice that the G.Skill claims 50k IOPS on the 4k Random write test which seems a bit... optimistic. The OCZ makes no such claim and as far as I'm concerned both disks are more less the same thing. So it's pretty much up to brand preference at this point.

## Choosing an SSD (Update)

My brother is in the planning stages of building a new desktop. One of the things he's planning on doing differently from his last build[1] is using an SSD for OS + Programs.

I had mentioned to him previously that I a wrote a program for helping to choose an SSD based on what SSD's are meant for and are good at doing. So he asked if I could recommend him one. Below are the results of the latest run of my script based on the most current listings[2] of SSD's newegg offers.

 Manufacturer: OCZ G.Skill OCZ Series: Agility 2 Phoenix Pro Vertex 2 Capacity: 120GB 120GB 120GB Read: 285MB/s 285MB/s 285MB/s Write: 275MB/s 275MB/s 275MB/s Item: N82E16820227543[3] N82E16820231378[4] N82E16820227551[5] Price: $235.99$239.99 \$240.00

It looks like OCZ has two of the top three places this run and G.Skill is still maintaining one of the top three from before. Between the 3 of them I think i would likely still go for the G.Skill just because of personal preference despite there not really being any significant differences between the three. Excepting price of course.

