System Memory Analysis
Choosing the best RAM for your system can be difficult, as there are a lot of things to consider. Doing comparisons by hand can net you some pretty decent results but picking the best price per capacity per ... can get fairly complicated if you're doing it by hand.
A while ago you may remember my SSD analysis script[1] that scraped HTML from Newegg to calculate scores for each product to choose the best one. I've also recently discovered that Newegg does indeed have an API[2] that greatly simplifies this whole process[3].
Once I had explored Newegg's API enough to get the data I needed I set to work to update the SSD script as well as write a few others for HDD's and system memory as well. Of the scripts I wrote the one for system memory turned out to be particularly useful as it made finding great deals very easy. It also illustrated that popular brands may not always be the best deal.
The first major improvement over the previous scripts was the use of threading to make multiple API requests in parallel which sped things up quite a bit. While Python's threading library doesn't allow for parallelism of the CPU[4] it does for file I/O. Below is the class used for grabbing urls throughout the script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import threading import urllib, urllib2 import json, re from Queue import Queue class GetURL(threading.Thread): def __init__(self, urlQueue, jsonQueue): threading.Thread.__init__(self) self.urlQueue = urlQueue self.jsonQueue = jsonQueue def run(self): while True: itemNumber, url = self.urlQueue.get() raw = urllib2.urlopen(url).read() jsonQueue.put((itemNumber, json.loads(raw))) self.urlQueue.task_done() |
Newegg's API paginates the data as the Android app displays the data directly to the user which means there's no easy way to retrieve all results in one request. So you must make successive calls incrementing the page number until all results for the query have been retrieved.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | itemSpecURL = "http://www.ows.newegg.com/Products.egg/{}/Specification" searchURL = "http://www.ows.newegg.com/Search.egg/Advanced" itemList = getItems() urlQueue = Queue() jsonQueue = Queue() items = {} for item in itemList: specURL = itemSpecURL.format(item["ItemNumber"]) urlQueue.put((item["ItemNumber"], specURL)) items[item["ItemNumber"]] = item for worker in xrange(2): t = GetURL(urlQueue, jsonQueue) t.setDaemon(True) t.start() urlQueue.join() |
These basic setups are fairly generic and can be used to analyze just about any product from Newegg. Anything beyond this point however is specific to the type of product you're analyzing. This will grab each item's basic data including price as well as it's detailed specifications. I should also note that the parameters passed to the API in getItems is generated using the query builder available in the post about Newegg's API.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | speed_re = re.compile('DDR\d\s(\d+).*') capacity_re = re.compile("(\d+)GB\s\((\d+)\sx\s(\d+)GB\)") timing_re = re.compile('(\d+-\d+-\d+-\d+)') features = ['Brand', 'Model', 'ItemNumber', 'Price', 'Speed', 'Capacity', 'Dimms', 'Timing', 'Voltage'] while not jsonQueue.empty(): itemNumber, specs = jsonQueue.get() item = {} for group in specs['SpecificationGroupList']: for pair in group['SpecificationPairList']: if pair['Key'] in features: item[pair['Key']] = pair['Value'].encode('ascii', errors='ignore') if 'Capacity' in item: capacity = capacity_re.match(item['Capacity']) if capacity: item['Capacity'] = capacity.group(1) item['Dimms'] = capacity.group(2) if 'Speed' in item: speed = speed_re.match(item['Speed']) if speed: item['Speed'] = speed.group(1) if 'Timing' in item: timing = timing_re.match(item['Timing']) if timing: item['Timing'] = timing.group(1).replace('-','\t') else: continue item['Price'] = items[itemNumber]['FinalPrice'] item['ItemNumber'] = specs['NeweggItemNumber'] try: print '\t'.join(map(lambda x: item[x], features)) except KeyError: pass jsonQueue.task_done() |
The basic purpose of the above code is to go through each item and format each feature into usable data[5]. Once the data has been formatted and printed I continue the rest of the filtering and analysis in Microsoft's Excel.
The equation used to calculate a score for each set of system memory is as follows:
Currently it looks like G.Skill has the best to offer in the DDR3 memory market if you're looking for a quad-channel set for Sandy Bridge's enthusiast hardware due out in the next quarter[6].
- Choosing an SSD (A more different S) [↩]
- Newegg's JSON API [↩]
- Even though it was never intended for that what I'm using it for. [↩]
- Python is crippled in this way due to a global interpreter lock. [↩]
- Or at least the data that we're interested in using for analysis. [↩]
- G.SKILL Ripjaws X Series 16GB (4 x 4GB) 1333Mhz [↩]
Newegg’s JSON API
For the longest time I've wanted access to Newegg's product list. For me they've been one of the better and more structured websites for buying computer hardware. So naturally they're usually my first choice when it comes to finding a good deal on a particular piece of hardware. They're also rather useful for seeing what's out there since their product catalog is fairly complete.
A while back I had started wanting to sort through items to heuristically pick the best deal based on a number of features Newegg generally provides for each item. This method works pretty well on SSD's and system memory. But until a recent discovery I was limited to scraping Newegg's website in order to get any kind of information from them. If you've ever tried this sort of thing you know that it is messy and generally a bad idea because every single time Newegg changes the structure of their website or any minute detail this will almost always break your scraping script.
The discovery came in the form of a mobile application for Android[1]. The mobile app lets you browse their website in a clean and fast manner. But what got me thinking is that unlike some other mobile applications out there that are just application wrappers for the mobile version of their websites this one operates directly through the native GUI. Now this is where it got interesting. I knew that if Newegg had written the app to use the native GUI then they had to be providing the data to it somehow and I knew it had to be more structured than HTML scraping like what I've been doing[2]. You have no idea how happy I was to discover that I was right.
First thing I did was connect my Droid 2 Global to my home network via WiFi in order to sniff some of the traffic going to and from the mobile app. This was accomplished by mounting a CIFS drive from my Windows 7 desktop to my router running Tomato based firmware. The share had a binary for TCPDump which I then used to sniff for traffic originating or going to my phone's IP address. After setting this up and performing all of the basic operations I would need in order to "reverse engineer" the data source I got to work on filtering the important bits.
In WireShark I immediately discovered that they had a sub-domain they were using for these operations. All of the web requests that weren't images or for customer metrics and tracking went to this host:
Because this API is structured more or less the same as navigating their site and the identifiers are different I decided to start with writing a query builder. Basically the purpose was to allow me to browse to the particular category I was interested in analyzing and filter it down to just a few simple requirements to simplify the analysis.
The first major entry point in the process of browsing to what you're interested in pulling is:
http://www.ows.newegg.com/Stores.egg/Menus
This takes no parameters and provides the main menu:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | [ { "StoreDepa": "ComputerHardware", "StoreID": 1, "ShowSeeAllDeals": true, "Title": "Computer Hardware" }, { "StoreDepa": "PCNotebook", "StoreID": 3, "ShowSeeAllDeals": true, "Title": "PCs & Laptops" }, { "StoreDepa": "Electronics", "StoreID": 10, "ShowSeeAllDeals": true, "Title": "Electronics" }, ... |
Once you've selected a store to browse the next uri is:
http://www.ows.newegg.com/Stores.egg/Categories/{StoreID}
The only parameter it takes is StoreID which you'll find in the first query. This will return all of the categories within a store. I haven't really explored this very much as I'm only really interested in browsing system memory and SSD's. Using the Computer Hardware store the output is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | [ { "Description": "Backup Devices & Media", "StoreID": 1, "NodeId": 6642, "ShowSeeAllDeals": true, "CategoryType": 0, "CategoryID": 2 }, { "Description": "Barebone / Mini Computers", "StoreID": 1, "NodeId": 6668, "ShowSeeAllDeals": true, "CategoryType": 0, "CategoryID": 3 }, { "Description": "CD / DVD Burners & Media", "StoreID": 1, "NodeId": 6646, "ShowSeeAllDeals": true, "CategoryType": 0, "CategoryID": 10 }, ... |
StoreID is included from the parameters of the request. I'm not exactly sure how to describe the purpose of NodeID but it appears to be a distinguishing feature of a category or subcategory. CategoryID is used for filtering results down to a specific category and can be either a root category or a subcategory. CategoryType determines whether CategoryID is a root category or if it contains subcategories. A value of 1 for CategoryType indicates that it is the root category.
Now depending on CategoryType you either move straight to the search query or onto a navigation query. The navigation query is used if there are subcategories:
http://www.ows.newegg.com/Stores.egg/Navigation/{StoreID}/{CategoryID}/{NodeID}
This query takes StoreID, CategoryID and NodeID, which you can get from the category listing of a particular store. It will return a subcategory list. Below is the subcategory listing for the memory category.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | [ { "Description": "Desktop Memory", "StoreID": 1, "NodeId": 7611, "ShowSeeAllDeals": false, "CategoryType": 1, "CategoryID": 147 }, { "Description": "Flash Memory", "StoreID": 1, "NodeId": 8038, "ShowSeeAllDeals": false, "CategoryType": 1, "CategoryID": 68 }, { "Description": "Laptop Memory", "StoreID": 1, "NodeId": 7609, "ShowSeeAllDeals": false, "CategoryType": 1, "CategoryID": 381 }, ... |
From here you will go to the search query[3]. At this point it does get a little tricky as the parameters for the query are no longer sent via GET they are instead sent using POST[4] which basically will require a programmatic method for making a search query. The search query given a category, store and node will list quite a lot of things. The first thing in the list is search filtering parameters, these will allow you to limit the products shown in the listing.
Data being posted is necessary to receive a non-404 response from the server, if you really wanted to you could just send an empty dictionary as this would just query newegg's entire product list. Any of the query options can be omitted, integer values may be omitted by substituting their value with -1.
The parameters you should concern yourself with are as follows along with the URL the data should be posted in JSON format to:
http://www.ows.newegg.com/Search.egg/Advanced
1 2 3 4 5 6 7 8 9 | data = { "SubCategoryId": 147, "NValue": "", "StoreDepaId": 1, "NodeId": 7611, "BrandId": -1, "PageNumber": 1, "CategoryId": 17 } |
NValue is a space separated list of NValues from the search parameters. Mind you, you cannot filter against more than one item in any category of search filters. For example in system memory you can't select DDR3 1333 (PC3 10600), DDR3 1333 (PC3 10660) and DDR3 1333 (PC3 10666). The query will return an unsucessful search result. The rest of the parameters are fairly self-explanatory.
The result returned will contain the following elements: RelatedLinkList, CoremetricsInfo, NavigationContentList, PaginationInfo, ProductListItems. CoremetricsInfo and RelatedLinkList can usually be ignored, the elements we're interested in are the NavigationContentList which is a list of search parameters//filters you can apply to the search. PaginationInfo describes how many elements were returned, what page we're on and how many elements there are per page. Last but not least the ProductListItems which provides a list of the products returned by the query along with some basic listing info for each one.
Below is a portion of the NavigationContentList:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | { "NavigationContentList": [ { "NavigationItemList": [ { "SubCategoryId": -1, "Description": "Free Shipping", "StoreDepaId": 94, "NValue": "100007611 600006050 600052012 4808", "BrandId": -1, "StoreType": 4, "ItemCount": 194, "CategoryId": -1, "ElementValue": "4808" }, { "SubCategoryId": -1, "Description": "Top Sellers", "StoreDepaId": -1, "NValue": "100007611 600006050 600052012 4802", "BrandId": -1, "StoreType": -1, "ItemCount": 39, "CategoryId": -1, "ElementValue": "4802" }, ... |
This section will also contain a group name:
1 2 3 4 5 6 7 8 9 10 11 12 13 | ... "TitleItem": { "SubCategoryId": -1, "Description": "Useful Links", "StoreDepaId": -1, "NValue": "4800", "BrandId": -1, "StoreType": -2, "ItemCount": 0, "CategoryId": -1, "ElementValue": "4800" } ... |
The PaginationInfo and ProductListItem elements will look like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | ... "PaginationInfo": { "TotalCount": 233, "PageNumber": 1, "PageSize": 20 }, "ProductListItems": [ { "SellerId": null, "ItemOwnerType": 0, "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139", "ItemGroupID": 0, "ReviewSummary": { "Rating": 5, "TotalReviews": "[1]" }, "IsCellPhoneItem": false, "Discount": null, "FinalPrice": "$104.99", "ItemNumber": "20-148-372", "MappingFinalPrice": "$104.99", "FreeShippingFlag": true, "OriginalPrice": "$104.99", "IsComboBundle": false, "MailInRebateText": null, "ProductStockType": 0, "Model": "BL2KIT25664FN2139", "ShowOriginalPrice": false, "Image": { "FullPath": "http://images17.newegg.com/is/image/newegg/20-148-372-TS?$S125W$", "SmallImagePath": null, "ThumbnailImagePath": null, "Title": null }, "SellerName": null, "ParentItem": null }, ... |
At this point you might be wondering what good will all this do me if I can't get specifications on an item? Well, you can and here's how: In each ProductListItems element you'll find an ItemNumber, this is essentially the primary key that each product is related to within this interface to newegg's product list. Using the following url you can obtain the full details page on any given item using it's ItemNumber:
http://www.ows.newegg.com/Products.egg/{ItemNumber}/Specification
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | { "SpecificationGroupList": [ { "GroupName": "Model", "SpecificationPairList": [ { "Value": "Crucial", "Key": "Brand" }, { "Value": "Ballistix", "Key": "Series" }, { "Value": "BL2KIT25664FN2139", "Key": "Model" }, { "Value": "240-Pin DDR3 SDRAM", "Key": "Type" } ] }, { "GroupName": "Tech Spec", "SpecificationPairList": [ { "Value": "4GB (2 x 2GB)", "Key": "Capacity" }, { "Value": "DDR3 2133 (PC3 17000)", "Key": "Speed" }, { "Value": "9", "Key": "Cas Latency" }, { "Value": "9-10-9-24", "Key": "Timing" }, { "Value": "1.65V", "Key": "Voltage" }, { "Value": "No", "Key": "ECC" }, { "Value": "Unbuffered", "Key": "Buffered/Registered" }, { "Value": "Dual Channel Kit", "Key": "Multi-channel Kit" } ] }, { "GroupName": "Manufacturer Warranty", "SpecificationPairList": [ { "Value": "Lifetime limited", "Key": "Parts" }, { "Value": "Lifetime limited", "Key": "Labor" } ] } ], "NeweggItemNumber": "N82E16820148372", "Title": "Crucial Ballistix 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 2133 (PC3 17000) Desktop Memory with Thermal Sensor Model BL2KIT25664FN2139" } |
From this point on you can grab all of the features and specifications of any particular item you're interested in. In the near future I'll be writing a new post for both my memory and SSD analysis scripts using this interface.
The full code for my query builder is as follows, though you should note this was a quick script and is in no way complete or fully functional. As soon as it was to a useable point I moved onto the main point of this whole ordeal. You should also note that this requires CherryPy[5] and lxml[6]. The end result of this program is a query which you can use to retrieve a list of products matching the options you've selected. This is mainly to simplify product list selection and to minimalize the need to hardcode in certain values as newegg as a tendency to change things around on a regular basis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | import cherrypy, json, urllib, urllib2 from lxml import etree from lxml.builder import E class QueryBuilder(object): def index(self): request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Menus") response = request.read() data = json.loads(response) body = E.body() ul = E.ul() for store in data: ul.append(E.li(E.a( store['Title'], href= '/Store?StoreID={}'.format(store['StoreID']) ))) page = E.html(E.body(ul)) return etree.tostring(page, pretty_print=True) index.exposed = True def Store(self, StoreID=None): if StoreID is not None: request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Categories/{}".format(StoreID)) response = request.read() data = json.loads(response) body = E.body() ul = E.ul() for category in data: if category['CategoryType'] == 1: ul.append(E.li(E.a( category['Description'], href='/Search?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId']) ))) else: ul.append(E.li(E.a( category['Description'], href='/Category?StoreID={}&CategoryID={}&NodeID={}'.format(StoreID, category['CategoryID'], category['NodeId']) ))) page = E.html(E.body(ul)) return etree.tostring(page, pretty_print=True) else: return "Invalid parameters." Store.exposed = True def Category(self, StoreID, CategoryID, NodeID): if None not in [StoreID, CategoryID, NodeID]: request = urllib2.urlopen("http://www.ows.newegg.com/Stores.egg/Navigation/{}/{}/{}".format(StoreID, CategoryID, NodeID)) response = request.read() data = json.loads(response) body = E.body() ul = E.ul() for subcategory in data: ul.append(E.li(E.a( subcategory['Description'], href= '/Search?StoreID={}&CategoryID={}&SubCategoryID={}&NodeID={}'.format(StoreID, CategoryID, subcategory['CategoryID'], subcategory['NodeId']) ))) page = E.html(E.body(ul)) return etree.tostring(page, pretty_print=True) else: return "Invalid parameters." Category.exposed = True def Search(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None): url = "http://www.ows.newegg.com/Search.egg/Advanced" data = { "IsUPCCodeSearch": False, "IsSubCategorySearch": True, "isGuideAdvanceSearch": False, "StoreDepaId": StoreID, "CategoryId": CategoryID, "SubCategoryId": SubCategoryID, "NodeId": NodeID, "BrandId": -1, "NValue": "", "Keyword": "", "Sort": "FEATURED", "PageNumber": 1 } params = json.dumps(data).replace("null", "-1") request = urllib2.Request(url, params) response = urllib2.urlopen(request) data = json.loads(response.read()) if data['NavigationContentList'] is None: return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True) body = E.body() form = E.form(name='PowerSearch', action='GenerateURL', method='GET') table = E.table() form.append(table) for section in data['NavigationContentList']: index = 0 tr = E.tr(E.td(section['TitleItem']['Description'], colspan='3')) table.append(tr) for option in section['NavigationItemList']: if index % 3 == 0: tr = E.tr() table.append(tr) index += 1 checkbox = E.td(E.input(option["Description"], type="checkbox", name=section['TitleItem']['Description'].replace(" ", ""), value=option['NValue'])) tr.append(checkbox) for param, value in [('StoreID', StoreID), ('CategoryID', CategoryID), ('SubCategoryID', SubCategoryID), ('NodeID',NodeID)]: try: form.append(E.input(type='hidden', name=param, value=value)) except KeyError: pass form.append(E.input(type='submit', value='Submit')) page = E.html(E.body(form)) return etree.tostring(page, pretty_print=True) Search.exposed = True def GenerateURL(self, StoreID=None, CategoryID=None, SubCategoryID=None, NodeID=None, **kwargs): NValue = set([]) for arg in kwargs: if type(kwargs[arg]) == list: for value in kwargs[arg]: NValue.add(value) else: NValue.add(kwargs[arg]) NValue = list(NValue) NValue.sort() if StoreID is None: StoreID = -1 if CategoryID is None: CategoryID = -1 if SubCategoryID is None: SubCategoryID = -1 if NodeID is None: NodeID = -1 data = { "StoreDepaId": int(StoreID), "CategoryId": int(CategoryID), "SubCategoryId": int(SubCategoryID), "NodeId": int(NodeID), "BrandId": -1, "NValue": ' '.join(NValue), "PageNumber": 1 } return etree.tostring(E.pre(json.dumps(data, indent=4)), pretty_print=True) GenerateURL.exposed = True cherrypy.quickstart(QueryBuilder()) |
- And iOS devices I assume as well. [↩]
- Because lets face it, that would be stupid. [↩]
- ... or get to the search query from selecting a root category in the main category listing for a store [↩]
- At least this is the method used by the mobile app. [↩]
- CherryPy: CherryPy is a pythonic, object-oriented HTTP framework. [↩]
- lxml: A Pythonic binding for the C libraries libxml2 and libxslt. [↩]
Automagic TV Show Calendar
A little while ago I was browsing the web and discovered a website called tvrage.com[1] which seems to be the definitive online TV guide. I didn't originally enter the site on the main index but on a page describing the functionality of an XML API[2] they host for accessing their database of TV shows.
To me, this is like opening presents on christmas day. Just imagine the possibilities! I immediately began exploring the kind of data they provide. The very first idea I had was to use this to create events on my google calendar automatically for unaired episodes of my favorite TV shows.
I've previously written python scripts that interface with gdata but I find their implementation for python to be kind of cumbersome to deal with so I began researching their Protocol API[3]. At first I wasted a lot of time attempting to build the necessary XML structures to add events and the like. This got old very fast and I decided to just give JSON-C[4] a try. Turns out you can use the built-in JSON module in python for creating the necessary structures.
For parsing the results I got from tvrage I ended up using python's xml.etree.ElementTree which was simple enough to setup to retrieve only the information for each episode I was interested in.[5]
I had a bit of trouble initially with adding events to google calendar. This stemmed from the fact that google often will return an HTTP Redirect which includes a url with an appended gsession attribute which you're supposed to resubmit the exact data from the first request to. Once I figured this out it was turtles all the way down. I even managed to get the whole script multi-threaded to speed things up since it's impossible to perform batch-requests with JSON-C.
I should note that for the configuration file the calendar should be the "Calendar ID" for the calendar that can be found by looking at the settings page for the individual calendar, it is grouped with the XML and iCal feeds.
ShowList.txt:[6]
1 2 3 4 5 6 7 8 9 10 11 12 | Castle 19267 House 3908 Bones 2870 Big Bang Theory, The 8511 Mentalist, The 18967 Rizzoli & Isles 24996 Venture Bros., The 6270 Top Gear 6753 Mythbusters 4605 Archer 23354 NCIS 4628 Community 22589 |
Config.cfg:
1 2 3 4 | [Credentials] username = someuser@gmail.com password = somebase64encodedpassword calendar = somecalendarid@group.calendar.google.com |
AirDate.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | import urllib2, urllib, json, ConfigParser, base64 from datetime import date from xml.etree import ElementTree from threading import Thread calendar = "" header = {} # Thread for retrieving a list of episodes for a given show_id class airDate(Thread): # Initialize thread and set some local attributes def __init__(self, show_name, show_id): Thread.__init__(self) self.show_name = show_name self.show_id = show_id # Get episode list from tvrage.com based on the show_id def run(self): # Retrieve XML episode_list from tvrage.com xml_data = urllib2.urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=%s" % self.show_id).read() # Pares XML into ElementTree.Element() xml_tree = ElementTree.fromstring(xml_data) self.result = [] # For each season for season in xml_tree.findall("Episodelist/Season"): # Get the season number season_num = int(season.get("no")) # For each episode in the episode list for episode in season.findall("episode"): # Get episode number and title episode_num = int(episode.find("seasonnum").text) episode_title = episode.find("title").text # Build the episode code S##E## episode_code = "S%02dE%02d" % (season_num, episode_num) # Parse the airdate into year, month and day year, month, day = map(lambda x: int(x), episode.find("airdate").text.split("-")) try: episode_airdate = date(year, month, day) today = date.today() # If episode hasn't aired yet if episode_airdate >= today: # Add episode to results list self.result.append("%s %s - %s" % (str(episode_airdate), self.show_name, episode_code)) except ValueError: # If the airdate is invalid (tvrage.com sometimes # includes 00's for unknown sections of the date pass class addEvent(Thread): # Thread for adding events to google calendar # Initialize thread and set local episode variable def __init__(self, episode): Thread.__init__(self) self.episode = episode # Add new entry to google calendar def run(self): # Build entry structure entry = {"data": {"details": self.episode, "quickAdd": True}} # Convert to JSON entry = json.dumps(entry) # Build request including necessary headers and data calReq = urllib2.Request("http://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), entry, header) # Execute the request calRes = urllib2.urlopen(calReq) # Get the redirect url (gsession appended) redirectReq = urllib2.Request(calRes.geturl(), entry, header) try: redirectRes = urllib2.urlopen(redirectReq) except HTTPError: # If we get some sort of HTTP error code # skip entry, can always run again pass # Get list of events already added to # the calendar from previous executions def getExistingEpisodes(header): # Get JSON-C representation of calendar calReq = urllib2.Request(url="https://www.google.com/calendar/feeds/%s/private/full?alt=jsonc" % (calendar), headers=header) calRes = urllib2.urlopen(calReq) # Parse JSON-C data = json.loads(calRes.read()) # If the calendar has events on it if "items" in data["data"]: # Get the list of events events = data["data"]["items"] existing_episodes = [] # For each event for event in events: # Append just the title of the event to the results existing_episodes.append(event["title"]) return existing_episodes else: # We don't have any events on this calendar # so just return an empty list return [] if __name__ == '__main__': # Open the configuration file and get the necessary # credentials and settings config = ConfigParser.ConfigParser() config.readfp(open("Config.cfg")) username = config.get("Credentials", "username") password = config.get("Credentials", "password") # Password is stored as base64 encoded string just so # we don't have our password sitting out in plain sight password = base64.b64decode(password) calendar = config.get("Credentials", "calendar") # Build loginData structure, this is used to get # authentication data from google loginData = { "Email": username, "Passwd": password, "source": "BeMasher-ETR-2", "service": "cl" } # Encode the loginData for usage in a url loginData = urllib.urlencode(loginData) # Get authentication data gdataLogin = urllib2.urlopen("https://www.google.com/accounts/ClientLogin", data=loginData) SID, LSID, Auth = gdataLogin.read().splitlines() # Build header structure, this will be used for # all requests to google calendar from now on header = { "Authorization": "GoogleLogin %s" % (Auth), "GData-Version": 2, "Content-Type": "application/json" } # Open a list of the shows we're interested in # Stored as "show_name\tshow_id", one per line show_list = open("ShowList.txt") jobs = [] for line in show_list: show = line.strip().split("\t") jobs.append(show) # Get a list of existing events from previous # executions so we don't wind up with duplicates existingEpisodes = getExistingEpisodes(header) threadQueue = [] # For each episode we've retrieved that is unaired for job in jobs: show_name, show_id = job # Create an instance of the airDate thread thread = airDate(show_name, show_id) # Start it thread.start() # Add it to the threadQueue threadQueue.append(thread) episodes = [] # While we've still got running threads while len(threadQueue) > 0: # Get a thread from the queue thread = threadQueue.pop() # Block until it completes thread.join() # For each episode in the results for episode in thread.result: # If it hasn't already been added to google calendar if episode[11:] not in existingEpisodes: print episode # Add to list of episodes that need events created episodes.append(episode) # For each episode that doesn't have an # event on google calendar already for episode in episodes: # Create an addEvent thread, start it # and add it to the threadQueue thread = addEvent(episode) thread.start() threadQueue.append(thread) # While we still have threads running while len(threadQueue) > 0: # Get a thread from the queue thread = threadQueue.pop() # Block until it completes thread.join() |
This was all done shortly before I discovered that tvrage.com also provides iCal feeds for your favorite shows provided that you register and add some to your list. Unfortunately the iCal feed they generate creates events for exact air times of each episode which I'm not really all that concerned about. So I use this script still to add all-day events for each episode which is easier to view//see when there's a new episode.
I did write another script using their XML API but that will have to wait for another post.
- http://tvrage.com/ [↩]
- http://services.tvrage.com/ [↩]
- Data API Developer's Guide: The Protocol [↩]
- Google's own flavor of JSON which is almost identical to plain old JSON. [↩]
- I only really needed the original air date, title, season number and episode number. [↩]
- You can find the show_id via the show search found on their XML API page. [↩]
GitHub Repositories Feed
I noticed at the bottom of the page on GitHub that there was an API link. I took a look at it and found it to be pretty interesting, it's actually really simple to use. You can export in xml, json and yaml. I thought to myself: "Hey it'd be great if I could put a repositories feed in the sidebar of my blog here!".
So I took a look at the JSON output since it's small and really easy to deserialize in php, so I wrote up a quick little php script on the server I'm hosting my blog at that will spit out an RSS feed of the repositories I've created on GitHub. The code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | <?="<?xml version=\"1.0\"?>"?> <rss version="2.0"> <channel> <title>BeMasher's GitHub Repositories</title> <link>http://github.com/bemasher</link> <description>BeMasher's GitHub Repositories</description> <language>en-us</language> <pubDate><?=date("D, d M Y G:i:s e")?></pubDate> <lastBuildDate><?=date("D, d M Y G:i:s e")?></lastBuildDate> <webMaster>bemasher@bemasher.net</webMaster> <ttl>5</ttl> <?php $data = file_get_contents("http://github.com/api/v2/json/repos/show/bemasher"); $data = json_decode($data, true); $today = date("D, d M Y G:i:s e"); foreach($data["repositories"] as $repository) { echo <<<ITEM <item> <title>{$repository["name"]}</title> <link>{$repository["url"]}</link> <description>{$repository["description"]}</description> <pubDate>$today</pubDate> <guid>{$repository["url"]}</guid> </item> ITEM; } ?> </channel> </rss> |
And if you look to the right you can see the RSS Widget in action displaying the output of the script. Cool huh?

