GEEKERY  
ADVENTURE  
CONTEMPLATION  

20110117

not fast enough

Tonight I was working on processing data about a topic model on some 360 thousand documents from ASCII files into a database.  Bweh, what a task.  I started with a pretty naïve approach, and after getting it to work piecewise, I set it to run on the whole shebang.  After watching numbers fly by for a few minutes, I crunched some numbers and figured that it would take about 10 days to finish--not okay.  This was a piece of code that was pretty tailored to my task and would likely only ever be seen by me, and even then, only run this once; I didn't really want to sink a lot of time into it, but I'd like it to finish, in say, under 24 hours.

First was the problem of finding links to the documents.  Nature has this entire set online, but finding a link given a document id (doi) wasn't a find-and-replace task.  Take a look at this document on The Rockefeller Foundation (doi: 10.1038/147811a0), for instance.  Part of the doi is in the link, but there's also a volume number and one other number prefixed by n that makes the link unique.  And I didn't have those numbers, so I was querying nature for the doi and finding the pdf link on the search page (such as this).  Good for a handful of links, but not 360k times.  Turns out, I was able to find those numbers buried in my data, but it was pretty obscure.  Plug-n-play on the link took me down to 5 days.

Next, there were two types of data file about the documents, ones with a doi and an abstract per line and ones with all sorts of other info, including topic model data and also the doi, again one line per document.  The files themselves weren't one-to-one (there being 8 of one kind and 25 of the other), but they were one-to-one when it came to the relevant document lines: each doi had one line in each of the two file sets.  However, the organization wasn't intuitive and I wasn't about to pry through them by hand.  Instead, I noted that the matches occurred in groups, though again, not in any way that made logical sense that I could hard-code.  So instead of looking through all the other files for a match, I just had it stash the last-used file and look in that one first each time.  If it wasn't there, it looked in the others and updated the last-used file accordingly.  That took the expected run time down to under 24 hours, and I declared myself done for the night.

I guess the lesson learned was that even for simple pieces of code that are for private use and only to be used once, it's important to take the time to do things right instead of just the easiest/stupidest way possible.

No comments: