My thesis takes all my time: I have to review papers, write out my thoughts, build a platform, attend classes as well as poke into Big Data; and my blog has been at the receiving end. This story about big data came to my mind while I was thinking about my planned work on the stackoverflow (SO) dataset. My adviser suggested using the dataset as a proxy for my experiments; I am assuming you know stackoverflow already.
Dumps of stackoverflow content are released every third month on a creative commons license (don’t worry; it’s anonymized). I got the latest dump and it is ~7GB of compressed XML files. You can get the dumps at this link. The XML files posed the challenge; some are pretty ok – the smallest is around 309Mb but there are two humongous beasts: the posthistory and the posts files are ~17.9GB and ~12Gb respectively! How do you open text files that huge? I tried vim, less and others before giving up.
I eventually found a python script to convert the XML dump to SQLite; I had to update it as the schema has changed. Although it’s not optimized and somewhat slow, it does what I want; (maybe I can improve it and put it back up on the stackoverflow (SO) meta site – bah… I am too lazy).
Pronto, I ran the script and went away with happy thoughts of the good times I would have with the data. Well, I came back a couple of hours later to find out that the conversion had failed. I initially thought the script was broken and went on a wild goose chase looking for other scripts and converters – I got one that required me to install postgresql (I have never used this RDBMS before and it has some of its own quirks too ;) ). Finally I got that converter to play nice with postgresql only for it to break too – Aaaargh! Try picturing an exasperated me :P.
I backtracked and found out that the problem was actually due to poorly-formed XML – wonder why don’t they make ’em parsers lenient? The dump contained some Unicode characters which are invalid in XML. I went back to SO and came across someone who someone who ran into the same problem with an old dump of SO data! From that question; I got a python script that could detect the occurrence of invalid characters and ran it on my dump – well it detected quite a couple. Next step? The elimination of the characters of course!
Subsequently I found another python script to replace the unwanted Unicode (again? Pulling scripts off the internet? Well, I could write them myself but why spend lots of hours on something when I could get it working in minutes – and yes I understand how the code works; at least I think I do :) ). The script didn’t do exactly what I wanted so I fixed the regular expression in it; tested it and then ran it on my ginormous 17Giga… a couple of minutes later and I was done. Phew! I loaded it up in sqlite3 (another hair-pulling experience – I found out the hard way that SQLite wouldn’t load sqlite3 DBs…. yikes!).
Finally I can run queries on my humongous pet – the largest table contains more than 23 million records and I have to pull out metrics and possibly generate images and graphs to really get an idea of what lies beneath. I pray I get this done, I really want to conquer this beast of data and I should be posting here about my weekly progress insha Allaah.
So in case you need scripts to use with ginormous data; I have uploaded all three to my github: here.
DISCLAIMER: I am not claiming to be the author of these scripts; I only made minor modifications to get them to work.
Do you know editors that can handle HUGE GINORMOUS databeasts (yes databeast, not dataset)? Please drop a comment.
10 thoughts on “Thesis Stories : Wrangling with HUGE data”
prof, nice work so far!
Thanks a lot Prof Osaro! :)
waooo, i am learning phython recently from edx and coursera, i thought i am getting better in phython, i could barely understand 10 lines in each of the scripts…Still more to learn, programming is not easy ooo. Nice Job, May Allah assist you with the thesis.
Ameen; jazaakumullaahu khayran akhee. :)
Insha Allaah you’ll get the hang of it soon.