Thesis Stories Episode 2 : Adventures in Ginormous Data

What could be worse than trying to understand ginormous data? Not finding what you’re looking for in it! The original plan was to use stackoverflow data as a proxy for my data mining experiments and after battling the databeast into submission (with the aid of ‘weapons of mass data analysis‘ like Python, SQLServer and sqlite3); it pulled another fast one on me.

I started exploring the murky depths of the subdued dataset and plotted the data distributions (they were mostly heavy-tailed as expected although one was unexpectedly normally distributed). Plotting the distributions was a task in itself – I had to learn about the awesome matplotlib, numpy and scipy (installing them was ‘wahala‘) – and  then the plots were so skewed that I had to spend hours on end finetuning and tweaking before they finally agreed to appear properly on my plots.

Plotting data distributions was child’s play compared with the next set of challenges : getting the candidate posts. Having defined the five features I was interested in; I set out with gusto to find the ‘offending’ entries. I got a surprising outcome – the first three characteristics (deleted answers, converted answers and flagged answers) didn’t exist in the dump; I got the same outcome when I ran my queries on the online data explorer. Finally, I asked on stackoverflow meta and it was confirmed.

You bet I was disappointed right? Sure! I’d spent hours on end writing convoluted SQL queries (subselects, joins, aggregations and what-have-you) and wrapping my head around the data. Heck! Some queries took me about an hour to write, run, verify and tune. Do you know what programming in a declarative non-Turing complete language with lots of constraints (geek-speak for SQL) feels like? It feels like fighting Mike Tyson with one hand tied behind your back. :P (Alhamdulilah,  I took the MOOC DB course).

When man fall down, no be the end of hin life… (Nigerian proverb; language: pidgin English)

So I listed out my alternatives : getting a new dataset, using the same dataset or starting on a new topic (most disliked obviously ;) ). My search for a new dataset was not fruitful – I find other datasets ill-suited to my research and going through the potentially painful process of transforming them does not appeal to me. I went back to my dataset and extracted three other features but the nagging feeling in my mind is that I might have to fall back to the third option.

So do I concede defeat? Nah, nat at all – am a Nigerian remember? We never say die; we’re way too optimistic for our good even :).

Lessons Learnt

  • Never write a lot until you’re really really sure that you’re gonna get something.
  • How to extract information from papers and critique them, know what they are all about.
  • How to read and write continuously for a long period – how do I do it? Pomodoro of course!

Next Steps

I might go back to the SO data; or start all over again but I just pray it turns out all fine – I now have about 4 months left.

Ohh; I forgot to talk about the platform – that’s just been about as good as the experiments.

I am using EmberJS, a MVC framework and it’s being really challenging as I am new to it. I’ve had to fix issues with performance and page load times, integration on Amazon EC2; and all sorts. It’s been so difficult that I’ve started entertaining un-Nigerian thoughts of giving up on EmberJS – plain old vanilla JavaScript is much simpler.

Ok. Magana Yakare (“The discussion is over”, language: Hausa).

Have a great weekend – I just wanted to go at it the Naija style today and not write the same old normal way I do. I hope you enjoyed it; if you did – drop a nice comment or share some of your grad life experience.

N.B: If you’re a grad student having issues with your thesis; don’t worry be happy :D



Thesis Stories : Wrangling with HUGE data

My thesis takes all my time: I have to review papers, write out my thoughts, build a platform, attend classes as well as poke into Big Data; and my blog has been at the receiving end. This story about big data came to my mind while I was thinking about my planned work on the stackoverflow (SO) dataset. My adviser suggested using the dataset as a proxy for my experiments; I am assuming you know stackoverflow already.

Dumps of stackoverflow content are released every third month on a creative commons license (don’t worry; it’s anonymized). I got the latest dump and it is  ~7GB of compressed XML files. You can get the dumps at this link.  The XML files posed the challenge; some are pretty ok – the smallest is around 309Mb but there are two humongous beasts: the posthistory and the posts files are ~17.9GB and ~12Gb respectively! How do you open text files that huge? I tried vim, less and others before giving up.

I eventually found a python script to convert the XML dump to SQLite; I had to update it as the schema has changed. Although it’s not optimized and somewhat slow, it does what I want; (maybe I can improve it and put it back up on the stackoverflow (SO) meta site – bah… I am too lazy).

Pronto, I ran the script and went away with happy thoughts of the good times I would have with the data. Well, I came back a couple of hours later to find out that the conversion had failed. I initially thought the script was broken and went on a wild goose chase looking for other scripts and converters – I got one that required me to install postgresql (I have never used this RDBMS before and it has some of its own quirks too ;) ). Finally I got that converter to play nice with postgresql only for it to break too – Aaaargh! Try picturing an exasperated me :P.

I backtracked and found out that the problem was actually due to poorly-formed XML – wonder why don’t they make ’em parsers lenient? The dump contained some Unicode characters which are invalid in XML. I went back to SO and came across someone who someone who ran into the same problem with an old dump of SO data! From that question; I got a python script that could detect the occurrence of invalid characters and ran it on my dump – well it detected quite a couple. Next step? The elimination of the characters of course!

Subsequently I found another python script to replace the unwanted Unicode (again? Pulling scripts off the internet? Well, I could write them myself but why spend lots of hours on something when I could get it working in minutes – and yes I understand how the code works; at least I think I do :) ).  The script didn’t do exactly what I wanted so I fixed the regular expression in it; tested it and then ran it on my ginormous 17Giga… a couple of minutes later and I was done. Phew! I loaded it up in sqlite3 (another hair-pulling experience – I found out the hard way that SQLite wouldn’t load sqlite3 DBs…. yikes!).

Finally I can run queries on my humongous pet – the largest table contains more than 23 million records and I have to pull out metrics and possibly generate images and graphs to really get an idea of what lies beneath. I pray I get this done, I really want to conquer this beast of data and I should be posting here about my weekly progress insha Allaah.

So in case you need scripts to use with ginormous data; I have uploaded all three to my github: here.

DISCLAIMER: I am not claiming to be the author of these scripts; I only made minor modifications to get them to work.

Do you know editors that can handle HUGE GINORMOUS databeasts (yes databeast, not dataset)? Please drop a comment.

Have fun!