Thesis Stories Episode 2 : Adventures in Ginormous Data


What could be worse than trying to understand ginormous data? Not finding what you’re looking for in it! The original plan was to use stackoverflow data as a proxy for my data mining experiments and after battling the databeast into submission (with the aid of ‘weapons of mass data analysis‘ like Python, SQLServer and sqlite3); it pulled another fast one on me.

I started exploring the murky depths of the subdued dataset and plotted the data distributions (they were mostly heavy-tailed as expected although one was unexpectedly normally distributed). Plotting the distributions was a task in itself – I had to learn about the awesome matplotlib, numpy and scipy (installing them was ‘wahala‘) – and  then the plots were so skewed that I had to spend hours on end finetuning and tweaking before they finally agreed to appear properly on my plots.

Plotting data distributions was child’s play compared with the next set of challenges : getting the candidate posts. Having defined the five features I was interested in; I set out with gusto to find the ‘offending’ entries. I got a surprising outcome – the first three characteristics (deleted answers, converted answers and flagged answers) didn’t exist in the dump; I got the same outcome when I ran my queries on the online data explorer. Finally, I asked on stackoverflow meta and it was confirmed.

You bet I was disappointed right? Sure! I’d spent hours on end writing convoluted SQL queries (subselects, joins, aggregations and what-have-you) and wrapping my head around the data. Heck! Some queries took me about an hour to write, run, verify and tune. Do you know what programming in a declarative non-Turing complete language with lots of constraints (geek-speak for SQL) feels like? It feels like fighting Mike Tyson with one hand tied behind your back. :P (Alhamdulilah,  I took the MOOC DB course).

When man fall down, no be the end of hin life… (Nigerian proverb; language: pidgin English)

So I listed out my alternatives : getting a new dataset, using the same dataset or starting on a new topic (most disliked obviously ;) ). My search for a new dataset was not fruitful – I find other datasets ill-suited to my research and going through the potentially painful process of transforming them does not appeal to me. I went back to my dataset and extracted three other features but the nagging feeling in my mind is that I might have to fall back to the third option.

So do I concede defeat? Nah, nat at all – am a Nigerian remember? We never say die; we’re way too optimistic for our good even :).

Lessons Learnt

  • Never write a lot until you’re really really sure that you’re gonna get something.

  • How to extract information from papers and critique them, know what they are all about.

  • How to read and write continuously for a long period – how do I do it? Pomodoro of course!

Next Steps

I might go back to the SO data; or start all over again but I just pray it turns out all fine – I now have about 4 months left.

Ohh; I forgot to talk about the platform – that’s just been about as good as the experiments.

I am using EmberJS, a MVC framework and it’s being really challenging as I am new to it. I’ve had to fix issues with performance and page load times, integration on Amazon EC2; and all sorts. It’s been so difficult that I’ve started entertaining un-Nigerian thoughts of giving up on EmberJS – plain old vanilla JavaScript is much simpler.

Ok. Magana Yakare (“The discussion is over”, language: Hausa).

Have a great weekend – I just wanted to go at it the Naija style today and not write the same old normal way I do. I hope you enjoyed it; if you did – drop a nice comment or share some of your grad life experience.

N.B: If you’re a grad student having issues with your thesis; don’t worry be happy :D

5 thoughts on “Thesis Stories Episode 2 : Adventures in Ginormous Data

  1. Barakallah feeh bro. May Allah ease your task. Me myself I’m having problem(s) with my seminar on data mining. Big data is sure a bi*tch. I’m taking a course on big data and web intelligence. Too much history and logic embedded in data. I haven’t dealt with data at that level so I don’t have anything to offer you except prayers and well, I’ll be rooting for you all the way. You think you can help with my seminar?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.