Thesis Stories Episode 2 : Adventures in Ginormous Data

What could be worse than trying to understand ginormous data? Not finding what you’re looking for in it! The original plan was to use stackoverflow data as a proxy for my data mining experiments and after battling the databeast into submission (with the aid of ‘weapons of mass data analysis‘ like Python, SQLServer and sqlite3); it pulled another fast one on me.

I started exploring the murky depths of the subdued dataset and plotted the data distributions (they were mostly heavy-tailed as expected although one was unexpectedly normally distributed). Plotting the distributions was a task in itself – I had to learn about the awesome matplotlib, numpy and scipy (installing them was ‘wahala‘) – and  then the plots were so skewed that I had to spend hours on end finetuning and tweaking before they finally agreed to appear properly on my plots.

Plotting data distributions was child’s play compared with the next set of challenges : getting the candidate posts. Having defined the five features I was interested in; I set out with gusto to find the ‘offending’ entries. I got a surprising outcome – the first three characteristics (deleted answers, converted answers and flagged answers) didn’t exist in the dump; I got the same outcome when I ran my queries on the online data explorer. Finally, I asked on stackoverflow meta and it was confirmed.

You bet I was disappointed right? Sure! I’d spent hours on end writing convoluted SQL queries (subselects, joins, aggregations and what-have-you) and wrapping my head around the data. Heck! Some queries took me about an hour to write, run, verify and tune. Do you know what programming in a declarative non-Turing complete language with lots of constraints (geek-speak for SQL) feels like? It feels like fighting Mike Tyson with one hand tied behind your back. :P (Alhamdulilah,  I took the MOOC DB course).

When man fall down, no be the end of hin life… (Nigerian proverb; language: pidgin English)

So I listed out my alternatives : getting a new dataset, using the same dataset or starting on a new topic (most disliked obviously ;) ). My search for a new dataset was not fruitful – I find other datasets ill-suited to my research and going through the potentially painful process of transforming them does not appeal to me. I went back to my dataset and extracted three other features but the nagging feeling in my mind is that I might have to fall back to the third option.

So do I concede defeat? Nah, nat at all – am a Nigerian remember? We never say die; we’re way too optimistic for our good even :).

Lessons Learnt

  • Never write a lot until you’re really really sure that you’re gonna get something.
  • How to extract information from papers and critique them, know what they are all about.
  • How to read and write continuously for a long period – how do I do it? Pomodoro of course!

Next Steps

I might go back to the SO data; or start all over again but I just pray it turns out all fine – I now have about 4 months left.

Ohh; I forgot to talk about the platform – that’s just been about as good as the experiments.

I am using EmberJS, a MVC framework and it’s being really challenging as I am new to it. I’ve had to fix issues with performance and page load times, integration on Amazon EC2; and all sorts. It’s been so difficult that I’ve started entertaining un-Nigerian thoughts of giving up on EmberJS – plain old vanilla JavaScript is much simpler.

Ok. Magana Yakare (“The discussion is over”, language: Hausa).

Have a great weekend – I just wanted to go at it the Naija style today and not write the same old normal way I do. I hope you enjoyed it; if you did – drop a nice comment or share some of your grad life experience.

N.B: If you’re a grad student having issues with your thesis; don’t worry be happy :D



MOOC Talk : And I thought I knew SQL

Massive Open Online Courses (MOOC) became popular early this year with offerings from Coursera, Udacity and EdX. These platforms were inspired by the phenomenal success of the three online courses (db-class, ai-class and ml-class) that ran in late 2011. It has never been so easy to get high-quality knowledge – for example, Coursera has renowned experts teaching over 200 courses; moreover the incentive of getting certificates appeals to many of us.

I have taken a couple of courses since the year started and I periodically enroll in new courses. Presently; I am registered in about 50 courses (am I greedy? :) ) – I am not taking all of them simultaneously however I want to have access to the course content when I have time.

It’s impossible to learn everything and I don’t really care much about getting certificates (they are nice to have though); I am more interested in understanding the course and getting the concepts right. The same concept applies to everything I do: now I am not saying excellent grades are bad. I appreciate great grades however I place more importance on thorough understanding and being able to apply concepts.

After completing Professor’s Kearn’s Networked life which I enjoyed; I decided to take Professor Jennifer Widom’s Database course. I had earlier enrolled for the course in late 2011 however I had to drop out because of my coursework at MASDAR. The courses started off well enough but soon became challenging. I thought I was good at databases since I have been writing SQL queries (mostly CRUD and referential integrity etc.) for at least 3 years however I was wrong.

Professor Jennifer Widom does a pretty great job of guiding students through the concepts. The videos are easy to follow and have been meticulously prepared however the assignments are quite challenging and I recommend you do them if you really want to understand the course. I initially wanted to post my answers on my github however I don’t think I’ll be doing that – it doesn’t help anyone: the main aim is to get the knowledge and not just perfect scores.

I have learnt a lot: relational algebra, XML, JSON, SQL, constraints, triggers and a whole slew of other topics. It was also good to learn about database indexes, their types and applications; the succinct exposition provided great insight into the merits and demerits of this great concept.

I don’t want to be a database administrator but I want to learn. Moreover I am always interested in programming languages and paradigms and SQL is a language in its own right. It’s a declarative language and forces you to think in weird ways but it’s always fun to stretch ain’t it? Some people write and solve problems in SQL; talk about gurus….

Alhamdulilah, the skills I picked come in handy for my work using the stackoverflow dataset so you never can say when you’ll need some knowledge. It’s even possible to query XML data using XPath or XQuery or XSLT however the results will be in XML. If you’re interested in doing mental acrobatics; write some SQL code or challenge yourself.

Well I haven’t finished the course – I am midway through but I’ll gladly recommend it to anyone who is interested in DBs who to anyone who just wants to learn. That’s it for this week; insha Allaah I’ll be writing about productivity or Ember or my thesis next week.

Learn, learn even more and never stop learning…