What could be worse than trying to understand ginormous data? Not finding what you’re looking for in it! The original plan was to use stackoverflow data as a proxy for my data mining experiments and after battling the databeast into submission (with the aid of ‘weapons of mass data analysis‘ like Python, SQLServer and sqlite3); it pulled another fast one on me.
I started exploring the murky depths of the subdued dataset and plotted the data distributions (they were mostly heavy-tailed as expected although one was unexpectedly normally distributed). Plotting the distributions was a task in itself – I had to learn about the awesome matplotlib, numpy and scipy (installing them was ‘wahala‘) – and then the plots were so skewed that I had to spend hours on end finetuning and tweaking before they finally agreed to appear properly on my plots.
Plotting data distributions was child’s play compared with the next set of challenges : getting the candidate posts. Having defined the five features I was interested in; I set out with gusto to find the ‘offending’ entries. I got a surprising outcome – the first three characteristics (deleted answers, converted answers and flagged answers) didn’t exist in the dump; I got the same outcome when I ran my queries on the online data explorer. Finally, I asked on stackoverflow meta and it was confirmed.
You bet I was disappointed right? Sure! I’d spent hours on end writing convoluted SQL queries (subselects, joins, aggregations and what-have-you) and wrapping my head around the data. Heck! Some queries took me about an hour to write, run, verify and tune. Do you know what programming in a declarative non-Turing complete language with lots of constraints (geek-speak for SQL) feels like? It feels like fighting Mike Tyson with one hand tied behind your back. 😛 (Alhamdulilah, I took the MOOC DB course).
When man fall down, no be the end of hin life… (Nigerian proverb; language: pidgin English)
So I listed out my alternatives : getting a new dataset, using the same dataset or starting on a new topic (most disliked obviously 😉 ). My search for a new dataset was not fruitful – I find other datasets ill-suited to my research and going through the potentially painful process of transforming them does not appeal to me. I went back to my dataset and extracted three other features but the nagging feeling in my mind is that I might have to fall back to the third option.
So do I concede defeat? Nah, nat at all – am a Nigerian remember? We never say die; we’re way too optimistic for our good even :).
- Never write a lot until you’re really really sure that you’re gonna get something.
- How to extract information from papers and critique them, know what they are all about.
- How to read and write continuously for a long period – how do I do it? Pomodoro of course!
I might go back to the SO data; or start all over again but I just pray it turns out all fine – I now have about 4 months left.
Ohh; I forgot to talk about the platform – that’s just been about as good as the experiments.
Ok. Magana Yakare (“The discussion is over”, language: Hausa).
Have a great weekend – I just wanted to go at it the Naija style today and not write the same old normal way I do. I hope you enjoyed it; if you did – drop a nice comment or share some of your grad life experience.
N.B: If you’re a grad student having issues with your thesis; don’t worry be happy 😀
- Thesis stories : Wrangling with HUGE data (abdulapopoola.wordpress.com)