MOOC Review : Machine Learning

So I completed the Machine learning course on Coursera recently; it’s my umpteenth MOOC Course and I actually do not know how many I have taken or want to take. Despite this,  I have only gotten three certificates so far; all my attempts at integrating Coursera courses into my schedule have not been successful. I have had to settle for auditing courses at my pace: at least I get to fulfill my goal insha Allaah: learning and understanding.

I took a couple of data mining courses during my master’s degree and have some experience with a variety of tools and programming languages (Octave, Weka). As grad students, we had to write our own classifiers and performance evaluators (kNN, Naive Bayes, ROC curves). Using Weka or similar tools/libraries would have saved us much trouble however our lecturer did not agree. I believe his approach was great as we were ‘forced’ to learn and got to really understand what the mathematics was all about.

Back to the Machine learning course, I think Professor Andrew Ng (Stanford) is simply awesome at what he does – everything is explained in really simple terms. He is also the guy behind the awesome autonomous helicopter video, haven’t seen it? Here is the link, go watch it!

It was refreshing to use Octave again, its elegant approach to numerical computation is mind-blowing – it makes it possible to get a lot done in a few lines. It also seems to handle the floating-point issue well (0.1 + 0.2 = 0.3). These features are impressive when compared to other languages; however every language has its strengths, weaknesses and application domains.

Alhamdulilaah I learned a great deal and refreshed my background in a variety of machine learning techniques and applications. It was nice to relearn the concepts of gradient and stochastic descent, clustering, overfitting and underfitting (bias and variance). There were a couple of new topics such as principal component analysis (beautifully explained), logistic regression, anomaly detection, recommender systems and support vector machines.

I struggled with the neural networks section and at times found the programming exercises quite challenging. It was a surprise (albeit a humbling one) to realize that I didn’t know as much as I thought I knew. The suggestions and advice on data mining are invaluable, he gave suggestions on setting up a processing pipeline, ceiling analysis, learning curves, error analysis and regularization and creating artificial data and dealing with massive datasets.

Overall I think it is a great course and Alhamdulilah I am glad to have taken it. The next challenge is to find some way to use these techniques – there is no better way to fully understand than to practice, not so?

Interested? Head over to Coursera and sign up. It is currently not running however you can either view the archives at your own pace or wait for a new run. I have also written on some other MOOCs too: Databases and Social Networks. Do check out this cool infographic too.



Thesis Stories Episode 2 : Adventures in Ginormous Data

What could be worse than trying to understand ginormous data? Not finding what you’re looking for in it! The original plan was to use stackoverflow data as a proxy for my data mining experiments and after battling the databeast into submission (with the aid of ‘weapons of mass data analysis‘ like Python, SQLServer and sqlite3); it pulled another fast one on me.

I started exploring the murky depths of the subdued dataset and plotted the data distributions (they were mostly heavy-tailed as expected although one was unexpectedly normally distributed). Plotting the distributions was a task in itself – I had to learn about the awesome matplotlib, numpy and scipy (installing them was ‘wahala‘) – and  then the plots were so skewed that I had to spend hours on end finetuning and tweaking before they finally agreed to appear properly on my plots.

Plotting data distributions was child’s play compared with the next set of challenges : getting the candidate posts. Having defined the five features I was interested in; I set out with gusto to find the ‘offending’ entries. I got a surprising outcome – the first three characteristics (deleted answers, converted answers and flagged answers) didn’t exist in the dump; I got the same outcome when I ran my queries on the online data explorer. Finally, I asked on stackoverflow meta and it was confirmed.

You bet I was disappointed right? Sure! I’d spent hours on end writing convoluted SQL queries (subselects, joins, aggregations and what-have-you) and wrapping my head around the data. Heck! Some queries took me about an hour to write, run, verify and tune. Do you know what programming in a declarative non-Turing complete language with lots of constraints (geek-speak for SQL) feels like? It feels like fighting Mike Tyson with one hand tied behind your back. :P (Alhamdulilah,  I took the MOOC DB course).

When man fall down, no be the end of hin life… (Nigerian proverb; language: pidgin English)

So I listed out my alternatives : getting a new dataset, using the same dataset or starting on a new topic (most disliked obviously ;) ). My search for a new dataset was not fruitful – I find other datasets ill-suited to my research and going through the potentially painful process of transforming them does not appeal to me. I went back to my dataset and extracted three other features but the nagging feeling in my mind is that I might have to fall back to the third option.

So do I concede defeat? Nah, nat at all – am a Nigerian remember? We never say die; we’re way too optimistic for our good even :).

Lessons Learnt

  • Never write a lot until you’re really really sure that you’re gonna get something.
  • How to extract information from papers and critique them, know what they are all about.
  • How to read and write continuously for a long period – how do I do it? Pomodoro of course!

Next Steps

I might go back to the SO data; or start all over again but I just pray it turns out all fine – I now have about 4 months left.

Ohh; I forgot to talk about the platform – that’s just been about as good as the experiments.

I am using EmberJS, a MVC framework and it’s being really challenging as I am new to it. I’ve had to fix issues with performance and page load times, integration on Amazon EC2; and all sorts. It’s been so difficult that I’ve started entertaining un-Nigerian thoughts of giving up on EmberJS – plain old vanilla JavaScript is much simpler.

Ok. Magana Yakare (“The discussion is over”, language: Hausa).

Have a great weekend – I just wanted to go at it the Naija style today and not write the same old normal way I do. I hope you enjoyed it; if you did – drop a nice comment or share some of your grad life experience.

N.B: If you’re a grad student having issues with your thesis; don’t worry be happy :D