The Myth of Perfect Software


Programs do not acquire bugs as people acquire germs, by hanging around other buggy programs. Programmers must insert them… Harlan Mills

Software breaks all the time: booting issues, corrupt software and files, crashes etc; nearly everyone has had a close shave or two with fragile software. Can programmers write ‘perfect’ fault-free software? I presume a trip to Uranus would be much easier, here I come NASA! :D.

A program’s complexity scope is way too large to fit in a programmer’s mind. It is difficult (nigh impossible) to ascertain a program is valid, in fact it is extremely difficult to know the number of errors in software. Here is a simple example, exhaustive validation of an 8 character field would require checking 26^8 combinations (assuming only letters a-z are allowed). Real-life programs contain typically have larger problem domains.

Since achieving a utopia of bug-free software is not feasible, we might as well set our sights on something much more achievable – producing robust software. Robust code doesn’t translate into perfect software; however it tries to behave ‘nicely’ when things go awry – i.e. it does not wipe your hard drive.

Here are a couple of suggestions for writing robust code:

1. Don’t trust anything that comes in from external sources – sanitize, validate and then confirm. You can also log suspicious activity and have default fallback actions.

2. Remember Murphy’s law (“Anything that can go wrong, will go wrong”); that 1 in a 10000000000000000000000000000 chance event needs to be handled even if you ‘think’ it’ll never happen. This is software + humans, remember the large complexity space?

3. It is essential to write simple, flexible, extensible code; if changing one parameter breaks the source then you need to refactor. Also keep in mind the YAGNI principle and do not write unnecessary code just because you feel like it.

5. Write the least amount of (clear) code needed – The less code you have, the less the complexity (and the number of likely bugs/errors) and the better for everyone.

6. Keep cyclomatic complexity low, the fewer the number of executable paths in your program the simpler it is.

7. Test your code and when ‘bad’ things happen, remember to exit gracefully.

Do try to write robust code; you’ll save yourself future worries, impress your customers (which likely translates into lots more $$$ from future referrals) and improve your craft.

This post was motivated by my fascination with the ext file system…. I keep having boot issues with my Ubuntu installation whenever I force it to shut down – this causes the ext file journal system to end up in an unclean state. While in this state, the operating system does not load and I keep ending up at the initramfs prompt. A live disk, some tweaking at the terminal (e2fsck) and the operating system is good as new again.

Advertisements

Thesis Stories Ep 3: Research is Hard!


Alhamdulilah I completed my thesis about three weeks ago; if you’re interested, you can check out my thesis and presentation. Looking back at the two years I spent at MASDAR, I have a couple of thoughts: Alhamdulilah I learnt a lot, met a couple of wonderful people and matured significantly. There were a couple of not-so-pleasant experiences too but I believe I emerged stronger ultimately.

So I switched to the complexity analysis of road networks after my stack overflow adventure ended unsuccessfully. It was a fresh start but I had no alternative since I wanted to graduate. In the end, I defended all my hard work in about 75 mins – imagine! Nearly 6 months of work translating into just 75 mins!!

Research is difficult! As difficult as any other endeavor; I think most researchers don’t know how their efforts will turn out (as most start-ups do at the beginning too). There is usually some hunch about a model, some experiments and then eventually they have to figure out what the ‘right’ result is. Also, ‘big data’ appears to be fun and cool but it requires unbelievable and prodigious amounts of grunt work.

I built JIZNA, a custom Python framework for complexity analysis. JIZNA can parse openstreetmaps XML dumps of cities (the parser was an open-source utility I found and modified), create dual graphs of these networks, merge discrete roads, exclude outliers and calculate the desired metrics. These metrics were used to predict how difficult it would be to search the city. The JIZNA platform is available here.

The Cool Stuff

I think I wrote much better code: the framework was modular, nicely designed and flexible; I was able to write some really cool algorithms for the complex computations and I learnt how to use Sphinx, the Python documentation tool. Sphinx, in my opinion, is a lovely tool once you grasp its basics.

The Not-so-Cool

I got a couple of interesting results however I think they were not so spectacular. I guess further work would reveal some new insights.

I had to throw away some of my code (a complete simulation framework had to be discarded when the approach changed) and my writing (again! This is the umpteenth time I’d be chopping off my writing).

So what did I learn? Lots more Python, algorithms, software design, documentation, writing, latex, vim and some maths (mostly matrix algebra). However, more importantly, I came to appreciate the value of grit, determination and perseverance while working towards goals. Don’t ever give up, even if all appears to be lost.

Next plans? I don’t quite know fully yet; one thing for sure: research is hard! :)

Did you like this post? Check out my other posts on Grad School.

Taking the PAIN out of coding


Over the years, I have learnt some tricks and picked up some lessons while writing code. Most were learnt the hard way, so I decided to share a couple of tips on how to avoid development pitfalls.

Meticulous Planning and Design

One of the most difficult lessons I learnt in software development was not to rush into code; I used to jump impulsively into software projects and start hacking away at code without planning fully. And as you can bet, the thrill of coding soon evaporated when I got bogged down by messy code. Sadly, many projects of mine met their nemesis this way.

Now, I am just too lazy or maybe too battle-scared to do that; I mostly write out a high level system design document (usually a single page or two) describing all the major features. Next, I run through everything to know that the various components and interfaces are logically valid and try the edge cases. Only when I am satisfied with this do I start writing code.

Alhamdulilah, I think I write cleaner, more modular and better designed code this way. For example, I recently had to extend an experimental framework I wrote a couple of months back; surprisingly I was able to make all major changes in less than two hours. Better still, nothing broke when I ran the framework again!

A dev lead once told me coding is the easiest part of software development… I think I agree with him…

Do it fast and dirty, then clean up

I started using EmberJS last year for a web project. EmberJS is a really cool framework and reduces the amount of boilerplate code you have to write: it’s so powerful that some things seem magical. However, EmberJS has a really steep learning curve.

As usual, I was trying to write perfect code at my first attempt, did I succeed? Your guess is as good as mine. I got so frustrated that I started hating EmberJS, the project and everything remotely related to the project. :)

Before giving up, I decided to have one more go at it; my new approach involved ignoring all standards and good practices until I got something to work. And that was it, I soon got ‘something’ that looked like a working web application running. One day, while working on the ‘bad’ code, I had an epiphany. In a flash, I suddenly knew what I was doing wrong. Following standards and practices was relatively easy afterwards.

Looking back, I realize that if I was bent on doing it perfectly at the first go I most probably wouldn’t have gotten to this point. Oh by the way, EmberJS got a new release so my code is obsolete again. :P

Clean up the code from step 2 above X more times

This is a part of development I don’t think I really like but it is essential for maintenance. You have to go back through the code (yes, your code; you ain’t gonna make life miserable for the developer inheriting your codebase). Refactor all duplicated, extraneous and obscure pieces of code ruthlessly. Most importantly, improve the readability of the code (yes, readability is REALLY important – make it read like a good novel if possible à la Shakespeare or Dickens).

I also keep a running list of all the hacks I make as I go about writing code in step 2; this list comes in handy at this stage and enables me to go straight to the substandard code pieces and fix them up.

Use a consistent coding style

I recently noticed that my coding style was inconsistent across my projects: variables names were either under_score or camelCase while method declarations used brace-on-new-line and brace-on-same-line approaches.

The problem with this approach is that it breaks up my flow of thought and makes speed-reading code difficult. Now, I choose a single style and stick to it throughout a project – any style is fine provided I use it consistently.

Scientific Debugging

I came across the term ‘scientific debugging’ while blog-hopping and it has stuck in my subconsciousness ever since. Identifying bugs can be a challenge: for small projects, I just try to figure out where the bug might be and then check for this. However, this approach does not scale, I wouldn’t randomly guess on a 5000 line project.

Scientific debugging is a systematic process: you make hypotheses about the likely causes of the bug, make a list of places to check and then go through this systematically while eliminating entries. You’ll most probably find the bug with less effort and without running through the entire list.

Project Management

I rarely used to track how much time and effort I put into my projects; I would just code and code and code. Now I know better, I estimate how many hours I can put in before, during and after each project. I try to use Agile (although, I use a simple list + pomodoro too) for project planning, task management and effort estimation. It is now trivial looking up my project status: implemented features, issues list and proposed features.

Testing

I tried my hands at TDD last year and I felt it was just a way of doing too much work for coding. While I might be wrong about TDD, I think it’s essential to have a solid testing process in whatever project you’re doing.

Test, test and test: run the gamut (if possible): unit, integration, functional, stress, regression etc.

Enough said… I have dirty code to polish. If you did find some of the points useful, please share your thoughts and ideas.

Related Posts

  1. Symptoms of Software Rot
  2. So you want to become a better Programmer
  3. Clean code, Dirty code

The language series: PHP


I think PHP is disproportionately targeted for rants; it’s the language everyone loves to hate. Despite the proliferation of posts that list the 1 trillion things PHP does wrongly, more than 70% of all websites run PHP on their servers. Even more interestingly, ‘better’ languages like Ruby are below PHP in the TIOBE index ( a language ranking metric). Definitely, PHP must be doing something right, no?

Personally, I am pretty much indifferent to the language (or any other language for the matter); it does what it needs to do. Now, don’t start another debate around this post.

How I came to learn PHP

I had to learn PHP during my 6-month undergrad internship; I would have preferred a Java-related job however, after 6 weeks of extremely difficult job-hunting, I was ready to take up anything. I wrote lots of PHP code, lots of baaaaaad code – the type that makes you cringe, rant about PHP and scream at the developer. The only rule I didn’t break was using proper variable names; all other rules ( DRY, YAGNI, OOP concepts) were routinely chucked out of the window.

It wasn’t PHP that made me write bad code, it was rather due to my amateur software development skills – that was my first real-life software development experience. However, I got to write some fun code too Alhamdulilah; a cool timetable generating algorithm, do some Linux and toyed around with networking ( I dropped that afterwards).

The Bad Parts of PHP

All languages have flaws and PHP is no exception; here are some of the reasons why people say PHP is baaaad for your programming health :D:

  • The language wasn’t really designed but grew by accreting features. So it feels kind of clunky…
  • Built-in libraries and functions are inconsistent.
  • The naming convention is a mix of different naming styles
    • camelCase: getName() 
    • under_score e.g. array_combine()
    • joinedwords e.g. localtime()
    • modulename_function: libxml_clear_errors()
  • No Unicode support.
  • Slow; of course, what do you expect?

The Good

  • Huge libraries; PHP has functions for nearly everything you’ll ever need: regular expressions, URL parsing etc.
  • Awesome database support.
  • Familiar syntax; similar to the C/Java family.
  • Great community – the PHP community is sooooooo active.
  • Widely supported by hosting companies and is extremely widespread.
  • Easy to use:
    • Install WAMP/LAMP/XAMPP and get coding.
    • Low deployment and maintenance demands – you don’t need 6 months to launch :)
    • Easily integrated with most applications and frameworks.
  • Extensible platforms exist e.g. wordpress, Joomla, Drupal.

Developing in PHP

Most developers start out with Macromedia Dreamweaver and they write extremely long monolithic files that are poorly structured and organized –  ugly spaghetti code. Amateur developers don’t care for two reasons: little development experience and they get to build stuff. Ultimately, without improvement, these developers are responsible for the bulk of awful PHP code existing out there. There is little that can be done about this; however, if you have bad PHP code, please hide it until it’s improved. There are already enough rants on the internet…

Hopefully, you’ll realize the flaws in your development skills and leave Dreamweaver for an IDE (I personally think Netbeans is awesome). You’ll might go ahead to learn about, NoSQL DBs, ORMs and a couple of other good things.

Finally, maybe you’ll move on to other languages and come to see your old PHP code as ugly. If this happens, just don’t forget that  you’ve probably improved your programming skills by learning a new language so don’t blame PHP totally – you were a worse programmer before. Remember this before you start ranting.

Rating

6.5/10

Bjarne Stroustrup said: “There are only two kinds of languages: the ones people complain about and the ones nobody uses”.

Agreed that it’s not be the best language but come on, it gets things done. The language is way too popular today, has excellent support and helps you to get things done easily.

Do you like this post? Check out my earlier posts on C, Python, JavaScript and Java.

The Language Series : Java


Java! The language I once loved so passionately that I saw other languages as being inferior. Now, I rarely use it – the last time was while writing an Android app early last year.

My Java Story

I had been learning C++ and finding OOP quite difficult. When the new academic session began, I had to take my a programming language course in Java and I had no compunction whatsoever dropping my half-baked C++ for Java. I soon grew to like Java; in fact to me, it was the best language ever – and that’s not surprising given my level of developer expertise back then.

Not surprisingly, I had to relegate Java for PHP – well, circumstances dictated this and I had to make it my second-choice – however that’s the PHP story.

Java – the good

  • Java is everywhere: here, there, beyond and yonder. Windows, Linux, Mobile devices, Servers, devices etc.
  • Lots of libraries – GUIs, Threading, Databases, Numerical computation, engines etc.
  • The Java Virtual Machine protects you from serious damage; moreover it can execute bytecode produced by a lot of languages e.g. Scala (better Java), Groovy, Clojure (Lisp dialect).
  • Grammar is similar to the C/C++ family.
  • JIT compilers are quite efficient now and Java code compares favourably with C/C++.
  • Auto garbage collection, Memory management and Interfaces are cool; some believe these features spoil programmers. :)
  • Huge industry support.

Java – the bad

  • Verbose; it lacks expressiveness – a gazillion lines are needed to do simple stuff.
  • The == operator works differently on primitives and objects.
  • Some serious issues with the Date classes.
  • UI interfaces (well, the ones I know and have used) don’t look too good.
  • Issues with numbers – floating point issues, huge integers; libraries exist though.
  • No way to return multiple values from a method; maybe an object or array? This affects C too.
  • The JVM doesn’t support tail-call optimization for recursive operations.

Java – the in-between :)

  • Java’s is not purely OOP; it’s more of a mix.
  • Does not support first-class functions; yeah, I know Java is OOP but what about C# which has this? OOP is no silver bullet; other approaches work better at times.
  • Reflection – can be a double-edged sword.

Writing Code in Java

Cool; you’ll most probably end up writing bad code :) Well, initially; (I did too)

Over time, you’ll learn the culture of the language, how to use stuff and how to produce real awesome software. (Hopefully)

Quirks

Check out this article for some weirdness :)

Good for beginners?

I think lots of people start out with Java or make the switch at some point. It probably is the first attempt at OOP for lots of people. It is weak in interactivity (it’s a compiled language) and this can hinder rapid prototyping prevalent during learning.

However, Java has some strong points – increased programmer productivity and code portability – it’s worth a try.

Rating

7.5/10

The language has somewhat come to represent OOP even though it is not a pure OOP language. It has lots of libraries, is fairly easy to use and has huge industry support.

Do you like this post? Check out my posts on C, Python, JavaScript and PHP.

The language series: C


I finally took the compulsory software engineering course notorious for its very difficult course project – writing a bitcoin client in C. Alhamdulilah, we successfully completed the project: about 18k lines of code, automated builds/documentation/tests and lots of other stuff. I figure we rank around 7 or 8 on the Joel 12-point scale even though some don’t apply to our project. :D Big UPs to the team!

I decided to do a review of all the languages I have used or been forced to use while taking the course; the story behind learning these languages, their strengths and weaknesses; quirks, advice for beginners and some wisecracks too :).

C is first on the list. Here goes!

How I learnt C

I somewhat got forced to relearn this language this year but my first attempt at C was self-study in 2007 or 2008 as an undergrad. Despite my dreams of building the most AWESOME program ever, my C adventure ended abruptly after I read about 3 to 5 chapters of a C book. I was discouraged by apocryphal reports which insinuated that C was no longer relevant; so I left C for C++ and then Java. That story is here.

Well, this year I had no choice but to learn it. Well, there was another choice: getting a poor grade in the software engineering course.

Likes

  • C packs a powerful punch, who doesn’t like power and speed?
  • It has a concise grammar and you can learn the language fast.
  • Purity: its simplicity forces you to think.
  • I think function pointers are kind of cool too.
  • Forces you to learn how low-level computer stuff like stacks, heaps and memory allocation work.

Dislikes

  • It doesn’t support as much abstraction as I want.
  • Bah… why do I have to call free() all the time? Can’t the language help me with this? I already know and agree am spoilt but why make programming harder?
  • No hashtables? No string support? Beats me… every language seems to have these.
  • There is some redundancy in the methods available in the C library; e.g. strtol and atol; seems PHP got a predecessor in C.
  • Pointer tangles; what does this point to or mean? ***a.
  • Uninitialized values can hold all sorts of values; woe betide you if you make the mistake of using them straightaway; C won’t raise any errors.

Writing code in C

It’s one of two things: you’ll either learn code purity and write pretty nifty code or massacre lots of innocent computer bits à la segmentation faults, memory overwriting and stack overflows…

I think everyone starts out in the latter group and moves to the former :).

Recommended For Beginners?

C is pure and has a small grammar (makes it easy to learn) but a bit challenging for a beginner to start with. I think Python or scheme will be easier.
You’ll probably find OOP difficult to grasp if C is your first language however you’ll find other languages really easy.

C Quirks

7[a] == a[7] if a is an array; it was even on my exam! :P

while(*s++ = *t++) copy a string t to a string s.

Rating

6/10

Pretty powerful, compact and small although lacks a lot of expected features and development is sometimes painful. There are a couple of libraries that you can use though.

I hear C++ is more challenging… Do the ++  signs signify difficulty? :)

Read my reviews of PythonJavaPHP and JavaScript too.

Thesis Stories : Wrangling with HUGE data


My thesis takes all my time: I have to review papers, write out my thoughts, build a platform, attend classes as well as poke into Big Data; and my blog has been at the receiving end. This story about big data came to my mind while I was thinking about my planned work on the stackoverflow (SO) dataset. My adviser suggested using the dataset as a proxy for my experiments; I am assuming you know stackoverflow already.

Dumps of stackoverflow content are released every third month on a creative commons license (don’t worry; it’s anonymized). I got the latest dump and it is  ~7GB of compressed XML files. You can get the dumps at this link.  The XML files posed the challenge; some are pretty ok – the smallest is around 309Mb but there are two humongous beasts: the posthistory and the posts files are ~17.9GB and ~12Gb respectively! How do you open text files that huge? I tried vim, less and others before giving up.

I eventually found a python script to convert the XML dump to SQLite; I had to update it as the schema has changed. Although it’s not optimized and somewhat slow, it does what I want; (maybe I can improve it and put it back up on the stackoverflow (SO) meta site – bah… I am too lazy).

Pronto, I ran the script and went away with happy thoughts of the good times I would have with the data. Well, I came back a couple of hours later to find out that the conversion had failed. I initially thought the script was broken and went on a wild goose chase looking for other scripts and converters – I got one that required me to install postgresql (I have never used this RDBMS before and it has some of its own quirks too ;) ). Finally I got that converter to play nice with postgresql only for it to break too – Aaaargh! Try picturing an exasperated me :P.

I backtracked and found out that the problem was actually due to poorly-formed XML – wonder why don’t they make ’em parsers lenient? The dump contained some Unicode characters which are invalid in XML. I went back to SO and came across someone who someone who ran into the same problem with an old dump of SO data! From that question; I got a python script that could detect the occurrence of invalid characters and ran it on my dump – well it detected quite a couple. Next step? The elimination of the characters of course!

Subsequently I found another python script to replace the unwanted Unicode (again? Pulling scripts off the internet? Well, I could write them myself but why spend lots of hours on something when I could get it working in minutes – and yes I understand how the code works; at least I think I do :) ).  The script didn’t do exactly what I wanted so I fixed the regular expression in it; tested it and then ran it on my ginormous 17Giga… a couple of minutes later and I was done. Phew! I loaded it up in sqlite3 (another hair-pulling experience – I found out the hard way that SQLite wouldn’t load sqlite3 DBs…. yikes!).

Finally I can run queries on my humongous pet – the largest table contains more than 23 million records and I have to pull out metrics and possibly generate images and graphs to really get an idea of what lies beneath. I pray I get this done, I really want to conquer this beast of data and I should be posting here about my weekly progress insha Allaah.

So in case you need scripts to use with ginormous data; I have uploaded all three to my github: here.

DISCLAIMER: I am not claiming to be the author of these scripts; I only made minor modifications to get them to work.

Do you know editors that can handle HUGE GINORMOUS databeasts (yes databeast, not dataset)? Please drop a comment.

Have fun!

What Endian are you? Big or little?


Have you read Gulliver’s travels? If you have then you must have read about endianness. The Lilliputians were divided into two factions over what end of an egg to break – the big end or the small end. Isn’t it interesting that such a petty and seemingly trivial issue caused both parties to become embroiled in clashes? Similarly, some flame wars in programming are over issues as ‘important’ as endianness.

Egg
What end should you break?

Ok, history lesson over; you now know the origin of endianness. Lets get to the issue of endianness in computer science. Computers can read streams of bits and bytes just as we can read written text. However, interpretations vary, for example, I can’t read Greek or Spanish even though I can see the writing marks.

Lets take the popular quote

“There are 10 kinds of people, those that understand this and those that don’t.”

Someone who doesn’t know binary wouldn’t realize that 10 in the quote means 2 in decimal. As such 10 can be the decimal ten or the decimal two; infact it might also mean input/output or the moon Io (in other context). It’s difficult to interprete 10 unless you know what the author meant.

The same interpretation problem applies to computers; in the early days of computing, a byte depended on the particular hardware architecture and there was no standard definition. The problem of endianness arises when you have lots of bytes being exchanged between several computers. Local data is fine as computers understand locally-stored data; however when you have to talk to another computer; how do you interpret the data you get?

Big-endian systems store the largest byte in a byte sequence in the very first byte location while little-endian systems store the largest byte in a byte sequence in the very last byte location. This is somewhat similar to writing/reading styles of languages; some are left-to-right while others are right-to-left; some are even top-to-bottom.

Here is how the 16-bit number 00000001 00000010 will be interpreted on both systems:

  • To a little-endian system, the very first byte is the smallest, so the number is equivalent to (1 + 2*256) = 513.
  • A big-endian number however sees the first byte as the biggest so the number will be equivalent to (1*256 + 2) = 258.

Amazing right? Some people feel that one endianness is better (just like the Lilliputians :) ) and have their reasons too. Both systems have their strenghs and flaws.

Big-endian systems represent numbers the way we understand them and this makes debugging easier; (in English, we all recognize thirteen as 13 and not 31). Also, you can check if a number is positive or negative by looking at the leading byte.

Little-endian systems allow you to check the lowest order byte (e.g. if you want to know if a number is odd or not) and make it easier to write math routines in assembly language.

Another issue is the NUXI issue; say you want to store the four bytes labeled UNIX (U being the largest byte) on machines that store numbers as 2-byte integers; hence the number UNIX is split into two chunks UN and IX. A big-endian system will store it as UN (U is larger than N right?) and also IX.  A little-endian system will store UN as NU ( U is larger than N and the largest byte comes last, remember? ). Similarly IX is stored as XI and its internal representation ends up as NUXI. Both computers understand their internal representations perfectly; however imagine a big-endian storing UNIX on a little-endian computer; when it tries to retrieve its data it’ll get NUXI. I wonder if computers get perplexed… :P

Fixes for the endianness problem include using a standard format across computers and using headers to describe the information format ( yes this wastes space but you’ve got no choice ;( ).

So what endian are you?

Good to know:

  • Intel processors for PCs are little-endian while Motorola processors for Mac are big-endian.
  • Adobe Photoshop files and JPEG files are big-endian while bitmaps are little-endian.
  • The network order (order of transmission over networks) is big-endian.