Programming Principles #

Writing readable and correct scientific code is possible with a little bit of practise and a healthy mindset. While it does get easier as one learns more and more tricks to solve specific programming difficulties, the basics are very simple. There are four things we want to constantly pay attention to:

good names,
meticulous tests,
simple design,
and helpful documentation.

We usually achieve these things through iteration. We write a first version and make sure it has somewhat reasonable names but most importantly that it’s correct, tested, is well-designed and fits nicely into the rest of the application. Once the big-picture issues are sorted, we read the new code and check if it satisfies a number of criteria, e.g. are the names on point, is there needless repetition, is there a mix of layers of abstraction, could it be simpler, is it documented, did we leave something for later, etc. While doing these checks we may notice that we’ve not covered an aspect by tests, so we improve the tests; and then make sure the tests also satisfy the criteria, before going back to the new code to start the cycle over. It’s very small-minded work but unfortunately code that doesn’t receive the required amount of attention and care quickly becomes unnecessarily hard to read and maybe even incomprehensible. Once satisfied, we create and push a commit. Then in the browser, we read the code again, before merging the changes into the main branch.

Good Names #

Picking accurate, descriptive and appropriately long names is absolutely essential to writing good code. Picking good names is undeniably worth the effort. The dynamic here is that one bad choice inspires other consistently bad choices and the result is completely incomprehensible. Conversely, consistently good names make code read as if it were plain English with non-standard formatting.

It’s useful to think of names as having a certain reach or scope. Functions that are part of the public API of a library reach across projects, while local variables only effect a couple of lines of code.

Therefore, constructs with a large scope need descriptive, intuitive names. They’re often used far away from they definition and have little obvious context. They tend to not contain abbreviations and are one to three words long.

Public methods of a class are similar to the above but have immediate context. Therefore, it’s possible to use generic names, such as update, add, etc. if it’s clear from the context what’s being updated or added.

Finally, local variables have very short scope. Usually their definition is visible on screen; and often they’re used repeatedly. At this level scientific computing often implements mathematical formulas. Therefore, it can be correct to pick single letter variable names, e.g. x is a perfectly suitable name for the x-coordinate, while x_coord might be needlessly long. Additionally, x and y differ by 100% of their letters, while {x,y}_coord have most of their letters in common.

What I find interesting about picking good names is how it draws attention to convoluted design. For example a function that does two things (instead of just one) can’t have an accurate name without the use of and. The difficulty of naming something correctly might indicate either insufficient high-level understanding or a design flaw.

Meticulous Testing #

It is essential to establish a very strong habit of testing every change made to a code. Personally, I know I’m incapable of writing more than two or three lines of code without a mistake, and there’s plenty of single line changes that are sufficiently error prone that I get them wrong frequently.

From personal experience I can say that adding lots of code quickly without thorough testing will lead a “complete” implementation after very little time. What comes next is a very protracted phase of debugging. If unlucky the early plots may look very promising and one continues to add complexity. Only much later does one notice that it’s complete non-sense. The ensuing phase of debugging is very frustrating, because one can’t know which parts are working. Therefore, there’s always a possibility that something obscure is happening inside some logically unrelated function; or that a particular function doesn’t returns the expected values. This completely negates our ability to analyse the written code and systematically track down the bug (or at this stage a veritable swarm of bugs).

Fortunately, the above is largely avoidable. One component of it is to test meticulously. By writing tests I know a particular piece of code does exactly what I think it does. Then when implementing the next layer of abstraction I have an accurate mental picture of each involved function. I then test and debug the new functionality. This way we move from one working state to the next. While we may move slightly slower from phase to phase, we arrive at a reasonably bug free application faster and reliably.

This almost test-driven approach to developing is highly effective at almost entirely eliminating the ceaseless stream of senseless, petty bugs. Naturally, if there’s an error in my reasoning, testing might not be effective, because the buggy code written might pass the tests. In the rare cases where this happens, at least one can “debug” at the high-level without the persistent fear that it’s just an off-by-one that’s the true cause of the issue.

Personally, I envision bugs as creatures that are permitted to subvert the meaning of my code subject to one restriction: not make any changes that are incompatible with any tested behaviour. They take an obscene delight in pointing out all the obscure, pedantic ways in which I didn’t test that they can’t exist.

Generally the motto is: buggy until tested.

Good Design #

This is probably the vaguest category and will improve most over time.

Simplicity is usually a good metric to start with. If it can be explained with what feels like a sensible number of sentences without ifs and buts, then chances are it’s simple.

There’s a set of anti-patterns that we want to avoid at all costs. Repeatedly, keep an eye out for those.

Next, check if the rules for writing good functions and classes are followed. Personally, I found Clean Code helpful, maybe especially for scientific computing since the ideals described are completely unattainable.

For C++ the organization of code into files and directories matters. We have pairs of files, a .cpp file an associated .hpp file with the same name. These files tend to be small, e.g. one class and its associated free functions. The reason we include this piece of pedantry here is because there’s two ways this immediately works in your favour:

It massively reduces the chance of cyclic dependencies among headers.
By making #include <foo.hpp> the very first line in foo.cpp we automatically detect all defects related to include ordering, i.e. any missing #include in foo.hpp.

The physical organization of code is described in great length in Lakos, Large-Scale C++: Volume 1.

We’ve said nothing about software architecture and it certainly matters. Fortunately, it’s not as important in scientific computing, functions can carry us a long way and external dependencies are often without alternative. Therefore, we’ll probably benefit more by focussing on other things until those become second nature.

Documentation #

Lastly, we must write helpful documentation. We start by documenting how to build and run the code. Keep this up-to-date and check it regularly. Code that can’t be compiled is worthless and you’ll forget really quickly how to compile the code.

Next, we need some amount of developer documentation. This should include a precise write up of the algorithm. Likely, this is part of, or can be reused for, a method paper. The difficulty here is to have it sufficiently detailed to be of use to a developer joining the project or when discussing detailed algorithmic questions. While being vague enough to not get outdated quickly due to irrelevant changes in the codebase.

Then there’s API documentation. Fortunately, we can get this for almost free. Essentially, we document every function or class using comments formatted in a specific way that allows tools to extract these comments and render them as an HTML page. For C++ we use Doxygen. With customized CSS it looks reasonably modern. Personally, I quite like doxygen-awesome. For Python we’d use Sphinx.

When writing API documentation and comments in general, try to avoid using the same word to describe that word. Instead try to imagine what’s not obvious about the function. While it’s preferable to fix quirky behaviour, it can easily happen that a sensible assumption isn’t obvious from the function’s signature. These choices need to be documented, for example the pixels in an image can be stored row-by-row or column-by-column, and start at either the lower or upper left corner. Even though there’s a “natural” choice, i.e. row-by-row starting at the lower left corner, either of these choices is fine, it’s only a problem if the choice is unclear or inconsistent throughout the codebase.

For comments specifically, we tend to avoid explaining the language itself (with the occasional exception) and we try to not literally translate the code to English. If the code needs a verbatim translation, then there’s likely an issue with poorly chosen names or mixed levels of abstraction. In a similar spirit don’t add comments that aren’t helpful.

What comments can be used for is providing relevant context, or alert the reader to important subtleties. As always with coding be as clear as possible, cryptically short comments aren’t helpful. If there’s a difficulty that needs explaining it’s very unlikely that a half-sentence will unambiguously explain the issue.

Examples #

Avoid Covering for Bad Names #

Probably, the time spent writing the documentation would have been better spent improving the variables names:

// Here `next` is an array of size `n` and stores the value at
// time `t+dt`, `a` in an array of size `n` storing the values
// at time `t`, `t` is the loop  index over space, and `f8`
// stores `f(u)`.
next[t] = t[a] + b*f8[t];

while some of the documentation is helpful, much of it is not needed if better names where chosen or should be part of the API documentation instead of a comment.

Avoid Translating Code to English #

It’s usually possible to write code such that it’s obvious what sequence of commands it executes. Therefore, a literal translation to English isn’t helpful.

// Read the `i`th element of `u0` and `dudt`, then compute the
// sum `u0 + dt * dudt` and store it in `u1[i]`.
double a = u0[i];
double b = dudt[i];
b *= dt;
b += a;

u1[i] = b;

Redundant Comments #

The following is not helpful at all:

// Set `u1` correctly:
u1[i] = u0[i] + dt*dudt[i];

because first of all I can see u1 is being set and secondly, the author of the comment would hardly admit to setting u1 incorrectly.

The following comment likely doesn’t answer any question a reader might have about the code.

// Compute the sum: `u0[i] + dt*dudt[i]`.
u1[i] = u0[i] + dt*dudt[i];

Cryptically Short Comments #

After reading the following:

u1[i] = u0[i] + dt*dudt[i]; // danger

I only know that something might be going wrong, which, since it’s C++, isn’t news at all. Maybe they’re referring to i being out of bounds? Or do they mean that dt must be chosen small enough? Maybe it’s an outdated comment?

Adding Relevant Context #

What might be helpful is to provide some context:

// Advance in time using forward Euler:
u1[i] = u0[i] + dt*dudt[i];

because it may not be obvious to every reader that the specific formula written in the code is the explicit Euler method for integrating ODEs. Equipped with that knowledge they can easily read up on any missing background knowledge.

Summarizing Blocks of Code #

While literally translating the code isn’t helpful. It might be useful to write down the combined effect of several statements. Often numerous steps are needed to achieve a simple high-level goal. Unfortunately, sometimes the sequence of steps doesn’t obviously express the high-level goal. First make sure that there’s no opportunity for restructuring the code, but if there isn’t you could chose to document the high-level goal.

Commenting Ad Hoc Conventions #

In scientific computing we eventually end up with something non-trivial. After checking that it can’t be simplified, we might settle on making the unfortunate code involving 4 or 5 indices as readable as possible. This may involve picking some ad hoc naming conventions, say time indices are k, indices over the x-axis are i, etc. As a result everything is nice and consistent and easy to understand. Unfortunately, the next person to read the code might not pick up on the conventions used and as a result find it’s not consistent and not easy to read at all. Therefore, it might be useful do document these ad hoc naming conventions.