nomen est omen

The Romans used to say

“nomen est omen”

By this they meant that a person's behaviour or characteristics were influenced by the person's name.

Shakespeare, however, had Juliet say to Romeo:

“What's in a name? that which we call a rose
By any other name would smell as sweet”

This, of course, has the opposite meaning — no matter what you call a rose, it will still be a rose.

So who is right, and what does it have to do with programming?

The need for identifiers

In just about all programming languages, the data we deal with is stored in variables or represented by classes or constants, and all of these are typically given a label or name, more commonly referred to as an identifier. Methods and functions too use identifiers, as do their arguments. Also, sometimes we create aliases for certain identifiers.

On the one hand, humans follow the Romans' example — the meaning of a variable is derived from its identifier.

On the other hand, computers follow Shakespeare — no matter what identifier you use to name an object, the object is still the same object without changing its behaviour or characteristics.

So, if we want to have any chance for others to understand a bit of code we have written, we have to ignore Shakespeare and go about naming these identifiers in a meaningful way.

The bad old days

Some of my early programming involved writing machine code. I would sit down with a notepad, look up the hexadecimal values for the instructions, write them down on the pad, and add the numeric arguments to instructions.

If an instruction referred to a variable, I would keep track of the memory location that I had allocated to that variable on a separate sheet of paper. When I needed to reference that variable I would use the address value.

As you can imagine, this was very tedious. If you removed a variable, you would either waste the space previously used by it or you would have to re-assign a new address to all that had a higher address. This meant scanning your code for these and making the adjustment to the instruction's arguments as appropriate.

Then, along came 'languages' where you could represent the variable with a name. This was revolutionary — you could now use a label to refer to the variable and deleting or adding new variables did not affect anything in your code — the compiler or interpreter would take care of this for you.

Still, early languages had quite heavy restrictions on the length of variables and the type of characters that could be used.

For example in BASIC that was prevalent when I started programming, variables could have up to 8 characters. Actually some interpreters allowed you to have more, but would ignore characters after the 8^th. Whether you typed the identifier in upper of lower case did not matter, the interpreter would always change it to upper case for you.

These restrictions made it necessary to use truncations of words that represented the meaning of a variable; while FIRSTNAME was just short enough, CLIENTADDRESS and CLIENTPOSTCODE were not, so you would truncate them to CLNTADDR and CLNTPOSTC or something similar. You would put in some comments to indicate what the identifiers represented and hoped that anyone trying to decipher your code would read these comments.

Often there were groups of related variables that represented very similar objects, which resulted in the identifiers quite often being jumbled further.

Modern syntax

Most modern languages are a lot better - they allow us to have longer identifiers, and distinguish between upper and lowercase. This allows us to make the identifiers much more meaningful.

The long and the short of it

To quote Shakespeare again, “brevity is the soul of wit”.

Suppose you have a simple loop that iterates over an array or collection of some type:

for (int contactIndex = 0; contactIndex < contacts.Length; contactIndex++)
{
    Contact contact = contacts[contactIndex];
    // do something with contact
}

Using a simple identifier i would have been perfectly reasonable:

for (int i = 0; i < contacts.Length; i++)
{
    Contact contact = contacts[i];
    // do something with contact
}

Any programmer worth his salt reading the above should know that i is traditionally used for iterating, and using the longer identifier does not really make the code more readable.

This also applies to nested loops. Because we cannot use the same i for both loops, most programmers are use j for the second level loop and k for the third. But using l for the fourth level is really asking for trouble, as l (letter L) looks too much like 1 (number 1).

Similarly, if you are dealing with x and y co-ordinates, it is perfectly acceptable to use the variables x and y to represent these, either as members of a structure or class or as variables in their own right.

But one can take this too far. Suppose that the loop contains a few dozen lines of code and that in various parts we need to refer to the i^th item in the later lines further down, so that the for loop might no longer be visible at the top of the screen when we examine the pertinent code. Here the meaning of the i is lost and the use of the longer contactIndex as the identifier would probably make more sense.

Too often one sees code where every identifier is reduced to a single letter and it becomes almost impossible to discern what is happening:

int d = 0;
float f = 0;
float q = 10;
//...

if (i < 10)
    if (!p)
        if ( c != ".")
        {
            d = d * 10;
            d += c - '0';
        }
        else
           p = true;
    else
    {
        f += (c - '0') / q;
        q *= 10;
    }

//...

return d + f;

Reading this code and applying a little thinking, it is possible to figure out that some numerical parsing is taking place. But consider how much more readable it becomes if we use longer identifiers:

int whole = 0;
float fraction = 0;
float quotient = 10;
//...

if (i < 10)
    if (!hadDecimalPoint)
        if (character != ".")
        {
            whole = whole * 10;
            whole += character - '0';
        }
        else
           hadDecimalPoint = true;
    else
    {
        fraction += (character - '0') / quotient;
        quotient *= 10;
    }
//...

return whole + fraction;

Here a mere glance at the code tells us what is happening.

The argument that is often put forward is that the first version saves us a lot of typing and is quicker to compile as the compiler has to read less code from disk and do less parsing.

But this ignores the fact that most modern Integrated Development Environments (IDE) will do auto-suggestion (or have 'Intellisense'), taking care of the typing for us.

More importantly, it fails to appreciate that writing the original bit of code is the least bit of effort in the maintenance of long lived code. One also has to consider the time that it takes other programmers, or indeed the original programmer, to come to grips with this code when it it re-visited at a later stage. Depending on the lifetime of the project this can be a significant cost.

As for helping the compiler along — this is really putting the cart before the horse. You should remember that the compiler is to help you and not the other way round. Also the few extra nanoseconds that it takes to process the additional characters are far less expensive or noticeable than the amount of time programmers spend trying to de-cypher the code.

Name decoration

Identifiers, as previously described, are used for a multitude of purposes.

When trying to read and understand some strange bit of code it is of great help, if just by looking at an identifier you can tell whether it represents a class, structure, property, field or variable.

This can be achieved by decorating the identifiers in certain ways. For example, you might decide that:

All classes, structs and enums have to start with a capital letter (ex: MyClass ),
All properties of a class start with a capital letter (ex: IsReadOnly )
All fields start with lower case letter (ex: numItems )
Prefix all private fields with a an underscore '_' character. (ex: _quoteResults )

Consider this code:

public object EvaluateExpression(Expression expression, ReturnAs returnAs)
{
    var result = new List<object>();

    object localResult = null;

    if (exp.IsBoolean)
    {
        // ... do something
    }
    else
    {
        int expressionResult = 0;

        foreach(string word in expression.Words)
        {
            // ... evaluate each word in turn;
        }
    }

    if (localResult != null)
    {
        switch(returnAs)
        {
            case ReturnAs.String:
                localResult = localResult.ToString();

                if (_quoteResults)
                    localResult = "\"" + localResult.Replace("\"", "'"),  + "\"";
                break;

            case ReturnAs.Object:
                break;
        }

        result.Add(localResult);
    }

    foreach(var subExpr in expression.SubExpressions)
        result.Add(EvaluateExpression(subExpr, returnAs);

    return result;
}

Using the rules outlined above it becomes apparent that:

IsBoolean , Words and SubExpressions are properties of the Expression class.
_quoteResults is a private field of the current object that implements the EvaluateExpression function.
Any identifier that starts with a lower case character result , localResult , expression and returnAs represents local variables.

So we innately know at what level the objects represented by the identifiers are defined and we therefore can get a much better feel for the structure of the code without actually having to immerse ourselves in it.

This will only work if we are consistent in our application of these conventions and follow them to a T every time, lest we confuse the reader of the code.

Hungarian notation

Some languages are type weak. That is they fundamentally only have one or two data type and other concepts somehow have to be construed in another way.

For example early versions of C really only knew a handful of data types: short, long, char, float, double and pointer. It did not cater for boolean variables for example. BCPL only recognized a machine word.

Thus a boolean became an integer where the 'True' was defined as '1' and 'False' as '0' (or in BASIC 'False' was 0 and 'True' was -1). Since the booleans are really integers at heart, the compiler cannot check for inconsistent usage. In other words if we have a function such as int comparestrings(string1, string2, maxLength, caseSensitive) where the maxLength represents an integer and the caseSensitive represents a boolean, the following code would compile, but most probably cause unexpected side effects:

int cased = TRUE;
int length = 10;

int result = comparestrings(str1, str2, cased, length);

To help us overcome this, each variable is prefixed with a letter, or series of letters. This combination of letters tells us what type of entity is represented by the variable. This type of name decoration is known as Hungarian Notation.

Frequently used notations are:

'n' - number,
'b' - boolean,
'sz' stands for a zero terminated string.
'p' - pointer (pUser - pointer to a user object, ppUser - pointer to a pointer to a user object),
'w' - word (16 bit),
'dw' - double word,
'i' - integer,
'fp' - floating point (single precision), and
'dp' - double precision.

Combinations of these might be used: pdwFileMode might stand for pointer to a double word.

Thus the above code can be re-written as:

int bCase = TRUE;
int nLength = 10;

int nResult = comparestrings(szStr1, szStr2, nLength, bCase);

By consistently using this kind of convention programmers can convey not just the purpose of a variable but also its type.

The use of Hungarian Notation is of diminishing value in strongly typed languages and clever IDEs where the popup tooltips tell us the type of a variable when we float the mouse over it. But some programmers schooled in the old ways will to this day persist in using this notation.

Modifying code that was written with this notation requires us to follow it.

CamelHump or under_score_naming

Since the primary purpose of using identifiers is to make the code human readable, we have to find a way of naming identifiers that sometimes encapsulate more than one idea.

For example we might have a variable that represents 'the index of the contact'. Since most computer languages do not allow the use white space characters as part of identifiers, we must come up with a way of conveying the same information. So we might call the variable theindexofthecontact .

But humans are not very good at parsing a long string like that. So we might shorten it to contactindex . This looks better, but because we are used to seeing white space between words we still find it difficult to discern the two words 'contact' and 'index'.

Two common schemes to overcome this human weakness is to either use the '_' underscore character in place of white space contact_index or to use lower case characters, but switch to uppercase at the start of a new word contactIndex . This latter convention is known as camelHump.

Again consistency is the keyword here. Whether we choose the underscore or the camelHump is irrelevant, as long as, we use the same convention throughout a project.

And when modifying existing code, we must follow the conventions used in that project.