C/C++ Preprocessor Macros

Paul J. Lucas - Dec 31 '23 - - Dev Community

Introduction

The C Preprocessor originally was a stand-alone program that the C compiler called to “preprocess” source files before compiling them — hence the name. The reason C has a preprocessor unlike most other languages is due to the use of preprocessors in general at Bell Labs at the time, such as M4 and the troff suite of preprocessors.

Modern C and C++ compilers have the preprocessor integrated, though there are often options to control it specifically. For gcc and clang at least, the -E option causes a file only to be preprocessed, which can often been illuminating as to what the preprocessor is doing.

Preprocessing includes:

The first four are fairly straightforward; macro expansion, however, is the most complicated — and weird. It’s its own mini-language inside C and C++.

Preliminaries

Unlike either C or C++, the preprocessor’s language is line-based, that is a preprocessor directive begins with # (that must be the first token on a line) and ends with end-of-line (on Unix systems, the newline character, ASCII decimal 10, written as \n) — unless escaped via \ in which case the directive ends with the first unescaped newline.

Following the # may be zero or more whitespace characters followed by the directive name (define, if, include, etc.), hence all the following are equivalent:

#ifndef NDEBUG
#   ifndef NDEBUG
    #ifndef NDEBUG
Enter fullscreen mode Exit fullscreen mode

Some people, myself included, prefer the style where the # is always in the first column to make preprocessor lines stand out since an indented # is harder to notice.

Macros

There are two kinds of macros:

  1. Object-like.
  2. Function-like.

By convention, macro names are in generally written in all UPPER CASE to draw attention to the fact than a name is a macro and the normal C/C++ rules do not apply.

Object-Like Macros

Object-like macros are the simplest: they replace a macro name with zero or more tokens comprising the replacement list. For example:

#define WS  " \n\t\r\f\v"
Enter fullscreen mode Exit fullscreen mode

Why would you want an object-like macro defined with zero tokens? Just to indicate that it is defined at all for use with #ifdef, #ifndef, or defined(). The actual definition doesn’t matter.

Another use is when a macro expands to different tokens depending on the platform via a sequence of #ifdefs. It’s sometimes the case that a particular platform either doesn’t support (or doesn’t need) whatever you’re trying to do; hence, the macro just expands to nothing.

Object-like macros are used for:

  • Program-wide definitions to control conditional compilation via #if, #ifdef, etc.
  • Program-wide constants (such as the above example).
  • Include guards.

These days, constexpr in C23 can largely replace using object-like macros for program-wide constants; the same is true for constexpr in C++11.

Include guards are still necessary to prevent the declarations within a file from being seen by the compiler proper more than once:

// c_ast.h
#ifndef CDECL_C_AST_H
#define CDECL_C_AST_H
// ...
#endif /* CDECL_C_AST_H */
Enter fullscreen mode Exit fullscreen mode

One convention for naming macros for include guards is the file name where:

  1. It is prefixed by the program name (to help ensure a unique name); and:
  2. All letters are converted to upper case; and:
  3. All non-identifier characters are converted to an underscore — except never to start a name with _ nor contain __ (double underscore) that are reserved identifiers in C and C++.

You might sometimes see #pragma once (a non-standard, but widely supported directive) in header files as a replacement, but some compilers are optimized to handle this implicitly, so there’s really no reason to use it.

Function-Like Macros

Function-like macros can take zero or more parameters. For example, a common macro to get the number of elements of a statically allocated array might be:

#define ARRAY_SIZE(ARRAY) \
  (sizeof(ARRAY) / sizeof(0[ARRAY]))
Enter fullscreen mode Exit fullscreen mode

Yes, the syntax of 0[ARRAY] is legal. It’s a consequence of the quirky interplay between arrays and pointers in C. Briefly, the a[i] syntax to access the ith element of an array a is just syntactic sugar for *(a+i). Since addition is commutative, *(a+i) can be alternatively written as *(i+a); that in turn can be written as i[a]. In C, this has no practical use.

So why use it here? In C++, however, using 0[ARRAY] will cause trying to use ARRAY_SIZE on an object of a class for which operator[] has been overloaded to cause a compilation error, which is what you’d want.

Unlike either a C or C++ function, when defining a function-like macro, the ( that follows the macro’s name must be adjacent, i.e., not have any whitespace between them. (If there is whitespace between them, then it’s an object-like macro where the first character of the replacement list is (.)

Properly Using Parameters

Macro parameters, when used within the replacement list, invariably should be enclosed within parentheses. Additionally, if the replacement list is an expression, all of it should be enclosed within parentheses as well. For example:

#define MAX(X,Y)      ( (X) > (Y) ? (X) : (Y) )
Enter fullscreen mode Exit fullscreen mode

Why? Because substitution can result in incorrect operator precedence. Suppose MAX were instead defined like:

#define BAD_MAX(X,Y)  X > Y ? X : Y
Enter fullscreen mode Exit fullscreen mode

then called like:

int m = BAD_MAX( n & 0xFF, 8 );
Enter fullscreen mode Exit fullscreen mode

The problem is that the precedence of the operators is >, &, ?:, so it would be as if:

int m = n & ((0xFF > 8) ? n & 0xFF : 8);
Enter fullscreen mode Exit fullscreen mode

that very likely isn’t what you want.

Another problem with parameters is that arguments having side effects can have those side effects occur multiple times. For example:

int m = MAX( n, ++i );
Enter fullscreen mode Exit fullscreen mode

expands into:

int m = ( (n) > (++i) ? (n) : (++i) );
Enter fullscreen mode Exit fullscreen mode

So if ni, then i will be incremented twice! Historically, there’s no way to fix this other than simply never pass expressions having side effects as macro arguments.

This is a great example of why the convention is that macro names are generally written in all UPPER CASE to draw attention to the fact than a name is a macro and the normal C/C++ rules do not apply.

The solution really is to use an inline function instead of a macro.

Arguments Can Be Anything

A macro argument can be any sequence of tokens, not just identifiers (like n) or valid expressions (like ++i), for example:

#define A_OR_B(A,OP,B)  ( (A) OP (B) ? (A) : (B) )
#define MAX(X,Y)        A_OR_B( (X), >, (Y) )
Enter fullscreen mode Exit fullscreen mode

Above, the > token is a perfectly valid preprocessor argument even though it would be a syntax error as an ordinary C function argument. Remember: the normal C/C++ rules do not apply to the preprocessor.

Macros employing such “anything goes” arguments should generally be avoided, but they are used for “stringification” (see below).

Variable Numbers of Arguments

Starting with C99, function-like macros can take a variable number of arguments (including zero). Such a macro is known as variadic. To declare such a macro, the last (or only) parameter is ... (an ellipsis).

For example, given a fatal_error() variadic function that prints an error message formatted in a printf()-like way and exit()s with a status code, you might want to wrap that with an INTERNAL_ERROR() macro that includes the file and line whence the error came:

_Noreturn void fatal_error( int status,
                            char const *format, ... );

#define INTERNAL_ERROR(FORMAT,...)    \
  fatal_error( EX_SOFTWARE,           \
    "%s:%d: internal error: " FORMAT, \
    __FILE__, __LINE__, __VA_ARGS__   \
  )
Enter fullscreen mode Exit fullscreen mode

It’s common practice both to break-up long replacement lists into multiple lines and to align the \s for readability.

Also, when a macro expands into a C or C++ statement, the replacement list must not end with a ; — it will be provided by the caller of the macro.

The __VA_ARGS__ token expands into everything that was passed for the second (in this case) and subsequent arguments (if any) and the commas that separate them.

In C and C++, adjacent string literals are concatenated into a single string which is how the “internal error” part gets prepended to FORMAT. (This is an example of a rare instance where a macro parameter is intentionally not enclosed within parentheses.)

A keen observer might notice a possible problem with the INTERNAL_ERROR macro, specifically this line:

    __FILE__, __LINE__, __VA_ARGS__   \
Enter fullscreen mode Exit fullscreen mode

What if the macro were called like:

INTERNAL_ERROR( "oops" );
Enter fullscreen mode Exit fullscreen mode

passing zero additional arguments? Then __VA_ARGS__ would expand into nothing and the , after __LINE__ would cause a syntax error since C functions don’t accept blank arguments. What’s needed is a way to include the , in the expansion only if __VA_ARGS__ is not empty.

Starting in C23 and C++20, that’s precisely what __VA_OPT__ does. You can rewrite the macro using it:

#define INTERNAL_ERROR(FORMAT,...)    \
  fatal_error( EX_SOFTWARE,           \
    "%s:%d: internal error: " FORMAT, \
    __FILE__, __LINE__                \
    __VA_OPT__(,) __VA_ARGS__         \
  )
Enter fullscreen mode Exit fullscreen mode

Tokens to include in the expansion are enclosed with parentheses immediately following __VA_OPT__.

Properly Using Multiple Statements

Suppose there weren’t a fatal_error() function, so we’d need to call fprintf() and exit() directly instead. You might define INTERNAL_ERROR() like:

#define INTERNAL_ERROR(FORMAT, ...)    \
  fprintf( stderr,                     \
    "%s:%d: internal error: " FORMAT,  \
    __FILE__, __LINE__                 \
    __VA_OPT__(,) __VA_ARGS__          \
  );                                   \
  exit( EX_SOFTWARE )
Enter fullscreen mode Exit fullscreen mode

While such a definition will work in many cases, it will fail in others, specifically when following either an if or else. The problem is that a use like this:

if ( n < 0 )
  INTERNAL_ERROR( "n = %d", n );
Enter fullscreen mode Exit fullscreen mode

will expand into:

if ( n < 0 )
  fprintf( stderr,                       \
    "%s:%d: internal error: " "n = %d",  \
    "bad.c", 42                          \
    , n                                  \
  );                                     \
  exit( EX_SOFTWARE )
Enter fullscreen mode Exit fullscreen mode

Since the user didn’t use {} for the if, only the fprintf() is executed conditionally and the exit() is always executed because it’s a separate statement.

The common way do fix this is by always enclosing multiple statements in a replacement list within a do ... while loop:

#define INTERNAL_ERROR(FORMAT, ...)      \
  do {                                   \
    fprintf( stderr,                     \
      "%s:%d: internal error: " FORMAT,  \
      __FILE__, __LINE__                 \
      __VA_OPT__(,) __VA_ARGS__          \
    );                                   \
    exit( EX_SOFTWARE );                 \
  } while (0)
Enter fullscreen mode Exit fullscreen mode

Such a loop will keep multiple statements grouped together as a compound statement and execute exactly once. (The compiler will optimize away the check for 0.)

Stringification

The C preprocessor actually has two of its own operators. The first is (confusingly) # that “stringifies” its single argument:

#define STRINGIFY(X)  # X

STRINGIFY(a)          // results in: "a"
Enter fullscreen mode Exit fullscreen mode

Specifically, # followed by a parameter name stringifies the set of tokens comprising the corresponding argument for that parameter.

Some of the weird things about the C preprocessor include:

  • More than one token can comprise an argument.
  • An argument’s leading and trailing whitespace is eliminated.
  • Intervening whitespace between an argument’s tokens (if it has more than one) is collapsed to a single space.

For example:

STRINGIFY(a b)        // "a b"
STRINGIFY( a b )      // "a b"
STRINGIFY(a   b)      // "a b"
Enter fullscreen mode Exit fullscreen mode

One place where stringification is typically used is with error-reporting macros to include the textual representation of a condition that was violated. For example, the assert() macro is implemented something like:

#define assert(EXPR) ((EXPR) ? (void)0 : \
  __assert( __func__, __FILE__, __LINE__, #EXPR ))

assert( p != NULL );
Enter fullscreen mode Exit fullscreen mode

Hence the expression of p != NULL is included in the error message as a string.

Concatenation

The second preprocessor operator is (confusingly, again) ## that concatenates (or pastes) its two arguments together:

#define PASTE(A,B)    A ## B

PASTE(foo, bar)       // results in: foobar
Enter fullscreen mode Exit fullscreen mode

Specifically, ## between two parameter names concatenates the set of tokens comprising the corresponding arguments for those parameters. Additionally, there can be multiple ## in a row:

#define PASTE3(A,B,C) A ## B ## C
Enter fullscreen mode Exit fullscreen mode

Another weird thing about the preprocessor includes:

  • Arguments can be omitted resulting in empty arguments.

For example:

PASTE(,)              // (nothing)
PASTE(foo,)           // foo
PASTE(,bar)           // bar
Enter fullscreen mode Exit fullscreen mode

This is even true for one argument:

STRINGIFY()           // ""
Enter fullscreen mode Exit fullscreen mode

# and ## Argument Pitfalls

One curious (and often annoying) thing about arguments for either # or ## is that neither expands its arguments even when they’re themselves macros. For example, the previous definition of PASTE() can be insufficient:

PASTE(var_, __LINE__) // var___LINE__
Enter fullscreen mode Exit fullscreen mode

What you want is a result like var_42, that is the prefix var_ followed by the current line number. (Such names are typically used to help ensure unique names.) The problem is that, while PASTE expands its parameter B into the argument __LINE__, if the argument is itself a macro that ordinarily would expand (in this case, to the current line number), it won’t be expanded.

To fix it (as with many other problems in software) requires an extra level of indirection:

#define PASTE_HELPER(A,B)  A ## B
#define PASTE(A,B)         PASTE_HELPER(A,B)

PASTE(var_, __LINE__)      // var_42
Enter fullscreen mode Exit fullscreen mode

This fixes the problem because __LINE__ will be expanded by PASTE (because it’s not an argument of ##) and then the result of that expansion (the current line number, here, 42) will be passed to PASTE_HELPER that will just concatenate it as-is. The same indirection fix can be used with # when necessary as well.

# and ## Argument Rationale

So why does the preprocessor work this way, i.e., why do # and ## not expand arguments that are macros? Because if it did expand them, then it would be impossible either to quote or paste arguments as-is.

For example, consider a macro that defines the C language version where __LINE__ starts being supported:

#define LANG___LINE__        199409L
Enter fullscreen mode Exit fullscreen mode

and a function lang_is() that checks whether the current language is equal to or later than its argument:

if ( lang_is( LANG___LINE__ ) )
  // ...
Enter fullscreen mode Exit fullscreen mode

Rather than having to type both lang and LANG which is redundant, you define a convenience macro like:

#define LANG_IS(LANG_MACRO)  lang_is( LANG_ ## LANG_MACRO )
Enter fullscreen mode Exit fullscreen mode

and can now instead write:

if ( LANG_IS( __LINE__ ) )
  // ...
Enter fullscreen mode Exit fullscreen mode

In this case, you want LANG_IS() to paste LANG_ and __LINE__ together to get LANG___LINE__ which is exactly what happens. If ## expanded its arguments, you’d instead get something like LANG_42 which is not what you want in this case.

Not Expanding a Macro

There are a couple of more weird things about the C preprocessor, specifically that a macro will not expand if either:

  • It references itself (either directly or indirectly); or:
  • A use of a function-like macro is not followed by (.

An example of a self-referential macro is:

#define nullptr nullptr
Enter fullscreen mode Exit fullscreen mode

What use is that? It makes a name available to the preprocessor that can be tested via #ifdef.

What about a use of a function-like macro that is not followed by (? There are no simple examples I’m aware of so a complicated example is a story for another time.

Paste Avoidance

As mentioned initially, the preprocessor tokenizes the input. New (or different) tokens can only ever be created via ##. In all other cases where a new (or different) token would be created, the preprocessor inserts a space to avoid it. For example:

#define EMPTY      /* nothing */
#define AVOID1     -EMPTY-

AVOID1             // - -, not --
Enter fullscreen mode Exit fullscreen mode

Because EMPTY is defined to have zero tokens (comments don’t count), when EMPTY is expanded into AVOID1, you’d think that nothing should be there — except the left - and right - would then come together and form -- (a different token of the minus-minus operator) so the preprocessor inserts a space between them to preserve the original, separate - tokens.

The preprocessor largely doesn’t care whether it’s preprocessing C or C++ code — except when it comes to paste avoidance. For example:

#define AVOID2(X)  X*

AVOID2(->)         // in C  : ->*
AVOID2(->)         // in C++: -> *
Enter fullscreen mode Exit fullscreen mode

The reason the results differ is because ->* isn’t an operator in C: it’s just the -> and * operators next to each other, which is fine. However, ->* is a distinct operator in C++, so the preprocessor must avoid pasting -> and * together to form the different token of ->*, so it inserts a space.

Preprocessor Problems

The preprocessor has a bad reputation, largely deservedly so. Specifically:

  1. All macros are in the global scope meaning you have to choose names that won’t clash with standard names nor names used in third-party packages.
  2. As shown, function-like macro parameters should always be enclosed in parentheses (except when arguments to either # or ##) to ensure the desired precedence.
  3. As shown, function-like macro arguments should not have side effects.
  4. As shown, multi-line macros should be enclosed in do ... while loops.
  5. Since multiline macros using escaped newlines are joined into a single, long line, errors in such lines are hard to read.
  6. Complicated macros are hard to debug because you only ever see the end result of the expansion. There’s no way to see how expansion progresses step-by-step.

Problems 2-5 go away by using either constexpr expressions or inline or constexpr (in C++) functions instead of function-like macros. However, it’s harder to use inline functions in C if the parameter types can vary since C doesn’t have templates.

C11 offers _Generic that provides a veneer of function overloading. Ironically, it requires a function-like macro to use it.

I decided to solve problem 6 myself by adding a feature to cdecl that allows you to #define macros as usual and then expand them where cdecl will print the expansion step-by-step as well as warn about things you might not expect. However, that’s a story for another time.

Conclusion

Preprocessor macros have their own weird language and are fraught with problems. Instead, it’s best to use inline or constexpr (in C++) functions whenever possible. Despite this, if used judiciously, preprocessor macros can be useful, but that too is a story for another time.

Further Reading

In the meantime, here are some other articles I’ve written that use preprocessor macros:

References

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .