Introduction
The C Preprocessor originally was a stand-alone program that the C compiler called to “preprocess” source files before compiling them — hence the name. The reason C has a preprocessor unlike most other languages is due to the use of preprocessors in general at Bell Labs at the time, such as M4 and the troff
suite of preprocessors.
Modern C and C++ compilers have the preprocessor integrated, though there are often options to control it specifically. For gcc
and clang
at least, the -E
option causes a file only to be preprocessed, which can often been illuminating as to what the preprocessor is doing.
Preprocessing includes:
- Conditional elimination via
#if
,#ifdef
, etc. -
File inclusion via
#include
. - Comment replacement.
- Tokenization.
- Macro expansion.
The first four are fairly straightforward; macro expansion, however, is the most complicated — and weird. It’s its own mini-language inside C and C++.
Preliminaries
Unlike either C or C++, the preprocessor’s language is line-based, that is a preprocessor directive begins with #
(that must be the first token on a line) and ends with end-of-line (on Unix systems, the newline character, ASCII decimal 10, written as \n
) — unless escaped via \
in which case the directive ends with the first unescaped newline.
Following the #
may be zero or more whitespace characters followed by the directive name (define
, if
, include
, etc.), hence all the following are equivalent:
#ifndef NDEBUG
# ifndef NDEBUG
#ifndef NDEBUG
Some people, myself included, prefer the style where the
#
is always in the first column to make preprocessor lines stand out since an indented#
is harder to notice.
Macros
There are two kinds of macros:
- Object-like.
- Function-like.
By convention, macro names are in generally written in all UPPER CASE to draw attention to the fact than a name is a macro and the normal C/C++ rules do not apply.
Object-Like Macros
Object-like macros are the simplest: they replace a macro name with zero or more tokens comprising the replacement list. For example:
#define WS " \n\t\r\f\v"
Why would you want an object-like macro defined with zero tokens? Just to indicate that it is defined at all for use with
#ifdef
,#ifndef
, ordefined()
. The actual definition doesn’t matter.Another use is when a macro expands to different tokens depending on the platform via a sequence of
#ifdef
s. It’s sometimes the case that a particular platform either doesn’t support (or doesn’t need) whatever you’re trying to do; hence, the macro just expands to nothing.
Object-like macros are used for:
- Program-wide definitions to control conditional compilation via
#if
,#ifdef
, etc. - Program-wide constants (such as the above example).
- Include guards.
These days, constexpr
in C23 can largely replace using object-like macros for program-wide constants; the same is true for constexpr
in C++11.
Include guards are still necessary to prevent the declarations within a file from being seen by the compiler proper more than once:
// c_ast.h
#ifndef CDECL_C_AST_H
#define CDECL_C_AST_H
// ...
#endif /* CDECL_C_AST_H */
One convention for naming macros for include guards is the file name where:
- It is prefixed by the program name (to help ensure a unique name); and:
- All letters are converted to upper case; and:
- All non-identifier characters are converted to an underscore — except never to start a name with
_
nor contain__
(double underscore) that are reserved identifiers in C and C++.
You might sometimes see #pragma once
(a non-standard, but widely supported directive) in header files as a replacement, but some compilers are optimized to handle this implicitly, so there’s really no reason to use it.
Function-Like Macros
Function-like macros can take zero or more parameters. For example, a common macro to get the number of elements of a statically allocated array might be:
#define ARRAY_SIZE(ARRAY) \
(sizeof(ARRAY) / sizeof(0[ARRAY]))
Yes, the syntax of
0[ARRAY]
is legal. It’s a consequence of the quirky interplay between arrays and pointers in C. Briefly, thea[i]
syntax to access the ith element of an arraya
is just syntactic sugar for*(a+i)
. Since addition is commutative,*(a+i)
can be alternatively written as*(i+a)
; that in turn can be written asi[a]
. In C, this has no practical use.So why use it here? In C++, however, using
0[ARRAY]
will cause trying to useARRAY_SIZE
on an object of aclass
for whichoperator[]
has been overloaded to cause a compilation error, which is what you’d want.
Unlike either a C or C++ function, when defining a function-like macro, the (
that follows the macro’s name must be adjacent, i.e., not have any whitespace between them. (If there is whitespace between them, then it’s an object-like macro where the first character of the replacement list is (
.)
Properly Using Parameters
Macro parameters, when used within the replacement list, invariably should be enclosed within parentheses. Additionally, if the replacement list is an expression, all of it should be enclosed within parentheses as well. For example:
#define MAX(X,Y) ( (X) > (Y) ? (X) : (Y) )
Why? Because substitution can result in incorrect operator precedence. Suppose MAX
were instead defined like:
#define BAD_MAX(X,Y) X > Y ? X : Y
then called like:
int m = BAD_MAX( n & 0xFF, 8 );
The problem is that the precedence of the operators is >
, &
, ?:
, so it would be as if:
int m = n & ((0xFF > 8) ? n & 0xFF : 8);
that very likely isn’t what you want.
Another problem with parameters is that arguments having side effects can have those side effects occur multiple times. For example:
int m = MAX( n, ++i );
expands into:
int m = ( (n) > (++i) ? (n) : (++i) );
So if n
≤ i
, then i
will be incremented twice! Historically, there’s no way to fix this other than simply never pass expressions having side effects as macro arguments.
This is a great example of why the convention is that macro names are generally written in all UPPER CASE to draw attention to the fact than a name is a macro and the normal C/C++ rules do not apply.
The solution really is to use an inline
function instead of a macro.
Arguments Can Be Anything
A macro argument can be any sequence of tokens, not just identifiers (like n
) or valid expressions (like ++i
), for example:
#define A_OR_B(A,OP,B) ( (A) OP (B) ? (A) : (B) )
#define MAX(X,Y) A_OR_B( (X), >, (Y) )
Above, the >
token is a perfectly valid preprocessor argument even though it would be a syntax error as an ordinary C function argument. Remember: the normal C/C++ rules do not apply to the preprocessor.
Macros employing such “anything goes” arguments should generally be avoided, but they are used for “stringification” (see below).
Variable Numbers of Arguments
Starting with C99, function-like macros can take a variable number of arguments (including zero). Such a macro is known as variadic. To declare such a macro, the last (or only) parameter is ...
(an ellipsis).
For example, given a fatal_error()
variadic function that prints an error message formatted in a printf()
-like way and exit()
s with a status code, you might want to wrap that with an INTERNAL_ERROR()
macro that includes the file and line whence the error came:
_Noreturn void fatal_error( int status,
char const *format, ... );
#define INTERNAL_ERROR(FORMAT,...) \
fatal_error( EX_SOFTWARE, \
"%s:%d: internal error: " FORMAT, \
__FILE__, __LINE__, __VA_ARGS__ \
)
It’s common practice both to break-up long replacement lists into multiple lines and to align the
\
s for readability.Also, when a macro expands into a C or C++ statement, the replacement list must not end with a
;
— it will be provided by the caller of the macro.
The __VA_ARGS__
token expands into everything that was passed for the second (in this case) and subsequent arguments (if any) and the commas that separate them.
In C and C++, adjacent string literals are concatenated into a single string which is how the “internal error” part gets prepended to
FORMAT
. (This is an example of a rare instance where a macro parameter is intentionally not enclosed within parentheses.)
A keen observer might notice a possible problem with the INTERNAL_ERROR
macro, specifically this line:
__FILE__, __LINE__, __VA_ARGS__ \
What if the macro were called like:
INTERNAL_ERROR( "oops" );
passing zero additional arguments? Then __VA_ARGS__
would expand into nothing and the ,
after __LINE__
would cause a syntax error since C functions don’t accept blank arguments. What’s needed is a way to include the ,
in the expansion only if __VA_ARGS__
is not empty.
Starting in C23 and C++20, that’s precisely what __VA_OPT__
does. You can rewrite the macro using it:
#define INTERNAL_ERROR(FORMAT,...) \
fatal_error( EX_SOFTWARE, \
"%s:%d: internal error: " FORMAT, \
__FILE__, __LINE__ \
__VA_OPT__(,) __VA_ARGS__ \
)
Tokens to include in the expansion are enclosed with parentheses immediately following __VA_OPT__
.
Properly Using Multiple Statements
Suppose there weren’t a fatal_error()
function, so we’d need to call fprintf()
and exit()
directly instead. You might define INTERNAL_ERROR()
like:
#define INTERNAL_ERROR(FORMAT, ...) \
fprintf( stderr, \
"%s:%d: internal error: " FORMAT, \
__FILE__, __LINE__ \
__VA_OPT__(,) __VA_ARGS__ \
); \
exit( EX_SOFTWARE )
While such a definition will work in many cases, it will fail in others, specifically when following either an if
or else
. The problem is that a use like this:
if ( n < 0 )
INTERNAL_ERROR( "n = %d", n );
will expand into:
if ( n < 0 )
fprintf( stderr, \
"%s:%d: internal error: " "n = %d", \
"bad.c", 42 \
, n \
); \
exit( EX_SOFTWARE )
Since the user didn’t use {}
for the if
, only the fprintf()
is executed conditionally and the exit()
is always executed because it’s a separate statement.
The common way do fix this is by always enclosing multiple statements in a replacement list within a do
... while
loop:
#define INTERNAL_ERROR(FORMAT, ...) \
do { \
fprintf( stderr, \
"%s:%d: internal error: " FORMAT, \
__FILE__, __LINE__ \
__VA_OPT__(,) __VA_ARGS__ \
); \
exit( EX_SOFTWARE ); \
} while (0)
Such a loop will keep multiple statements grouped together as a compound statement and execute exactly once. (The compiler will optimize away the check for 0.)
Stringification
The C preprocessor actually has two of its own operators. The first is (confusingly) #
that “stringifies” its single argument:
#define STRINGIFY(X) # X
STRINGIFY(a) // results in: "a"
Specifically, #
followed by a parameter name stringifies the set of tokens comprising the corresponding argument for that parameter.
Some of the weird things about the C preprocessor include:
- More than one token can comprise an argument.
- An argument’s leading and trailing whitespace is eliminated.
- Intervening whitespace between an argument’s tokens (if it has more than one) is collapsed to a single space.
For example:
STRINGIFY(a b) // "a b"
STRINGIFY( a b ) // "a b"
STRINGIFY(a b) // "a b"
One place where stringification is typically used is with error-reporting macros to include the textual representation of a condition that was violated. For example, the assert()
macro is implemented something like:
#define assert(EXPR) ((EXPR) ? (void)0 : \
__assert( __func__, __FILE__, __LINE__, #EXPR ))
assert( p != NULL );
Hence the expression of p != NULL
is included in the error message as a string.
Concatenation
The second preprocessor operator is (confusingly, again) ##
that concatenates (or pastes) its two arguments together:
#define PASTE(A,B) A ## B
PASTE(foo, bar) // results in: foobar
Specifically, ##
between two parameter names concatenates the set of tokens comprising the corresponding arguments for those parameters. Additionally, there can be multiple ##
in a row:
#define PASTE3(A,B,C) A ## B ## C
Another weird thing about the preprocessor includes:
- Arguments can be omitted resulting in empty arguments.
For example:
PASTE(,) // (nothing)
PASTE(foo,) // foo
PASTE(,bar) // bar
This is even true for one argument:
STRINGIFY() // ""
#
and ##
Argument Pitfalls
One curious (and often annoying) thing about arguments for either #
or ##
is that neither expands its arguments even when they’re themselves macros. For example, the previous definition of PASTE()
can be insufficient:
PASTE(var_, __LINE__) // var___LINE__
What you want is a result like var_42
, that is the prefix var_
followed by the current line number. (Such names are typically used to help ensure unique names.) The problem is that, while PASTE
expands its parameter B
into the argument __LINE__
, if the argument is itself a macro that ordinarily would expand (in this case, to the current line number), it won’t be expanded.
To fix it (as with many other problems in software) requires an extra level of indirection:
#define PASTE_HELPER(A,B) A ## B
#define PASTE(A,B) PASTE_HELPER(A,B)
PASTE(var_, __LINE__) // var_42
This fixes the problem because __LINE__
will be expanded by PASTE
(because it’s not an argument of ##
) and then the result of that expansion (the current line number, here, 42
) will be passed to PASTE_HELPER
that will just concatenate it as-is. The same indirection fix can be used with #
when necessary as well.
#
and ##
Argument Rationale
So why does the preprocessor work this way, i.e., why do #
and ##
not expand arguments that are macros? Because if it did expand them, then it would be impossible either to quote or paste arguments as-is.
For example, consider a macro that defines the C language version where __LINE__
starts being supported:
#define LANG___LINE__ 199409L
and a function lang_is()
that checks whether the current language is equal to or later than its argument:
if ( lang_is( LANG___LINE__ ) )
// ...
Rather than having to type both lang
and LANG
which is redundant, you define a convenience macro like:
#define LANG_IS(LANG_MACRO) lang_is( LANG_ ## LANG_MACRO )
and can now instead write:
if ( LANG_IS( __LINE__ ) )
// ...
In this case, you want LANG_IS()
to paste LANG_
and __LINE__
together to get LANG___LINE__
which is exactly what happens. If ##
expanded its arguments, you’d instead get something like LANG_42
which is not what you want in this case.
Not Expanding a Macro
There are a couple of more weird things about the C preprocessor, specifically that a macro will not expand if either:
- It references itself (either directly or indirectly); or:
- A use of a function-like macro is not followed by
(
.
An example of a self-referential macro is:
#define nullptr nullptr
What use is that? It makes a name available to the preprocessor that can be tested via #ifdef
.
What about a use of a function-like macro that is not followed by (
? There are no simple examples I’m aware of so a complicated example is a story for another time.
Paste Avoidance
As mentioned initially, the preprocessor tokenizes the input. New (or different) tokens can only ever be created via ##
. In all other cases where a new (or different) token would be created, the preprocessor inserts a space to avoid it. For example:
#define EMPTY /* nothing */
#define AVOID1 -EMPTY-
AVOID1 // - -, not --
Because EMPTY
is defined to have zero tokens (comments don’t count), when EMPTY
is expanded into AVOID1
, you’d think that nothing should be there — except the left -
and right -
would then come together and form --
(a different token of the minus-minus operator) so the preprocessor inserts a space between them to preserve the original, separate -
tokens.
The preprocessor largely doesn’t care whether it’s preprocessing C or C++ code — except when it comes to paste avoidance. For example:
#define AVOID2(X) X*
AVOID2(->) // in C : ->*
AVOID2(->) // in C++: -> *
The reason the results differ is because ->*
isn’t an operator in C: it’s just the ->
and *
operators next to each other, which is fine. However, ->*
is a distinct operator in C++, so the preprocessor must avoid pasting ->
and *
together to form the different token of ->*
, so it inserts a space.
Preprocessor Problems
The preprocessor has a bad reputation, largely deservedly so. Specifically:
- All macros are in the global scope meaning you have to choose names that won’t clash with standard names nor names used in third-party packages.
- As shown, function-like macro parameters should always be enclosed in parentheses (except when arguments to either
#
or##
) to ensure the desired precedence. - As shown, function-like macro arguments should not have side effects.
- As shown, multi-line macros should be enclosed in
do
...while
loops. - Since multiline macros using escaped newlines are joined into a single, long line, errors in such lines are hard to read.
- Complicated macros are hard to debug because you only ever see the end result of the expansion. There’s no way to see how expansion progresses step-by-step.
Problems 2-5 go away by using either constexpr
expressions or inline
or constexpr
(in C++) functions instead of function-like macros. However, it’s harder to use inline
functions in C if the parameter types can vary since C doesn’t have templates.
C11 offers
_Generic
that provides a veneer of function overloading. Ironically, it requires a function-like macro to use it.
I decided to solve problem 6 myself by adding a feature to cdecl that allows you to #define
macros as usual and then expand them where cdecl will print the expansion step-by-step as well as warn about things you might not expect. However, that’s a story for another time.
Conclusion
Preprocessor macros have their own weird language and are fraught with problems. Instead, it’s best to use inline
or constexpr
(in C++) functions whenever possible. Despite this, if used judiciously, preprocessor macros can be useful, but that too is a story for another time.
Further Reading
In the meantime, here are some other articles I’ve written that use preprocessor macros:
- Bit Constant Macros in C
- C++ New Style Casts in C (sort of)
- setjmp(), longjmp(), and Exception Handling in C
References
- 7 Scandalous Weird Old Things About The C Preprocessor, Robert Elder, September 20, 2015
- GNU C Preprocessor Internals, The
- Programming languages — C, ISO/IEC 9899:2023 working draft — April 1, 2023