Introduction
A union
is syntactically just like a struct
and is used to store data for any one of its members at any one time. For example:
union value {
long i;
double f;
char c;
char *s;
};
union value v;
v.i = 42; // value is now 42
v.c = 'a'; // value is now 'a' (no more 42)
union value *pv = &v;
pv->s = malloc(6); // -> works too
strcpy( pv->s, "hello" );
The size of a union
is the size of its largest member.
A common use-case for a union
would be in a compiler or interpreter where a token is any one of a character literal, integer literal, floating-point literal, string literal, identifier, operator, etc. It would be wasteful to use a struct
since only one member would ever have a value.
Initialization
Since all members have the same offset, their order mostly doesn’t matter — except that the first member is the one that is initialized when an initializer list is used so the value given must be the same type:
union value v = { 42 }; // as if: v.i = 42
Although
0
can initialize any built-in type.
Alternatively, you can use a designated initializer to specify a member:
union value v = { .c = 'a' };
Which Member?
One obvious problem with a union
is, after you store a value in a particular member, how do you later remember which member that was? With a union
by itself, you generally can’t. You need some other variable to “remember” the member you last stored a value in. Often, this is done using an enumeration and a struct
:
enum token_kind {
TOKEN_NONE,
TOKEN_INT,
TOKEN_FLOAT,
TOKEN_CHAR,
TOKEN_STR
};
struct token {
enum token_kind kind;
union { // "anonymous" union
long i;
double f;
char c;
char *s;
};
};
struct token t = { .kind = TOKEN_CHAR, .c = 'a' };
When a union
is used inside a struct
, it’s often made an anonymous union
, that is a union
without a name. In this case, the union
members behave as if they’re direct members of their enclosing struct
except they all have the same offset.
Anonymous
union
s (andstruct
s) are only supported starting in C11.
Type Punning
Type punning is a technique to read or write an object as if it were of a type other than what it was declared as. Since this circumvents the type system, you really have to know what you’re doing. In C (but not C++), a union
can be used for type punning. For example, here’s a way to get the value of a 32-bit integer with the high and low order 16-bit halves swapped:
uint32_t swap16of32( uint32_t n ) {
union {
uint32_t u32;
uint16_t u16[2];
} u = { n };
uint16_t const t16 = u.u16[0];
u.u16[0] = u.u16[1];
u.u16[1] = t16;
return u.u32;
}
The union
members u32
and u16[2]
“overlay” each other allowing you to read and write a uint32_t
as if it were a 2-element array of uint16_t
. (You could alternatively write a version that used uint8_t[4]
and reversed the entire byte order depending on your particular need.)
You can also use union
s to do type punning of unrelated types, for example int32_t
and float
allowing you to access the sign, exponent, and mantissa individually. (However, this is CPU-dependent.)
Restricted Class Hierarchies in C
Another use for union
s is to implement class hierarchies in C, but only “restricted” class hierarchies. A “restricted” class hierarchy is one used only to implement a solution to a problem where all the classes are known. Users are not permitted to extend the hierarchy via derivation.
This can be partially achieved via
final
in C++ or fully achieved viasealed
in Java or Kotlin.
Of course C doesn’t have either classes or inheritance, but restricted class hierarchies can be implemented via struct
s and a union
.
The token example shown previously is simple example of this: all the kinds of tokens are known and there’s one member in the union
to hold the data for each kind. But what if there’s more than one member per kind?
For a larger example, consider cdecl that is a program that can parse a C or C++ declaration (aka, “gibberish”) and explain it in English:
cdecl> explain int *const (*p)[4]
declare p as pointer to array 4 of constant pointer to integer
During parsing, cdecl creates an abstract syntax tree (AST) of nodes where each node contains information for a particular kind of declaration. For example, the previous declaration could be represented as an AST like (expressed in JSON):
{
name: "p",
kind: "pointer",
pointer: {
to: {
kind: "array",
array: {
size: 4,
of: {
kind: "pointer",
type: "const",
pointer: {
to: {
kind: "built-in type",
type: "int"
}
}
}
}
}
}
}
For this example, let’s consider a subset of the kinds of nodes in a C++ declaration (to keep the example shorter):
enum c_ast_kind {
K_BUILTIN, // e.g., int
K_CLASS_STRUCT_UNION,
K_TYPEDEF,
K_ARRAY,
K_ENUM,
K_POINTER,
K_REFERENCE, // C++ reference
K_CONSTRUCTOR, // C++ constructor
K_DESTRUCTOR, // C++ destructor
K_FUNCTION,
K_OPERATOR, // C++ overloaded operator
// ...
};
typedef enum c_ast_kind c_ast_kind_t;
And declare some struct
s to contain the information needed for each kind:
struct c_array_ast {
c_ast_t *of_ast; // array of ...
unsigned size;
};
struct c_enum_ast {
c_ast_t *of_ast; // fixed type, if any
unsigned bit_width; // width when > 0
char const *enum_name; // enumeration name
};
struct c_function_ast {
c_ast_t *ret_ast; // return type
c_ast_list_t param_ast_list; // parameters
};
struct c_operator_ast {
c_ast_t *ret_ast; // return type
c_ast_list_t param_ast_list; // parameters
c_operator_t const *operator; // operator info
};
struct c_ptr_ref_ast {
c_ast_t *to_ast; // pointer/ref to ...
};
struct c_typedef_ast {
c_ast_t const *for_ast; // typedef for ...
unsigned bit_width; // width when > 0
};
Notice that, of the AST information declared thus far, there are similarities, specifically:
- The nodes point to one other node and the pointer is declared first.
- Functions and operators both have return types and parameter lists and the parameter lists are declared second.
- For nodes that have bit-field widths, the width is alternatively declared second.
The fact that the same members in different struct
s are at the same offset is convenient because it means that code that, say, iterates over the parameters of a function will also work for the parameters of an operator. Having noticed this, we can make an effort to keep the same members in any remaining struct
s at the same offsets. For example, the information for K_BUILTIN
could be declared as:
struct c_builtin_ast {
unsigned bit_width; // width when > 0
};
because that’s all the information that’s needed for a built-in type. However, the bit_width
member wouldn’t be at the same offset as the same member in either c_enum_ast
or c_typedef_ast
. To fix that so code that accesses bit_width
can do so for any type that has it, we need to insert an unused pointer (a void
pointer will do):
struct c_builtin_ast {
void *reserved; // instead of for/to
unsigned bit_width; // width when > 0
};
If you think inserting unused members might waste space, remember that, once all these struct
s are put into the same union
, the union
will be the size of the largest member anyway; hence inserting unused members doesn’t waste space.
While using a named member like reserved
is fine, if you want to help guarantee that the member can never be accessed directly, you can employ a macro:
#define DECL_UNUSED(T) \
_Alignas(T) char UNIQUE_NAME(unused)[ sizeof(T) ]
struct c_builtin_ast {
DECL_UNUSED(c_ast_t*); // instead of for/to
unsigned bit_width; // width when > 0
};
See here for details on
UNIQUE_NAME
.
We can apply the same fix for the information for K_CONSTRUCTOR
so param_list
is at the same offset as in c_function_ast
and c_operator_ast
(constructors don’t have return types):
struct c_ctor_ast {
DECL_UNUSED(c_ast_t*); // instead of ret_ast
c_ast_list_t param_ast_list; // parameter(s)
};
And again apply the same fix for the information for K_CLASS_STRUCT_UNION
so csu_name
is at the same offset as enum_name
in c_enum_ast
:
struct c_csu_ast {
DECL_UNUSED(c_ast_t*); // instead of for/to
DECL_UNUSED(unsigned); // instead of bit_width
char const *csu_name;
};
Given all those declarations (assume that for any struct
c_X_ast
, there’s a typedef struct c_X_ast c_X_ast_t
), we can now put them all inside an anonymous union
inside a struct
for an AST node:
struct c_ast {
c_ast_kind_t kind;
char const *name;
c_type_t type;
// ...
union {
c_array_ast_t array;
c_builtin_ast_t builtin;
c_csu_ast_t csu;
c_ctor_ast_t ctor;
c_enum_ast_t enum_;
c_function_ast_t func;
c_operator_ast_t oper;
c_ptr_ref_ast_t ptr_ref;
c_typedef_ast_t tdef;
// ...
};
};
Safeguards
One problem with this approach is that, if you modify any of the struct
s, you might inadvertently change the offset of some member so that it no longer is at the same offset as the same member in another struct
. One way to guard against this is via offsetof
and _Static_assert
:
static_assert(
offsetof( c_operator_ast_t, param_ast_list ) ==
offsetof( c_function_ast_t, param_ast_list ),
"offsetof param_ast_list in c_operator_ast_t & c_function_ast_t must equal"
);
static_assert(
offsetof( c_csu_ast_t, csu_name ) ==
offsetof( c_enum_ast_t, enum_name ),
"offsetof csu_name != offsetof enum_name"
);
// More for other members ....
Now you’ll get a compile-time error if any of the offsets change inadvertently.
Conclusion
Take-aways for union
s in C:
- They can be used either for storing data for any one member at any one time or for type punning.
- For type punning very different types, the order of bytes is CPU-dependent.
- They can be used to implement restricted class hierarchies.
You can also use union
s in C++, but that’s a story for another time.