A tokenizer also known as a lexical analyzer, or lexer, in compiler design, is the first phase of the compiler. It is at this stage that the source code is scanned and broken down into a sequence of tokens. The tokens created by the tokenizer are usually designed according to the programming language's syntax being processed and they represent the basic building blocks of the language, such as keywords, identifiers, operators, and literals.
The primary function of the tokenizer in compiler design is to eliminate superfluous information from the source code and transform it into a format that is simpler for the compiler's later stages to process. It is here where Regular expressions, which specify the patterns for identifying the various sorts of tokens in the source code, are typically used to accomplish tokenization. The tokenizer may also handle white space and comments, detect errors, report them, and create symbol tables for the parser.
In this article, we are going to attempt to build a tokenizer function using Typescript! Our Tokenizer will try to tokenize a string of simple arithmetic procedures like addition, subtraction, division and multiplication, this can be used as part of a calculator program. A sample string would be "2+2"
or "2 * (2 + 4)"
const tokenRef = {
number: /\d+/,
operand: /[\+\-\*\/]/,
openBracket: /\(/,
closeBracket: /\)/,
} as const;
We create an object stores a group of regular expressions that would tokenize a mathematical phrase. The expression has different token kinds, including numbers, operators, and parentheses, this regular expressions are going to be used to identity specific tokens that matches our program definition.
const types = {
number: 'Number',
operand: 'Operand',
openBracket: 'OpenBracket',
closeBracket: 'CloseBracket',
} as const;
The types
object is used to provide more detail about the types of tokens that can be recognized by the tokenizer. It will be used to map the regular expressions used by the tokenizer in the tokenRef
object to the specific token types that they represent. For example, the number
regular expression will be mapped to the Number
token type, the operand
regular expression will be mapped to the Operand
token type, and so on. Observe closely that the two objects have the same keys, we'll create a utility function that will help us get an array of the keys any object we pass to it since we are writing typescript.
const objKeys = function <T extends Object>(obj: T): (keyof T)[] {
return Object.keys(obj) as (keyof T)[];
};
We want the array to be strongly typed by infering the type of the keys to the items in the array.
const ALLOWED_TOKENS = objKeys(tokenRef);
type token = {
token: string;
type: typeof ALLOWED_TYPES[number];
};
const ALLOWED_TYPES = [
types.number,
types.operand,
types.closeBracket,
types.openBracket,
] as const;
The token
type is the type for a token object, the ALLOWED_TYPES
constant is used to store an array of token types that the tokenizer will recognize. This provides a clear set of rules for the tokenizer to follow, allowing it to only recognize tokens of the specified types, let's write the function that will produce the tokens.
const generateToken = function* (
str: string
): Generator<{ token: string; type: typeof ALLOWED_TYPES[number] }> {
const charArr = Array.from(str);
let cursor: number = 0;
const tokenKeys = objKeys(tokenRef);
while (cursor !== charArr.length) {
for (let i = 0; i < tokenKeys.length; i++) {
if (tokenRef[tokenKeys[i]].test(charArr[cursor])) {
yield { token: charArr[cursor], type: types[tokenKeys[i]] };
cursor++;
}
}
}
};
Our function above generateToken
takes a string as an argument and returns a generator that yields the tokens of the string. The tokenKeys
array contains the keys of the tokenRef
object. The types
object is then used to map the regular expressions to their specific token types. The generateToken
function then iterates through the tokens of the string and tests them against the regular expressions for a particular. If a token matches one of the regular expressions, it is yielded with its associated token type. This allows the tokenizer to accurately tokenize a mathematical expression.
const str = `2+2`;
let _tokens: token[] = [];
const tokenIterator = generateToken(str);
_tokens = [...tokenIterator];
console.log(_tokens);
/* [
{ token: '2', type: 'Number' },
{ token: '+', type: 'Operand' },
{ token: '2', type: 'Number' },
] */
We have successfully created a tokenizer function, however there are some drawbacks associated with our code, we can improve the efficiency of our code by adding an optimization that checks for overlapping regular expressions and prefers the longest matching expression. We can also add error handling to the generateToken
function to ensure that it fails gracefully if an invalid token is encountered. Finally, we can add an optimization that allows the generateToken
function to skip over tokens that have already been processed. These optimizations will help the tokenizer to be more accurate and efficient
See you in the next article where we'll optimize the tokenizer and move into the next stage of compiling.