Creating Python's AST (Pogo Pt:7)

Chig Beef - Jan 23 - - Dev Community

Intro

In this series I am creating a transpiler from Python to Golang called Pogo. In the last post we created a bunch of functions that will allow us to easily construct the abstract syntax tree from our source, a slice of Tokens. Creating the AST is the last step before we can start emitting code!

The Main Function

func (p *Parser) parse(input []Token) Structure {
    if len(input) == 0 {
        log.Fatal("[Parse (parse)] Missing input")
    }

    p.source = input
    p.curPos = 0
    p.curToken = p.source[p.curPos]

    s, err := p.program()
    if err != nil {
        log.Fatal(err.Error())
    }

    s, err = p.checkImport(s)
    if err != nil {
        log.Fatal(err.Error())
    }

    return s
}
Enter fullscreen mode Exit fullscreen mode

We are finally starting to parse the code, and the first check we are going to make is whether we even have code to parse. To help make this process make more sense, you have to keep in mind that we are making a tree, so quite a few of the functions we will be looking at are recursive (usually indirectly). This functions checks whether we have a program (the entire code). And then does error checking, including whether the programmer has imported GoType, which we used to simulation Go's typing system in Python. In the functions to come, you may notice that most of them return a Structure and an error. In such a tree formation it is very useful to be able to return errors. When we get an error, say when we are checking for a newline token, then we return an error if we don't find it. That error bubbles all the way up to that log.Fatal(err.Error()) that you see in the code. In this way, we know where the code will stop, and is much simpler.

Program

The first actual Structure we have is the program. This will be the parent node to the entire tree.

func (p *Parser) program() (Structure, error) {
    program := Structure{[]Structure{}, structureCode["PROGRAM"], "PROGRAM"}

    for p.curPos < len(p.source) {
        statement, err := p.statement()
        if err != nil {
            return program, err
        }
        program.children = append(program.children, statement)

        program.children = append(program.children, p.nextTokenNoNotes()...)

    }

    return program, nil
}
Enter fullscreen mode Exit fullscreen mode

In this function we initialize a Structure with the code associated with being a program, which is taken from the structureCode map which you may remember. The program's children starts of empty, and the next few lines populate this slice. We check whether we have a valid statement and append it to the program's children, in this way a program is a bunch of statements. As always, if we find an error, return the error so that we can log it. Then we look for the next statement, skipping over any notes and newlines.

Statements

Now, I'm not going to send the entire function that deals with statements, because it's around 300 lines long at the moment, but I'll send important bits and if it helps to have it up you can find Pogo's GitHub to keep track.
What is a statement? A statement can be a declaration of a variable, the start of a loop or if statement, an import, or a function call. You may have noticed that there were some copies of names in the structureCode map, such as "K_FOR" and "ST_FOR". Although these look very similar they have different purposes. "K_FOR" is for the "for" keyword, and "ST_FOR" is for the entire loop structure.

We will start with one of the simpler statements

if p.curToken.code == tokenCode["K_IMPORT"] {
    s = Structure{[]Structure{}, structureCode["ST_IMPORT"], ""}
    s.children = append(s.children, Structure{[]Structure{}, structureCode["K_IMPORT"], p.curToken.text})
    p.nextToken()

    temp, err := p.checkToken("IDENTIFIER")
    if err != nil {
        return s, err
    }
    s.children = append(s.children, temp)
}
Enter fullscreen mode Exit fullscreen mode

Here we start off by checking what token we are currently looking at, we do the same thing for all statements. We then turn that token into a Structure and save it as s, which will be returned to be appended to the program. However, s can also have children, and in this case it has 2, the "import" keyword and an identifier. We complete error checking for both tokens, and then at the end of the function we can return s (and any errors we find).

If, Elif, Else

Another interesting statement is the if statement.

else if p.curToken.code == tokenCode["K_IF"] {
    s = Structure{[]Structure{}, structureCode["ST_IF_ELSE_BLOCK"], "ST_IF_ELSE_BLOCK"}

    temp, err := p.s_if()
    if err != nil {
        return s, err
    }
    s.children = append(s.children, temp)

    p.setMarker()
    p.nextTokenNoNotes()
    extra_found := false
    for p.curToken.code == tokenCode["K_ELIF"] {
        extra_found = true
        temp, err = p.s_elif()
        if err != nil {
            return s, err
        }
        s.children = append(s.children, temp)
        p.nextTokenNoNotes()
    }

    if p.curToken.code == tokenCode["K_ELSE"] {
        extra_found = true
        temp, err = p.s_else()
        if err != nil {
            return s, err
        }
        s.children = append(s.children, temp)
    }
    if !extra_found {
        p.gotoMarker()
    }
}
Enter fullscreen mode Exit fullscreen mode

Firstly, if, elif, and else aren't separate statements, we keep them in one Structure. We do this so they're easier to keep track of in respect to one another, and avoids confusion. Taking a quick look the code may make sense, you start of checking for 1 if, then you check how many elifs there are, and end with a single else if there is one. This statement makes good use of setMarker and gotoMarker. We use these before we start checking for elifs and elses. This is because to check for elifs we have to look at the next Token that isn't a comment or a newline. Because of this, we can't use peek like we usually would, and instead set the marker so we can rollback an arbitrary amount.

Identifiers

else if p.curToken.code == tokenCode["IDENTIFIER"] {
    if p.peek().code == tokenCode["COLON"] {
        s = Structure{[]Structure{}, structureCode["ST_DECLARATION"], "ST_DECLARATION"}
        s.children = append(s.children, Structure{[]Structure{}, structureCode["IDENTIFIER"], p.curToken.text})
        p.nextToken()

        temps, err := p.checkTokenRange([]string{
            "COLON",
            "IDENTIFIER",
            "ASSIGN",
        })
        if err != nil {
            return s, err
        }
        s.children = append(s.children, temps...)

        temp, err := p.checkTokenChoices([]string{
            "L_BOOL",
            "L_INT",
            "L_STRING",
        })
        if err != nil {
            return s, err
        }
        s.children = append(s.children, temp)
    } else if p.peek().code == tokenCode["ASSIGN"] {
        s = Structure{[]Structure{}, structureCode["ST_MANIPULATION"], "ST_MANIPULATION"}
        s.children = append(s.children, Structure{[]Structure{}, structureCode["IDENTIFIER"], p.curToken.text})
        p.nextToken()

        temp, err := p.checkToken("ASSIGN")
        if err != nil {
            return s, err
        }
        s.children = append(s.children, temp)
        p.nextToken()

        temp, err = p.expression()
        if err != nil {
            return s, err
        }
        s.children = append(s.children, temp)
    }
}
Enter fullscreen mode Exit fullscreen mode

When we hit an identifier, we have 2 options, either w have found the declaration of a variable, or the manipulation of one. We can easily found out which one it is by checking the peek token. It is useful to look at the use of checkTokenChoices.

Blocks

A block is the code in an if or for statement. We need to separate this code, especially for semantic analysis and placing curly braces properly.

func (p *Parser) block() (Structure, error) {
    block := Structure{[]Structure{}, structureCode["BLOCK"], "BLOCK"}

    for p.curPos < len(p.source) {
        statement, err := p.statement()
        if err != nil {
            return block, err
        }
        block.children = append(block.children, statement)

        //p.nextToken()

        if p.peek().code == tokenCode["ANTI_COLON"] {
            p.nextToken()
            break
        }

        block.children = append(block.children, p.nextTokenNoNotes()...)

    }

    block.children = append(block.children, Structure{[]Structure{}, structureCode["ANTI_COLON"], ":"})

    return block, nil
}
Enter fullscreen mode Exit fullscreen mode

The code to understand blocks is very similar to the code for programs. The main difference is that blocks have to look for anti-colons for their end.

Expressions

A lot of code uses expressions, such as x: int = 10 + 5, which uses the expression 10 + 5. Expressions can also be single literal, or contain an identifier.

func (p *Parser) expression() (Structure, error) {
    s := Structure{[]Structure{}, structureCode["EXPRESSION"], "EXPRESSION"}

    temp, err := p.checkTokenChoices([]string{
        "L_BOOL",
        "L_INT",
        "L_STRING",
        "IDENTIFIER",
    })
    if err != nil {
        return s, err
    }
    s.children = append(s.children, temp)
    p.nextToken()

    temp, err = p.checkTokenChoices([]string{
        "MO_PLUS",
        "MO_SUB",
        "MO_MUL",
        "MO_DIV",
        "MO_MODULO",
    })
    if err != nil {
        p.rollBack()
        return s, nil // Could be a single literal, so we don't error
    }
    s.children = append(s.children, temp)
    p.nextToken()

    temp, err = p.checkTokenChoices([]string{
        "L_BOOL",
        "L_INT",
        "L_STRING",
        "IDENTIFIER",
    })
    if err != nil {
        return s, err
    }
    s.children = append(s.children, temp)

    return s, nil
}
Enter fullscreen mode Exit fullscreen mode

We check for a literal or identifier, then we also need to check whether it's just a single unit, or is an operation taking place? Then we find the last literal and return the whole thing. The code for a comparison is very similar, but uses operators such as < instead of mathematical operators.

There is a lot of code in the parser so I didn't show it all, just be sure to check the GitHub and ask any questions in the comments if I didn't explain anything well at all. Code like this is repetitive so it's hard to explain without saying the same thing, but hopefully you can follow along and understand how the compiler is working so far.

Next

The next post is probably the most exciting, because we are going to start making Go! Our first few programs will be simple, but we will keep building until we can start making more complex programs. Although this is the step that we create Go code, it isn't the final step, we can step back into the compiler and ask, "how could we make this better?" And the answer is make it make sense, because we don't want code that doesn't even work to make it through, which will be in a later post about making the semantic analyzer.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .