yacc
, a widely used LALR(1) parser generator, is efficient but burdened by archaic, unreadable code from the 1970s. Its Go adaptation, goyacc
, inherits these issues despite the modern language.
While exploring compilers, I found yacc
to be a gold mine of insights, though buried under outdated practices. Following Allan Holub’s Compiler Design in C
, I dissected its code—clarifying, simplifying, and refactoring convoluted sections. I then tested the revised version in a real-world scenario to ensure it retained its original functionality. The full code is published on github.
I aimed to explore real-world open-source code, bridge ideals with reality, and stay sane as a reader. By refining and clarifying it, I made learning smoother and reuse easier.
The first issue is poor naming—cryptic, vague, or misleading names. Here’s how naming was improved.
original | cleanup | intent
--------------------------------------------------------------------------------------------------
cpfir | computeFirstsets | compute the first sets for non-terminal symbols
cpemp | computeEmpty | compute compute table to check if a non-terminal symbol is nullable
cpres | computeYields | compute compute the production yields for non-terminal symbols
curres | prds | the productions having the same non-terminal symbol as LHS
wSet | wItem | type of a working item generated during closure
wSets | wSet | store working items generated during closure
statemem | kernls | store kernel items
mstate | statechain | chain a state's previous state to track where it cames from
pstate | kernlp | pointer to a kernel item
putitem | addKernItem | add a kernel item
writem | item.string() | get the string representation of a kernel item
apack | packGotosRow | compress newly goto row after closing state, not action as "a" implies
prectfn | handleShiftReduceConflict | handle shift-reduce conflict of a state during processing closed states
go2out | packGotosByCol | compress goto table by column
go2gen | computeGotoCol | compute the goto column for a non-terminal symbol
callopt | storeShiftsAndGotos | store shift-row for a state or goto column for a non-terminal symbol in action-store
gin | storeGotoCol | store a column of compressed goto table for a terminal symbol into action-store
stin | storeShifts | store shifts of a row in action table for a state into action-store
setbit | lkset.set | set a bit ON
bitset | lkset.check | check if a bit is ON (mix up with setbit easily)
The second issue is cramming everything into one place, burying core logic (closure and state generation) under the bulky input parsing. Refactoring separates it to files by functionality, improving clarity and focus.
The third issue is poor organization. A key component, the lookahead set, should have its own type with all related methods grouped together, not lying elsewhere. This improves maintainability and makes adding methods easier. Similarly, the kernel item should define its own string representation method. These structs also include a Clone method for copying instances.
The fourth issue is convolution. The nolook
variable determines whether closure is LALR(1)
or LALR(0)—LALR(1)
for state generation and LALR(0)
for processing closed states. However, it pops up in unexpected places like kernel item addition, state generation, and packing gotos. Furthermore, it creates twists and turns in the closure
logic, making it harder to follow.
To resolve this, I split closure
into closure1
for LALR(1)
and closure0 for LALR(0)
. While slightly redundant, this separation prevents their logic from getting entangled, making the flow easier to follow. With this change, the confusing nolook
is completely eliminated.
Similarly, MUSTLOOKAHEAD
determines whether we're handling an epsilon
but is tangled with nolook
in closed state processing. By explicitly storing and processing epsilons
, the confusion caused by MUSTLOOKAHEAD
is eliminated.
The fifth issue is redundant complications. An example is production yields
-production rules with the same LHS
. The code extracts only RHS
to construct the yield
while calling it prod
in kernel item. By consistently using full production rules, we remove this friction. It helps to eliminate further redundancies like keeping production number in kernel item for comparison and tricky code to print the dot for a kernel item. Here is the code with comments highlighting the problems.
// original
for j := 0; j < nprod; j++ {
if prdptr[j][0] == c {
// 📌 extract the RHS by skipping the first symbol
curres[n] = prdptr[j][1:]
n++
}
}
// cleanup
for _, prd := range prods[:nprod] {
if prd[0] == c {
// 💚 use production rule as-is
yld[n] = prd
n++
}
}
// 📌 no-need to compare the whole array because each production ends with -prodNo
if aryeq(wsets[v].pitem.prod, prd) != 0 {...}
// 💚 compare production by its id
func id(prd []int) int {
return -prd[len(prd)-1]
}
// 📌 keep the production and production id in a work item
wsets[cwp].pitem = Pitem{prd, 0, prd[0], -prd[len(prd)-1]}
// 💚 only need the production, noted that idx starts from 1 for full production
wSet[cwp].item = item{1, prd, prd[1], emptyLkset()}
// 📌 things get messy and tricky
p := pp.prod // 📌 RHS of production, not the full
npi := pp.off // 📌 masking pp.off with mysterious name!
// 📌 no need, pi, the idx of first symbol of RHS is always 0
pi := aryeq(p, prdptr[pp.prodno]) // 📌 retrieving full production
for {
c := ' ' // 📌 it is the separator, not any character
if pi == npi { // 📌 print the dot when idx matches with offset. why not `pi == p.off`?
c = '.'
}
q += string(c)
i = p[pi]
pi++
if i <= 0 {
break
}
q += chcopy(symnam(i))
}
// 💚 use the full production rule, no need to retrieve it
prd := i.prd
var n int
// 💚 idx for RHS starts from 1 and flows to end!
for j := 1; j < len(prd); j++ {
n = prd[j]
sep := ' ' // 💚 call separator separator!
if j == i.off { // 💚 don't hide offset in npi!
sep = '.'
}
res += string(sep)
if n <= 0 {
break
}
res += symbolName(n)
}
Another example is the redundant outer loop of stagen
.
// 📌 the first iteration of outer loop, not needed
first := 1
// 📌 this outer loop not needed
for more := 1; more != 0; first = 0 {
more = 0 // 📌 not needed
for i := 0; i < nstate; i++ {
if tystate[i] != MUSTDO { // 📌 not needed
continue
}
tystate[i] = DONE // 📌 not needed
// ...
// generate goto's
for p := 0; p < cwp; p++ {
// ...
}
if first != 0 { // 📌 no need to check
indgo[i] = apack(temp1[1:], nnonter-1) - 1
}
more++ // 📌 not needed
}
}
// 💚 outer loop is removed
for i := 0; i < nstate; i++ {
// ...
for j := 0; j < cwp; j++ {
// ...
}
gotoIdx[i] = packGotosRow() - 1 // store index to retrieve goto from actionStore
}
Another redundant outer loop is in closure's work item processing.
// 📌 outer loop is not need
for v := u; v < cwp; v++ {
// 📌 not needed
if wsets[v].flag != 1 || wsets[v].pitem.first != c {
continue
}
// ...
// 📌 not needed too
wsets[v].flag = 0
// ...
for ch > 0 {
// ...
}
// ...
}
// 💚 outer loop removed
for sym > 0 {
// ...
}
// ...
The sixth issue is that several lines can be made more readable by adopting a clearer style. Let's review some snippets.
// examples from stagen() function
// 📌 these belongs to packing, not here
amem = make([]int, ACTSIZE)
memp = 0
clset = mkset()
// 📌 all of these are to add the first kernel item. addKernelItem can handle all
pstate[0] = 0
pstate[1] = 0
aryfil(clset, tbitset, 0)
putitem(Pitem{prdptr[0], 0, 0, 0}, clset)
tystate[0] = MUSTDO
nstate = 1
pstate[2] = pstate[1]
// 💚 nothing out of place
clkset = newLkset()
// 💚 addKernItem handle it all
addKernItem(item{0, prods[0], 0, newLkset()})
nstate++
// 📌 register anyway, no need for if-else
if c < NTBASE {
state(c) // register new state
} else {
temp1[c-NTBASE] = state(c)
}
// 💚 just register first
s := newState(first)
if first > NTBASE { // non-terminal symbol
trans[first-NTBASE] = s // store gotos
}
// example from state() function
// 📌 all of this is to sort kernel items [p1:p2]
var k, l int
for k = p1 + 1; k < p2; k++ { // make k the biggest
for l = k; l > p1; l-- {
if statemem[l].pitem.prodno < statemem[l-1].pitem.prodno ||
statemem[l].pitem.prodno == statemem[l-1].pitem.prodno &&
statemem[l].pitem.off < statemem[l-1].pitem.off {
s := statemem[l]
statemem[l] = statemem[l-1]
statemem[l-1] = s
} else {
break
}
}
}
// 💚 slice and sort explicitly
partn := kernls[p1:p2]
// sort to help search later
sort.Slice(partn, func(i, j int) bool {
return id(partn[i].prd) < id(partn[j].prd) ||
(id(partn[i].prd) == id(partn[j].prd) && partn[i].off < partn[j].off)
})
Thanks for reading to the end. Hope this helps on your journey.