Refactor and cleaup yacc: Making sense of legacy code

mbver - Feb 15 - - Dev Community

yacc, a widely used LALR(1) parser generator, is efficient but burdened by archaic, unreadable code from the 1970s. Its Go adaptation, goyacc, inherits these issues despite the modern language.

While exploring compilers, I found yacc to be a gold mine of insights, though buried under outdated practices. Following Allan Holub’s Compiler Design in C, I dissected its code—clarifying, simplifying, and refactoring convoluted sections. I then tested the revised version in a real-world scenario to ensure it retained its original functionality. The full code is published on github.

I aimed to explore real-world open-source code, bridge ideals with reality, and stay sane as a reader. By refining and clarifying it, I made learning smoother and reuse easier.

The first issue is poor naming—cryptic, vague, or misleading names. Here’s how naming was improved.

original | cleanup          | intent
--------------------------------------------------------------------------------------------------
cpfir    | computeFirstsets | compute the first sets for non-terminal symbols
cpemp    | computeEmpty     | compute compute table to check if a non-terminal symbol is nullable
cpres    | computeYields    | compute compute the production yields for non-terminal symbols
curres   | prds             | the productions having the same non-terminal symbol as LHS
wSet     | wItem            | type of a working item generated during closure
wSets    | wSet             | store working items generated during closure
statemem | kernls           | store kernel items
mstate   | statechain       | chain a state's previous state to track where it cames from
pstate   | kernlp           | pointer to a kernel item
putitem  | addKernItem      | add a kernel item
writem   | item.string()    | get the string representation of a kernel item
apack    | packGotosRow     | compress newly goto row after closing state, not action as "a" implies
prectfn  | handleShiftReduceConflict | handle shift-reduce conflict of a state during processing closed states
go2out   | packGotosByCol   | compress goto table by column
go2gen   | computeGotoCol   | compute the goto column for a non-terminal symbol
callopt  | storeShiftsAndGotos | store shift-row for a state or goto column for a non-terminal symbol in action-store
gin      | storeGotoCol     | store a column of compressed goto table for a terminal symbol into action-store
stin     | storeShifts      | store shifts of a row in action table for a state into action-store
setbit   | lkset.set        | set a bit ON
bitset   | lkset.check      | check if a bit is ON (mix up with setbit easily)

Enter fullscreen mode Exit fullscreen mode

The second issue is cramming everything into one place, burying core logic (closure and state generation) under the bulky input parsing. Refactoring separates it to files by functionality, improving clarity and focus.

The third issue is poor organization. A key component, the lookahead set, should have its own type with all related methods grouped together, not lying elsewhere. This improves maintainability and makes adding methods easier. Similarly, the kernel item should define its own string representation method. These structs also include a Clone method for copying instances.

The fourth issue is convolution. The nolook variable determines whether closure is LALR(1) or LALR(0)—LALR(1) for state generation and LALR(0) for processing closed states. However, it pops up in unexpected places like kernel item addition, state generation, and packing gotos. Furthermore, it creates twists and turns in the closure logic, making it harder to follow.

To resolve this, I split closure into closure1 for LALR(1) and closure0 for LALR(0). While slightly redundant, this separation prevents their logic from getting entangled, making the flow easier to follow. With this change, the confusing nolook is completely eliminated.

Similarly, MUSTLOOKAHEAD determines whether we're handling an epsilon but is tangled with nolook in closed state processing. By explicitly storing and processing epsilons, the confusion caused by MUSTLOOKAHEAD is eliminated.

The fifth issue is redundant complications. An example is production yields -production rules with the same LHS. The code extracts only RHS to construct the yield while calling it prod in kernel item. By consistently using full production rules, we remove this friction. It helps to eliminate further redundancies like keeping production number in kernel item for comparison and tricky code to print the dot for a kernel item. Here is the code with comments highlighting the problems.

// original
for j := 0; j < nprod; j++ {
  if prdptr[j][0] == c {
    // 📌 extract the RHS by skipping the first symbol
    curres[n] = prdptr[j][1:]
    n++
  }
}
// cleanup
for _, prd := range prods[:nprod] {
  if prd[0] == c {
    // 💚 use production rule as-is
    yld[n] = prd
    n++
  }
}

// 📌 no-need to compare the whole array because each production ends with -prodNo
if aryeq(wsets[v].pitem.prod, prd) != 0 {...}

// 💚 compare production by its id
func id(prd []int) int {
    return -prd[len(prd)-1]
}

// 📌 keep the production and production id in a work item
wsets[cwp].pitem = Pitem{prd, 0, prd[0], -prd[len(prd)-1]}
// 💚 only need the production, noted that idx starts from 1 for full production
wSet[cwp].item = item{1, prd, prd[1], emptyLkset()}


// 📌 things get messy and tricky
p := pp.prod // 📌 RHS of production, not the full
npi := pp.off // 📌 masking pp.off with mysterious name!

// 📌 no need, pi, the idx of first symbol of RHS is always 0
pi := aryeq(p, prdptr[pp.prodno]) // 📌 retrieving full production

for {
  c := ' ' // 📌 it is the separator, not any character
  if pi == npi { // 📌 print the dot when idx matches with offset. why not `pi == p.off`?
    c = '.'
  }
  q += string(c)

  i = p[pi]
  pi++
  if i <= 0 {
    break
  }
  q += chcopy(symnam(i))
}

// 💚 use the full production rule, no need to retrieve it
prd := i.prd
var n int

// 💚 idx for RHS starts from 1 and flows to end!
for j := 1; j < len(prd); j++ {
  n = prd[j]
  sep := ' ' // 💚 call separator separator!
  if j == i.off { // 💚 don't hide offset in npi!
    sep = '.'
  }
  res += string(sep)
  if n <= 0 {
    break
  }
  res += symbolName(n)
}
Enter fullscreen mode Exit fullscreen mode

Another example is the redundant outer loop of stagen.

// 📌 the first iteration of outer loop, not needed
    first := 1
  // 📌 this outer loop not needed
    for more := 1; more != 0; first = 0 {
        more = 0 // 📌 not needed
        for i := 0; i < nstate; i++ {
            if tystate[i] != MUSTDO { // 📌 not needed
                continue
            }
            tystate[i] = DONE // 📌 not needed

      // ...

            // generate goto's
            for p := 0; p < cwp; p++ {
        // ...
            }

            if first != 0 { // 📌 no need to check
                indgo[i] = apack(temp1[1:], nnonter-1) - 1
            }
            more++ // 📌 not needed
        }
    }

// 💚 outer loop is removed
    for i := 0; i < nstate; i++ {
    // ...
        for j := 0; j < cwp; j++ {
      // ...
        }
        gotoIdx[i] = packGotosRow() - 1 // store index to retrieve goto from actionStore
    }
Enter fullscreen mode Exit fullscreen mode

Another redundant outer loop is in closure's work item processing.

// 📌 outer loop is not need
for v := u; v < cwp; v++ {
  // 📌 not needed
  if wsets[v].flag != 1 || wsets[v].pitem.first != c {
    continue
  }
  // ...
  // 📌 not needed too
  wsets[v].flag = 0
  // ...
  for ch > 0 {
    // ...    
  }
  // ...
}
// 💚 outer loop removed
for sym > 0 {
  // ...
}
// ...
Enter fullscreen mode Exit fullscreen mode

The sixth issue is that several lines can be made more readable by adopting a clearer style. Let's review some snippets.

// examples from stagen() function
// 📌 these belongs to packing, not here
    amem = make([]int, ACTSIZE)
    memp = 0

    clset = mkset()

// 📌 all of these are to add the first kernel item. addKernelItem can handle all
    pstate[0] = 0
    pstate[1] = 0
    aryfil(clset, tbitset, 0)
    putitem(Pitem{prdptr[0], 0, 0, 0}, clset)
    tystate[0] = MUSTDO
    nstate = 1
    pstate[2] = pstate[1]

// 💚 nothing out of place
    clkset = newLkset()
// 💚 addKernItem handle it all
    addKernItem(item{0, prods[0], 0, newLkset()})
    nstate++

// 📌 register anyway, no need for if-else
if c < NTBASE {
  state(c) // register new state
} else {
  temp1[c-NTBASE] = state(c)
}
// 💚 just register first
s := newState(first)
if first > NTBASE { // non-terminal symbol
  trans[first-NTBASE] = s // store gotos
}

// example from state() function
// 📌 all of this is to sort kernel items [p1:p2]
var k, l int
for k = p1 + 1; k < p2; k++ { // make k the biggest
  for l = k; l > p1; l-- {
    if statemem[l].pitem.prodno < statemem[l-1].pitem.prodno ||
      statemem[l].pitem.prodno == statemem[l-1].pitem.prodno &&
        statemem[l].pitem.off < statemem[l-1].pitem.off {
      s := statemem[l]
      statemem[l] = statemem[l-1]
      statemem[l-1] = s
    } else {
      break
    }
  }
}

// 💚 slice and sort explicitly
partn := kernls[p1:p2]
// sort to help search later
sort.Slice(partn, func(i, j int) bool {
  return id(partn[i].prd) < id(partn[j].prd) ||
    (id(partn[i].prd) == id(partn[j].prd) && partn[i].off < partn[j].off)
})
Enter fullscreen mode Exit fullscreen mode

Thanks for reading to the end. Hope this helps on your journey.

. .