⛔ Squash commits considered harmful ⛔

Manuel Odendahl - May 23 '22 - - Dev Community

A recurring conversation in developer circles is if you should use git --squash when merging or do explicit merge commits. The short answer: you shouldn't.

People have strong opinions about this. The thing is that my opinion is the correct one. Squashing commits has no purpose other than losing information. It doesn't make for a cleaner history. At most it helps subpar git clients show a cleaner commit graph, and save a bit of space by not storing intermediate file states.

Let me show you why.

Git tracks contents, not diffs

In many ways you can just see git as a filesystem.
– Linus (in 'Re: more git updates..' - MARC)

Git is in many ways a very dumb graph database. When you check in code, it actually stores the content of all the tracked files in your repository.

The content of each file is stored as a "blob" node in the database. The filenames are stored separately in a "tree" node: If you rename a file, no new content node will be created. Only a new tree node will be created.

Commits are store as "commit" nodes. A commit object points to a tree, and adds metadata: author, committer, message and parent commits. A merge commit has multiple parents.

Here is a visualization from Scott Chacon's Git Internals:

Image description

Looking at a real git repository

Enough theory, we have work to get done. Let's create a simple git repository:



> mkdir squash-merges-considered-harmful
> cd squash-merges-considered-harmful 
> git init
> echo hello > foo.txt
> git add foo.txt
> git commit -m "Initial commit"
[main (root-commit) 02a154b] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 foo.txt
> echo more >> foo.txt
> git add foo.txt
> git commit -m "Add more" 
[main 16660f8] Add more
 1 file changed, 1 insertion(+)


Enter fullscreen mode Exit fullscreen mode

We can now look at the contents of the objects we created:



# initial commit
❯ git cat-file -p 02a154b
tree f269b7cd59094d5365ef6b5618098cbcbeee0c43
author Manuel Odendahl <wesen@ruinwesen.com> 1653303427 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653303427 -0400

Initial commit
# initial tree
❯ git cat-file -p f269b7cd59094d5365ef6b5618098cbcbeee0c43
100644 blob ce013625030ba8dba906f756967f9e9ca394464a    foo.txt
# initial foo.txt
❯ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello

# second commit
❯ git cat-file -p 16660f8
tree 5a0c4a660a13c0ada7611651399abb362756f83e
parent 02a154bc4f0fa9bca567676d45d136619c076a95
author Manuel Odendahl <wesen@ruinwesen.com> 1653303485 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653303485 -0400

Add more
# second tree
❯ git cat-file -p 5a0c4a660a13c0ada7611651399abb362756f83e
100644 blob 2227cddb7f6318ea735a1c4adb52f5cd36c5783c    foo.txt
❯ git cat-file -p 2227cddb7f6318ea735a1c4adb52f5cd36c5783c
hello
more



Enter fullscreen mode Exit fullscreen mode

Branches, tags (and branches, tags on remote repositories) are just pointers to commit nodes.

cat .git/refs/heads/main         
16660f8b1d1538ed1b55d8533b3ee7feb68e474c


Enter fullscreen mode Exit fullscreen mode

But we still use diffs and merges

But Manuel, you ask, how does git diff and git merge and all that funky stuff work?

When you run git diff, git actually uses different diff algorithm to compare the state of two trees, every time.

When you do a rebase, git computes the diff for each commit of the branch before rebase, and then applies those diffs to the destination, thus "moving" the branch over to the destination, with fresh tree and commit nodes.

When you do a merge, git first searches for the common parent of both branches to be merged (this can be a bit more involved depending on your graph). It computes the diff of each branch to that original commit, and then merges both diffs in what is called a three-way merge.

The resulting commit has multiple parent fields. The parent fields don't really mean anything except for informational purposes, the tree the merge commit points to is what actually counts. Once a three-way merge has been computed and applied, git doesn't really care how the resulting tree was computed.

This is literally all there is to git, and the mental model that I use every day, even as I'm doing the most advanced git surgery.

What is a squash merge?

So what is a squash merge? A squash merge is the same as a normal merge, except that it doesn't record only parent commit. It basically slices off a whole part of the git graph, which will later be garbage collected if not referenced anymore. You're basically losing information for no reason.

Let's look at this in practice. Let's create a few commits on top of the ones we have, and then do both a squash merge and a non-squash merge, and look at the results.



> git checkout -B work-branch
Switched to a new branch 'work-branch'echo "Add more" >> foo.txt
❯ git add foo.txt && git commit -m "Add more"
[main 4b84cfe] Add more
 1 file changed, 1 insertion(+)echo "Add more" >> foo.txt                 
❯ git add foo.txt && git commit -m "And more"
[main 1836f1c] And more
 1 file changed, 1 insertion(+)
❯ git checkout -B no-squash-merge main
Switched to a new branch 'no-squash-merge'
❯ git merge --no-squash --no-ff work-branch
Merge made by the 'ort' strategy.
 foo.txt | 2 ++
 1 file changed, 2 insertions(+)
❯ git checkout -B squash-merge main
Switched to a new branch 'squash-merge'
❯ git merge --squash --ff work-branch
Updating 16660f8..1836f1c
Fast-forward
Squash commit -- not updating HEAD
 foo.txt | 2 ++
 1 file changed, 2 insertions(+)
❯ git commit
[squash-merge 150c57d] Squashed commit of the following:
 1 file changed, 2 insertions(+) 


Enter fullscreen mode Exit fullscreen mode

Let's look at the resulting graph and commits.



❯ git log --graph --pretty=oneline --abbrev-commit --all
* 150c57d (HEAD -> squash-merge) Squashed commit of the following:
| * 535b740 (no-squash-merge) Merge branch 'work-branch' into no-squash-merge
|/| 
| * 1836f1c (work-branch) And more
| * 4b84cfe Add more
|/  
* 16660f8 (main) Add more
* 02a154b Initial commit
❯ git cat-file -p no-squash-merge
tree 58c1fb22faa444b264e98a5ae4c4ddb07be09697
parent 16660f8b1d1538ed1b55d8533b3ee7feb68e474c
parent 1836f1c53221ae701a038bf5ae380770ea911665
author Manuel Odendahl <wesen@ruinwesen.com> 1653304391 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653304391 -0400

Merge branch 'work-branch' into no-squash-merge

* work-branch:
  And more
  Add more

squash-merges-considered-harmful on  squash-merge on ☁️  ttc (us-east-1) 
❯ git cat-file -p squash-merge   
tree 58c1fb22faa444b264e98a5ae4c4ddb07be09697
parent 16660f8b1d1538ed1b55d8533b3ee7feb68e474c
author Manuel Odendahl <wesen@ruinwesen.com> 1653304543 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653304543 -0400

Squashed commit of the following:

commit 1836f1c53221ae701a038bf5ae380770ea911665
Author: Manuel Odendahl <wesen@ruinwesen.com>
Date:   Mon May 23 07:11:08 2022 -0400

    And more

commit 4b84cfe11aa51da994448e602e1bc4cc6083d691
Author: Manuel Odendahl <wesen@ruinwesen.com>
Date:   Mon May 23 07:11:03 2022 -0400

    Add more



Enter fullscreen mode Exit fullscreen mode

You can see that save that both squash-merge and no-squash-merge point to the exact same tree. The only changed thing is the commit message, and the missing parent in the squash merge.

To read more about the underpinnings of git, I can recommend just experimenting with the git command line, and the following resources:

But the history!

But Manuel, you say, the history is so much cleaner!

To which I counter that it is actually not. If you want to hide the link to the right parent of the non-squash merge (as it is called, the left parent being main ), all you need to do is to hide it. If you use the command-line or a proper tool, use the option to only show first parents. If you only look at the first parent, and configure your git tool to fill in a full log history of the branch into the merge commit message (I personally use the github CLI gh or some git-commit hooks to do it), the squash merge commit is identical to the non squash merge commit.

A favorite git log command of mine to quickly look at the history of the main branch, and create a changelog:



> git log --pretty=format:'# %ad %H %s' --date=short --first-parent --reverse
# 2022-05-23 02a154bc4f0fa9bca567676d45d136619c076a95 Initial commit
# 2022-05-23 16660f8b1d1538ed1b55d8533b3ee7feb68e474c Add more
# 2022-05-23 535b740f42e331175f3766c1374116e329a78f7e Merge branch 'work-branch' into no-squash-merge


Enter fullscreen mode Exit fullscreen mode

When using github and pull requests, this will show author, branch name (which would contain ticket name and short description in my case) and date on a single line. Here's a slightly more complex real world example (anonymized)


2021-12-15 123 Merge pull request #5937 from garbo/TK-234/feature-1

2021-12-16 234 Merge pull request #5938 from bongo/TK-235/feature-2

2021-12-16 456 Merge pull request #5939 from gingo/TK-236/feature-3

Enter fullscreen mode Exit fullscreen mode




But why?

But Manuel, why keep all those commits lying around when we have all we need in the commit message?

One comes down to just preference. I like to see the actual log of what a person did on their branch. Did they do many small commits? On which days (this might make looking up documents or slack conversations related to the work easier)? Did they merge other branches into their work (useful when resolving merge conflicts and other boo boos)?

I have done a lot of git cleanup work, and while they are not supposed to exist, big merges with thousands of lines happen, and having a single monolithic commit that contains 80 different changes is a nightmare.

The other one actually makes the side history extremely useful. When hunting down for a bug, I often use git bisect. I first use git bisect --first-parent to jump from main commit to main commit. But once I found which pull request led to the bug, I bisect on the original branch. Instead of having to figure out which line in the pull-request merge might cause the bug, I have a much more granular path. Often, it surfaces a single line commit, and leads to a painless and immediate bugfix.

As you can drive your bisect with your unit tests, you often have no work to do at all, given sufficiently atomic and small commits on side branches. Losing that capability would seriously impact my sanity when I have to fix bugs.

Conclusion

And that is why squashing history is harmful. It's literally just deleting information from the git graph by losing a single parent entry into the merge commit.

. . . . . . . . . . . . . . . . . . . . . . .