2023.01 Vol.1

// Git Merge Strategies #Devprod

I am Pro Merge

I am starting up a new project and wanted to get a handle on the different Pull Request merge strategies (General => Pull Requests) which Github offers before getting too entrenched in one. There are three options: merge, rebase, and squash.

A --> B -----------> E
      |              | 
      +--> C --> D --+

merge

A --> B -> C' -> D'
      |
      +--> C --> D

rebase

A --> B -> CD'
      |
      +--> C --> D

squash

I am coming from the monorepo world where the squash strategy is very popular. Why is that? A monorepo usually involves a lot of code with a lot of authors. If those authors are not very disciplined, its easy for the main branch (“mainline”) history to quickly become a mess of tiny, relatively meaningless commits. If something breaks in production, searching through the history feels very needle-in-a-haystack-esque. The squash Pull Request strategy automatically takes all those tiny commits and squashes them together into just one on the main branch. Much easier to parse the main branch history now. Github has a branch protection setting, “Require Linear History” which blocks merge commits from getting on the branch, effectively forcing a rebase or squash pull request strategy.

But there are downsides to squash which is why I am here wondering if a different strategy is for me.

I really hate that squash breaks local git tooling. It is the remote server which auto-squashes commits and creates the CD' commit in the example. The local git repos on developer laptops don’t know that CD' came from C and D and there isn’t anything built in to git to relay this info. So from the local repos perspective, CD' is just some brand new code. Commands like git branch -d which try to remove local branches which have been merged into the mainline by detecting commits don’t work. In a monorepo environment this leads to a zombie wasteland of local branches which just need to be force purged from time to time.

Do the other strategies solve this issue? What are their downsides?

The rebase strategy is pretty similar to squash except there is a commit on the main line branch for every commit on the feature branch. They are new commits however, so I believe the same issue would exist for local tooling.

The merge strategy is different from squash and rebase in that the commit which the remote server makes can be understood by local repositories. It is a special merge commit which relays info between the repositories. Now local tooling is able to understand branch merges and work appropriately. Hooray! There is a classic issue with merge though which can make it a tough sell in a modern environment: the original issue that squash solves, a clean mainline history, is back. But now it even has merge commits to clutter it up! Ah, trade-offs.

I made a list of things I wanted to compare the strategies on:

Does local repository tooling always work?
Is mainline history parse-able?
Is it easy to revert a Pull Request?
Are we confident that any ol’ commit on mainline is functioning code?
Can Pull Requests be associated with the commits on mainline?

squash

No
Yes
Yes – every commit can be reverted to back out a full
Yes – assuming some basic CI jobs test code on Pull Requests, that means every commit on the mainline should have run the gauntlet: high confidence.
Yes, but.. – Github’s API exposes a mapping between the commits it creates on the remote server and the Pull Requests

rebase

No
No – depends on developer discipline
No – hard to say which commits belong to which Pull Requests without asking Github’s API
No – A more complex CI job would be required to run on every commit of a Pull Request
Yes, but… – Github’s API exposes a mapping between the commits it creates on the remote server and the Pull Requests

merge

Yes
Double No – depends on developer discipline and tooling
Yes – just revert the merge commit
No – A more complex CI job would be required to run on every commit of a Pull Request
Yes – Native git mapping of commits to feature branches

Looking at the lists, its understandable why the practicality of squash often wins out. But I can’t get over it failing requirement #1. I also find the #5 caveat of using Github’s API a little gross, but more on that in a bit.

The rebase strategy looks like a worse squash for my requirements, so its out. Which leaves old-school merge vs. practical squash. merge fails on #2 and #4, but wins on #1 and #5.

I for one am ok with this tradeoff. #4 is not too big of a deal since Pull Requests are still easy to revert (#3). #2 is the real sticking point with most developers. I think there are a few strategies to mitigate the worst case scenario of millions of commits on a ton of branches weaving in and out of the mainline branch.

To start, I think a CI job could step in to encourage best behavior. Something as simple as a check which blocks Pull Requests with more than 3 commits would help keep the noise down. Developers would get in the habit of rebasing their branches locally and cleaning things up before merging. This also relates a bit to #5. Depending on Github’s API to join history of a commit with it’s Pull Request feels gross. If feature branches are instead cleaned up before merge, this gives developers an opportunity to also bake in Pull Request feedback right into the git history.

So in general, I think the merge strategy requires more up front effort, but it is not wasteful, and results in a better history and developer environment in the long run. I am going to continue down the rabbithole of merge commits though for a better understanding.

3-way merge

A --> B ---> D ----> E
      |              | 
      +----> C ------+

B
=
A1
B1
C1
D1

C
=
A1
B2
C2
D2

D
=
A1
B1
C3
D2

E
=
A1
B2
C?
D2

three way merge

By default, git creates a new commit when it merges in a branch called a “3-way commit”. The commit has 2 parent commits. To create the merge commit, git compares the state of both branches it merges plus the shared root commit of the branches. This way it can tell if a code chunk changed in both branches and detect merge conflicts (in the above example, the C chunk).

The real tricky part with merge commits is quickly understanding the commit history. By default, the git log command orders commits by time. This mixes together feature branch commits with mainline commits and I don’t find it a particularly good default. The --topo-order flag makes more sense to me, grouping commits by their feature branch. The --graph flags also use topo order, but prints pretty ascii lines that actually help a lot in smaller or well organized repositories.

$ git log --graph --oneline 
*   03d91a5 (HEAD -> master) Merge branch 'feature'
|\  
| * b775904 (feature) Feature 1
* | 6bf99d7 2nd commit
|/  
* aae13f6 First Commit

graph ascii art

If there are no changes on the mainline branch, git will just “fast forward” it with the commits from the feature branch unless told otherwise. This keeps the mainline history linear and easy to parse. A --no-ff flag can be used to force a 3-way merge commit. This keeps the history of the feature branch intact. Fast forwards could make it difficult to quickly revert a pull request since it not immediately obvious which commits were associated with the PR.

A --> B
      |               
      +--> C --> D

fast forward merge

Maybe some principles to follow:

Merge to mainline so local tooling works
Disable fast forwards on merges to mainline so history is maintained
Rebase on feature branches where you control history (not shared with other devs) to simplify graph (less merge commits)