A Case Against Git Rebase


Okay so, this is one of those topics that’s almost religious for some people. But at the pain of a bunch of rebase-everything fanatics coming after me, I feel like this topic deserves another opinion piece. The occasion for me writing this post is yet another case of unnecessary merge conflicts caused by someone rebasing code they didn’t write/own. 🤬

Let’s start with a disclaimer: I’m going to talk about rebasing from the perspective of a software engineer. I know that my points might not (fully) apply to other disciplines, like infrastructure engineering, where, for example, the branches are usually quite small and often don’t consist of multiple commits.

Git Primer

The rest of the post is going to heavily assume that you, the reader, are already familiar with Git. This next chapter will be a technical introduction to Git. If you already know how Git works internally, feel free to skip it.

Git is a decentralized version control system by Linus Torvalds. It is used to track changes on a project. It’s designed for software projects where the majority of the content is plain text. However, it also has basic support for binary files – without diffs, though, for obvious reasons.

The “atoms” in Git are called “objects.” They are stored in a so-called “content-addressable filesystem.” That means that the content, or rather its SHA1 hash1Strictly speaking, it’s the SHA1 hash of the zlib compressed content., is the address/key of the object. A consequence of this is that all content is immutable. Changing an object would also change its address → it’s a different object.
Typically, there are four different kinds of objects:

  • Blobs are file content without any metadata.
  • Trees are essentially directories. They store references to their files (= blobs) and subdirectories (= trees) as well as their names and permissions. (Technically also commits in the case of Git submodules).
  • Commits are snapshots of the repository. They contain a reference to the tree object of the project root directory, some metadata like author, commit message, timestamp, etc., as well as 0 or more references to parent commits. (Although, typically only 1 or 2 parents are referenced.) Commits without a parent are called “orphans.” Commits that don’t share an ancestor are called “unrelated.”
  • Annotated tag objects are essentially pointers to commits with some more meta information, like description, timestamp, and who created the tag.
    Note that there are two different kinds of tags in Git: annotated tags and lightweight tags. Only the first one has an object. We’ll talk more about this later on.
wpg_div_git_objects
Objects types and how they relate to each other.

(Please note that the graph of commits is sometimes also referred to as a “tree”. However, a tree in Git usually means a directory structure. I think this misunderstanding might be because the word “branch” suggests that the structure it is branching off has to be a tree.)

Because objects are always immutable, changing a file will result in a different blob, which means all its containing trees and commits are also different. The same thing applies to history: Because of the recursive nature of tree and commit IDs, changing any object in the “past” will either cascade all the way to the HEAD or create inconsistent hashes at some point. In other words, as long as the HEAD commit ID is known, any change in the history can be detected. This is actually a consistency feature. It makes sure that you can always detect disk corruption or even malicious changes to the history. This is an important detail to remember for later, when we are looking at rebasing in detail.

Another feature in Git for detecting bad actors is signing. The metadata in commits is essentially arbitrary. So, similar to email, you can actually specify anything as the author. But, like with email, you can use PGP to sign commits with your private key. Rewriting history would change the content of the commit and thus invalidate the signature. And forging a new commit using a fake identity but without a signature will, depending on the process used in the team, be noticed quickly.

Branches

Theoretically, we don’t need any structure for creating branches. Creating a “branch,” strictly speaking, just means having multiple commits with the same parent after all. However, that’s not really practical because we’d need to manually keep track of what the current commit for which branch is.

Luckily, Git already provides a solution for that: you can give names to important commits. The technical term for this is “references,” or “refs” for short. Refs are essentially just pointers to commit objects. These are the most important types of refs:

  • Tags are meant as immutable names. They are often used to mark released or other special versions. As mentioned earlier, there are actually two kinds of tags: lightweight tags are really just pointers to commits. Annotated tags, by contrast, are pointers to tag objects that contain some metadata as well as a pointer to the corresponding commit.
  • Heads are names that are meant to change. Typically they represent the most recent state of a given development line. When a new commit is added, the head ref moves accordingly. This is what’s generally referred to as a “branch” in Git.
wpg_div_git_refs
Relationship between tags, heads, HEAD and commits.

Note that while HEAD (in caps) is technically a ref, it is not actually a head. Instead, it is a pointer to the currently checked-out ref or commit. In case HEAD (in caps) points to a branch, it is called “attached.” This means that the underlying head ref will move with the HEAD whenever new commits are added. When HEAD points to a tag (or a commit), it is “detached.” That means that when HEAD moves to a new commit, it will not automatically update any underlying refs.

There are actually multiple such HEAD (in caps) pointers for various situations where you need to keep track of multiple states at once. For example, during merging or rebasing, there are MERGE_HEAD or REBASE_HEAD, respectively.

(Fun fact: Git actually keeps a log of all changes to refs. It is creatively named “reflog” and can be used to do all kinds of trickery, like git checkout HEAD@{one.week.ago}.)

Merge vs. Rebase

Now that we have multiple branches with different changes, at some point we might want to combine them back together.

The straightforward way of doing this is to utilize the fact that a commit can have any arbitrary number of parents. We could combine the changes from both branches into a new commit that references both heads as its parents. This new commit is usually referred to as a “merge commit”.

wpg_div_git_merge
Before and after a merge: head 2 is merged into head 1. MC is the merge commit.

Now, I wrote “combine the changes from both branches,” but that’s not actually super trivial. Each commit represents a snapshot of the repo, not a set of changes. So first Git needs to figure out what changed in each branch compared to the shared ancestor – basically a diff. Then it needs to compare the changes to each other2Git actually supports multiple different merge strategies. The default one, “ort,” is essentially a 3-way merge.. Most of the time Git will be able to do this automatically. However, when both branches touch the same file, Git might not be able to resolve the merge conflict, and it has to be fixed manually. Lastly, the combined changes are “replayed” on the target branch and added as a merge commit.

If the target head commit is actually an ancestor of the other branch, Git can also do what’s called a “fast-forward” merge. In this case no new commit is created. Instead, the target head is just updated to point to the head of the other branch. This is sometimes wrongly referred to as “rebasing.”

wpg_div_git_merge_ff
Before and after a fast-forward merge. No merge commit is created.

An actual rebase is when a series of commits on one branch are directly replayed on another branch. The name comes from the effect rebasing has: it essentially changes the base commit of a branch.

wpg_div_git_rebase
Before and after a rebase: head 1 is rebased on top of head 2.

Because when rebasing we usually deal with multiple commits, we need to consider each one separately. So, for each commit, we generate a diff and apply it to the target branch, then we commit the change with the same commit message and continue with the next commit. Since each commit is replayed separately, conflicts might need to be resolved multiple times during a single rebase. In the end, the head of the branch is updated to point to the new set of commits.

Note that these new commits on the target branch are distinct from the commits on the original branch. That’s because both the tree and parent commit references are different. (Strictly speaking, the original commits are not deleted by this operation. They are just not referenced by that branch anymore.)

This replaying of commits into a different context actually also has a different name: cherry-picking3While rebasing and cherry-picking are the same operation on a technical level, they actually have different effects. Rebasing usually replays all commits from since the common ancestor, while cherry-picking, by contrast, allows you to take any commit – or rather, its change – out of context. Cherry-picking in particular can have some pretty nasty side-effects if it is not used correctly.. In fact, earlier implementations of the rebase feature in Git used to use the git cherry-pick command under the hood.

The main advantage of doing a rebase instead of a merge is to clean up the Git history. After rebasing a branch onto some other branch, the head of the target branch is a direct ancestor of the rebased branch. That means you can now do a fast-forward merge, skipping the merge commit and thus making the history linear.

While it’s not really the focus of this article, it might still be worth briefly mentioning what “squashing” is. When doing an interactive rebase, you have the option of marking certain commits as “squash” or “fixup.” That means that when replaying these changes, the rebase command will not create a commit4Well, technically, it’s the previous commit’s changes that are not committed. In the case of a real squash – not a fixup – the new commit message then contains all squashed commit messages in chronological order.. Put another way: Changes from multiple different commits are combined into one.

Local Rebasing

Okay, let’s get into the weeds now: Why do I think rebasing is not a great idea in many cases?

I think it’s useful to consider two different scenarios: The first is purely local rebasing, where the branch (or rather, the commits) in question are only present locally and cannot be seen by anyone who’s not the author.

In this case there are basically only three meh reasons for avoiding rebasing (aside from preference):

The first is mainly ornamental: the commits might no longer be ordered chronologically. The rebased commits could be (and probably are) older than their new parent/base commit. (I should probably mention that Git puts two different timestamps on commits. “Author date” is the timestamp of when the commit was originally created. The “committer date,” by contrast, is the timestamp when the actual commit was created. So when rebasing, the “committer date” is usually the date when the rebase operation took place.)

The next one might be a bit weird at first: rebasing creates transient states between the new base and the new head that never actually existed during the development process.

It might not be obvious why this could be a problem. And admittedly it’s probably not something many people encounter anyway. Essentially, it has to do with a (I think) little-known debugging feature in Git: git bisect

Git bisect allows you to semi-automatically search for a commit that introduced a regression. You can specify a start and end commit, and bisect will do a binary search: it checks out a commit in the middle, prompts whether the regression is present, and recursively continues on the half that probably contains the culprit commit.

This, of course, assumes that each commit represents a state of the code that actually works, or at the very least builds (c.f. atomic commits). This clashes with rebasing in that there is no guarantee that the newly created transient commits actually work. (Ironically, even though squashing has arguably more disadvantages than rebasing, it’s actually better in this case, because the resulting commit can be verified directly.)

Below you can see an example of invalid transient states created by a rebase. It’s a small Hello World example. The initial commit contains a component called “greeter” that’s never used. The change in the main branch just calls this component, while the feature branch removes the component and replaces it with a direct printf call. When using a merge commit, every single commit works fine, and the tests pass. However, in the rebase case, the “remove greeter interface” commit doesn’t even build, because the rebase introduces a call to the greeter that didn’t exist during development.

Terminal screenshot of a Git graph and some automated tests. The graph shows two branches being merged. All the tests pass.
When doing a merge all states in the history existed at some point.
Terminal screenshot of a Git graph and some automated tests. The graph is linear. One test failed.
Rebase might create states that never existed during development.

The last disadvantage is that rebases are much harder to undo than merges. In case you want to undo a merge, you can just look up the old head commit ID with git log, and then git reset --hard the branch to that old state. After a rebase, however, the old head commit ID is no longer part of the history5Unless there is another ref pointing to the old state or one of its children. But that’s bad in and of itself. See below.. What we can do in this case is to utilize the fact that rebase doesn’t delete the original commits. We can use git reflog to look up the old head ref, and reset the branch to there. This, of course, assumes that the old commits are still present, i.e. the Git garbage collector has not run. The last option is to rebase or cherry-pick the commits back onto the old base. However, then we have no guarantee (and there’s no easy way to verify6Git rebase actually also changes the metadata slightly. So even if we know the old commit IDs, we cannot compare them with the new ones. ) that the commits are actually the same as before.

Terminal screenshot showing how to undo a rebase.

The three commands being executed are:
git adog, to show the rebased history
git reflog, to search for the commit ID
git reset dash dash hard, followed by the commit ID
git adog, to verify the result.
How to undo a rebase.

Remote Rebasing

In the second scenario, the changes being rebased exist on a remote. Specifically, this case occurs when other people have access to the commits being rebased. (I’d even argue it also applies when there is any other ref pointing to any commit that’s being rebased.)

The core issue is that after a rebase, the original commits are, in a sense, incompatible with the rebased commits: they contain the same changes but don’t have the same history. So if work continues on the original branch, you are going to have a bad time if you are not careful.

One of the more harmless but still annoying things that could happen is duplicated commits. Imagine someone branches off a head that will be rebased. Later on that new branch is merged back in. Now both the rebased and the original commits exist in the Git history. One thing you can do to avoid this is to rebase the branch instead of merging it – Git is actually clever enough to not apply patches that were already cherry-picked. That being said, it’s still possible to end up with duplicated commits even if everything was rebased. Below you can find a small demo of this happening. In this case I’m utilizing the fact that Git only considers patches with the same diffs as identical7I’m not 100% sure on this. The Git source code is rather complex, and I’m not willing to spend hours trying to understand it. ^^.

Demo for duplicated commits caused by rebasing

Now, when you are only working locally, this really isn’t a problem. It’s not like you can’t run into this (see video), but it doesn’t happen naturally. However, as soon as other people have access to the original branch, it’s really out of your hands what happens with that state.

Do not rebase commits that exist outside your repository and that people may have based work on.

If you follow that guideline, you’ll be fine. If you don’t, people will hate you, and you’ll be scorned by friends and family.

“Pro Git” – Scott Chacon and Ben Straub (That’s the book on Git.)

This is also, in essence, what the Linux kernel rules for rebasing boil down to.

One special case of this is when the person rebasing isn’t the person who initially committed the code. This, in my book, is an absolute no-go. Not only do you have all the disadvantages already mentioned, but since you are rewriting history, you are also invalidating all signatures on the commits. Below you can find an example of this: One user creates a branch with signed commits, a second user rebases the commits (with a different PGP key), which causes the signature checks for the first user to fail, because the signing key is not trusted.

Terminal screenshot demonstrating how rebasing other people's commits invalidates their signatures. 

The first command shows that all commits are signed. Then the user does a rebase, signing the rebased commit with their key. The first user shows that the new commit signature is untrusted and tries (and fails) to verify the signatures for a fast-forward merge.
Rebasing other people’s commits invalidates their signatures.

I’m especially annoyed by GitHub, GitLab, Codeberg/Forgejo, and whatnot because they natively support remote rebasing for merge requests. I know it’s not that bad when you use Git on a centralized platform, but it still feels like abusing the tool.

Conclusion

I hope this article made it a bit clearer why I feel like you shouldn’t use rebasing all the time. The feature does have its time and place. But using it by default without thinking causes more problems than it solves.

In case you are frustrated with my rather verbose writing style, take a look at “Git for Computer Scientists” by Tv. It essentially covers my whole Git Primer chapter plus a bit more, but it’s way more concise than my post. 😅
There’s also an excellent video by Alex from Philomatics over on YouTube on rebasing in particular. It’s much more positive towards the topic than I am, but it still takes care to mention the disadvantages.

I actually wanted to talk a bit more about how rebasing can be used in the different established Git workflows. But the article was just getting much too long… again… So I figured I might write a follow-up that focuses on that. Stay tuned.

If you want to read about another of my tool-related deep dives, check out my blog post on Writing Sudo Plugins.

In any case, I hope you enjoyed the article, or at the very least now know something about Git you didn’t know before.

Have a wonderful day.
Sigma!

Illustration of the trolly-problem on the top. Below is a similar image with all the people on one train track instead of two. In between the two is a screenshot of a terminal: 
git rebase master
Successfully rebased and updated ref
The trolly-problem meme with rebasing.

Latest Posts


Leave a Reply

Your email address will not be published. Required fields are marked *