Search IconIcon to open search


Last updated Dec 28, 2022 Edit Source

git internals. Heavily inspired by The Git Parable, a talk by Tom Preston-Werner

# Git Internals

# Snapshots

What if you could take snapshots of your codebase at any time and resurrect that code on demand?

The simplest version of this is something you may have done with Photoshop files.

# Branches

Well now we run into a new problem. What if we roll back to a previous version of the code and then make some more changes?

That is, we create a new snapshot that is not a direct descendent of the preceding snapshot.

Our previous snapshot system only worked for a linear system of changes. How might we handle having multiple points where there are active changes being made?

By looking at your code history as a tree, solving the problem of ancestry becomes trivial. All you need to do is include the name of the parent snapshot in the message file you write for each snapshot.

This is also why there is a lot of arboreal terminology in git: branches, trunk, etc.

Now… we run into another problem. How should we name our snapshots now that we can’t use a linear system of numbers?

Well, we can actually just name each branch and then list branch-name: snapshot-name pairs that represent the tips of branches. Let’s store this in a file called branches:

main: project
tmp-fix: project-version-2-fix

To switch to a named branch you need only look up the snapshot for the corresponding name from this file.

To ensure this file is always up to date, creating an additional snapshot on a branch means we should update the entry in branches that corresponds to our current branch to the latest snapshot.

# Branches as Pointers

After using branches for a while you notice that they can serve two purposes. First, they can act as movable pointers to snapshots so that you can keep track of the branch tips. Second, they can be pointed at a single snapshot and never move.

Mixing both of these uses into a single file feels messy. Both types are pointers to snapshots, but one moves and one doesn’t.

We can create another file called tags to contain static pointers to snapshots.

# Collaboration

Now, imagine you are working with a friend on this project. You give them a copy of all of your snapshots, branches, and tags. Your friend happens to go offline for a while and when you meet up again, you both realize you’ve been using the same naming system for your snaphsots! Now, you both have snapshots called project-version-23 and project-version-24 that have different contents. Also, we have no idea who authored which snapshot!

There are two things we can do to solve this:

  1. Snapshot messages will henceforth contain author name and email.
  2. Snapshots will no longer be named with simple numbers. Instead, you’ll use the contents of the message file to produce a hash. This hash will be guaranteed to be unique to the snapshot since no two messages will ever have the same date, message, parent, and author. Let’s use the SHA-1 hash algorithm

Nice! Now, we can merge our snapshots (and thus our working trees) without conflicts. Because of how we hash our snapshots to get their names, we know that any two snapshots with the same name actually have the same content too.

# Merges

This is what we normally call a merge commit!

However, while constructing the snapshot message for the merge, you realize that this snapshot is special. Instead of just a single parent, this merge snapshot has two parents.

# Eliminating Duplication

We can use content addressed storage and Merkle-DAGs!

  1. Create a directory named objects
  2. Go to the most deeply nested directory in the snapshot
  3. Create a temp working file
  4. For each file in the directory
    1. Calculate the hash of the contents
    2. Add an entry to the temp working file: blob {hash} {filename}
    3. Copy the file into the objects directory and rename it to the hash
  5. Afterwards, find hash of the temp working file and place it in the objects directory (also using the hash as the name). This represents the folder we just traversed
  6. Move up on directory and repeat starting from step 3
    1. When we come across the folder we just traversed, we add the following entry to the temp working file: tree {hash} {dir name}
  7. Once this has been accomplished for every directory and file in the snapshot, you have a single root directory object file and its corresponding SHA1. We record this in the commit message

Thus, we avoid storing duplicate files!

# Compression

Text can be very efficiently compressed using something like the LZW or DEFLATE compression algorithms. If you compress every blob before computing its SHA1 and saving it to disk you can reduce the total storage size of the project history significantly.

# Handy Commands

Think about git like a file time machine – it allows you to traverse and manage an entire multiverse of files

Here are some verbs and commands you’ll find yourself using a lot locally!

For collaboration and online repositories: