Git Internals: Objects & Refs
Git Internals: Objects & Refs
Most engineers use Git daily without ever looking beneath the surface. But the engineers who truly master Git — the ones who rescue corrupted repositories, design branching strategies at scale, and build CI/CD pipelines that never lose work — understand the object model underneath. This lesson strips the magic away and shows you exactly what Git is: a content-addressed filesystem with a thin version-control layer on top.
The Object Store: Four Types That Explain Everything
Every piece of data Git tracks lives in .git/objects/. Git stores four types of objects, each identified by the SHA-1 (or SHA-256 in newer repos) hash of its contents. The hash is the identity — change one byte, get a completely different object. This is the content-addressed model.
- Blob — raw file contents, nothing else. No filename, no permissions. Two files with identical contents share one blob.
- Tree — a directory snapshot: a list of (mode, name, SHA) entries pointing to blobs and other trees. It represents one directory at one moment in time.
- Commit — a pointer to a root tree, zero or more parent commit SHAs, author/committer metadata, and a message. The commit is what gives Git its history graph.
- Tag — an annotated tag object: points to any object (usually a commit) and adds tagger identity, a date, and a PGP signature for release integrity.
git commit --amend does not edit the old commit; it writes a brand-new commit object with a new SHA and moves the branch pointer. The old commit still exists until garbage collected.
Anatomy of the DAG
The commit history forms a Directed Acyclic Graph (DAG). Each commit points backward to its parent(s). A merge commit has two parents. This structure makes branching and merging cheap — no data is copied, only pointers are written.
Exploring the Object Store Directly
Everything below is runnable in any Git repository. Use git cat-file — the Swiss-army knife for plumbing — to inspect raw objects.
Every file under .git/objects/ uses the first two hex characters as a directory name and the remaining 38 as the filename. Objects are zlib-compressed. Packfiles (in .git/objects/pack/) bundle many objects together for efficiency — you will see them in any cloned repo.
Refs: Names That Point to SHAs
A ref is simply a file containing a SHA. That is all. refs/heads/main is a 41-byte file holding the SHA of the latest commit on main. When you run git commit, Git writes the new commit object, then rewrites that file with the new SHA.
refs/heads/*— local branchesrefs/remotes/*— remote-tracking branches (read-only snapshots of what the remote had last time you fetched)refs/tags/*— lightweight tags (just a SHA file) or annotated tags (point to a tag object)HEAD— a symbolic ref pointing to the currently checked-out branch, or a bare SHA when detached
git reset --hard or accidental force-push leaves engineers panicking about "lost" commits, the reflog almost always saves you. Reflog entries are kept for 90 days by default (gc.reflogExpire). On shared remotes (GitHub, GitLab), the reflog is not exposed — but locally you can always recover before running git gc.
How Packs and the Object Graph Enable Scale
In large monorepos (think Chromium at 900K commits, or Linux at 1.1M), the loose-object store would be unmanageable. Git uses packfiles and delta compression: instead of storing every version of a file, it stores one full copy and binary diffs (deltas) between similar blobs. git gc (garbage collection) triggers packing. On GitHub, every git push triggers a server-side repack.
git gc --aggressive on a live shared remote. It rewrites all pack delta chains and can take hours on large repos. On managed platforms (GitHub, GitLab, Bitbucket), let the platform handle repacking — they run it asynchronously with clone-serving continuity. On self-hosted Gitea or bare repos, schedule git gc during low-traffic windows.
The Content-Addressed Model in Production
Understanding that Git is content-addressed has direct implications for DevOps work:
- Reproducible builds: Pinning a dependency to a Git commit SHA (not a branch name) is deterministic — the same SHA always means the exact same tree. This is why Kubernetes manifest repos, Terraform modules, and Go modules all reference SHAs.
- Integrity verification:
git fsckcan detect data corruption from disk failures. Any bit flip changes the SHA and immediately signals an error. - Shallow clones in CI:
git clone --depth=1fetches only the tip commit and its tree — no history. This is why GitHub Actions can clone a 5GB repo in 3 seconds for a build job. The trade-off: nogit log, nogit bisect. - Partial clones:
git clone --filter=blob:none(sparse checkout) fetches commits and trees but lazily downloads blobs on demand. This is how large monorepo teams at Google-scale work with subsets of a repository without downloading terabytes.
git init --object-format=sha256). GitHub has not yet migrated, but major internal platforms are evaluating it. The object model is identical — only the hash function changes. For now, all production repos you will encounter use SHA-1. The known SHA-1 collision (SHAttered, 2017) is mitigated in Git via a collision-detection library; real-world attacks on Git repos remain theoretical.
With this foundation — blobs, trees, commits, refs, and the DAG — every other Git behavior becomes predictable. Branching is just writing a 41-byte file. Merging is creating a commit with two parents. Rebasing is rewriting commit objects with new parent SHAs. You now have the mental model to diagnose any Git problem at the object level.