Git & Collaboration Workflows

Git Internals: Objects & Refs

18 min Lesson 1 of 28

Git Internals: Objects & Refs

Most engineers use Git daily without ever looking beneath the surface. But the engineers who truly master Git — the ones who rescue corrupted repositories, design branching strategies at scale, and build CI/CD pipelines that never lose work — understand the object model underneath. This lesson strips the magic away and shows you exactly what Git is: a content-addressed filesystem with a thin version-control layer on top.

The Object Store: Four Types That Explain Everything

Every piece of data Git tracks lives in .git/objects/. Git stores four types of objects, each identified by the SHA-1 (or SHA-256 in newer repos) hash of its contents. The hash is the identity — change one byte, get a completely different object. This is the content-addressed model.

Blob — raw file contents, nothing else. No filename, no permissions. Two files with identical contents share one blob.
Tree — a directory snapshot: a list of (mode, name, SHA) entries pointing to blobs and other trees. It represents one directory at one moment in time.
Commit — a pointer to a root tree, zero or more parent commit SHAs, author/committer metadata, and a message. The commit is what gives Git its history graph.
Tag — an annotated tag object: points to any object (usually a commit) and adds tagger identity, a date, and a PGP signature for release integrity.

Key insight: Because every object is identified by its content hash, Git is fundamentally immutable. You never modify an object — you create a new one. git commit --amend does not edit the old commit; it writes a brand-new commit object with a new SHA and moves the branch pointer. The old commit still exists until garbage collected.

Anatomy of the DAG

The commit history forms a Directed Acyclic Graph (DAG). Each commit points backward to its parent(s). A merge commit has two parents. This structure makes branching and merging cheap — no data is copied, only pointers are written.

Git object DAG: refs point to commits, commits point to trees, trees point to blobs. Unchanged files share blobs across commits — no redundant storage.

Exploring the Object Store Directly

Everything below is runnable in any Git repository. Use git cat-file — the Swiss-army knife for plumbing — to inspect raw objects.

# Inspect the current commit object
git cat-file -t HEAD                  # prints: commit
git cat-file -p HEAD                  # pretty-print: tree SHA, parent SHA(s), author, message

# Inspect the root tree for that commit
git cat-file -p HEAD^{tree}           # lists mode, type, SHA, filename for every entry

# Inspect a blob (raw file contents)
git ls-tree -r HEAD                   # list all blobs recursively with SHAs
git cat-file -p <blob-sha>            # print raw file bytes — no metadata

# Walk the entire object graph manually
git log --oneline --graph --all       # DAG overview
git rev-list --objects HEAD           # every object reachable from HEAD

# Find the SHA for any ref
git rev-parse HEAD                    # full 40-char SHA
git rev-parse HEAD~3                  # three commits back
git rev-parse main@{yesterday}        # reflog-based: where main was yesterday

Every file under .git/objects/ uses the first two hex characters as a directory name and the remaining 38 as the filename. Objects are zlib-compressed. Packfiles (in .git/objects/pack/) bundle many objects together for efficiency — you will see them in any cloned repo.

Refs: Names That Point to SHAs

A ref is simply a file containing a SHA. That is all. refs/heads/main is a 41-byte file holding the SHA of the latest commit on main. When you run git commit, Git writes the new commit object, then rewrites that file with the new SHA.

refs/heads/* — local branches
refs/remotes/* — remote-tracking branches (read-only snapshots of what the remote had last time you fetched)
refs/tags/* — lightweight tags (just a SHA file) or annotated tags (point to a tag object)
HEAD — a symbolic ref pointing to the currently checked-out branch, or a bare SHA when detached

# See all refs as raw files
ls .git/refs/heads/
cat .git/refs/heads/main              # prints the SHA directly

# HEAD as a symbolic ref
cat .git/HEAD                         # prints: ref: refs/heads/main (or a bare SHA if detached)

# Packed refs (Git packs many refs into a single file for performance)
cat .git/packed-refs                  # one "sha ref-name" line per packed ref

# Reflog: every position HEAD or a branch has ever pointed to
git reflog                            # last 30 HEAD movements (default)
git reflog show main                  # movements of the main branch tip

# Recover a "deleted" commit via reflog (critical rescue skill)
git reflog | grep <keyword>
git checkout -b rescue <sha-from-reflog>

Production rescue pattern: When a git reset --hard or accidental force-push leaves engineers panicking about "lost" commits, the reflog almost always saves you. Reflog entries are kept for 90 days by default (gc.reflogExpire). On shared remotes (GitHub, GitLab), the reflog is not exposed — but locally you can always recover before running git gc.

How Packs and the Object Graph Enable Scale

In large monorepos (think Chromium at 900K commits, or Linux at 1.1M), the loose-object store would be unmanageable. Git uses packfiles and delta compression: instead of storing every version of a file, it stores one full copy and binary diffs (deltas) between similar blobs. git gc (garbage collection) triggers packing. On GitHub, every git push triggers a server-side repack.

# See pack statistics for your repo
git count-objects -vH

# Manually trigger a full repack (what GitHub does on ingest)
git gc --aggressive --prune=now

# Verify the integrity of the entire object database
git fsck --full

# Find the largest objects in pack history (useful before git-lfs migration)
git rev-list --objects --all \
  | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
  | sort -k3 -rn \
  | head -20

Do not run git gc --aggressive on a live shared remote. It rewrites all pack delta chains and can take hours on large repos. On managed platforms (GitHub, GitLab, Bitbucket), let the platform handle repacking — they run it asynchronously with clone-serving continuity. On self-hosted Gitea or bare repos, schedule git gc during low-traffic windows.

The Content-Addressed Model in Production

Understanding that Git is content-addressed has direct implications for DevOps work:

Reproducible builds: Pinning a dependency to a Git commit SHA (not a branch name) is deterministic — the same SHA always means the exact same tree. This is why Kubernetes manifest repos, Terraform modules, and Go modules all reference SHAs.
Integrity verification: git fsck can detect data corruption from disk failures. Any bit flip changes the SHA and immediately signals an error.
Shallow clones in CI: git clone --depth=1 fetches only the tip commit and its tree — no history. This is why GitHub Actions can clone a 5GB repo in 3 seconds for a build job. The trade-off: no git log, no git bisect.
Partial clones: git clone --filter=blob:none (sparse checkout) fetches commits and trees but lazily downloads blobs on demand. This is how large monorepo teams at Google-scale work with subsets of a repository without downloading terabytes.

SHA-1 vs SHA-256: Git 2.29+ supports SHA-256 repositories (git init --object-format=sha256). GitHub has not yet migrated, but major internal platforms are evaluating it. The object model is identical — only the hash function changes. For now, all production repos you will encounter use SHA-1. The known SHA-1 collision (SHAttered, 2017) is mitigated in Git via a collision-detection library; real-world attacks on Git repos remain theoretical.

With this foundation — blobs, trees, commits, refs, and the DAG — every other Git behavior becomes predictable. Branching is just writing a 41-byte file. Merging is creating a commit with two parents. Rebasing is rewriting commit objects with new parent SHAs. You now have the mental model to diagnose any Git problem at the object level.