Git & GitHub

Understanding Git Architecture

13 min Lesson 3 of 35

Understanding Git Architecture

To use Git effectively, it's crucial to understand how Git stores and manages your data. Unlike other version control systems, Git has a unique architecture that makes it fast, reliable, and powerful. In this lesson, we'll explore Git's internal structure and how it tracks changes.

The Three States of Git

Git has three main states that your files can be in:

1. Modified (Working Directory)
You have changed files but haven't committed them yet.
Files are in your working directory but not tracked.

2. Staged (Staging Area / Index)
You have marked modified files to go into your next commit.
Files are ready to be committed.

3. Committed (Repository / .git directory)
Data is safely stored in your local database.
Files are permanently saved in Git history.

The Git Workflow:

[Working Directory] → git add → [Staging Area] → git commit → [Repository]
     (Modified)                     (Staged)                   (Committed)

Example:
1. Edit file.txt (Modified)
2. git add file.txt (Staged)
3. git commit -m "Update file" (Committed)

Key Concept: The staging area (also called the "index") is what makes Git unique. It allows you to carefully prepare your commit by selecting exactly which changes to include.

The Three Main Sections

1. Working Directory

What it is:
• A single checkout of one version of the project
• Files extracted from the compressed database in .git
• The actual files you see and edit on your computer

Location:
Your project folder (e.g., /Users/john/my-project/)

Purpose:
Where you make changes to your files

2. Staging Area (Index)

What it is:
• A file (stored in .git/index)
• Stores information about what will go into your next commit
• Also called the "index"

Purpose:
• Lets you prepare commits carefully
• You can stage some changes while leaving others unstaged
• Allows for atomic, logical commits

Example:
You modified 5 files but only want to commit 3:
git add file1.txt file2.txt file3.txt
(file4.txt and file5.txt remain unstaged)

3. Repository (.git directory)

What it is:
• Where Git stores the metadata and object database
• Located in the .git directory in your project root
• Contains complete history of your project

Location:
.git/ directory (hidden folder)

Purpose:
• Permanent storage of all commits
• Complete project history
• Branches, tags, configuration

Important: Never manually edit files in the .git directory unless you know exactly what you're doing. Git manages this directory automatically.

Inside the .git Directory

Let's explore the structure of the .git directory:

.git/
├── HEAD                  # Points to current branch
├── config               # Repository-specific configuration
├── description          # Repository description (for GitWeb)
├── index                # Staging area
├── hooks/               # Client and server-side hooks
├── info/                # Global exclude file
├── objects/             # Object database (commits, trees, blobs)
│   ├── pack/           # Packed objects for efficiency
│   └── info/           # Object database info
├── refs/                # References (branches and tags)
│   ├── heads/          # Local branches
│   ├── remotes/        # Remote branches
│   └── tags/           # Tags
└── logs/                # History of ref updates

Key Files and Directories:

HEAD:
Points to the current branch you're on.
Example content: ref: refs/heads/main

config:
Repository-specific configuration settings.
Override global Git config for this repository.

objects/:
The object database - where Git stores all content.
Contains blobs, trees, commits, and tags.

refs/:
References to commits (branches and tags).
refs/heads/ contains local branches.
refs/remotes/ contains remote tracking branches.

index:
The staging area (binary file).

Git Objects: How Git Stores Data

Git stores all content as objects. There are four types of objects:

1. Blob (Binary Large Object)

What it stores:
• File content (the actual data in your files)
• No filename, no directory structure
• Just pure content

Example:
If you have a file "hello.txt" with content "Hello, World!"
Git creates a blob object containing "Hello, World!"
The blob is identified by its SHA-1 hash.

Key Point:
If two files have identical content, Git stores only ONE blob.

2. Tree

What it stores:
• Directory structure
• References to blobs (files) and other trees (subdirectories)
• Filenames and permissions

Example:
project/
├── README.md (blob: abc123)
└── src/
    └── main.js (blob: def456)

Tree object contains:
- README.md → blob abc123
- src → tree xyz789

3. Commit

What it stores:
• Reference to a tree object (project snapshot)
• Author name and email
• Committer name and email
• Commit message
• Parent commit(s) reference
• Timestamp

Example commit object:
tree abc123def456...         (snapshot of project)
parent 789xyz123...          (previous commit)
author John Doe <john@example.com> 1234567890 -0500
committer John Doe <john@example.com> 1234567890 -0500

Initial commit message

4. Tag (Annotated)

What it stores:
• Reference to a commit
• Tagger name and email
• Tag message
• Tag name

Used for marking specific points in history (releases).

Important: All Git objects are immutable. Once created, they never change. This is fundamental to Git's data integrity model.

Understanding SHA-1 Hashes

Every object in Git is identified by a SHA-1 hash:

What is SHA-1?
• Secure Hash Algorithm 1
• Produces a 40-character hexadecimal string
• Acts as a unique fingerprint for content

Example SHA-1:
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0

Properties:
• Same content always produces same hash
• Different content produces different hash (practically impossible to collide)
• Even tiny change produces completely different hash

Example:
Content: "Hello, World!"  → Hash: 8ab686ea...
Content: "Hello, World."  → Hash: 9bc797fb...
(One character difference, completely different hash)

Why SHA-1 Matters:

1. Data Integrity:
   You can't change committed content without Git knowing.

2. Unique Identification:
   Every commit, file, and tree has a unique ID.

3. Efficient Storage:
   Git can quickly find and compare objects.

4. Content Addressing:
   Git stores content based on its hash, not filename.

Fun Fact: Git is moving from SHA-1 to SHA-256 for even better security, but SHA-1 has worked remarkably well for Git's purposes.

How Git Stores Data: Snapshots, Not Deltas

This is where Git differs fundamentally from other version control systems:

Traditional VCS (Delta-based):
Stores differences between versions.

File V1: "Hello"
File V2: +", World"  (stores the difference)
File V3: +"!"        (stores the difference)

To get V3, you need: V1 + changes to V2 + changes to V3

Git (Snapshot-based):
Stores complete snapshots of your project.

Commit 1: Full snapshot (tree + blobs)
Commit 2: Full snapshot (tree + blobs)
Commit 3: Full snapshot (tree + blobs)

To get any version: Just read that commit's tree.

But wait, isn't that wasteful?

No! Git is smart:

1. Unchanged files:
   If a file hasn't changed, Git doesn't create a new blob.
   It just points to the existing blob.

2. Pack files:
   Git periodically compresses objects into pack files.
   Similar content is stored efficiently.

Result:
• Fast operations (no need to calculate deltas)
• Efficient storage (no duplicate content)
• Simple model (snapshots are easier to reason about)

Key Insight: Git's snapshot model makes operations like branching, merging, and switching extremely fast because you're just moving pointers, not recalculating deltas.

References: HEAD, Branches, and Tags

HEAD

What is HEAD?
• A pointer to your current location in the repository
• Usually points to a branch reference
• Determines what you see in your working directory

Example .git/HEAD content:
ref: refs/heads/main

This means HEAD → main branch → latest commit on main

When you commit:
HEAD → current branch → new commit

Branches

What are branches in Git?
• Just pointers to commits
• Stored in .git/refs/heads/
• Lightweight (just 41 bytes!)

Example .git/refs/heads/main:
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0

This is just the SHA-1 hash of the commit the branch points to.

Creating a branch:
Just create a new file in refs/heads/ with a commit hash.
That's it! No copying of files or history.

Git's Data Integrity Model

Git's Integrity Guarantees:

1. Checksum Everything:
   Every file and commit is checksummed before storage.
   You can't change content without Git knowing.

2. Content-Addressed Storage:
   Files are stored by their content hash.
   Same content = same hash = stored once.

3. Append-Only:
   Git generally only adds data, never removes.
   Even "deleted" commits can often be recovered.

4. Commit Chain:
   Each commit references its parent.
   You can't change history without breaking the chain.

Result:
It's nearly impossible to lose data or have corruption go undetected.

Exploration Exercise:

Let's explore the .git directory:

Create a new directory and initialize a Git repository:
```
mkdir git-test && cd git-test
git init
```
Look at the .git directory structure:
```
ls -la .git/
```
Check what HEAD points to:
```
cat .git/HEAD
```

Create a file, stage it, and commit:

echo "Hello Git" > test.txt
git add test.txt
git commit -m "First commit"

Look at the objects directory:
```
find .git/objects -type f
```
Check the commit hash:
```
git log --oneline
```

What you learned: You saw how Git creates objects and references as you work!

Summary

In this lesson, you learned:

Git has three main states: Modified, Staged, and Committed
The working directory, staging area, and repository form Git's architecture
The .git directory contains all Git data and metadata
Git stores data as four types of objects: blobs, trees, commits, and tags
SHA-1 hashes uniquely identify all objects and ensure data integrity
Git uses snapshots, not deltas, for fast and reliable operations
HEAD, branches, and tags are just pointers to commits
Git's architecture ensures data integrity and makes branching lightweight

Next Up: In the next lesson, we'll put this knowledge into practice by learning the basic Git workflow - creating repositories, staging changes, and making commits!