We are still cooking the magic in the way!
Git & GitHub
Understanding Git Architecture
Understanding Git Architecture
To use Git effectively, it's crucial to understand how Git stores and manages your data. Unlike other version control systems, Git has a unique architecture that makes it fast, reliable, and powerful. In this lesson, we'll explore Git's internal structure and how it tracks changes.
The Three States of Git
Git has three main states that your files can be in:
1. Modified (Working Directory)
You have changed files but haven't committed them yet.
Files are in your working directory but not tracked.
2. Staged (Staging Area / Index)
You have marked modified files to go into your next commit.
Files are ready to be committed.
3. Committed (Repository / .git directory)
Data is safely stored in your local database.
Files are permanently saved in Git history.
The Git Workflow:
[Working Directory] → git add → [Staging Area] → git commit → [Repository]
(Modified) (Staged) (Committed)
Example:
1. Edit file.txt (Modified)
2. git add file.txt (Staged)
3. git commit -m "Update file" (Committed)
Key Concept: The staging area (also called the "index") is what makes Git unique. It allows you to carefully prepare your commit by selecting exactly which changes to include.
The Three Main Sections
1. Working Directory
What it is:
• A single checkout of one version of the project
• Files extracted from the compressed database in .git
• The actual files you see and edit on your computer
Location:
Your project folder (e.g., /Users/john/my-project/)
Purpose:
Where you make changes to your files
2. Staging Area (Index)
What it is:
• A file (stored in .git/index)
• Stores information about what will go into your next commit
• Also called the "index"
Purpose:
• Lets you prepare commits carefully
• You can stage some changes while leaving others unstaged
• Allows for atomic, logical commits
Example:
You modified 5 files but only want to commit 3:
git add file1.txt file2.txt file3.txt
(file4.txt and file5.txt remain unstaged)
3. Repository (.git directory)
What it is:
• Where Git stores the metadata and object database
• Located in the .git directory in your project root
• Contains complete history of your project
Location:
.git/ directory (hidden folder)
Purpose:
• Permanent storage of all commits
• Complete project history
• Branches, tags, configuration
Important: Never manually edit files in the .git directory unless you know exactly what you're doing. Git manages this directory automatically.
Inside the .git Directory
Let's explore the structure of the .git directory:
.git/
├── HEAD # Points to current branch
├── config # Repository-specific configuration
├── description # Repository description (for GitWeb)
├── index # Staging area
├── hooks/ # Client and server-side hooks
├── info/ # Global exclude file
├── objects/ # Object database (commits, trees, blobs)
│ ├── pack/ # Packed objects for efficiency
│ └── info/ # Object database info
├── refs/ # References (branches and tags)
│ ├── heads/ # Local branches
│ ├── remotes/ # Remote branches
│ └── tags/ # Tags
└── logs/ # History of ref updates
Key Files and Directories:
HEAD:
Points to the current branch you're on.
Example content: ref: refs/heads/main
config:
Repository-specific configuration settings.
Override global Git config for this repository.
objects/:
The object database - where Git stores all content.
Contains blobs, trees, commits, and tags.
refs/:
References to commits (branches and tags).
refs/heads/ contains local branches.
refs/remotes/ contains remote tracking branches.
index:
The staging area (binary file).
Git Objects: How Git Stores Data
Git stores all content as objects. There are four types of objects:
1. Blob (Binary Large Object)
What it stores:
• File content (the actual data in your files)
• No filename, no directory structure
• Just pure content
Example:
If you have a file "hello.txt" with content "Hello, World!"
Git creates a blob object containing "Hello, World!"
The blob is identified by its SHA-1 hash.
Key Point:
If two files have identical content, Git stores only ONE blob.
2. Tree
What it stores:
• Directory structure
• References to blobs (files) and other trees (subdirectories)
• Filenames and permissions
Example:
project/
├── README.md (blob: abc123)
└── src/
└── main.js (blob: def456)
Tree object contains:
- README.md → blob abc123
- src → tree xyz789
3. Commit
What it stores:
• Reference to a tree object (project snapshot)
• Author name and email
• Committer name and email
• Commit message
• Parent commit(s) reference
• Timestamp
Example commit object:
tree abc123def456... (snapshot of project)
parent 789xyz123... (previous commit)
author John Doe <john@example.com> 1234567890 -0500
committer John Doe <john@example.com> 1234567890 -0500
Initial commit message
4. Tag (Annotated)
What it stores:
• Reference to a commit
• Tagger name and email
• Tag message
• Tag name
Used for marking specific points in history (releases).
Important: All Git objects are immutable. Once created, they never change. This is fundamental to Git's data integrity model.
Understanding SHA-1 Hashes
Every object in Git is identified by a SHA-1 hash:
What is SHA-1?
• Secure Hash Algorithm 1
• Produces a 40-character hexadecimal string
• Acts as a unique fingerprint for content
Example SHA-1:
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0
Properties:
• Same content always produces same hash
• Different content produces different hash (practically impossible to collide)
• Even tiny change produces completely different hash
Example:
Content: "Hello, World!" → Hash: 8ab686ea...
Content: "Hello, World." → Hash: 9bc797fb...
(One character difference, completely different hash)
Why SHA-1 Matters:
1. Data Integrity:
You can't change committed content without Git knowing.
2. Unique Identification:
Every commit, file, and tree has a unique ID.
3. Efficient Storage:
Git can quickly find and compare objects.
4. Content Addressing:
Git stores content based on its hash, not filename.
Fun Fact: Git is moving from SHA-1 to SHA-256 for even better security, but SHA-1 has worked remarkably well for Git's purposes.
How Git Stores Data: Snapshots, Not Deltas
This is where Git differs fundamentally from other version control systems:
Traditional VCS (Delta-based):
Stores differences between versions.
File V1: "Hello"
File V2: +", World" (stores the difference)
File V3: +"!" (stores the difference)
To get V3, you need: V1 + changes to V2 + changes to V3
Git (Snapshot-based):
Stores complete snapshots of your project.
Commit 1: Full snapshot (tree + blobs)
Commit 2: Full snapshot (tree + blobs)
Commit 3: Full snapshot (tree + blobs)
To get any version: Just read that commit's tree.
But wait, isn't that wasteful?
No! Git is smart:
1. Unchanged files:
If a file hasn't changed, Git doesn't create a new blob.
It just points to the existing blob.
2. Pack files:
Git periodically compresses objects into pack files.
Similar content is stored efficiently.
Result:
• Fast operations (no need to calculate deltas)
• Efficient storage (no duplicate content)
• Simple model (snapshots are easier to reason about)
Key Insight: Git's snapshot model makes operations like branching, merging, and switching extremely fast because you're just moving pointers, not recalculating deltas.
References: HEAD, Branches, and Tags
HEAD
What is HEAD?
• A pointer to your current location in the repository
• Usually points to a branch reference
• Determines what you see in your working directory
Example .git/HEAD content:
ref: refs/heads/main
This means HEAD → main branch → latest commit on main
When you commit:
HEAD → current branch → new commit
Branches
What are branches in Git?
• Just pointers to commits
• Stored in .git/refs/heads/
• Lightweight (just 41 bytes!)
Example .git/refs/heads/main:
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0
This is just the SHA-1 hash of the commit the branch points to.
Creating a branch:
Just create a new file in refs/heads/ with a commit hash.
That's it! No copying of files or history.
Tags
What are tags?
• References to specific commits
• Stored in .git/refs/tags/
• Unlike branches, tags don't move
Types:
1. Lightweight tag: Just a pointer (like a branch that doesn't move)
2. Annotated tag: A full object with message, tagger, date
Used for:
Marking releases (v1.0.0, v2.1.3, etc.)
Git's Data Integrity Model
Git's Integrity Guarantees:
1. Checksum Everything:
Every file and commit is checksummed before storage.
You can't change content without Git knowing.
2. Content-Addressed Storage:
Files are stored by their content hash.
Same content = same hash = stored once.
3. Append-Only:
Git generally only adds data, never removes.
Even "deleted" commits can often be recovered.
4. Commit Chain:
Each commit references its parent.
You can't change history without breaking the chain.
Result:
It's nearly impossible to lose data or have corruption go undetected.
Exploration Exercise:
Let's explore the .git directory:
- Create a new directory and initialize a Git repository:
mkdir git-test && cd git-test git init
- Look at the .git directory structure:
ls -la .git/
- Check what HEAD points to:
cat .git/HEAD
- Create a file, stage it, and commit:
echo "Hello Git" > test.txt git add test.txt git commit -m "First commit"
- Look at the objects directory:
find .git/objects -type f
- Check the commit hash:
git log --oneline
What you learned: You saw how Git creates objects and references as you work!
Summary
In this lesson, you learned:
- Git has three main states: Modified, Staged, and Committed
- The working directory, staging area, and repository form Git's architecture
- The .git directory contains all Git data and metadata
- Git stores data as four types of objects: blobs, trees, commits, and tags
- SHA-1 hashes uniquely identify all objects and ensure data integrity
- Git uses snapshots, not deltas, for fast and reliable operations
- HEAD, branches, and tags are just pointers to commits
- Git's architecture ensures data integrity and makes branching lightweight
Next Up: In the next lesson, we'll put this knowledge into practice by learning the basic Git workflow - creating repositories, staging changes, and making commits!