How ZIP Compression Works — Behind the Scenes (With Visuals)
ZIP file compression explained
ZIP file compression explained
You click "Compress to ZIP" on a 50MB folder and seconds later you have a 12MB file. Same files. Same data. A quarter of the size.
How does that actually work?
This post goes behind the scenes — no fluff, just the real mechanics of how ZIP finds patterns, squeezes them down, and puts everything back perfectly when you unzip.
ZIP is not magic. It's a lossless compression format — meaning it shrinks your files but when you unzip, you get back exactly the same data, bit for bit. Nothing is lost or approximated.
It achieves this through two algorithms working together:
Let's look at each one.
Imagine you have a text file that says:
the cat sat on the mat and the cat satA human reading this notices "the cat sat" appears twice. LZ77 does the same thing — but on raw bytes, not just words.
LZ77 scans through your file using a sliding window — think of it like a magnifying glass moving across the data. As it moves forward, it looks back and asks:
"Have I seen this sequence of bytes before? If yes, how far back, and how long was the match?"
Instead of storing the repeated data again, it stores a reference:
"the cat sat on the mat and " → (go back 27 chars, copy 11 chars)| Original | LZ77 Output |
|---|---|
ABCABCABC |
ABC + (go back 3, copy 6) |
| 9 bytes | ~4 bytes |
The more repetition in your file, the more LZ77 saves. This is why:
After LZ77 has removed repetitions, Huffman Coding takes over. It handles the remaining data differently — by giving shorter binary codes to more common bytes.
In English text, the letter "e" appears far more often than "z". So why should both take up the same space?
Standard ASCII stores every character in 8 bits (1 byte), regardless of how common it is.
Huffman says: what if common characters used fewer bits, and rare ones used more?
The algorithm counts how often each byte appears in your data, then builds a binary tree:
[root]
/ \
[common] [rare]
/ \ / \
'e' 't' 'z' 'q'
(2) (3) (6) (7)Frequent characters get shorter paths (fewer bits). Rare characters get longer paths. The result:
| Character | Frequency | Old (ASCII) | Huffman Code | Savings |
|---|---|---|---|---|
| 'e' | Very common | 8 bits | 2 bits | 75% |
| 't' | Common | 8 bits | 3 bits | 62% |
| 'z' | Rare | 8 bits | 6 bits | 25% |
The total file gets smaller because you're saving far more bits on common characters than you're spending on rare ones.
Here's the complete process from "click Compress" to the final .zip file:
Your original file (50MB)
│
▼
┌─────────────────────┐
│ 1. SCAN & ANALYZE │ ← ZIP reads your file and detects file types,
│ │ checks if compression will actually help
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ 2. LZ77 ENCODING │ ← Sliding window finds repeated byte sequences
│ (De-duplication) │ and replaces them with back-references
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ 3. HUFFMAN CODING │ ← Assigns shorter binary codes to frequent
│ (Re-encoding) │ bytes in the LZ77 output
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ 4. ZIP CONTAINER │ ← Wraps compressed data with a header
│ (Packaging) │ containing: filename, date, checksum, sizes
└────────┬────────────┘
│
▼
final .zip file (12MB)A .zip file isn't just compressed data. It has a specific internal structure:
┌────────────────────────────┐
│ LOCAL FILE HEADER │ ← filename, compression method, date
│ + COMPRESSED DATA │ ← the actual LZ77+Huffman compressed bytes
├────────────────────────────┤
│ (repeated for each file) │
├────────────────────────────┤
│ CENTRAL DIRECTORY │ ← index of all files and their locations
├────────────────────────────┤
│ END OF CENTRAL DIRECTORY │ ← total file count, size info
└────────────────────────────┘The Central Directory at the end is like a table of contents. This is why ZIP files can be "browsed" before fully extracting — your computer reads the Central Directory first, sees what's inside, and only decompresses the files you choose to extract.
Unzipping simply reverses the process:
Step 1 — Read the Central Directory
The unzipper jumps to the end of the .zip file, reads the index, and knows exactly where each file's compressed data begins.
Step 2 — Huffman Decoding Using the Huffman table stored in the file header, each compressed bit sequence is decoded back to the original bytes.
Step 3 — LZ77 Decoding Back-references are resolved: "go back 27 chars, copy 11" is expanded back to the actual repeated data.
Step 4 — Checksum Verification The unzipper runs a CRC-32 checksum on the decompressed output and compares it to the checksum stored in the ZIP header. If they match, the file is verified identical to the original.
ZIP file → Huffman decode → LZ77 expand → checksum check → original file ✅You may have noticed that some ZIPs aren't much smaller than the originals. Here's why:
| File Type | Compressibility | Reason |
|---|---|---|
.txt, .csv, .html |
Excellent (60–90%) | High text repetition |
.docx, .xlsx |
Good (40–70%) | Already internally compressed but with structure |
.jpg, .mp3, .mp4 |
Poor (0–5%) | Already heavily compressed |
.png (lossless) |
Moderate (10–30%) | Some redundancy possible |
.exe, .dll |
Moderate (30–50%) | Code has patterns but also random-looking sections |
.zip inside .zip |
Almost none | Already maximally compressed |
The golden rule: ZIP is most powerful on text, code, and uncompressed data. It cannot significantly compress files that are already compressed.
| Format | Algorithm | Compression | Speed | Best For |
|---|---|---|---|---|
| ZIP | Deflate (LZ77+Huffman) | Good | Fast | Universal compatibility |
| 7-ZIP (.7z) | LZMA | Better | Slower | Maximum compression |
| RAR | Proprietary | Good | Medium | Archives with recovery |
| GZIP | Deflate | Good | Fast | Linux/web servers |
| Brotli | Custom | Excellent | Variable | Web browser compression |
ZIP wins on compatibility — every operating system can open it without extra software.
When you compress a PDF using a tool like CommandPDF, a similar process happens:
This is why a PDF with lots of images compresses differently than a text-heavy PDF — the text compresses well, the images less so (especially JPEGs already inside).
👉 Compress your PDF now — free at CommandPDF
| Step | What Happens |
|---|---|
| LZ77 | Finds and replaces repeated byte sequences with short back-references |
| Huffman Coding | Gives shorter binary codes to frequently-used bytes |
| ZIP container | Wraps compressed data with headers, filenames, and checksums |
| Unzip | Reverses both steps and verifies integrity via CRC checksum |
ZIP compression is elegant engineering: it exploits the natural redundancy in data to shrink files — then reconstructs everything perfectly on the other end. No data lost. No magic. Just math.
Related Guides: