You click "Compress to ZIP" on a 50MB folder and seconds later you have a 12MB file. Same files. Same data. A quarter of the size.
How does that actually work?
This post goes behind the scenes — no fluff, just the real mechanics of how ZIP finds patterns, squeezes them down, and puts everything back perfectly when you unzip.
What ZIP Actually Is
ZIP is not magic. It's a lossless compression format — meaning it shrinks your files but when you unzip, you get back exactly the same data, bit for bit. Nothing is lost or approximated.
It achieves this through two algorithms working together:
- LZ77 — finds and eliminates repeated patterns in data
- Huffman Coding — replaces common bytes with shorter codes
Let's look at each one.
Step 1 — LZ77: Finding Repetitions
Imagine you have a text file that says:
the cat sat on the mat and the cat satA human reading this notices "the cat sat" appears twice. LZ77 does the same thing — but on raw bytes, not just words.
How LZ77 Works
LZ77 scans through your file using a sliding window — think of it like a magnifying glass moving across the data. As it moves forward, it looks back and asks:
"Have I seen this sequence of bytes before? If yes, how far back, and how long was the match?"
Instead of storing the repeated data again, it stores a reference:
"the cat sat on the mat and " → (go back 27 chars, copy 11 chars)Real Example
| Original | LZ77 Output |
|---|---|
ABCABCABC |
ABC + (go back 3, copy 6) |
| 9 bytes | ~4 bytes |
The more repetition in your file, the more LZ77 saves. This is why:
- Text files compress extremely well (lots of repeated words and patterns)
- Code files compress well (repeated keywords, syntax)
- Already-compressed files (like JPEGs or MP3s) barely shrink at all — the patterns are already gone
Step 2 — Huffman Coding: Shorter Codes for Common Things
After LZ77 has removed repetitions, Huffman Coding takes over. It handles the remaining data differently — by giving shorter binary codes to more common bytes.
The Idea
In English text, the letter "e" appears far more often than "z". So why should both take up the same space?
Standard ASCII stores every character in 8 bits (1 byte), regardless of how common it is.
Huffman says: what if common characters used fewer bits, and rare ones used more?
Building the Huffman Tree
The algorithm counts how often each byte appears in your data, then builds a binary tree:
[root]
/ \
[common] [rare]
/ \ / \
'e' 't' 'z' 'q'
(2) (3) (6) (7)Frequent characters get shorter paths (fewer bits). Rare characters get longer paths. The result:
| Character | Frequency | Old (ASCII) | Huffman Code | Savings |
|---|---|---|---|---|
| 'e' | Very common | 8 bits | 2 bits | 75% |
| 't' | Common | 8 bits | 3 bits | 62% |
| 'z' | Rare | 8 bits | 6 bits | 25% |
The total file gets smaller because you're saving far more bits on common characters than you're spending on rare ones.
What Happens When You ZIP a File — Full Behind-the-Scenes Flow
Here's the complete process from "click Compress" to the final .zip file:
Your original file (50MB)
│
▼
┌─────────────────────┐
│ 1. SCAN & ANALYZE │ ← ZIP reads your file and detects file types,
│ │ checks if compression will actually help
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ 2. LZ77 ENCODING │ ← Sliding window finds repeated byte sequences
│ (De-duplication) │ and replaces them with back-references
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ 3. HUFFMAN CODING │ ← Assigns shorter binary codes to frequent
│ (Re-encoding) │ bytes in the LZ77 output
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ 4. ZIP CONTAINER │ ← Wraps compressed data with a header
│ (Packaging) │ containing: filename, date, checksum, sizes
└────────┬────────────┘
│
▼
final .zip file (12MB)The ZIP File Structure — What's Inside
A .zip file isn't just compressed data. It has a specific internal structure:
┌────────────────────────────┐
│ LOCAL FILE HEADER │ ← filename, compression method, date
│ + COMPRESSED DATA │ ← the actual LZ77+Huffman compressed bytes
├────────────────────────────┤
│ (repeated for each file) │
├────────────────────────────┤
│ CENTRAL DIRECTORY │ ← index of all files and their locations
├────────────────────────────┤
│ END OF CENTRAL DIRECTORY │ ← total file count, size info
└────────────────────────────┘The Central Directory at the end is like a table of contents. This is why ZIP files can be "browsed" before fully extracting — your computer reads the Central Directory first, sees what's inside, and only decompresses the files you choose to extract.
How Decompression (Unzipping) Works
Unzipping simply reverses the process:
Step 1 — Read the Central Directory
The unzipper jumps to the end of the .zip file, reads the index, and knows exactly where each file's compressed data begins.
Step 2 — Huffman Decoding Using the Huffman table stored in the file header, each compressed bit sequence is decoded back to the original bytes.
Step 3 — LZ77 Decoding Back-references are resolved: "go back 27 chars, copy 11" is expanded back to the actual repeated data.
Step 4 — Checksum Verification The unzipper runs a CRC-32 checksum on the decompressed output and compares it to the checksum stored in the ZIP header. If they match, the file is verified identical to the original.
ZIP file → Huffman decode → LZ77 expand → checksum check → original file ✅Why Some Files Don't Compress Well
You may have noticed that some ZIPs aren't much smaller than the originals. Here's why:
| File Type | Compressibility | Reason |
|---|---|---|
.txt, .csv, .html |
Excellent (60–90%) | High text repetition |
.docx, .xlsx |
Good (40–70%) | Already internally compressed but with structure |
.jpg, .mp3, .mp4 |
Poor (0–5%) | Already heavily compressed |
.png (lossless) |
Moderate (10–30%) | Some redundancy possible |
.exe, .dll |
Moderate (30–50%) | Code has patterns but also random-looking sections |
.zip inside .zip |
Almost none | Already maximally compressed |
The golden rule: ZIP is most powerful on text, code, and uncompressed data. It cannot significantly compress files that are already compressed.
ZIP vs Other Compression Formats
| Format | Algorithm | Compression | Speed | Best For |
|---|---|---|---|---|
| ZIP | Deflate (LZ77+Huffman) | Good | Fast | Universal compatibility |
| 7-ZIP (.7z) | LZMA | Better | Slower | Maximum compression |
| RAR | Proprietary | Good | Medium | Archives with recovery |
| GZIP | Deflate | Good | Fast | Linux/web servers |
| Brotli | Custom | Excellent | Variable | Web browser compression |
ZIP wins on compatibility — every operating system can open it without extra software.
Fun Facts About ZIP
- ZIP was invented by Phil Katz in 1989. He released the format as open, and it became the world standard.
- The DEFLATE algorithm (LZ77 + Huffman) is also used in PNG images, HTTP gzip compression, and PDF internal compression.
- A ZIP file can contain other ZIP files — but as explained above, double-zipping gives almost no benefit.
- The largest ZIP file ever created: the "ZIP bomb" — a tiny ZIP that expands to petabytes of data, used by malware. A classic example is a 42KB file that expands to 4.5 petabytes through nested compression of repetitive data.
How This Relates to PDF Compression
When you compress a PDF using a tool like CommandPDF, a similar process happens:
- Text and vector content in the PDF is compressed using Deflate (the same LZ77+Huffman)
- Images inside the PDF are resampled to lower DPI or re-compressed with JPEG
- Metadata and unused objects (like deleted content still stored in the file) are removed
This is why a PDF with lots of images compresses differently than a text-heavy PDF — the text compresses well, the images less so (especially JPEGs already inside).
👉 Compress your PDF now — free at CommandPDF
Summary
| Step | What Happens |
|---|---|
| LZ77 | Finds and replaces repeated byte sequences with short back-references |
| Huffman Coding | Gives shorter binary codes to frequently-used bytes |
| ZIP container | Wraps compressed data with headers, filenames, and checksums |
| Unzip | Reverses both steps and verifies integrity via CRC checksum |
ZIP compression is elegant engineering: it exploits the natural redundancy in data to shrink files — then reconstructs everything perfectly on the other end. No data lost. No magic. Just math.
Need to bundle several PDFs into one download? Try the free PDFs to ZIP tool — it builds the archive right in your browser.
Related Guides:
