How ZIP Compression Works — Behind the Scenes

You click "Compress to ZIP" on a 50MB folder and seconds later you have a 12MB file. Same files. Same data. A quarter of the size.

How does that actually work?

This post goes behind the scenes — no fluff, just the real mechanics of how ZIP finds patterns, squeezes them down, and puts everything back perfectly when you unzip.

What ZIP Actually Is

ZIP is not magic. It's a lossless compression format — meaning it shrinks your files but when you unzip, you get back exactly the same data, bit for bit. Nothing is lost or approximated.

It achieves this through two algorithms working together:

LZ77 — finds and eliminates repeated patterns in data
Huffman Coding — replaces common bytes with shorter codes

Let's look at each one.

Step 1 — LZ77: Finding Repetitions

Imagine you have a text file that says:

the cat sat on the mat and the cat sat

A human reading this notices "the cat sat" appears twice. LZ77 does the same thing — but on raw bytes, not just words.

How LZ77 Works

LZ77 scans through your file using a sliding window — think of it like a magnifying glass moving across the data. As it moves forward, it looks back and asks:

"Have I seen this sequence of bytes before? If yes, how far back, and how long was the match?"

Instead of storing the repeated data again, it stores a reference:

"the cat sat on the mat and " → (go back 27 chars, copy 11 chars)

Real Example

Original	LZ77 Output
`ABCABCABC`	`ABC` + `(go back 3, copy 6)`
9 bytes	~4 bytes

The more repetition in your file, the more LZ77 saves. This is why:

Text files compress extremely well (lots of repeated words and patterns)
Code files compress well (repeated keywords, syntax)
Already-compressed files (like JPEGs or MP3s) barely shrink at all — the patterns are already gone

Step 2 — Huffman Coding: Shorter Codes for Common Things

After LZ77 has removed repetitions, Huffman Coding takes over. It handles the remaining data differently — by giving shorter binary codes to more common bytes.

The Idea

In English text, the letter "e" appears far more often than "z". So why should both take up the same space?

Standard ASCII stores every character in 8 bits (1 byte), regardless of how common it is.

Huffman says: what if common characters used fewer bits, and rare ones used more?

Building the Huffman Tree

The algorithm counts how often each byte appears in your data, then builds a binary tree:

      [root]
     /      \
  [common]  [rare]
   /    \     /   \
  'e'   't'  'z'  'q'
  (2)   (3)  (6)  (7)

Frequent characters get shorter paths (fewer bits). Rare characters get longer paths. The result:

Character	Frequency	Old (ASCII)	Huffman Code	Savings
'e'	Very common	8 bits	2 bits	75%
't'	Common	8 bits	3 bits	62%
'z'	Rare	8 bits	6 bits	25%

The total file gets smaller because you're saving far more bits on common characters than you're spending on rare ones.

What Happens When You ZIP a File — Full Behind-the-Scenes Flow

Here's the complete process from "click Compress" to the final .zip file:

Your original file (50MB)
         │
         ▼
┌─────────────────────┐
│   1. SCAN & ANALYZE │  ← ZIP reads your file and detects file types,
│                     │    checks if compression will actually help
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  2. LZ77 ENCODING   │  ← Sliding window finds repeated byte sequences
│  (De-duplication)   │    and replaces them with back-references
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ 3. HUFFMAN CODING   │  ← Assigns shorter binary codes to frequent
│  (Re-encoding)      │    bytes in the LZ77 output
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  4. ZIP CONTAINER   │  ← Wraps compressed data with a header
│  (Packaging)        │    containing: filename, date, checksum, sizes
└────────┬────────────┘
         │
         ▼
   final .zip file (12MB)

The ZIP File Structure — What's Inside

A .zip file isn't just compressed data. It has a specific internal structure:

┌────────────────────────────┐
│  LOCAL FILE HEADER         │  ← filename, compression method, date
│  + COMPRESSED DATA         │  ← the actual LZ77+Huffman compressed bytes
├────────────────────────────┤
│  (repeated for each file)  │
├────────────────────────────┤
│  CENTRAL DIRECTORY         │  ← index of all files and their locations
├────────────────────────────┤
│  END OF CENTRAL DIRECTORY  │  ← total file count, size info
└────────────────────────────┘

The Central Directory at the end is like a table of contents. This is why ZIP files can be "browsed" before fully extracting — your computer reads the Central Directory first, sees what's inside, and only decompresses the files you choose to extract.

How Decompression (Unzipping) Works

Unzipping simply reverses the process:

Step 1 — Read the Central Directory The unzipper jumps to the end of the .zip file, reads the index, and knows exactly where each file's compressed data begins.

Step 2 — Huffman Decoding Using the Huffman table stored in the file header, each compressed bit sequence is decoded back to the original bytes.

Step 3 — LZ77 Decoding Back-references are resolved: "go back 27 chars, copy 11" is expanded back to the actual repeated data.

Step 4 — Checksum Verification The unzipper runs a CRC-32 checksum on the decompressed output and compares it to the checksum stored in the ZIP header. If they match, the file is verified identical to the original.

ZIP file → Huffman decode → LZ77 expand → checksum check → original file ✅

Why Some Files Don't Compress Well

You may have noticed that some ZIPs aren't much smaller than the originals. Here's why:

File Type	Compressibility	Reason
`.txt`, `.csv`, `.html`	Excellent (60–90%)	High text repetition
`.docx`, `.xlsx`	Good (40–70%)	Already internally compressed but with structure
`.jpg`, `.mp3`, `.mp4`	Poor (0–5%)	Already heavily compressed
`.png` (lossless)	Moderate (10–30%)	Some redundancy possible
`.exe`, `.dll`	Moderate (30–50%)	Code has patterns but also random-looking sections
`.zip` inside `.zip`	Almost none	Already maximally compressed

The golden rule: ZIP is most powerful on text, code, and uncompressed data. It cannot significantly compress files that are already compressed.

ZIP vs Other Compression Formats

Format	Algorithm	Compression	Speed	Best For
ZIP	Deflate (LZ77+Huffman)	Good	Fast	Universal compatibility
7-ZIP (.7z)	LZMA	Better	Slower	Maximum compression
RAR	Proprietary	Good	Medium	Archives with recovery
GZIP	Deflate	Good	Fast	Linux/web servers
Brotli	Custom	Excellent	Variable	Web browser compression

ZIP wins on compatibility — every operating system can open it without extra software.

Fun Facts About ZIP

ZIP was invented by Phil Katz in 1989. He released the format as open, and it became the world standard.
The DEFLATE algorithm (LZ77 + Huffman) is also used in PNG images, HTTP gzip compression, and PDF internal compression.
A ZIP file can contain other ZIP files — but as explained above, double-zipping gives almost no benefit.
The largest ZIP file ever created: the "ZIP bomb" — a tiny ZIP that expands to petabytes of data, used by malware. A classic example is a 42KB file that expands to 4.5 petabytes through nested compression of repetitive data.

How This Relates to PDF Compression

When you compress a PDF using a tool like CommandPDF, a similar process happens:

Text and vector content in the PDF is compressed using Deflate (the same LZ77+Huffman)
Images inside the PDF are resampled to lower DPI or re-compressed with JPEG
Metadata and unused objects (like deleted content still stored in the file) are removed

This is why a PDF with lots of images compresses differently than a text-heavy PDF — the text compresses well, the images less so (especially JPEGs already inside).

👉 Compress your PDF now — free at CommandPDF

Summary

Step	What Happens
LZ77	Finds and replaces repeated byte sequences with short back-references
Huffman Coding	Gives shorter binary codes to frequently-used bytes
ZIP container	Wraps compressed data with headers, filenames, and checksums
Unzip	Reverses both steps and verifies integrity via CRC checksum

ZIP compression is elegant engineering: it exploits the natural redundancy in data to shrink files — then reconstructs everything perfectly on the other end. No data lost. No magic. Just math.

Need to bundle several PDFs into one download? Try the free PDFs to ZIP tool — it builds the archive right in your browser.

Related Guides:

Blog