Skip to main content

How ZIP Compression Works — Behind the Scenes (With Visuals)

ZIP file compression explained

How ZIP Compression Works — Behind the Scenes (With Visuals)

ZIP file compression explained


Arslan Iqbal
Arslan Iqbal
Technical Writer · CommandPDF Blog · May 2025 · 8 min read

You click "Compress to ZIP" on a 50MB folder and seconds later you have a 12MB file. Same files. Same data. A quarter of the size.

How does that actually work?

This post goes behind the scenes — no fluff, just the real mechanics of how ZIP finds patterns, squeezes them down, and puts everything back perfectly when you unzip.


What ZIP Actually Is

ZIP is not magic. It's a lossless compression format — meaning it shrinks your files but when you unzip, you get back exactly the same data, bit for bit. Nothing is lost or approximated.

It achieves this through two algorithms working together:

  1. LZ77 — finds and eliminates repeated patterns in data
  2. Huffman Coding — replaces common bytes with shorter codes

Let's look at each one.


Step 1 — LZ77: Finding Repetitions

Imagine you have a text file that says:

the cat sat on the mat and the cat sat

A human reading this notices "the cat sat" appears twice. LZ77 does the same thing — but on raw bytes, not just words.

How LZ77 Works

LZ77 scans through your file using a sliding window — think of it like a magnifying glass moving across the data. As it moves forward, it looks back and asks:

"Have I seen this sequence of bytes before? If yes, how far back, and how long was the match?"

Instead of storing the repeated data again, it stores a reference:

"the cat sat on the mat and " → (go back 27 chars, copy 11 chars)

LZ77 sliding window diagram

Real Example

Original LZ77 Output
ABCABCABC ABC + (go back 3, copy 6)
9 bytes ~4 bytes

The more repetition in your file, the more LZ77 saves. This is why:

  • Text files compress extremely well (lots of repeated words and patterns)
  • Code files compress well (repeated keywords, syntax)
  • Already-compressed files (like JPEGs or MP3s) barely shrink at all — the patterns are already gone

Step 2 — Huffman Coding: Shorter Codes for Common Things

After LZ77 has removed repetitions, Huffman Coding takes over. It handles the remaining data differently — by giving shorter binary codes to more common bytes.

The Idea

In English text, the letter "e" appears far more often than "z". So why should both take up the same space?

Standard ASCII stores every character in 8 bits (1 byte), regardless of how common it is.

Huffman says: what if common characters used fewer bits, and rare ones used more?

Building the Huffman Tree

The algorithm counts how often each byte appears in your data, then builds a binary tree:

      [root]
     /      \
  [common]  [rare]
   /    \     /   \
  'e'   't'  'z'  'q'
  (2)   (3)  (6)  (7)

Frequent characters get shorter paths (fewer bits). Rare characters get longer paths. The result:

Character Frequency Old (ASCII) Huffman Code Savings
'e' Very common 8 bits 2 bits 75%
't' Common 8 bits 3 bits 62%
'z' Rare 8 bits 6 bits 25%

The total file gets smaller because you're saving far more bits on common characters than you're spending on rare ones.

Huffman tree visualization


What Happens When You ZIP a File — Full Behind-the-Scenes Flow

Here's the complete process from "click Compress" to the final .zip file:

Your original file (50MB)
         │
         ▼
┌─────────────────────┐
│   1. SCAN & ANALYZE │  ← ZIP reads your file and detects file types,
│                     │    checks if compression will actually help
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  2. LZ77 ENCODING   │  ← Sliding window finds repeated byte sequences
│  (De-duplication)   │    and replaces them with back-references
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│ 3. HUFFMAN CODING   │  ← Assigns shorter binary codes to frequent
│  (Re-encoding)      │    bytes in the LZ77 output
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  4. ZIP CONTAINER   │  ← Wraps compressed data with a header
│  (Packaging)        │    containing: filename, date, checksum, sizes
└────────┬────────────┘
         │
         ▼
   final .zip file (12MB)

The ZIP File Structure — What's Inside

A .zip file isn't just compressed data. It has a specific internal structure:

┌────────────────────────────┐
│  LOCAL FILE HEADER         │  ← filename, compression method, date
│  + COMPRESSED DATA         │  ← the actual LZ77+Huffman compressed bytes
├────────────────────────────┤
│  (repeated for each file)  │
├────────────────────────────┤
│  CENTRAL DIRECTORY         │  ← index of all files and their locations
├────────────────────────────┤
│  END OF CENTRAL DIRECTORY  │  ← total file count, size info
└────────────────────────────┘

The Central Directory at the end is like a table of contents. This is why ZIP files can be "browsed" before fully extracting — your computer reads the Central Directory first, sees what's inside, and only decompresses the files you choose to extract.


How Decompression (Unzipping) Works

Unzipping simply reverses the process:

Step 1 — Read the Central Directory The unzipper jumps to the end of the .zip file, reads the index, and knows exactly where each file's compressed data begins.

Step 2 — Huffman Decoding Using the Huffman table stored in the file header, each compressed bit sequence is decoded back to the original bytes.

Step 3 — LZ77 Decoding Back-references are resolved: "go back 27 chars, copy 11" is expanded back to the actual repeated data.

Step 4 — Checksum Verification The unzipper runs a CRC-32 checksum on the decompressed output and compares it to the checksum stored in the ZIP header. If they match, the file is verified identical to the original.

ZIP file → Huffman decode → LZ77 expand → checksum check → original file ✅

Why Some Files Don't Compress Well

You may have noticed that some ZIPs aren't much smaller than the originals. Here's why:

File Type Compressibility Reason
.txt, .csv, .html Excellent (60–90%) High text repetition
.docx, .xlsx Good (40–70%) Already internally compressed but with structure
.jpg, .mp3, .mp4 Poor (0–5%) Already heavily compressed
.png (lossless) Moderate (10–30%) Some redundancy possible
.exe, .dll Moderate (30–50%) Code has patterns but also random-looking sections
.zip inside .zip Almost none Already maximally compressed

The golden rule: ZIP is most powerful on text, code, and uncompressed data. It cannot significantly compress files that are already compressed.


ZIP vs Other Compression Formats

Format Algorithm Compression Speed Best For
ZIP Deflate (LZ77+Huffman) Good Fast Universal compatibility
7-ZIP (.7z) LZMA Better Slower Maximum compression
RAR Proprietary Good Medium Archives with recovery
GZIP Deflate Good Fast Linux/web servers
Brotli Custom Excellent Variable Web browser compression

ZIP wins on compatibility — every operating system can open it without extra software.


Fun Facts About ZIP

  • ZIP was invented by Phil Katz in 1989. He released the format as open, and it became the world standard.
  • The DEFLATE algorithm (LZ77 + Huffman) is also used in PNG images, HTTP gzip compression, and PDF internal compression.
  • A ZIP file can contain other ZIP files — but as explained above, double-zipping gives almost no benefit.
  • The largest ZIP file ever created: the "ZIP bomb" — a tiny ZIP that expands to petabytes of data, used by malware. A classic example is a 42KB file that expands to 4.5 petabytes through nested compression of repetitive data.

How This Relates to PDF Compression

When you compress a PDF using a tool like CommandPDF, a similar process happens:

  • Text and vector content in the PDF is compressed using Deflate (the same LZ77+Huffman)
  • Images inside the PDF are resampled to lower DPI or re-compressed with JPEG
  • Metadata and unused objects (like deleted content still stored in the file) are removed

This is why a PDF with lots of images compresses differently than a text-heavy PDF — the text compresses well, the images less so (especially JPEGs already inside).

👉 Compress your PDF now — free at CommandPDF


Summary

Step What Happens
LZ77 Finds and replaces repeated byte sequences with short back-references
Huffman Coding Gives shorter binary codes to frequently-used bytes
ZIP container Wraps compressed data with headers, filenames, and checksums
Unzip Reverses both steps and verifies integrity via CRC checksum

ZIP compression is elegant engineering: it exploits the natural redundancy in data to shrink files — then reconstructs everything perfectly on the other end. No data lost. No magic. Just math.


About the Author

Arslan Iqbal

Arslan Iqbal
Technical Writer & Digital Tools Specialist at CommandPDF
Arslan writes in-depth guides on PDF tools, file formats, and web productivity.
He specializes in making complex technical topics easy to understand for everyday users.

Related Guides:

CommandPDF

Outils PDF Professionnels - Gratuits et Privés

Security

  • Client-side processingFiles never leave your device
  • No file uploads100% private & secure

Compliance

GDPR Compliant
100% Privé - Les fichiers ne quittent jamais votre appareil
Sélectionner la langue

© 2026 CommandPDF. © CommandPDF. Tous droits réservés.