We Tested 50 Documents: Here's How Much Markdown Saves in Tokens
Real token count data from 50 documents across 8 file formats — showing exactly how much converting to Markdown reduces AI token usage and API costs.
We Tested 50 Documents: Here's How Much Markdown Saves in Tokens
inktomd claims an average token savings of 63% when converting documents to Markdown. Here's the actual data behind that number — 50 real documents, 8 file formats, measured with OpenAI's tokenizer.
Methodology
We selected 50 documents across 8 common file formats and measured token counts in two scenarios:
- Raw format — Text copied directly from the source file and pasted as-is
- Markdown conversion — The same document converted to clean Markdown using inktomd, then measured
Token counts were measured using tiktoken with the cl100k_base encoding (used by GPT-4 and GPT-4o). All documents were real-world files — research papers, business reports, spreadsheets, presentations — not synthetic test cases.
Results by Format
PDF Documents (12 papers tested)
PDFs showed the highest overhead of any format — consistently 2.5–3x the token count of the equivalent Markdown.
| Document | Raw Tokens | Markdown Tokens | Reduction | |----------|-----------|-----------------|-----------| | 8-page research paper | 11,200 | 4,100 | 63% | | 15-page technical report | 21,400 | 7,600 | 64% | | 6-page financial summary | 8,900 | 3,200 | 64% | | 22-page annual report | 31,800 | 11,400 | 64% | | 4-page product spec | 5,600 | 2,100 | 63% | | Average (all 12 PDFs) | — | — | 64% |
PDF overhead comes primarily from: hard line breaks at column widths, page header/footer repetition, garbled multi-column layouts, and encoding artifacts from PDF text extraction.
Word Documents (8 documents tested)
Word documents (.docx) are cleaner than PDFs but still carry overhead from style metadata, formatting markers, and comment/revision data that bleeds into extracted text.
| Document | Raw Tokens | Markdown Tokens | Reduction | |----------|-----------|-----------------|-----------| | 3,000-word report | 5,100 | 2,100 | 59% | | 8,000-word proposal | 13,600 | 5,500 | 60% | | 1,500-word memo | 2,600 | 1,100 | 58% | | 12,000-word thesis chapter | 20,400 | 8,200 | 60% | | Average (all 8 Word docs) | — | — | 59% |
Excel Spreadsheets (8 files tested)
Excel files showed highly variable savings depending on data density and formatting complexity. Dense data tables saw the highest savings; sparse workbooks with lots of formatting saw somewhat lower savings.
| File | Raw Tokens | Markdown Tokens | Reduction | |------|-----------|-----------------|-----------| | 500-row sales data (5 cols) | 8,400 | 2,900 | 65% | | 12-sheet financial model | 44,200 | 15,800 | 64% | | 200-row inventory list | 3,600 | 1,300 | 64% | | Survey results (800 rows) | 13,200 | 4,600 | 65% | | Average (all 8 Excel files) | — | — | 64% |
The specific savings mechanism for Excel is conversion of raw cell data to structured Markdown pipe tables with explicit column headers.
PowerPoint Presentations (6 files tested)
PowerPoint conversion is particularly valuable because raw slide content loses all visual hierarchy — bullet hierarchy, section breaks between slides, and speaker notes placement.
| File | Raw Tokens | Markdown Tokens | Reduction | |------|-----------|-----------------|-----------| | 25-slide conference talk | 8,100 | 2,700 | 67% | | 40-slide product deck | 13,200 | 4,400 | 67% | | 15-slide quarterly review | 4,900 | 1,600 | 67% | | Average (all 6 PPTX files) | — | — | 67% |
PowerPoint files showed the highest consistent savings, with Markdown output preserving slide-by-slide structure via <!-- Slide N --> markers and heading hierarchy.
HTML Web Pages (6 pages tested)
HTML is particularly noisy when copied — navigation menus, sidebars, scripts, cookie notices, and inline styles inflate the raw token count significantly.
| Page | Raw Tokens | Markdown Tokens | Reduction | |------|-----------|-----------------|-----------| | Long-form blog post | 14,200 | 3,900 | 73% | | News article | 8,600 | 2,300 | 73% | | Documentation page | 11,400 | 3,600 | 68% | | Average (all 6 HTML pages) | — | — | 71% |
HTML showed the highest savings of any format because of the amount of non-content markup that gets stripped during Markdown conversion.
CSV Data Files (4 files tested)
CSV files are cleaner than most formats but still benefit from structured Markdown table conversion that adds explicit column headers.
| File | Raw Tokens | Markdown Tokens | Reduction | |------|-----------|-----------------|-----------| | 1,000-row customer data | 16,800 | 6,100 | 64% | | 300-row product catalog | 5,200 | 1,900 | 63% | | Average (all 4 CSV files) | — | — | 63% |
EPUB Books (3 files tested)
EPUB files showed consistent savings driven by removal of chapter navigation markup and XML structure.
Average savings: 62%
YouTube Transcripts (3 videos tested)
YouTube transcripts via URL conversion showed modest savings compared to raw transcript text — the main benefit here is convenience and clean formatting rather than dramatic token reduction.
Average savings: 28% (lower because raw transcript text is already fairly clean)
Summary: Average Savings by Format
| Format | Average Token Reduction | |--------|------------------------| | HTML | 71% | | PowerPoint | 67% | | PDF | 64% | | Excel | 64% | | CSV | 63% | | Word | 59% | | EPUB | 62% | | YouTube | 28% | | Overall average (50 documents) | 63% |
Cost Implications at Scale
These token savings translate directly to reduced API costs for developers and more effective usage of ChatGPT and Claude context windows for everyone.
For ChatGPT Plus users: A 128k token context window with raw PDF input fits roughly 8–9 ten-page PDFs. With Markdown conversion, the same window fits 22–24 ten-page PDFs — nearly 3x more document content per session.
For API users (GPT-4o at $2.50/1M tokens): Processing 1,000 documents per day:
| Format | Daily Input Cost | Monthly Cost | |--------|-----------------|--------------| | Raw mixed formats | ~$46 | ~$1,380 | | Markdown conversion | ~$17 | ~$510 | | Savings | ~$29/day | ~$870/month |
At higher volumes the savings scale linearly.
Conclusion
Across 50 documents and 8 file formats, converting to Markdown before sharing with AI tools reduces token usage by an average of 63%. The savings are consistent and measurable regardless of document type.
The practical implication: for any document-heavy AI workflow, Markdown conversion should be the first step in your process. It costs 30 seconds per document and returns roughly 3x the effective capacity of your AI tool's context window.
Try it on your own document
Convert to AI-ready Markdown in seconds — free, no signup.
Open the converter