We Tested 50 Documents: Markdown Token Savings

Real token count data from 50 documents across 8 file formats — showing exactly how much converting to Markdown reduces AI token usage and API costs.

Note: The figures below compare Markdown against manually copy-pasted text. For a comparison against native AI file-upload costs (what actually happens when you drag a PDF into Claude or ChatGPT), see our updated analysis.

We Tested 50 Documents: Here's How Much Markdown Saves in Tokens

inktomd's Markdown output uses 63% fewer tokens on average than the same content copy-pasted as raw extracted text — here's the data behind that number: 50 real documents, 8 file formats, measured with OpenAI's tokenizer. (For how this compares to uploading files directly to Claude or ChatGPT, see our token savings analysis.)

Methodology

We selected 50 documents across 8 common file formats and measured token counts in two scenarios:

Raw format — Text copied directly from the source file and pasted as-is
Markdown conversion — The same document converted to clean Markdown using inktomd, then measured

Token counts were measured using tiktoken with the cl100k_base encoding (used by GPT-4 and GPT-4o). All documents were real-world files — research papers, business reports, spreadsheets, presentations — not synthetic test cases.

Results by Format

PDF Documents (12 papers tested)

PDFs showed the highest overhead of any format — consistently 2.5–3x the token count of the equivalent Markdown.

Document	Raw Tokens	Markdown Tokens	Reduction
8-page research paper	11,200	4,100	63%
15-page technical report	21,400	7,600	64%
6-page financial summary	8,900	3,200	64%
22-page annual report	31,800	11,400	64%
4-page product spec	5,600	2,100	63%
Average (all 12 PDFs)	—	—	64%

PDF overhead comes primarily from: hard line breaks at column widths, page header/footer repetition, garbled multi-column layouts, and encoding artifacts from PDF text extraction.

Word Documents (8 documents tested)

Word documents (.docx) are cleaner than PDFs but still carry overhead from style metadata, formatting markers, and comment/revision data that bleeds into extracted text.

Document	Raw Tokens	Markdown Tokens	Reduction
3,000-word report	5,100	2,100	59%
8,000-word proposal	13,600	5,500	60%
1,500-word memo	2,600	1,100	58%
12,000-word thesis chapter	20,400	8,200	60%
Average (all 8 Word docs)	—	—	59%

Excel Spreadsheets (8 files tested)

Excel files showed highly variable savings depending on data density and formatting complexity. Dense data tables saw the highest savings; sparse workbooks with lots of formatting saw somewhat lower savings.

File	Raw Tokens	Markdown Tokens	Reduction
500-row sales data (5 cols)	8,400	2,900	65%
12-sheet financial model	44,200	15,800	64%
200-row inventory list	3,600	1,300	64%
Survey results (800 rows)	13,200	4,600	65%
Average (all 8 Excel files)	—	—	64%

The specific savings mechanism for Excel is conversion of raw cell data to structured Markdown pipe tables with explicit column headers.

PowerPoint Presentations (6 files tested)

PowerPoint conversion is particularly valuable because raw slide content loses all visual hierarchy — bullet hierarchy, section breaks between slides, and speaker notes placement.

File	Raw Tokens	Markdown Tokens	Reduction
25-slide conference talk	8,100	2,700	67%
40-slide product deck	13,200	4,400	67%
15-slide quarterly review	4,900	1,600	67%
Average (all 6 PPTX files)	—	—	67%

PowerPoint files showed the highest consistent savings, with Markdown output preserving slide-by-slide structure via  markers and heading hierarchy.

HTML Web Pages (6 pages tested)

HTML is particularly noisy when copied — navigation menus, sidebars, scripts, cookie notices, and inline styles inflate the raw token count significantly.

Page	Raw Tokens	Markdown Tokens	Reduction
Long-form blog post	14,200	3,900	73%
News article	8,600	2,300	73%
Documentation page	11,400	3,600	68%
Average (all 6 HTML pages)	—	—	71%

HTML showed the highest savings of any format because of the amount of non-content markup that gets stripped during Markdown conversion.

CSV Data Files (4 files tested)

CSV files are cleaner than most formats but still benefit from structured Markdown table conversion that adds explicit column headers.

File	Raw Tokens	Markdown Tokens	Reduction
1,000-row customer data	16,800	6,100	64%
300-row product catalog	5,200	1,900	63%
Average (all 4 CSV files)	—	—	63%

EPUB Books (3 files tested)

EPUB files showed consistent savings driven by removal of chapter navigation markup and XML structure.

Average savings: 62%

YouTube Transcripts (3 videos tested)

YouTube transcripts via URL conversion showed modest savings compared to raw transcript text — the main benefit here is convenience and clean formatting rather than dramatic token reduction.

Average savings: 28% (lower because raw transcript text is already fairly clean)

Summary: Average Savings by Format

Format	Average Token Reduction
HTML	71%
PowerPoint	67%
PDF	64%
Excel	64%
CSV	63%
Word	59%
EPUB	62%
YouTube	28%
Overall average (50 documents)	63%

Cost Implications at Scale

These token savings translate directly to reduced API costs for developers and more effective usage of ChatGPT and Claude context windows for everyone.

For ChatGPT Plus users: A 128k token context window with raw PDF input fits roughly 8–9 ten-page PDFs. With Markdown conversion, the same window fits 22–24 ten-page PDFs — nearly 3x more document content per session.

For API users (GPT-4o at $2.50/1M tokens): Processing 1,000 documents per day:

Format	Daily Input Cost	Monthly Cost
Raw mixed formats	~$46	~$1,380
Markdown conversion	~$17	~$510
Savings	~$29/day	~$870/month

At higher volumes the savings scale linearly.

Conclusion

Across 50 documents and 8 file formats, converting to Markdown produces output that uses 63% fewer tokens on average compared to manually copy-pasting the same content as raw text.

The practical implication: for any document-heavy AI workflow, Markdown conversion should be the first step in your process. It costs 30 seconds per document and returns roughly 3x the effective capacity of your AI tool's context window.

Convert any document to AI-ready Markdown — free →