How to Feed Documents to an LLM

The format you use to feed documents into LLMs determines retrieval quality, token costs, and response accuracy. Here's the preprocessing pipeline that...

How to Feed Documents to an LLM: A Developer's Guide to Document Preprocessing

The quality of LLM responses on document-heavy tasks is determined mostly by what happens before the prompt — the preprocessing pipeline that transforms raw documents into text an LLM can effectively reason about.

Most tutorials on document RAG and LLM pipelines spend 90% of their time on embeddings, vector databases, and retrieval strategies. They spend almost nothing on the document preprocessing step that determines whether any of that infrastructure produces useful results.

This guide fills that gap. Here's what actually matters when feeding documents to language models, and why Markdown is the format to standardize on.

The Core Problem: Format Mismatch

LLMs are trained on text. The vast majority of that training text is clean, structured, human-readable text — Markdown, HTML, plain prose, code. What they were not trained on is binary document formats like PDF and DOCX.

When you feed a raw PDF to an LLM — either by uploading it directly to an API endpoint or by extracting text before indexing — you're giving the model a degraded representation of the document. PDF text extraction is inherently lossy:

Structural information is lost. PDF readers render text at specific positions on a page. Extraction engines attempt to reconstruct reading order from position data, but this fails reliably on multi-column layouts, documents with sidebars, forms, and any complex visual structure.

Semantic structure is lost. PDF doesn't have a heading element. Headings are just text rendered at a larger font size or bold weight. Extractors can sometimes infer headings from formatting, but they miss them often enough that the resulting text has no reliable heading structure.

Tables collapse. PDF tables are rendered as positioned text. When extracted linearly, table rows become runs of space-separated values with no column delimiter. The model has to infer table structure from context — and it often gets it wrong.

Equation content is lost or corrupted. Mathematical notation renders from a typesetting system into PDF glyphs. Those glyphs extract as a mix of symbols that typically doesn't correspond to valid mathematical notation in text form.

The consequence for LLM applications: lower accuracy on structured content, higher token usage per document, and retrieval failures on table and equation content.

Why Markdown Solves This

Markdown is the right target format for LLM document preprocessing for three specific reasons:

Training distribution. LLMs were trained on enormous quantities of Markdown — GitHub repositories, technical documentation, Wikipedia source markup, Stack Overflow, and billions of words of Markdown-formatted web content. Markdown is effectively the native text format for these models.

Explicit structure. Markdown's structural markers — # headings, | tables, ``` code blocks, - lists — are explicit text tokens, not inferred from visual layout. The model doesn't have to guess that "INTRODUCTION" is a heading. # Introduction is unambiguous.

Token efficiency. Clean Markdown uses up to 82% fewer tokens than equivalent PDF-extracted text to represent the same information. For production RAG systems processing thousands of documents, this is a significant cost factor.

Chunking quality. Document chunking — splitting documents into segments for vector indexing — works much better on Markdown than on raw extracted text. Markdown heading structure gives you natural, semantically meaningful chunk boundaries.

Document Preprocessing Pipeline

Here's a production-ready approach to document preprocessing for LLM applications.

Step 1: Format Normalization

The first step is converting all source documents to a single consistent format — Markdown. This simplifies every downstream step because you're working with one text format instead of handling PDF, DOCX, XLSX, PPTX, and other formats separately.

For rapid prototyping or smaller document sets, inktomd.com handles manual conversion across 28 formats — PDF, Word, Excel, PowerPoint, EPUB, HTML, CSV, JSON, XML, Jupyter notebooks, email files, ZIP archives, and URL-based sources including YouTube transcripts and ArXiv papers.

For production pipelines, Microsoft's open source MarkItDown library (which powers inktomd's backend) is available as a Python package:

pip install markitdown[all]

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
markdown_text = result.text_content

MarkItDown is MIT licensed and handles the same 28 format conversions as inktomd's backend.

Step 2: Markdown Cleaning

Raw Markdown conversion output sometimes contains noise that degrades downstream quality. Post-processing steps worth applying:

Remove repeated headers and footers. Page numbers and document headers that repeat on every page appear multiple times in extracted Markdown. Deduplicate these.

Normalize heading hierarchy. Some documents use inconsistent heading levels. A cleanup pass that ensures headings form a proper tree (H1 → H2 → H3, no skipped levels) improves chunking quality.

Flag low-confidence sections. Conversion from complex tables or multi-column content sometimes produces visibly degraded output. A simple heuristic (unusually high ratio of single-character tokens, repeated whitespace) can flag sections for human review.

def clean_markdown(text: str) -> str:
    lines = text.split('\n')
    cleaned = []
    seen_headers = set()
    
    for line in lines:
        # Remove repeated page headers/footers
        if line.strip() in seen_headers and len(line.strip()) < 60:
            continue
        if line.startswith('#') or (len(line.strip()) < 60 and line.strip()):
            seen_headers.add(line.strip())
        cleaned.append(line)
    
    return '\n'.join(cleaned)

Step 3: Semantic Chunking

For RAG applications, how you chunk documents has as much impact on retrieval quality as the embedding model or vector database you choose.

Header-based chunking is the most effective strategy for well-structured documents. Split at ## heading boundaries. Each chunk includes the heading text, which provides context for what the chunk is about.

import re

def chunk_by_headers(markdown: str, max_chunk_size: int = 1000) -> list[dict]:
    chunks = []
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    
    for section in sections:
        if len(section) <= max_chunk_size:
            chunks.append({
                'content': section,
                'heading': re.match(r'^#+\s+(.+)', section).group(1) if re.match(r'^#+', section) else 'Introduction'
            })
        else:
            # Split large sections by paragraph
            paragraphs = section.split('\n\n')
            current_chunk = ''
            current_heading = re.match(r'^#+\s+(.+)', section).group(1) if re.match(r'^#+', section) else ''
            
            for para in paragraphs:
                if len(current_chunk) + len(para) < max_chunk_size:
                    current_chunk += para + '\n\n'
                else:
                    if current_chunk:
                        chunks.append({'content': current_chunk, 'heading': current_heading})
                    current_chunk = para + '\n\n'
            
            if current_chunk:
                chunks.append({'content': current_chunk, 'heading': current_heading})
    
    return chunks

Table preservation. Tables should not be split mid-table. Identify Markdown table blocks and treat them as atomic chunks regardless of size.

def preserve_tables(chunks: list[dict]) -> list[dict]:
    """Merge chunks that would split a table."""
    result = []
    i = 0
    while i < len(chunks):
        chunk = chunks[i]
        # Check if this chunk ends mid-table (has | but no closing row)
        if '|' in chunk['content']:
            table_lines = [l for l in chunk['content'].split('\n') if l.strip().startswith('|')]
            if table_lines and not table_lines[-1].endswith('|'):
                # Merge with next chunk
                if i + 1 < len(chunks):
                    chunk['content'] += '\n' + chunks[i+1]['content']
                    i += 1
        result.append(chunk)
        i += 1
    return result

Step 4: Metadata Enrichment

Each chunk should carry metadata that supports retrieval and citation:

{
    'content': '## Results\n\nThe model achieved 94.2% accuracy...',
    'heading': 'Results',
    'source_document': 'paper_2024_transformer_efficiency.md',
    'source_type': 'pdf',
    'page_estimate': 4,  # rough estimate from chunk position
    'chunk_index': 12,
    'total_chunks': 28,
    'contains_table': True,
    'contains_code': False
}

The contains_table and contains_code flags support filtered retrieval — useful when you need to specifically retrieve or exclude structured content.

Token Cost Benchmarks

For teams making infrastructure decisions, here are concrete token counts from real documents measured with the cl100k_base tokenizer:

Converting any document to clean Markdown before sharing it with AI significantly reduces token costs. By stripping out complex XML formatting, hidden metadata, and redundant structural elements, you preserve the human-readable text while discarding the bloat. For real token savings data across various file formats, read our token savings analysis.

At scale, processing documents across large pipelines can incur substantial costs. For real token savings calculations and detailed methodology on how much you can save, see our token savings analysis.

At scale, document format normalization to Markdown is one of the highest-ROI infrastructure decisions available.

The Simplest Starting Point

For teams evaluating Markdown preprocessing before building a pipeline, inktomd.com lets you test the conversion quality on your actual documents before committing to implementation.

Upload your most problematic document type — whatever has been causing retrieval failures or quality issues in your current pipeline — and check whether the Markdown output would actually solve the problem. Most teams find that it does, substantially.

For production implementation, MarkItDown (pip install markitdown) is the same conversion engine, MIT licensed, with a simple Python API.

Test Markdown conversion on your documents — free, 28 formats →