How to Chunk Files for a RAG Model (with Python Example)

Retrieval-Augmented Generation (RAG) works best when your source data is split into smart, bite-sized chunks. Here’s how to do it well—and why it matters.

What is Chunking?

In a RAG workflow, documents (PDFs, web pages, reports, etc.) are split into smaller sections called
chunks. Each chunk is embedded and stored in a vector database (e.g., Pinecone, Chroma, FAISS).
At query time, the system retrieves only the most relevant chunks and feeds them to the LLM for a grounded answer.

If chunks are too large, the model may miss context due to token limits. If they’re too small, you can lose semantic
coherence. A common starting point is 200–500 words per chunk with a 10–20% overlap.

Why Chunking Matters

Retrieval accuracy: Better segmentation improves semantic similarity search.
Context quality: Overlap helps avoid cutting off important sentences.
Token efficiency: Right-sized chunks keep prompts within model limits.

The Chunking Workflow

Load the file — Read text from a source (e.g., .txt, .pdf, .docx).
Clean the text — Normalize whitespace, remove artifacts, standardize line breaks.
Split into chunks — Divide text into overlapping segments.
Store embeddings — Embed each chunk and persist vectors + metadata for retrieval.

Python Example: Simple Text Chunking (Sliding Window)

from typing import List

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Splits text into overlapping chunks.

    Args:
        text (str): The input text.
        chunk_size (int): Number of characters per chunk.
        overlap (int): Number of overlapping characters between chunks.

    Returns:
        List[str]: A list of text chunks.
    """
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")
    if not (0 <= overlap < chunk_size):
        raise ValueError("overlap must be between 0 and chunk_size-1")

    chunks: List[str] = []
    start = 0
    n = len(text)

    while start < n:
        end = min(start + chunk_size, n)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)

        # Slide window forward while preserving overlap
        start += (chunk_size - overlap)

    return chunks

Code Breakdown

1) Function Definition

chunk_text() accepts text, a chunk_size (characters per chunk), and
an overlap (characters of context to preserve between chunks). It returns a list of chunks to embed.

2) Sliding Window Logic

We move a window of chunk_size characters through the text, advancing by
chunk_size - overlap each step. This overlap keeps context flowing across boundaries.

3) Basic Validation & Trimming

The function validates parameters and trims whitespace from each produced chunk before returning them.

Example Usage

if __name__ == "__main__":
    # Example text (in practice, load from a file or extractor)
    text = (
        "Retrieval-Augmented Generation (RAG) combines the reasoning ability of LLMs "
        "with external knowledge sources. To use RAG effectively, we split long documents "
        "into overlapping chunks to support high-quality retrieval and grounded responses."
    )

    chunks = chunk_text(text, chunk_size=120, overlap=20)

    for i, c in enumerate(chunks, start=1):
        print(f"--- Chunk {i} ---\\n{c}\\n")

Putting It All Together in a RAG Pipeline

Embed each chunk using your embedding model (e.g., OpenAI, Cohere, etc.).
Store embeddings in a vector DB (e.g., Pinecone, Chroma, FAISS) with metadata (source, page, etc.).
Retrieve top-k chunks per query via vector similarity.
Augment the prompt with retrieved chunks before sending it to the LLM.

Next Steps

Experiment with sentence-aware splitting (e.g., NLTK) or recursive splitters (e.g., LangChain’s RecursiveCharacterTextSplitter).
Tune chunk sizes based on your model’s context window and domain.
Attach metadata (document ID, section, page) for traceability and evaluation.