How to Chunk Files for a RAG Model (with Python Example)

How to Chunk Files for a RAG Model (with Python Example)

Retrieval-Augmented Generation (RAG) works best when your source data is split into smart, bite-sized chunks. Here’s how to do it well—and why it matters.

What is Chunking?

In a RAG workflow, documents (PDFs, web pages, reports, etc.) are split into smaller sections called
chunks. Each chunk is embedded and stored in a vector database (e.g., Pinecone, Chroma, FAISS).
At query time, the system retrieves only the most relevant chunks and feeds them to the LLM for a grounded answer.

If chunks are too large, the model may miss context due to token limits. If they’re too small, you can lose semantic
coherence. A common starting point is 200–500 words per chunk with a 10–20% overlap.

Why Chunking Matters

  • Retrieval accuracy: Better segmentation improves semantic similarity search.
  • Context quality: Overlap helps avoid cutting off important sentences.
  • Token efficiency: Right-sized chunks keep prompts within model limits.

The Chunking Workflow

  1. Load the file — Read text from a source (e.g., .txt, .pdf, .docx).
  2. Clean the text — Normalize whitespace, remove artifacts, standardize line breaks.
  3. Split into chunks — Divide text into overlapping segments.
  4. Store embeddings — Embed each chunk and persist vectors + metadata for retrieval.

Python Example: Simple Text Chunking (Sliding Window)

from typing import List

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Splits text into overlapping chunks.

    Args:
        text (str): The input text.
        chunk_size (int): Number of characters per chunk.
        overlap (int): Number of overlapping characters between chunks.

    Returns:
        List[str]: A list of text chunks.
    """
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")
    if not (0 <= overlap < chunk_size):
        raise ValueError("overlap must be between 0 and chunk_size-1")

    chunks: List[str] = []
    start = 0
    n = len(text)

    while start < n:
        end = min(start + chunk_size, n)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)

        # Slide window forward while preserving overlap
        start += (chunk_size - overlap)

    return chunks

Code Breakdown

1) Function Definition

chunk_text() accepts text, a chunk_size (characters per chunk), and
an overlap (characters of context to preserve between chunks). It returns a list of chunks to embed.

2) Sliding Window Logic

We move a window of chunk_size characters through the text, advancing by
chunk_size - overlap each step. This overlap keeps context flowing across boundaries.

3) Basic Validation & Trimming

The function validates parameters and trims whitespace from each produced chunk before returning them.

Example Usage

if __name__ == "__main__":
    # Example text (in practice, load from a file or extractor)
    text = (
        "Retrieval-Augmented Generation (RAG) combines the reasoning ability of LLMs "
        "with external knowledge sources. To use RAG effectively, we split long documents "
        "into overlapping chunks to support high-quality retrieval and grounded responses."
    )

    chunks = chunk_text(text, chunk_size=120, overlap=20)

    for i, c in enumerate(chunks, start=1):
        print(f"--- Chunk {i} ---\\n{c}\\n")

Putting It All Together in a RAG Pipeline

  1. Embed each chunk using your embedding model (e.g., OpenAI, Cohere, etc.).
  2. Store embeddings in a vector DB (e.g., Pinecone, Chroma, FAISS) with metadata (source, page, etc.).
  3. Retrieve top-k chunks per query via vector similarity.
  4. Augment the prompt with retrieved chunks before sending it to the LLM.

Next Steps

  • Experiment with sentence-aware splitting (e.g., NLTK) or recursive splitters (e.g., LangChain’s RecursiveCharacterTextSplitter).
  • Tune chunk sizes based on your model’s context window and domain.
  • Attach metadata (document ID, section, page) for traceability and evaluation.

In Summary

Chunking is the backbone of an effective RAG system. With a clean and balanced strategy, your model retrieves the right
context efficiently and produces more coherent, grounded responses.

Leave a Comment