How to Chunk Files for a RAG Model (with Python Example)
Retrieval-Augmented Generation (RAG) works best when your source data is split into smart, bite-sized chunks. Here’s how to do it well—and why it matters.
What is Chunking?
In a RAG workflow, documents (PDFs, web pages, reports, etc.) are split into smaller sections called
chunks. Each chunk is embedded and stored in a vector database (e.g., Pinecone, Chroma, FAISS).
At query time, the system retrieves only the most relevant chunks and feeds them to the LLM for a grounded answer.
If chunks are too large, the model may miss context due to token limits. If they’re too small, you can lose semantic
coherence. A common starting point is 200–500 words per chunk with a 10–20% overlap.
Why Chunking Matters
- Retrieval accuracy: Better segmentation improves semantic similarity search.
- Context quality: Overlap helps avoid cutting off important sentences.
- Token efficiency: Right-sized chunks keep prompts within model limits.
The Chunking Workflow
- Load the file — Read text from a source (e.g.,
.txt
,.pdf
,.docx
). - Clean the text — Normalize whitespace, remove artifacts, standardize line breaks.
- Split into chunks — Divide text into overlapping segments.
- Store embeddings — Embed each chunk and persist vectors + metadata for retrieval.
Python Example: Simple Text Chunking (Sliding Window)
from typing import List
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
"""
Splits text into overlapping chunks.
Args:
text (str): The input text.
chunk_size (int): Number of characters per chunk.
overlap (int): Number of overlapping characters between chunks.
Returns:
List[str]: A list of text chunks.
"""
if chunk_size <= 0:
raise ValueError("chunk_size must be > 0")
if not (0 <= overlap < chunk_size):
raise ValueError("overlap must be between 0 and chunk_size-1")
chunks: List[str] = []
start = 0
n = len(text)
while start < n:
end = min(start + chunk_size, n)
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# Slide window forward while preserving overlap
start += (chunk_size - overlap)
return chunks
Code Breakdown
1) Function Definition
chunk_text()
accepts text
, a chunk_size
(characters per chunk), and
an overlap
(characters of context to preserve between chunks). It returns a list of chunks to embed.
2) Sliding Window Logic
We move a window of chunk_size
characters through the text, advancing by
chunk_size - overlap
each step. This overlap keeps context flowing across boundaries.
3) Basic Validation & Trimming
The function validates parameters and trims whitespace from each produced chunk before returning them.
Example Usage
if __name__ == "__main__":
# Example text (in practice, load from a file or extractor)
text = (
"Retrieval-Augmented Generation (RAG) combines the reasoning ability of LLMs "
"with external knowledge sources. To use RAG effectively, we split long documents "
"into overlapping chunks to support high-quality retrieval and grounded responses."
)
chunks = chunk_text(text, chunk_size=120, overlap=20)
for i, c in enumerate(chunks, start=1):
print(f"--- Chunk {i} ---\\n{c}\\n")
Putting It All Together in a RAG Pipeline
- Embed each chunk using your embedding model (e.g., OpenAI, Cohere, etc.).
- Store embeddings in a vector DB (e.g., Pinecone, Chroma, FAISS) with metadata (source, page, etc.).
- Retrieve top-k chunks per query via vector similarity.
- Augment the prompt with retrieved chunks before sending it to the LLM.
Next Steps
- Experiment with sentence-aware splitting (e.g., NLTK) or recursive splitters (e.g., LangChain’s
RecursiveCharacterTextSplitter
). - Tune chunk sizes based on your model’s context window and domain.
- Attach metadata (document ID, section, page) for traceability and evaluation.