Chunking
Chunking is the strategy for splitting a document into pieces before embedding and indexing it for RAG. Chunk size and overlap directly affect retrieval quality: chunks too small lose context, chunks too large blur relevance.
Basic strategies:
- By character count (simple, fast, ignores meaning).
- By sentences or paragraphs (respects punctuation but not semantics).
- Smart / semantic — use structure (headings, sections) or meaning to keep related content together.
Smart approaches in practice:
- LangChain's
RecursiveCharacterTextSplitter— splits on a cascade of separators (\n\n,\n,.,) until chunks fit. - Unstructured.io — document-aware partitioning that understands PDFs, tables, titles.
- Embedding-based semantic chunking — embed each sentence, merge adjacent sentences while cosine similarity stays above a threshold.
- LLM-suggested chunking — ask a model to propose the best strategy for the document type.