Chunking — mindex

Chunking is the strategy for splitting a document into pieces before embedding and indexing it for RAG. Chunk size and overlap directly affect retrieval quality: chunks too small lose context, chunks too large blur relevance.

Basic strategies:

By character count (simple, fast, ignores meaning).
By sentences or paragraphs (respects punctuation but not semantics).
Smart / semantic — use structure (headings, sections) or meaning to keep related content together.

Smart approaches in practice:

LangChain's RecursiveCharacterTextSplitter — splits on a cascade of separators (\n\n, \n, . , ) until chunks fit.
Unstructured.io — document-aware partitioning that understands PDFs, tables, titles.
Embedding-based semantic chunking — embed each sentence, merge adjacent sentences while cosine similarity stays above a threshold.
LLM-suggested chunking — ask a model to propose the best strategy for the document type.