← back to stream

Chunking

#ai

Chunking is the strategy for splitting a document into pieces before embedding and indexing it for RAG. Chunk size and overlap directly affect retrieval quality: chunks too small lose context, chunks too large blur relevance.

Basic strategies:

  • By character count (simple, fast, ignores meaning).
  • By sentences or paragraphs (respects punctuation but not semantics).
  • Smart / semantic — use structure (headings, sections) or meaning to keep related content together.

Smart approaches in practice:

  • LangChain's RecursiveCharacterTextSplitter — splits on a cascade of separators (\n\n, \n, . , ) until chunks fit.
  • Unstructured.io — document-aware partitioning that understands PDFs, tables, titles.
  • Embedding-based semantic chunking — embed each sentence, merge adjacent sentences while cosine similarity stays above a threshold.
  • LLM-suggested chunking — ask a model to propose the best strategy for the document type.