Token Efficiency
Efficient RAG systems must balance context retrieval with the LLM’s finite context window. This guide details our chunking strategies and overlap algorithms.Chunking Strategies
We employ two distinct chunking mechanisms depending on the source content type.- Semantic Chunking
- Fixed-Size + Overlap
Best for: Unstructured text, essays, and reports where maintaining narrative flow is critical.Semantic chunking splits text based on meaning rather than character count. We use embedding similarity between sentences to determine break points.
- Algorithm: Calculate cosine similarity between adjacent sentences. If similarity drops below threshold (e.g., 0.5), start a new chunk.
- Pros: Keeps related concepts together.
- Cons: Variable chunk sizes.
Token Budgeting
To preventcontext_length_exceeded errors, we implement a strict token budget for the RAG input.
Hard Limit: The retrieved context must never exceed 60% of the model’s total context window.
Budget Allocation (Example: GPT-4o)
- Total Window: 128,000 tokens.
- System Prompt: 2,000 tokens (reserved).
- User Query: 1,000 tokens (max).
- Output Reservation: 4,000 tokens.
- Available for Context: ~70,000 tokens.