RAGBox Core - Official Documentation

Introduction

Welcome to the official documentation for RAGBox. RAGBox is a production-grade, auto-configuring asynchronous engine that abstracts the extreme complexity of modern Retrieval-Augmented Generation. It natively fuses Dense Vector Search, Knowledge Graphs, and Agentic Orchestration into a single, unified pipeline.

RAGBox is designed for engineers and data scientists who require high-performance, resilient document intelligence without the monolithic boilerplate of managing disparate stores, chunking algorithms, and LLM routers.

What It Does

RAGBox autonomously manages the entire lifecycle of document ingestion, semantic indexing, and hybrid retrieval. Specifically, the system performs:

Self-Healing Infrastructure: Watchdog daemons monitor a designated directory, parsing and indexing documents incrementally. It utilizes SQLite-backed content-addressed storage (CAS) to prevent amnesia and thwart redundant API costs.
Auto-Document Intelligence: Natively extracts content from unstructured formats including PDF, PPTX, Images, and Code. Applies OCR and layout analysis automatically.
Self-Optimizing Chunking: Rather than forcing hardcoded chunk boundaries, the engine dynamically samples data and queries an LLM once per file type to determine the mathematically optimal chunking strategy, mapping this strategy strictly via extension to avoid $O(N)$ token traps.
Agentic Routing: An orchestration layer intercepts user queries, expands the semantic intent, and routes it across six distinct sub-pipelines (Vector, Graph, Keyword, etc.) executing searches in parallel.
Reciprocal Rank Fusion (RRF): Results from varied vector topologies and graph queries are fused together, then aggressively pruned and passed into a Cross-Encoder for highly accurate deterministic reranking.

Why RAGBox?

Building a robust RAG system in 2024–2025 is typically an exercise in managing brittle dependencies. Current frameworks require extensive manual intervention to bridge vector stores, knowledge graphs (Neo4j/NetworkX), and language models.

We need RAGBox because enterprise document intelligence cannot rely on "weekend project" orchestration. Without RAGBox, teams typically run into:

Systemic Amnesia: Re-indexing entire directories upon server restart, resulting in exponential LLM costs.
Concurrency Bottlenecks: Synchronous cross-encoders locking up the event loop under load.
Schema Hallucinations: LLMs failing to properly output graph relationships, corrupting Neo4j databases.

RAGBox solves these through deep agentic workflows, strict JSON schema enforcement with self-healing retries, candidate pruning, and non-blocking asynchronous event handling.

Use Cases

RAGBox is built for environments where data velocity is high and structural complexity is dense.

Financial Auditing: Ingesting dense SEC 10-K filings, maintaining exact tables through OCR, and mapping entity relations (e.g., subsidiaries, debt structures) via the Knowledge Graph.
Legal Discovery: Tracking hundreds of dynamic contract revisions incrementally without triggering full re-indexes, allowing exact cross-referencing querying.
Technical Documentation: Bridging disparate codebases, markdown logs, and architecture diagrams into a single, highly accurate agentic interface that cites its sources.

Requirements

RAGBox is designed to run in a Python 3.11+ environment with native asynchronous support.

OS: Linux, macOS, or Windows (WSL recommended)
Python: >= 3.11
RAM: Minimum 8GB (16GB recommended for local LLM inference)

For advanced OCR and layout parsing, ensure system-level dependencies such as C++ build tools and Tesseract are installed for your OS.

Installation

RAGBox Core is available on PyPI. Install it using pip or poetry.

Using Pip

pip install -U ragbox-core

Using Poetry

poetry add ragbox-core

Setup & Quickstart

RAGBox utilizes an auto-detecting environment model. You do not need to construct complex configuration objects; simply supply API keys via environment variables.

1. Configure API Keys

Set one of the following environment variables to activate cloud LLM providers. If none are detected, RAGBox will gracefully attempt to fall back to a local quantized GGUF model.

export OPENAI_API_KEY="sk-..."
# OR
export ANTHROPIC_API_KEY="sk-ant-..."
# OR
export GROQ_API_KEY="gsk_..."

2. Initialize the Engine

A functional RAG pipeline can be started in three lines of code. This will automatically spark the background watchdog daemons and begin processing documents injected into the target directory.

from ragbox import RAGBox
import asyncio

async def main():
    # Initializes engine, sqlite CAS layer, and background indexing
    rag = RAGBox(watch_dir="./company-docs")
    
    # Send an agentic query through the hybrid router
    response = await rag.query("Synthesize the revenue metrics from Q3 across our entities.")
    
    print(response.answer)
    for src in response.sources:
        print(f"Source: {src.document_id} (Score: {src.score})")

if __name__ == "__main__":
    asyncio.run(main())

Core Architecture

RAGBox operates on a multi-layered, fault-tolerant infrastructure designed to decouple ingestion from retrieval.

1. Ingestion Layer

Production-grade watchdog implementation with debounce, duplication caching, and batch processing. Backed by an SQLite database indicating file states.

2. Routing & Chunking

Self-optimizing routines determine the best split boundaries per document type, dramatically reducing $O(N)$ LLM calls while maintaining semantic coherence.

3. Graph Extraction

Uses strict structured LLM generation with automatic retry circuits to build Leiden-alg derived graph communities via NetworkX without prompt injection risks.

4. Hybrid Retrieval

Executes Deep Dense search mixed with Graph queries. Re-ranks via Reciprocal Rank Fusion, prunes aggressively, and passes to a Cross-Encoder for final deterministic scoring.