RAG Q&A over Documentation

Open in Github

The code for this tutorial is available here.

In this tutorial, we'll build a Q&A system for our documentation using RAG (Retrieval-Augmented Generation). Our AI assistant will answer user queries by retrieving relevant sections from our documentation and using them as context when calling an LLM.

At the end, we will have:

A playground for testing different embeddings, adjusting top_k values (number of context chunks to include), and experimenting with various prompts and models
LLM-as-a-judge and RAG context relevancy evaluations for our Q&A application
Observability with Agenta to debug and monitor our application
A deployment that we can either directly invoke or fetch the configuration to run elsewhere

You can try our playground by creating a free account at https://cloud.agenta.ai and opening the demo.

Our stack

Agenta for playground, evaluation, observability, and deployment.
LiteLLM for interacting with language models and embeddings.
Qdrant as our vector database for storing and querying document embeddings.

Ingestion pipeline

The first step is to process our documentation and store it in a vector database for retrieval. Let's start by looking at how we ingest our documentation into Qdrant.

ingest.py

OPENAI_EMBEDDING_DIM = 1536  # For text-embedding-ada-002
COHERE_EMBEDDING_DIM = 1024  # For embed-english-v3.0

qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY")
)


def chunk_text(text: str, max_chunk_size: int = 1500) -> List[str]:
    """
    Split text into chunks based on paragraphs and size.
    Tries to maintain context by keeping paragraphs together when possible.
    """
    # Split by double newlines to preserve paragraph structure
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

    chunks = []
    current_chunk = []
    current_size = 0

    for paragraph in paragraphs:
        paragraph_size = len(paragraph)

        # If a single paragraph is too large, split it by sentences
        if paragraph_size > max_chunk_size:
            sentences = [s.strip() + "." for s in paragraph.split(".") if s.strip()]
            for sentence in sentences:
                if len(sentence) > max_chunk_size:
                    # If even a sentence is too long, split it by chunks
                    for i in range(0, len(sentence), max_chunk_size):
                        chunks.append(sentence[i : i + max_chunk_size])
                elif current_size + len(sentence) > max_chunk_size:
                    # Start new chunk
                    chunks.append(" ".join(current_chunk))
                    current_chunk = [sentence]
                    current_size = len(sentence)
                else:
                    current_chunk.append(sentence)
                    current_size += len(sentence)
        # If adding this paragraph would exceed the limit, start a new chunk
        elif current_size + paragraph_size > max_chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [paragraph]
            current_size = paragraph_size
        else:
            current_chunk.append(paragraph)
            current_size += paragraph_size

    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


def process_doc(file_path: str, docs_path: str, docs_base_url: str) -> List[Dict]:
    """Process a single document into chunks with metadata."""
    with open(file_path, "r", encoding="utf-8") as f:
        # Parse frontmatter and content
        post = frontmatter.load(f)
        content = post.content

        # Calculate document hash
        doc_hash = calculate_doc_hash(content)

        # Get document URL
        doc_url = get_doc_url(file_path, docs_path, docs_base_url)

        # Create base metadata
        metadata = {
            "title": post.get("title", ""),
            "url": doc_url,
            "file_path": file_path,
            "last_updated": datetime.utcnow().isoformat(),
            "doc_hash": doc_hash,
        }

        # Chunk the content
        chunks = chunk_text(content)

        return [
            {"content": chunk, "metadata": metadata, "doc_hash": doc_hash}
            for chunk in chunks
        ]


def get_embeddings(text: str) -> Dict[str, List[float]]:
    """Get embeddings using both OpenAI and Cohere models via LiteLLM."""
    # Get OpenAI embedding
    openai_response = embedding(model="text-embedding-ada-002", input=[text])
    openai_embedding = openai_response["data"][0]["embedding"]

    # Get Cohere embedding
    cohere_response = embedding(
        model="cohere/embed-english-v3.0",
        input=[text],
        input_type="search_document",  # Specific to Cohere v3 models
    )
    cohere_embedding = cohere_response["data"][0]["embedding"]

    return {"openai": openai_embedding, "cohere": cohere_embedding}


def setup_qdrant_collection():
    """Create or recreate the vector collection."""
    # Delete if exists
    try:
        qdrant_client.delete_collection(COLLECTION_NAME)
    except Exception:
        pass

    # Create collection with two vector types
    qdrant_client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config={
            "openai": models.VectorParams(
                size=OPENAI_EMBEDDING_DIM, distance=models.Distance.COSINE
            ),
            "cohere": models.VectorParams(
                size=COHERE_EMBEDDING_DIM, distance=models.Distance.COSINE
            ),
        },
    )


def upsert_chunks(chunks: List[Dict]):
    """Upsert document chunks to the vector store."""
    for i, chunk in enumerate(chunks):
        # Get both embeddings using LiteLLM
        embeddings = get_embeddings(chunk["content"])

        # Create payload
        payload = {**chunk["metadata"], "content": chunk["content"], "chunk_index": i}

        # Upsert to Qdrant
        qdrant_client.upsert(
            collection_name=COLLECTION_NAME,
            points=[
                models.PointStruct(
                    id=f"{chunk['doc_hash']}",
                    payload=payload,
                    vector=embeddings,  # Contains both 'openai' and 'cohere' embeddings
                )
            ],
        )


def main():
    # Get environment variables
    docs_path = os.getenv("DOCS_PATH")
    docs_base_url = os.getenv("DOCS_BASE_URL")

    if not docs_path or not docs_base_url:
        raise ValueError("DOCS_PATH and DOCS_BASE_URL must be set in .env file")

    # Create fresh collection
    setup_qdrant_collection()

    # Process all documents
    all_docs = get_all_docs(docs_path)
    for doc_path in tqdm.tqdm(all_docs):
        print(f"Processing {doc_path}")
        chunks = process_doc(doc_path, docs_path, docs_base_url)
        upsert_chunks(chunks)

This script performs the following steps:

Loads documentation files: Reads all .mdx files from the documentation directory.
Processes documents: Chunks the text, adds metadata (e.g. the url where the page where to be found)..
Generates embeddings: Generate embeddings for each chunk using both OpenAI and Cohere models. We use both because we would like to compare them in the playground.
Stores embeddings in Qdrant: Upserts the embeddings into a Qdrant collection for later retrieval. We use named vectors to save multiple embeddings for the same document.

To run the ingestion pipeline, you need first to create a collection in Qdrant and then set the following environment variables:

QDRANT_URL: The URL of your Qdrant instance.
QDRANT_API_KEY: The API key for your Qdrant instance.
DOCS_PATH: The folder containing the documentation (in our case it's under agenta/docs/docs).
DOCS_BASE_URL: The base URL where the documentation can be found (in our case it's https://docs.agenta.ai).

info

The complete ingestion script with a setup readme is available in Github.

The query RAG workflow

Now that we have ingested the documentation into the Qdrant vector database, let's create the query logic for our assistant. Parts related to the Agenta integrations are highlighted.

query.py
import agenta as ag
from pydantic import BaseModel, Field
from typing import Annotated
from agenta.sdk.assets import supported_llm_models

system_prompt = """
You are a helpful assistant that answers questions based on the documentation.
"""
user_prompt = """
Here is the query: {query}

Here is the context: {context}
"""
ag.init()

litellm.callbacks = [ag.callbacks.litellm_handler()]

# Initialize Qdrant client
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY")
)

# We define here the configuration that will be used by the playground
class Config(BaseModel):
    system_prompt: str = Field(default=system_prompt)
    user_prompt: str = Field(default=user_prompt)
    embedding_model: Annotated[str, ag.MultipleChoice(["openai", "cohere"])] = Field(
        default="openai"
    )
    llm_model: Annotated[str, ag.MultipleChoice(choices=supported_llm_models)] = Field(
        default="gpt-3.5-turbo"
    )
    top_k: int = Field(default=10, ge=1, le=25)
    rerank_top_k: int = Field(default=3, ge=1, le=10)
    use_rerank: bool = Field(default=True)


def get_embeddings(text: str, model: str) -> Dict[str, List[float]]:
    """Get embeddings using both OpenAI and Cohere models via LiteLLM."""
    if model == "openai":
        return embedding(model="text-embedding-ada-002", input=[text])["data"][0]["embedding"]
    elif model == "cohere":
        return embedding(
            model="cohere/embed-english-v3.0",
            input=[text],
            input_type="search_query",  # Use search_query for queries
        )["data"][0]["embedding"]

    raise ValueError(f"Unknown model: {model}")

@ag.instrument()
def search_docs(
    query: str, collection_name: str = os.getenv("COLLECTION_NAME", "docs_collection")
) -> List[Dict]:
    """
    Search the documentation using embeddings.

    Args:
        query: The search query
        collection_name: Name of the Qdrant collection to search

    Returns:
        List of dictionaries containing matched documents and their metadata
    """

    # Get embeddings for the query
    config = ag.ConfigManager.get_from_route(Config)
    # Search using embeddings
    results = qdrant_client.query_points(
        collection_name=collection_name,
        query=get_embeddings(query, config.embedding_model),
        using=config.embedding_model,
        limit=config.top_k,
    )
    # Format results
    formatted_results = []
    for result in results.points:
        formatted_result = {
            "content": result.payload["content"],
            "metadata": {
                "title": result.payload["title"],
                "url": result.payload["url"],
                "score": result.score,
            },
        }
        formatted_results.append(formatted_result)

    return formatted_results

@ag.instrument()
def llm(query: str, results: List[Dict]):
    config = ag.ConfigManager.get_from_route(Config)
    context = []
    for i, result in enumerate(results, 1):
        score = result["metadata"].get("rerank_score", result["metadata"]["score"])
        item = f"Result {i} (Score: {score:.3f})\n"
        item += f"Title: {result['metadata']['title']}\n"
        item += f"URL: {result['metadata']['url']}\n"
        item += f"Content: {result['content']}\n"
        item += "-" * 80 + "\n"
        context.append(item)
    # We store the context in the trace so that it can be used for evaluation
    ag.tracing.store_internals({"context": context})
    response = completion(
        model=config.llm_model,
        messages=[
            {"role": "system", "content": config.system_prompt},
            {
                "role": "user",
                "content": config.user_prompt.format(
                    query=query, context="".join(context)
                ),
            },
        ],
    )
    return response.choices[0].message.content

@ag.instrument()
def rerank_results(query: str, results: List[Dict]) -> List[Dict]:
    """Rerank the search results using Cohere's reranker."""
    config = ag.ConfigManager.get_from_route(Config)
    # Format documents for reranking
    documents = [result["content"] for result in results]

    # Perform reranking
    reranked = rerank(
        model="cohere/rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=config.rerank_top_k,
    )
    # Reorder the original results based on reranking
    reranked_results = []
    for item in reranked.results:
        # The rerank function returns dictionaries with 'document' and 'index' keys
        reranked_results.append(results[item["index"]])
        # Add rerank score to metadata
        reranked_results[-1]["metadata"]["rerank_score"] = item["relevance_score"]

    return reranked_results

# We define here the route that will be used by the playground
@ag.route("/", config_schema=Config)
@ag.instrument()
def generate(query: str):
    config = ag.ConfigManager.get_from_route(Config)
    results = search_docs(query)
    if config.use_rerank:
        reranked_results = rerank_results(query, results)
        return llm(query, reranked_results)
    else:
        return llm(query, results)

Our system uses a standard RAG workflow consisting of three main steps:

Searching the documentation: Uses the query to retrieve relevant documents from Qdrant.
Optionally reranking results: Improves the relevance of results using Cohere's reranker.
Generating the answer: Constructs a prompt with the query and context, then calls the LLM to generate the final answer.

To integrate this script with Agenta, we need to make two main adjustments:

Instrumentation: Use @ag.instrument() decorator to trace inputs, outputs, and internal variables.
Integration with the Playground: Use ag.route() to define a route and later create a service that will be used to test the app in the playground.

We'll discuss these in more detail in the next sections.

Instrumentation

Tracing captures the inputs and outputs of all functions and LLM calls in our app. This helps us debug multi-step workflows (for example, determining whether an incorrect response stems from the LLM call or from incorrect context) and monitor usage over time.

@ag.instrument()
def generate(query: str):
    ...

Instrumenting code in Agenta is straightforward. The @ag.instrument() decorator lets you capture function inputs and outputs to create a trace tree.

Agenta also provides auto-instrumentation for most frameworks and libraries. Since we're using litellm, we'll use Agenta's callback function to automatically instrument its calls.

For RAG evaluation of our applications, we need to evaluate the relevancy of retrieved context for each query. Since context isn't part of any function's input or output, we'll add it manually to a span using ag.tracing.store_internals({"context": context}), which stores internal variables in the ongoing span.

Playground integration

Agenta provides a custom playground for testing application parameters. Here, we can experiment with different embeddings, top_k values, and LLM models.

Using the Agenta SDK, we'll define a configuration schema for our application and create an endpoint to enable playground communication. Then, we'll deploy the application to Agenta Cloud using the Agenta CLI for testing. Agenta handles all infrastructure work needed to create our application service.

Defining the configuration

Let's define the configuration schema for our application. This schema will determine what elements appear in the playground UI and what parameters we can experiment with.

Our configuration includes:

System prompt: The system prompt template
User prompt: The user prompt template
Embedding model: Choice between OpenAI and Cohere
LLM model: Selection from supported language models
Top_k value: Number of document chunks to retrieve from the vector database
Use rerank: Toggle for Cohere's reranking feature
Rerank top_k value: Number of chunks the reranker should return (used for both reordering and filtering)

from pydantic import BaseModel, Field
from typing import Annotated
import agenta as ag
from agenta.sdk.assets import supported_llm_models

class Config(BaseModel):
    system_prompt: str = Field(default=system_prompt)
    user_prompt: str = Field(default=user_prompt)
    embedding_model: Annotated[str, ag.MultipleChoice(["openai", "cohere"])] = Field(
        default="openai"
    )
    llm_model: Annotated[str, ag.MultipleChoice(choices=supported_llm_models)] = Field(
        default="gpt-3.5-turbo"
    )
    top_k: int = Field(default=10, ge=1, le=25)
    rerank_top_k: int = Field(default=3, ge=1, le=10)
    use_rerank: bool = Field(default=True)

We implement this using a standard Config Pydantic class that inherits from BaseModel. The fields use simple types (str or int). Agenta requires each field to have a default value. For multiple-choice fields, we use Annotated[str, ag.MultipleChoice(choices=["choice1", "choice2"])] to specify the available options.

info

supported_llm_models is a helper variable provided by Agenta that contains the list available in LiteLLM.

Creating the endpoint and using the configuration

Next, we'll create an endpoint to enable communication between the playground and our application.

@ag.route("/", config_schema=Config)
def generate(query: str):
    config = ag.ConfigManager.get_from_route(Config)
    ...

The decorator @ag.route("/", config_schema=Config) registers the generate function as an endpoint and uses the Config class to define the configuration schema. This creates a POST /playground/run endpoint that accepts the configuration as a parameter and runs the workflow. The playground uses this endpoint to interact with the service.

To get the configuration from the request, we use ag.ConfigManager.get_from_route(Config), which returns a Config object containing the values provided by the playground.

We can use these configuration values throughout our workflow. For instance, we can use config.use_rerank in the generate function to control the reranking feature.

Note that ag.ConfigManager.get_from_route(Config) is accessible in any function called within the generate function's execution path, as the configuration is preserved in the context.

Deploying the application to Agenta

Now that we have everything ready to deploy our application to Agenta, let's proceed. First, add the requirements.txt file to the same folder as your project files and populate the .env file with your environment variables. Then run these commands:

agenta init

agenta variant serve query.py

The first command creates a new application in Agenta, while the second command serves the application and creates a playground for testing.

info

Under the hood, agenta variant serve creates a docker image of your application and sets up a service for it in Agenta Cloud.

Once complete, you can access the playground and begin testing your application.

Evaluating the assistant

To ensure our assistant provides accurate and relevant answers, we'll use evaluators to assess its performance. We will create two evaluators:

RAG Relevancy Evaluator: Measures how relevant the assistant's answers are with respect to the retrieved context.
LLM-as-a-Judge Evaluator: Rates the quality of the assistant's responses.

For the first, we use the RAG Relevancy evaluator as described in Agenta's evaluation documentation.

Configuration:

Question key: trace.generate.inputs.query
Answer key: trace.generate.outputs
Contexts key: trace.generate.llm.internals.context

This evaluator measures how relevant the assistant's answers are with respect to the retrieved context. Note that we use trace.generate.llm.internals.context, which we previously stored in the span, to get the context from the trace.

You can use the evaluator playground to configure the evaluator and identify the correct trace data to use in your configuration (see image below).

We set and test an LLM-as-a-Judge evaluator to rate the quality of the assistant's responses the same way. More details on setting up LLM-as-a-Judge evaluators can be found here.

Deploying the assistant

After iterating through various prompts and parameters and evaluating their performance, we can deploy our satisfied solution as an API endpoint using Agenta.

Simply click the Deploy button in the playground to accomplish this.

Agenta provides us with two endpoints to interact with our deployed application:

The first allows us to directly invoke the deployed application with the production configuration.
The second allows us to fetch the deployed configuration as a JSON and use it in our self-deployed application.

Conclusion

In this tutorial, we built a documentation Q&A system using RAG, but more importantly, we created a comprehensive LLMOps workflow that includes:

A playground for testing different embeddings, prompts, and retrieval parameters in real time
Observability tools for debugging multi-step RAG workflows and monitoring production performance
Evaluation pipelines for assessing both RAG relevancy and response quality
Deployment capabilities for smoothly transitioning from experimentation to production

This workflow shows how to evolve beyond a basic RAG implementation to build a production-ready system with robust testing, monitoring, and iteration capabilities.

Our stack​

Ingestion pipeline​

The query RAG workflow​

Instrumentation​

Playground integration​

Defining the configuration​

Creating the endpoint and using the configuration​

Deploying the application to Agenta​

Evaluating the assistant​

Deploying the assistant​

Conclusion​

Our stack

Ingestion pipeline

The query RAG workflow

Instrumentation

Playground integration

Defining the configuration

Creating the endpoint and using the configuration

Deploying the application to Agenta

Evaluating the assistant

Deploying the assistant

Conclusion