Build a PDF Knowledge Base
Most real-world RAG applications start with PDF files — product manuals, research papers, legal contracts, financial reports. This guide walks through ingesting a PDF, splitting it intelligently, storing vectors in a persistent Chroma database, and querying the result. What you'll build: A knowledge base from a PDF that survives process restarts and answers natural-language questions. Time: ~15 min. Difficulty: Beginner
Prerequisites
pip install synapsekit chromadb pypdf
export OPENAI_API_KEY="sk-..."
A sample PDF is used in the examples below. Replace the path with your own file.
What you'll learn
- How to load a PDF with
PDFLoader - How
RecursiveCharacterTextSplitterpreserves paragraph boundaries - How to persist vectors with
ChromaVectorStoreso they survive restarts - How to attach metadata (page number, source filename) to every chunk
- How to query with source attribution
Step 1: Load the PDF
from synapsekit.loaders import PDFLoader
# PDFLoader preserves page numbers as metadata so you can cite sources later.
# Async loading avoids blocking the event loop on large files.
loader = PDFLoader("company-handbook.pdf")
documents = await loader.aload()
print(f"Loaded {len(documents)} pages")
print(documents[0].metadata)
# {'source': 'company-handbook.pdf', 'page': 1}
PDFLoader returns a list of Document objects — one per page. Each Document has .page_content (the text) and .metadata (a dict you can query against later).
Step 2: Split pages into chunks
from synapsekit.splitters import RecursiveCharacterTextSplitter
# chunk_size=1000 fits comfortably in the context window while staying large
# enough to preserve complete sentences and paragraph structure.
# chunk_overlap=200 ensures a sentence cut at a boundary still appears in full
# in at least one chunk, preventing information loss.
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(chunks[0].page_content[:200])
RecursiveCharacterTextSplitter tries to split on paragraph breaks first, then sentences, then words — preserving as much semantic coherence as possible. Raw character splitting would break sentences mid-word, degrading retrieval quality.
Step 3: Set up a persistent vector store
from synapsekit.vectorstores.chroma import ChromaVectorStore
from synapsekit.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# persist_directory keeps the vectors on disk so you only pay for embedding once.
# Subsequent restarts load from disk in milliseconds rather than re-embedding.
vectorstore = ChromaVectorStore(
collection_name="company-handbook",
embedding_function=embeddings,
persist_directory="./chroma_db",
)
Persisting to disk is the single most important change from the quickstart. Without persistence you re-embed every time the process restarts, which costs money and adds startup latency.
Step 4: Embed and ingest chunks
# Batched ingestion respects the OpenAI embedding API rate limit automatically.
await vectorstore.aadd_documents(chunks)
print("Ingestion complete.")
On first run this call sends your chunks to the OpenAI embeddings API and writes the resulting vectors to ./chroma_db. On subsequent runs you skip this step entirely and load directly from disk.
Step 5: Build the RAG pipeline
from synapsekit import RAGPipeline
from synapsekit.llms.openai import OpenAILLM
rag = RAGPipeline(
llm=OpenAILLM(model="gpt-4o-mini"),
embeddings=embeddings,
vectorstore=vectorstore,
)
The pipeline reuses the same vectorstore instance, so no data is copied. Swapping the LLM later (e.g., to AnthropicLLM) requires changing only one line here.
Step 6: Query with source attribution
# return_sources=True makes the pipeline return (answer, sources) instead of
# just the answer string, so you can show users where the answer came from.
answer, sources = await rag.aquery(
"What is the company's remote work policy?",
return_sources=True,
)
print("Answer:", answer)
print("\nSources:")
for doc in sources:
print(f" - {doc.metadata['source']}, page {doc.metadata['page']}")
Step 7: Re-use an existing Chroma database
# On subsequent runs, skip aadd_documents() and just point at the existing db.
# Chroma loads the index from disk without hitting the embeddings API.
existing_vectorstore = ChromaVectorStore(
collection_name="company-handbook",
embedding_function=embeddings,
persist_directory="./chroma_db",
)
rag = RAGPipeline(
llm=OpenAILLM(model="gpt-4o-mini"),
embeddings=embeddings,
vectorstore=existing_vectorstore,
)
answer = await rag.aquery("How many days of paid leave do employees get?")
print(answer)
Complete working example
import asyncio
from synapsekit import RAGPipeline
from synapsekit.loaders import PDFLoader
from synapsekit.splitters import RecursiveCharacterTextSplitter
from synapsekit.embeddings.openai import OpenAIEmbeddings
from synapsekit.llms.openai import OpenAILLM
from synapsekit.vectorstores.chroma import ChromaVectorStore
PERSIST_DIR = "./chroma_db"
PDF_PATH = "company-handbook.pdf"
async def ingest(vectorstore):
loader = PDFLoader(PDF_PATH)
documents = await loader.aload()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
await vectorstore.aadd_documents(chunks)
print(f"Ingested {len(chunks)} chunks from {len(documents)} pages.")
async def main():
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = ChromaVectorStore(
collection_name="company-handbook",
embedding_function=embeddings,
persist_directory=PERSIST_DIR,
)
await ingest(vectorstore)
rag = RAGPipeline(
llm=OpenAILLM(model="gpt-4o-mini"),
embeddings=embeddings,
vectorstore=vectorstore,
)
answer, sources = await rag.aquery(
"What is the company's remote work policy?",
return_sources=True,
)
print("Answer:", answer)
for doc in sources:
print(f" Source: {doc.metadata['source']}, page {doc.metadata['page']}")
asyncio.run(main())
Expected output
Ingested 142 chunks from 38 pages.
Answer: Employees may work remotely up to three days per week, subject to
manager approval. Full-remote arrangements require VP sign-off and a six-month
performance review on file.
Source: company-handbook.pdf, page 12
Source: company-handbook.pdf, page 13
How it works
PDFLoader uses pypdf under the hood to extract text page by page, attaching source and page keys to each Document's metadata. RecursiveCharacterTextSplitter walks a priority list of separators (\n\n, \n, , "") so it always tries the least-destructive split first. ChromaVectorStore wraps the Chroma client in SynapseKit's async interface and calls collection.persist() automatically after each aadd_documents(). At query time return_sources=True tells the pipeline to also return the raw Document objects that were injected into the prompt, giving you provenance without any extra work.
Variations
| Variation | Change required |
|---|---|
| Use a local embedding model | Replace OpenAIEmbeddings with OllamaEmbeddings |
| Use Pinecone instead of Chroma | Replace ChromaVectorStore with PineconeVectorStore |
| Larger chunks for dense technical text | Increase chunk_size to 1500–2000 |
| Smaller chunks for precise Q&A | Decrease chunk_size to 300–500 |
| Add custom metadata | Extend doc.metadata after loading, before splitting |
Troubleshooting
ModuleNotFoundError: No module named 'pypdf'
Run pip install pypdf. SynapseKit's PDF support uses pypdf as an optional dependency to keep the base install lean.
PdfReadError: EOF marker not found
The PDF is corrupted or password-protected. Try opening it in a PDF viewer first. For password-protected files pass password="..." to PDFLoader.
Chunks contain garbled text or missing spaces
PDF text extraction quality depends heavily on how the PDF was generated. Scanned PDFs need OCR preprocessing (e.g., with pytesseract) before loading.
InvalidDimensionException from Chroma
You changed the embedding model after the collection was created. Delete ./chroma_db and re-ingest so all vectors share the same dimensionality.
Next steps
- Multi-Format Document Ingestion — add DOCX, web pages, and CSVs alongside PDFs
- Choosing a Chunking Strategy — understand when to use a different splitter
- Metadata Filtering in Vector Search — scope queries to specific pages or sections