Skip to main content

Text Splitters

Text splitters break documents into chunks for embedding and retrieval. SynapseKit provides five splitters — all extend BaseSplitter and share the same split(text) → list[str] interface.

BaseSplitter

All splitters inherit from BaseSplitter:

from synapsekit import BaseSplitter

class BaseSplitter(ABC):
def split(self, text: str) -> list[str]: ...

You can implement your own splitter by subclassing BaseSplitter and implementing split().

CharacterTextSplitter

Splits on a single separator string. Simple and fast.

from synapsekit import CharacterTextSplitter

splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=512,
chunk_overlap=50,
)

chunks = splitter.split("Paragraph one.\n\nParagraph two.\n\nParagraph three.")
ParameterDefaultDescription
separator"\n\n"The string to split on
chunk_size512Maximum characters per chunk
chunk_overlap50Characters of overlap between consecutive chunks

RecursiveCharacterTextSplitter

Tries splitting by paragraphs → sentences → words → hard split until chunks fit. This is the default splitter used by RAGPipeline.

from synapsekit import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "],
)

chunks = splitter.split(long_document)
ParameterDefaultDescription
chunk_size512Maximum characters per chunk
chunk_overlap50Characters of overlap between consecutive chunks
separators["\n\n", "\n", ". ", " "]Tried in order; first one that produces multiple parts is used
Backward compatibility

TextSplitter from synapsekit.rag.pipeline is now an alias for RecursiveCharacterTextSplitter. Existing code works without changes.

TokenAwareSplitter

Splits text so each chunk fits within a token budget. Uses a heuristic of ~4 characters per token and delegates to RecursiveCharacterTextSplitter.

from synapsekit import TokenAwareSplitter

splitter = TokenAwareSplitter(
max_tokens=256,
chunk_overlap=50,
)

chunks = splitter.split(long_document)
# Each chunk ≤ 256 × 4 = 1024 characters
ParameterDefaultDescription
max_tokens256Maximum tokens per chunk
chunk_overlap50Characters of overlap between chunks
chars_per_token4Character-to-token ratio (override for non-English text)

SemanticSplitter

Splits at semantic boundaries using sentence embeddings. Sentences whose cosine similarity to the next sentence drops below a threshold are treated as split points.

pip install synapsekit[semantic]
from synapsekit import SemanticSplitter

splitter = SemanticSplitter(
model="all-MiniLM-L6-v2",
threshold=0.5,
min_chunk_size=50,
)

chunks = splitter.split(document)
ParameterDefaultDescription
model"all-MiniLM-L6-v2"Sentence-transformers model for embeddings
threshold0.5Cosine similarity threshold — lower = more splits
min_chunk_size50Minimum characters before allowing a split
warning

SemanticSplitter requires sentence-transformers. Install with pip install synapsekit[semantic].

MarkdownTextSplitter

Splits markdown text respecting document structure. Headers define natural split points, and each chunk carries its parent header context for semantic completeness.

from synapsekit import MarkdownTextSplitter

splitter = MarkdownTextSplitter(
chunk_size=512,
chunk_overlap=50,
)

chunks = splitter.split("""# User Guide
## Installation
Run pip install synapsekit to get started.

## Quick Start
Import RAG and create a pipeline.

### Configuration
Set your API key in the config.
""")
# Each chunk includes parent headers:
# "# User Guide\n## Installation\nRun pip install..."
# "# User Guide\n## Quick Start\nImport RAG and..."
# "# User Guide\n## Quick Start\n### Configuration\nSet your..."
ParameterDefaultDescription
chunk_size512Maximum characters per chunk
chunk_overlap50Characters of overlap between consecutive chunks
headers_to_split_on[("#", "Header1"), ("##", "Header2"), ("###", "Header3"), ("####", "Header4")]Header markers and labels to split on

Oversized sections without headers fall back to RecursiveCharacterTextSplitter with ---, \n\n, \n, . , as separators.

Using splitters with RAGPipeline

By default, RAGPipeline uses RecursiveCharacterTextSplitter with the chunk_size and chunk_overlap from RAGConfig. You can override this by passing any BaseSplitter to RAGConfig.splitter:

from synapsekit import RAGConfig, RAGPipeline, TokenAwareSplitter

config = RAGConfig(
llm=llm,
retriever=retriever,
memory=memory,
splitter=TokenAwareSplitter(max_tokens=200),
)

pipeline = RAGPipeline(config)
await pipeline.add("Your document text here...")

When splitter is set, it overrides chunk_size and chunk_overlap.

Writing a custom splitter

from synapsekit import BaseSplitter

class SentenceSplitter(BaseSplitter):
def split(self, text: str) -> list[str]:
text = text.strip()
if not text:
return []
# Split on sentence endings
sentences = [s.strip() + "." for s in text.split(". ") if s.strip()]
return sentences

splitter = SentenceSplitter()
chunks = splitter.split("First sentence. Second sentence. Third sentence.")
# ["First sentence.", "Second sentence.", "Third sentence."]