The Document Service Mesh
for Serious Data Engineering

Stop building fragile Python script chains. Pipestream is a high-performance orchestration engine that combines gRPC real-time latency with Kafka-based durability. Type-safe, polyglot, and schema-first.

Read the Whitepaper View on GitHub

⚡ Quarkus Engine 📨 Kafka "Save Game" 🚀 gRPC Mesh 🛡️ Protobuf / Buf 📦 S3 Claim-Check 🔎 OpenSearch

🎮

Kafka = "Save Game"

We treat Kafka as a checkpoint system. Pause, rewind, and replay your pipeline from any node. Perfect for DLQ handling or re-running expensive LLM steps without hitting the API again.

🕸️

Visual DAG Builder

Design pipelines with a drag-and-drop graph editor. Connect nodes via Kafka (async) or gRPC (sync). The engine validates cycles and allows multiple pipelines per client.

🔐

BYOK Encryption

Security is baked in. Data stored in the S3 "Claim Check" layer can be encrypted with your own keys (BYOK). We handle the secure envelope; you hold the keys.

⚡

Dual-Mode Transport

Why choose? Route high-priority UI requests via gRPC (milliseconds) and bulk ingestion via Kafka (terabytes). The engine handles the switching dynamically.

🌐

Polyglot Service Mesh

Write LLM logic in Python. Write parsers in Rust. Connect them seamlessly in a Consul-backed service mesh. Modules are "dumb" endpoints; the Engine is the smart router.

🛡️

Schema-First Governance

Prevent data corruption with strict Apicurio & Protobuf contracts. The Engine controls mapping, so a rogue script can never break your Search Index.

Pipestream vs. The World

Existing tools are either too simple (libraries) or too heavy (ETL tools). Pipestream fills the gap.

Feature	Pipestream	LangChain	Apache NiFi	Data Prepper
Primary Goal	Document Business Logic	Prototyping / RAG	Moving Raw Bits (ETL)	Log Aggregation
Architecture	Service Mesh (Hub & Spoke)	Library (Client-side)	Visual DAG Graph	Linear Pipeline
Transport	Hybrid (gRPC + Kafka)	In-Memory / HTTP	Disk-based	Memory Buffer
State Mgmt	Claim-Check (S3)	Manual	Heavy FlowFiles	Transient
Schema	Strict (Protobuf)	Loose (JSON)	Schemaless	Loose JSON

The Engineering Flow

A DAM for Text + Language-Agnostic Stream Processor.

1. Ingest & Claim Check

Intake Service accepts documents via gRPC. Large binaries (PDFs/Videos) are immediately offloaded to encrypted S3 storage. Only the lightweight PipeDoc metadata enters the stream.

2. The Engine Router

The Engine acts as the "DNS" for your data. It checks the DAG, detects cycles, and routes to the next Node via gRPC (fast) or Kafka (durable save-point).

3. Polyglot Processing

Your Python Summarizer Module receives a typed Request. It processes data and returns it. It knows nothing about the mesh, retries, or storage.

4. Strict Indexing

The Engine maps the result to the OpenSearch schema. Decryption happens only at the last mile for indexing, ensuring security in transit.

The "Data Commons" Vision

Future versions will allow meshes to cluster together.

🌍 Subscribe to the World

We are building standardized, pre-indexed subscriptions for Wikipedia, Gutenberg, Census, and IRS data. Connect your mesh to the public Pipestream cluster and ingest curated datasets instantly.

🤝 Mesh Clustering

Pipestream instances will be able to join via Consul to form larger processing clusters. Share load, share datasets, and process the world's information together.

The Document Service Mesh for Serious Data Engineering