The Document Service Mesh
for Serious Data Engineering
Stop building fragile Python script chains. Pipestream is a high-performance orchestration engine that combines gRPC real-time latency with Kafka-based durability. Type-safe, polyglot, and schema-first.
Kafka = "Save Game"
We treat Kafka as a checkpoint system. Pause, rewind, and replay your pipeline from any node. Perfect for DLQ handling or re-running expensive LLM steps without hitting the API again.
Visual DAG Builder
Design pipelines with a drag-and-drop graph editor. Connect nodes via Kafka (async) or gRPC (sync). The engine validates cycles and allows multiple pipelines per client.
BYOK Encryption
Security is baked in. Data stored in the S3 "Claim Check" layer can be encrypted with your own keys (BYOK). We handle the secure envelope; you hold the keys.
Dual-Mode Transport
Why choose? Route high-priority UI requests via gRPC (milliseconds) and bulk ingestion via Kafka (terabytes). The engine handles the switching dynamically.
Polyglot Service Mesh
Write LLM logic in Python. Write parsers in Rust. Connect them seamlessly in a Consul-backed service mesh. Modules are "dumb" endpoints; the Engine is the smart router.
Schema-First Governance
Prevent data corruption with strict Apicurio & Protobuf contracts. The Engine controls mapping, so a rogue script can never break your Search Index.
Pipestream vs. The World
Existing tools are either too simple (libraries) or too heavy (ETL tools). Pipestream fills the gap.
| Feature | Pipestream | LangChain | Apache NiFi | Data Prepper |
|---|---|---|---|---|
| Primary Goal | Document Business Logic | Prototyping / RAG | Moving Raw Bits (ETL) | Log Aggregation |
| Architecture | Service Mesh (Hub & Spoke) | Library (Client-side) | Visual DAG Graph | Linear Pipeline |
| Transport | Hybrid (gRPC + Kafka) | In-Memory / HTTP | Disk-based | Memory Buffer |
| State Mgmt | Claim-Check (S3) | Manual | Heavy FlowFiles | Transient |
| Schema | Strict (Protobuf) | Loose (JSON) | Schemaless | Loose JSON |
The Engineering Flow
A DAM for Text + Language-Agnostic Stream Processor.
1. Ingest & Claim Check
Intake Service accepts documents via gRPC. Large binaries (PDFs/Videos) are immediately offloaded to encrypted S3 storage. Only the lightweight PipeDoc metadata enters the stream.
2. The Engine Router
The Engine acts as the "DNS" for your data. It checks the DAG, detects cycles, and routes to the next Node via gRPC (fast) or Kafka (durable save-point).
3. Polyglot Processing
Your Python Summarizer Module receives a typed Request. It processes data and returns it. It knows nothing about the mesh, retries, or storage.
4. Strict Indexing
The Engine maps the result to the OpenSearch schema. Decryption happens only at the last mile for indexing, ensuring security in transit.
The "Data Commons" Vision
Future versions will allow meshes to cluster together.
🌍 Subscribe to the World
We are building standardized, pre-indexed subscriptions for Wikipedia, Gutenberg, Census, and IRS data. Connect your mesh to the public Pipestream cluster and ingest curated datasets instantly.
🤝 Mesh Clustering
Pipestream instances will be able to join via Consul to form larger processing clusters. Share load, share datasets, and process the world's information together.