Dunwich: Simplifying Data Warehouse Ingestion Without the Complexity

By Admin — 2025-11-05

data-engineering data-warehousing etl streaming architecture

Overview

Dunwich is a modern data ingestion platform designed to help mid-to-large sized companies reliably move data into their data warehouse. Unlike traditional ETL approaches that force you to choose between complexity, cost, and capability, Dunwich combines the best aspects of streaming architecture with operational simplicity.

The Data Ingestion Challenge

As data engineering becomes increasingly critical to business operations, the first architectural decision most organizations face is straightforward: how do we get data into our data warehouse?

From years of experience working with data-driven organizations, most companies follow a predictable journey. They begin with one solution, discover its limitations, then gradually explore alternatives—often cycling through multiple approaches before settling on something sustainable. This exploration process, while valuable, typically costs time, resources, and engineering effort.

The sections below detail four common approaches to data ingestion, their tradeoffs, and why Dunwich was created as a more practical alternative.

Four Traditional Data Ingestion Patterns

The Replicator: Database-to-Database Mirroring

The most obvious first choice is database replication. If you're running MariaDB in production, set up a MariaDB replica as your warehouse. If it's PostgreSQL, replicate PostgreSQL.

Why it appeals: Replication is battle-tested, operationally straightforward, and works for many organizations.

Why it falls short:

Unnecessary data: Replication copies everything from source to destination. You end up ingesting thousands of tables when you need perhaps dozens.
Multi-source complexity: Combining data from multiple databases (each with their own replication stream) into a single warehouse becomes non-trivial.
Technology mismatch: When sources use different technologies (MariaDB and MongoDB, for example), a single replication strategy no longer works.
Transactional vs. analytical mismatch: Operational databases are optimized for transactions, not analytics. Eventually, you'll need column-based storage (like Snowflake or BigQuery) to enable efficient analytical queries.

The Loader: Periodic Batch Queries

This approach addresses replication's limitations by using an orchestrator (like Airflow or Cron) to periodically query source systems and load specific tables into your warehouse.

Why it appeals: Simpler to implement than replication, and you can selectively ingest only the tables you need. The orchestrator also enables transformation logic (ETL or ELT patterns).

Why it falls short:

Incremental loading complexity: Full-table loads become unmanageable as data volume grows. You're forced to implement windowing logic (filtering by created_at timestamps, for example), but not all tables are designed with this pattern in mind.
Fixed refresh intervals: You're bound to your batch schedule (e.g., every 4 hours), not to when data actually changes. For most use cases, this is acceptable—but business-critical events sometimes demand faster response times.
Scalability ceiling: As your table count grows, orchestration logic becomes increasingly complex to maintain.

The SaaS Solution: Managed Ingestion Services

Companies like Fivetran offer a turnkey solution: point them at your data sources, and they handle the complexity.

Why it appeals: Simple to set up and operate. No infrastructure to maintain.

Why it falls short:

Cost scaling with volume: Fivetran charges by the volume of data moved. For a primary data pipeline handling significant volume, costs can quickly exceed the budget for alternative solutions.
Limited customization: You're constrained by the platform's capabilities and pricing model.

The Streamer: Real-Time Data Integration

To overcome batch latency, some organizations move toward streaming architectures. Most modern databases publish changes via change data capture (CDC) mechanisms. Why not consume that stream directly?

This approach often looks like: Database CDC → Kafka → Spark/Flink → Data Warehouse

Why it appeals: You get sub-second data freshness and eliminate the need to design windowing logic. It feels like the "most sophisticated" solution.

But it introduces new problems...

The Queue Infrastructure Problem

Streaming data needs somewhere to live. Most teams choose either managed services (AWS Kinesis, Azure Event Hubs, GCP Pub/Sub) or the industry standard: Apache Kafka.

Kafka, while powerful, brings operational overhead:

Infrastructure management: You either maintain Kafka yourself or pay for a managed service (Confluent Cloud, AWS MSK).
Conceptual burden: Your team needs to understand topics, partitions, replication factors, and consumer groups.
Cardinality explosion: If you have 500 tables, do you create 500 topics? Topic maintenance becomes a significant operational problem.

The Streaming Processing Problem

Getting data from Kafka to your data warehouse seems straightforward until you actually do it. Most teams reach for Apache Spark, assuming the problem is solved.

Reality: With 500 Kafka topics and a single Spark job, you now need to list all 500 topics in code. If you consolidate into topic groups, you face new challenges: how do you batch writes to your warehouse? You end up managing in-memory buffers, tracking thresholds, and handling backpressure—quickly becoming far more complex than the simple hello_spark.py script you imagined.

The State Management Problem

The stream hasn't stopped. After writing a batch to your warehouse, you need to track which messages you've processed (Kafka offsets) for each topic. On restart, your application must resume from exactly where it left off.

This requires distributed state management, which most Spark deployments handle poorly.

The Flink Alternative

Apache Flink was purpose-built for streaming and solves many of these problems elegantly:

Native streaming semantics: Built-in windowing, asynchronous sinks, and automatic offset management.
Cleaner abstractions: State management is first-class.

However, Flink introduces its own costs:

Documentation gaps: Flink's documentation, while improving, still lags behind Spark's.
Language barriers: Python bindings are limited. You're effectively forced to write Java (Scala is largely deprecated).
Operations complexity: Local clusters are unstable. Production Flink typically requires Kubernetes, introducing container orchestration complexity.
Inherent complexity: Streaming data processing is fundamentally more complex than micro-batching.

Note: Apache Spark introduced structured streaming in version 4 to compete with Flink, but it inherits similar complexity tradeoffs.

The Cost Reality

After summing up infrastructure, operational overhead, and engineering time, the total cost typically becomes prohibitive:

Component costs: Kafka or managed streaming, Spark/Flink cluster, Kubernetes infrastructure.
Complexity costs: Skilled engineers to operate these systems, debugging non-deterministic issues, handling edge cases in distributed systems.

For most organizations, this math doesn't work.

The Dunwich Approach: Streaming Without the Complexity

Dunwich takes inspiration from streaming architectures while eliminating unnecessary complexity and cost.

Core Design Principles

Use change data capture, but choose lightweight infrastructure: Instead of Kafka, Dunwich uses NATS with JetStream. NATS provides the essential benefits of Kafka—multiple consumers, persistent state, consumer groups—while being dramatically simpler to operate.

NATS elegantly solves the "500 tables problem" through a topic hierarchy: instead of managing 500 independent topics, you use wildcards to consume database.table.* patterns. This hierarchical approach scales naturally.

Intelligent data movement: Dunwich runs a worker service that visits each topic, determines which tables need flushing, and uses optimized connectors specific to your data warehouse technology. This eliminates the need to manage streaming-to-warehouse logic yourself.

Flexible source integration: Data can enter Dunwich through:

Debezium (free, open-source CDC from most major databases)
REST API for ad-hoc JSON uploads
gRPC endpoint for efficient binary data ingestion and custom integrations

Architecture Components

Dunwich is distributed as precompiled binaries (Go, compiled for AMD64 and ARM64 Linux). The platform consists of four key services:

ComponentPurposeAPI | Accepts incoming data from Debezium, REST, or gRPC endpoints and routes it to NATS
Registry | Central service storing schemas, metrics, locks, and system metadata
Admin (optional) | Web UI for schema management, data governance (field hashing, encryption, GDPR masking)
Worker | Reads from NATS and writes data to your data warehouse using optimized connectors

Deployment Options

All components can be run as:

Standalone binaries directly on any Linux system
Docker containers for containerized environments
Nomad jobs (recommended) for simplified orchestration without Kubernetes complexity

Dunwich's multi-platform binaries also support FreeBSD and Windows (with additional testing) if needed.

Licensing and Business Model

Pricing Structure

Dunwich is free to evaluate in development and testing environments. For production use, a fixed annual license fee applies—not based on data volume. This predictable cost model contrasts sharply with consumption-based pricing (like Fivetran or Kinesis), where costs scale with data throughput.

Additional Services

Support subscription: Ongoing technical support and assistance.
Initial setup fee: Dunwich team handles production deployment and configuration according to best practices, ensuring your team can maintain the system independently.

Open Source Considerations

Dunwich itself is not open source. However, it integrates with open-source standards (Debezium, NATS) to avoid vendor lock-in and ensure long-term flexibility.

Getting Started

Dunwich is currently in alpha. Beta release will introduce public download links on the Dunwich website.

For more information and early access, visit starless.io/modules/dunwich.

Conclusion

Choosing a data ingestion strategy shouldn't require trading off simplicity, cost, and capability. Dunwich demonstrates that modern streaming architectures can be operationally lightweight and financially predictable.

If you're tired of managing complex Kafka clusters, overpaying for SaaS ingestion, or maintaining batch windows that don't align with business needs, Dunwich offers a practical alternative—the reliability of streaming data with the simplicity of traditional batch systems.

Ready to evaluate Dunwich for your organization? Request early access or reach out to learn how Dunwich can simplify your data warehouse ingestion.

Back to Blog