Introducing Dunwich

By Admin — 2025-11-05

News
Dunwich is first from series of applications aimed at Medium-Large sized companies.
In short: it helps to get Data into your Data Warehouse.


Okay, now we can go into the details.

Why?

I'm Data Engineer by trade and normally I get hired or contracted when client starts getting serious about the Data. First step of this journey is to have proper Data Warehouse.

From personal experience, companies either stick to one of solutions bellow, or end up trying all of them.

The Replicator

The first, most obvious, choice is just to take same type Database (eg. If it's MariaDB, take MariaDB, if it's Postgres - take Postgres), setup replication and say it's done.

This solution does work and may be enough for quite large chunk of companies. But there's few glaring issues:
  1. By replication you're taking everything. Some tables may be useful, most are not
  2. What to do when there are multiple teams with multiple dedicated databases? Replicating two sources into single isn't as trivial task
  3. What to do if different databases use different technologies? Eg. One is MariaDB, other may be MongoDB?
  4. Even if all the issues above can be mitigated, there's last problem: Transactional Databases are good for Transactions, not for Analytics. Eventually you need to use some sort of Column-Based Database.

The Loader

The downsides of previous solution can be mitigated and whole setup might be even simplified with bunch of periodic queries. Usually some kind of Orchestrator process is needed, then custom queries for each source table prepared and eventually Data Warehouse has up to date data. Bonus: eventually some sort of transformation process will be needed (Extract Transform Load ETL, or now quite popular Extract Load Transform ELT), Orchestrator comes useful here as well.

It actually looks easy and good approach, and might be sufficient for most cases. However, there are few caveats:
  • Full table load may start taking too long. Data needs to be processed in windows (usually by filtering created_at columns). Sadly, not every table is designed with this mind. Usually you have at least few of these evil ones.
  • You are bound to data refresh intervals, not to when data actually updates. This is ok for 99% of business cases, but there's always few cases when you need to react quicker.

The SaaS

Certainly there are SaaS solutions for this, right? Yes, there is. One of the big players is FiveTran. It's simple, it works and could be a great solution if you need to pull some minor data from other sources.

However, when using it for main Data flow, it looks something like this (because you're charged by Volume):


The Streamer

Most of modern databases has mechanism to publish updates. Usually it's on system level and created to make replication work. We can use it to update Data Warehouse with constant stream of updates.

At first glance it sounds most sophisticated and covers disadvantages mentioned above. However, there is a price to pay... Lets detail it step by step

First issue: never ending stream of data, where to put it?

Logical choice is to use some kind of queue. Usually people end up with either AWS Kinesis (or whatever alternative Azure and GCP has to offer), or end up with industry standard: Kafka

Kafka
comes with territory:
  • you need someone to maintain it or pay for Kafka as a Service
  • you need your team to get basic understanding how Kafka works: topics, partitions, replication factors, limits, grouping etc.
  • need to figure out proper structure. If there's few tables, having topic per table is simple and great idea. What if you have 500 tables? Suddenly overhead of topic maintenance becomes real problem

Second issue: getting Data from Kafka to Data Warehouse

In most cases people just take Apache Spark and assume problem solved. However... 
  • What to do with case of 1 topic per table, when you have 500 tables? List each of those in code?
  • Okay, maybe you decided to group and instead of 500 topic have like 5 groups. How do you batch writes then? In this case you need to start grouping into memory buckets, when certain threshold is reached do write, otherwise wait for more data. - From "hello_spark.py" it becomes quite a jungle
  • Somehow you figured it out, managed to read Kafka, write into the Data Warehouse. What's now? Stream hasn't stopped producing data. You need to store all of the offsets for every topic somewhere, and continue from where you left off.

It's doable, but not ideal. Natural progression to help to deal with streaming data is Apache Flink. You get quite similar functionality, however it's build from the ground up to deal with streaming data. You have concepts of processing window, asynchronous Sink, automatic way to store all those Kafka offsets and continue after restart. Sound great? Yes but...
  • Documentation is very lacking
  • There are Python bindings, but they are very limited. You're basically forced to write Java code (even Scala api is mostly deprecated by now)
  • There's local cluster mode, but it is unstable and only feasible for local testing. In production you basically need Kubernetes...
  • Dealing with Streaming data is order of magnitude more complex then micro batches
(Just for the sake of completeness: Since version 4, Spark has introduced streaming: https://spark.apache.org/docs/latest/streaming-programming-guide.html to rival Flink. But same complexity is inherited)

So to sumarize:

  • We get cost for components: Kafka, Spark or Flink cluster which in most cases ends up as operator on Kubernetes cluster
  • We get cost of complexity
After doing all of the math, you would end up with huge number which might, usually quite easily, burn a hole in your budget.


The Dunwich

And finally, solution we're introducing.

First of all, we take inspiration of "The Streamer" solution and use streaming data. For this, we suggest using Debezium (free, OpenSource, works with most major databases), but there's also open REST and gRPC endpoints where we accept any kind of custom solution.

But where we put that streaming Data? 

NATS: https://nats.io/ (NATS + Jetstream to be more precise). It gives us main benefits of Kafka:
  • can have multiple consumers
  • consumers has state, we mark processed messages as read and next time we know where to continue
  • Instead of just topic, NATS have topic + subtopic concept, where subtopic can be dynamic value and we can consume by providing wild cards. Quite gracefully solves that 500 tables problem.

How does Data reaches Data Warehouse?

We have created worker service which visits every topic+subtopic combination, checks which table needs to be flushed and by most efficient pattern (every supported Database have their own optimised connector), we push the data.


How is final solution hosted and distributed?

We provide binaries (compiled Golang code)
  • Registry - central service where we keep schemas, store metrics, locks and other vital info
  • Admin (optional) - UI to manage schemas. Initial schemas are created automatically. What you can do, is to mark fields for hashing, encryption or hide. This is used to meet GDPR requirements and to avoid problems of having sensitive data inside Data Warehouse.
  • API - service which receives incoming data and pushes to NATS. It accepts Debezium, direct JSON uploads through REST API and gRPC endpoint where it also accepts data, but in more efficient way.
  • Worker - service which does all of the heavy lifting of writing data into Data Warehouse.

All of these are binaries compiled to AMD64 and ARM64 processors and Linux System. (If there's a need for FreeBSD, or actual use case to run on Windows - let us know, with some additional testing should work as well).
And can be hosted in numerous ways: can be run directly, dockerized and hosted on Kubernetes.
Our preferred method currently is Nomad (https://developer.hashicorp.com/nomad) which is able to run both Docker and binary directly (useful if there's a need to run on FreeBSD server or where Docker is not available), while handle distribution in very simplified way.

Ok, how do we get it?

Currently it's still in Alpha stage. When we transfer into Beta, public download links will appear on this website.

So this is free and OpenSource?

No and no. It's free to test it out, for proper Production use you'll need a license. However, license is fixed yearly cost which doesn't depend on volume. 

And as for OpenSource, there's another rant for another day... 

What's the business model and what are additional costs?

Business model is:
  • sell license and use these money to improve product
  • sell additional support as subscription. 
  • initial setup fee. We come, setup everything by our standards so that later we could provide support

And finally, if you want to read more: https://starless.io/modules/dunwich