Tsang & Leng: Giving Back to the Community with Open Source Data Pipeline Automation

By Šarūnas Navickas — 2025-12-07

OpenSource Ruby Clojure Data Pipeline

Today, we're excited to announce that Starless.io is open-sourcing TNL (Tsang + Leng), our battle-tested data pipeline generation toolkit. Born from real-world data integration challenges faced by our customers, TNL automates the creation of production-ready data pipelines by transforming SQL queries into fully functional Clojure applications.

What is TNL?

TNL is a complete data pipeline development toolkit consisting of two powerful components:

Tsang - A Ruby-based SQL parser that converts SQL queries into an Abstract Syntax Tree (AST) and generates complete Clojure pipeline projects using Liquid templates.

Leng - A universal database library written in Clojure that provides adapters for heterogeneous database systems, enabling seamless data movement across different data stores.

The workflow is elegant: SQL Query → Tsang Parser → AST → Code Generator → Clojure Pipeline → Leng Library → Production Data Movement

Why We Built TNL

At Starless.io, we encountered a recurring pattern: customers needed to move data between different database systems for migrations, analytics pipelines, and cross-platform synchronization. Each time, engineers would write boilerplate code for:

Database connections and authentication
Incremental loading with watermarks
Batch processing and error handling
Schema mapping and type conversions
Monitoring and logging

TNL eliminates this repetitive work. You write SQL to describe what data you need, and TNL generates the how - a complete, production-ready pipeline with all the infrastructure code.

When Ad-Hoc Ingestion Makes Sense

While enterprise ETL platforms like Apache Airflow, Prefect, or commercial solutions excel at ongoing, scheduled workloads, there are specific scenarios where TNL's ad-hoc approach shines:

1. One-Time Data Migrations

Moving from a legacy database to a modern analytics platform doesn't need a permanent ETL infrastructure. Generate a pipeline, run it once, validate the data, and move on.

./tsang/bin/tsang generate \
  --sql "SELECT * FROM legacy.orders WHERE created_at > '2024-01-01'" \
  --name orders-migration \
  --config migration-config.json

2. Rapid Prototyping and Proof-of-Concepts

When evaluating a new analytics database or testing data integration patterns, TNL lets you spin up working pipelines in minutes rather than days. Prototype fast, iterate quickly, validate assumptions.

3. Emergency Data Extractions

When you need to pull specific datasets quickly for compliance audits, incident investigations, or urgent business requests, TNL generates extraction pipelines on-demand without requiring complex workflow orchestration.

4. Cross-Database Synchronization for Small Teams

Not every organization needs a data platform team. Small engineering teams can use TNL to handle periodic synchronization between operational and analytical databases without maintaining heavy infrastructure.

5. Exotic Source/Sink Combinations

Need to move data from Cassandra to Druid? MongoDB to Elasticsearch? PostgreSQL to a specialized analytics store? TNL's universal adapter library (Leng) handles heterogeneous database combinations that might not have pre-built connectors in standard ETL tools.

6. Incremental Backfills

When historical data needs to be loaded incrementally with watermark tracking, TNL generates pipelines with built-in watermark management, letting you safely resume interrupted loads.

# Generated pipeline automatically handles watermarking
export WATERMARK_COLUMN=created_at
clj -M:run incremental

7. Custom Business Logic in Data Movement

Since TNL generates actual Clojure code, you can inspect, modify, and extend the pipeline logic before deployment. Add custom transformations, business rules, or data quality checks directly in the generated code.

Real-World Example: Analytics Pipeline Generation

Here's how simple it is to create a production pipeline:

cd tsang
./bin/tsang generate \
  --sql "SELECT user_id, event_time, event_type FROM events.tracking" \
  --name tracking-pipeline \
  --config config.json

Configuration file specifies your infrastructure:

{
  "batch_size": 5000,
  "watermark_enabled": true,
  "timestamp_column": "created_at",
  "source_type": "cassandra",
  "sink": {
    "type": "druid",
    "table": "users_analytics"
  }
}

TNL generates a complete project in ./build/tracking-pipeline/ with connection handling, batch processing, watermarking, error recovery, and monitoring hooks. Set your environment variables and run:

cd build/tracking-pipeline
export CASSANDRA_HOST=localhost
export DRUID_URL=http://localhost:8888
clj -M:run incremental

Supported Data Sources and Sinks

Sources:

PostgreSQL
Cassandra
MongoDB
CSV (yes, you can treat local CSV files as source)
(Extensible adapter system)

Sinks:

Druid
PostgreSQL
Elasticsearch
(Extensible adapter system)

When NOT to Use TNL

To be clear, TNL isn't designed to replace comprehensive data platforms for organizations with complex, ongoing data workflows. You probably shouldn't use TNL if:

You need sophisticated DAG orchestration with complex dependencies
You require extensive monitoring, alerting, and SLA management
You're building a long-term, production data platform with dozens of pipelines
You need visual workflow builders or no-code/low-code interfaces
You require extensive built-in data quality and governance features

For those scenarios, modern data orchestration platforms are the right choice. TNL excels at the "I need this data moved, and I need it now" use cases.

Getting Started

TNL is now available on GitHub under an open-source license:

Repository: github.com/Griaustinis-Media/tnl

Quick start:

# Clone the repository
git clone https://github.com/Griaustinis-Media/tnl
cd tnl

# Install Tsang dependencies
cd tsang
bundle install
chmod +x bin/tsang

# Test the parser
./bin/tsang parse --sql "SELECT * FROM users" --pretty

# Generate your first pipeline
./bin/tsang generate \
  --sql "SELECT * FROM your_table" \
  --name my-pipeline \
  --config config.json

Why Open Source?

Starless.io was built by engineers who benefited enormously from open source software. TNL represents hundreds of hours of development, debugging, and refinement through real-world usage. By open-sourcing it, we hope to:

Help teams solve data integration challenges faster
Enable experimentation with heterogeneous database combinations
Contribute back to the data engineering community
Foster collaboration and improvements from diverse perspectives

We developed TNL to solve our own problems and our customers' problems. Now it's available to solve yours.

Project Structure

tnl/
├── tsang/              # SQL Parser & Code Generator
│   ├── bin/tsang      # CLI tool
│   ├── lib/           # Parser, lexer, AST, codegen
│   └── templates/     # Liquid templates for code generation
└── leng/              # Universal Database Library
    └── src/gm/
        ├── source/    # Source adapters (Cassandra, Postgres, MongoDB)
        ├── sink/      # Sink adapters (Druid, Postgres, Elasticsearch)
        └── utils/     # Utilities (watermarking, batching)

Contributing

We welcome contributions! Whether it's new database adapters, improved code generation templates, bug fixes, or documentation improvements, we'd love your help making TNL better.

Check out the repository, open issues, submit pull requests, and join us in making data pipeline generation accessible to everyone.

Conclusion

Data engineering shouldn't require rebuilding the same infrastructure patterns repeatedly. TNL takes the SQL you already know and generates the infrastructure code you don't want to write.

We built it for ourselves. We refined it with our customers. Now we're sharing it with you.

Welcome to TNL. Let's move some data.

Back to Blog