About

laktory logo

An open-source DataOps and dataframe-centric ETL framework for building lakehouses.

What is it?¤

Laktory is your all-in-one solution for defining both data transformations and Databricks resources. Imagine if Terraform, Databricks Asset Bundles, and dbt combined forces and added support for DataFrame API—that’s essentially Laktory.

This open-source framework streamlines the creation, deployment, and execution of data pipelines while adhering to essential DevOps practices such as version control, code reviews, and CI/CD integration. Powered by Narwhals, Laktory enables seamless transitions between Apache Spark, Polars, and other frameworks to perform data transformations reliably and at scale. Its modular and flexible design allows you to effortlessly combine SQL statements with DataFrame operations, reducing complexity and enhancing productivity.

what is laktory

Since Laktory pipelines are built on top of Narwhals, they can run in any environment that supports python—from your local machine to a Kubernetes cluster. They can also be deployed and orchestrated as Databricks Jobs or Lakeflow Declarative Pipelines, offering a simple, fully managed, and low-maintenance solution.

But Laktory goes beyond data pipelines. It empowers you to define and deploy your entire Databricks data platform—from Unity Catalog and access grants to compute and quality monitoring—providing a complete, modern solution for data platform management.

Why Laktory?¤

Laktory is designed to accelerate value generation and reduce the total cost of operations by focusing on four key areas: simplicity, speed, scalability and quality.

Simplicity¤

With Laktory, managing both Databricks resources and data transformations is streamlined into a single tool. This empowers your data team to take full ownership of the solution, eliminating the need to juggle multiple technologies. Say goodbye to relying on external Terraform experts to handle compute, workspace configuration, and Unity Catalog, while your data engineers and analysts try to combine Databricks Asset Bundles and dbt to build data pipelines. Laktory consolidates these functions, simplifying the entire process and reducing the overall cost. dataops

Speed¤

Adopting a DataOps approach and writing data pipelines as code enables faster, more reliable production releases. With Laktory, you can run and debug your pipelines directly from your favorite IDE before deploying to Databricks, allowing for rapid iteration. With Laktory taking care of read, write and DAG generation, you can concentrate on what matters most—business logic and creating value—while minimizing boilerplate code.

dataops

Scalability¤

Scale the right job to the right level. Run on a powerful Spark cluster when dealing with large datasets, or experiment locally with Polars on your machine with smaller ones. Switching between environments is effortless with a simple configuration change, ensuring that Laktory adapts to both small and large-scale needs without complexity.

Quality¤

Elevate data quality to a first-class citizen. Laktory integrates robust data quality expectations at both row and aggregate levels, ensuring that only trusted data reaches production analytics. Invalid records can trigger pipline failure, be dropped, or automatically quarantined for later review.

Who is using it?¤

A growing number of businesses rely on Laktory to build, deploy, and scale their lakehouses with confidence.

energir ezbi mimblestax taiga

Curious about their experience? Read their testimonials.

How does it work?¤

Using YAML configuration files or python code, define a stack of resources, such as a data pipelines with SQL-based and Spark-based transformations.

Declare¤

YAMLPython

pipeline_node.yaml

name: slv_stock_prices
dataframe_backend: PYSPARK
source:
  path: /Volumes/dev/sources/landing/tables/brz_stock_prices/
  format: JSON
sinks:
- schema_name: finance
  table_name: slv_stock_prices
transformer:
  nodes:
  - expr: |
        SELECT
          data.created_at AS created_at,
          data.symbol AS symbol,
          data.open AS open,
          data.close AS close,
          data.high AS high,
          data.low AS low,
          data.volume AS volume
        FROM
          {df}
  - func_name: drop_duplicates
    func_kwargs:
      subset:
        - symbol
        - timestamp
...

pipeline_node.py

from laktory import models

node = models.PipelineNode(
    name="slv_stock_prices",
    dataframe_backend="PYSPARK",
    source=models.FileDataSource(
        path="/Volumes/dev/sources/landing/tables/brz_stock_prices/",
        format="JSON",
    ),
    sinks=[
        models.UnityCatalogDataSink(
            schema_name="finance",
            table_name="slv_stock_prices"
        )
    ],
    transformer=models.DataFrameTransformer(
        nodes=[
            {"expr": """
                SELECT
                  data.created_at AS created_at,
                  data.symbol AS symbol,
                  data.open AS open,
                  data.close AS close,
                  data.high AS high,
                  data.low AS low,
                  data.volume AS volume
                FROM
                  {df}
            """},
            {
                "func_name": "drop_duplicates",
                "func_kwargs": {
                    "subset": ["symbol", "created_at"]
                }
            }
        ]
    )
)

Debug¤

Execute your pipline from your IDE using python and a local or remote Spark session.

YAMLPython

from databricks.connect import DatabricksSession

from laktory import models
from laktory import register_spark_session

with open("pipeline_node.yaml") as fp:
    node = models.PipelineNode.model_validate(fp)

spark = DatabricksSession.builder.getOrCreate()
register_spark_session(spark)

node.execute()
node.output_df.laktory.display()
print(node.output_df.to_native().show())

from databricks.connect import DatabricksSession

from laktory import register_spark_session

spark = DatabricksSession.builder.getOrCreate()
register_spark_session(spark)

node = ...

node.execute()
node.output_df.laktory.display()
print(node.output_df.to_native().show())

output

[laktory] Executing pipeline node slv_stock_prices
[laktory] Reading `PipelineNodeDataSource` brz_stock_prices with DataFrameBackends.PYSPARK
[laktory] Reading pipeline node brz_stock_prices from output DataFrame
[laktory] Read completed.
[laktory] Executing DataFrame Transformer
[laktory] Executing DataFrame transformer node 0 (DataFrameExpr).
[laktory] DataFrame as 
SELECT
  CAST(data.created_at AS TIMESTAMP) AS created_at,
  data.symbol AS name,
  data.symbol AS symbol,
  data.open AS open,
  data.close AS close,
  data.high AS high,
  data.low AS low,
  data.volume AS volume
FROM
  {df}
[laktory] Executing DataFrame transformer node 1 (DataFrameMethod).
[laktory] Applying df.unique(subset=['symbol', 'created_at'],keep=any)
[laktory] Writing static df to ./data/slv_stock_prices.parquet with format 'PARQUET' and {}
[laktory] Write completed.
┌────────────────┬───────┬────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ created_at     ┆ name  ┆ symbol ┆ open       ┆ close      ┆ high       ┆ low        ┆ volume     │
│ ---            ┆ ---   ┆ ---    ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│ datetime[μs]   ┆ str   ┆ str    ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
╞════════════════╪═══════╪════════╪════════════╪════════════╪════════════╪════════════╪════════════╡
│ 2023-04-04     ┆ AMZN  ┆ AMZN   ┆ 103.175003 ┆ 103.129997 ┆ 103.419998 ┆ 103.072502 ┆ 3.998703e6 │
│ 13:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-04-06     ┆ MSFT  ┆ MSFT   ┆ 288.859985 ┆ 290.360107 ┆ 291.070007 ┆ 288.700012 ┆ 4.355298e6 │
│ 13:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-03-02     ┆ AMZN  ┆ AMZN   ┆ 90.589996  ┆ 90.860001  ┆ 91.004997  ┆ 90.529999  ┆ 4.143251e6 │
│ 12:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-10-03     ┆ AMZN  ┆ AMZN   ┆ 125.150002 ┆ 125.580002 ┆ 125.699997 ┆ 124.800003 ┆ 5.416032e6 │
│ 11:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2024-03-05     ┆ AAPL  ┆ AAPL   ┆ 170.785004 ┆ 170.470001 ┆ 170.828705 ┆ 170.149994 ┆ 4.315788e6 │
│ 12:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ …              ┆ …     ┆ …      ┆ …          ┆ …          ┆ …          ┆ …          ┆ …          │
│ 2023-11-30     ┆ AAPL  ┆ AAPL   ┆ 188.335007 ┆ 188.789993 ┆ 188.830002 ┆ 188.289993 ┆ 2.75059e6  │
│ 13:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-03-22     ┆ MSFT  ┆ MSFT   ┆ 275.850006 ┆ 277.049988 ┆ 277.149994 ┆ 275.764008 ┆ 1.790997e6 │
│ 12:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-01-12     ┆ GOOGL ┆ GOOGL  ┆ 91.480003  ┆ 91.510002  ┆ 91.870003  ┆ 89.75      ┆ 7.320764e6 │
│ 09:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-07-20     ┆ GOOGL ┆ GOOGL  ┆ 120.459999 ┆ 120.059998 ┆ 120.470001 ┆ 119.764999 ┆ 2.94552e6  │
│ 12:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
│ 2023-08-22     ┆ AAPL  ┆ AAPL   ┆ 176.922897 ┆ 177.214996 ┆ 177.429993 ┆ 176.310104 ┆ 6.823573e6 │
│ 10:30:00       ┆       ┆        ┆            ┆            ┆            ┆            ┆            │
└────────────────┴───────┴────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Deploy¤

Preview, deploy and run your stack using the CLI

command line

laktory deploy --env dev

command line

laktory run --env dev --dbks-job slv_stock_prices

Watch it¤

See it in actions in this introduction video:

Who is it for?¤

Laktory is built by data engineers for data engineers. It does not have visual interface nor allow you to setup pipelines with a few clicks and some drag and drop. On the other hand, if you have minimal DevOps or programming experience, you will be able to efficiently scale, deploying hundreds of datasets and models without compromising data governance. It also provides you with the tools for building a UI within your organization for a fully self-served experience.

Okube¤

okube logo

Laktory is one of the kubes developed by Okube, a company dedicated to building open source frameworks, empowering businesses to build, deploy and operate highly scalable data platforms and AI models.

Partners¤

Okube and Laktory are official Databricks Technology Partners okube-databricks

Learn more in our Partners sections.