Spark Extension

API Documentation

laktory.spark.functions
laktory.spark.dataframe

Apache Spark, an open-source, distributed computing system. Apache Spark is designed for big data processing and analytics and provides a fast and general-purpose cluster-computing framework. It supports various programming languages, including Scala, Java, Python, and R.

To facilitate the transformation of your data Laktory extends spark native functions by in a laktory namespace.

Functions¤

The first extension is the provision of a library of functions that can be used to build columns from other columns or constants.

import pyspark.sql.functions as F

import laktory  # noqa: F401

df = spark.createDataFrame([{"x": 1}, {"x": 2}, {"x": 3}])
df = df.withColumn("y", F.laktory.convert_units("x", "ft", "m"))

Here function poly1 is a Laktory-specific function and is available because of the import laktory statement. All other custom functions are also available from the pyspark.sql.functions.laktory namespace.

Dataframe methods¤

In this case the methods are designed to be applied directly on a spark dataframe.

import laktory  # noqa: F401

df = spark.createDataFrame([{"x": 1}, {"x": 2}, {"x": 3}])
df.laktory.has_column("x")

Laktory is monkey patching the DataFrame class from spark by assigning all the custom methods under the laktory namespace at runtime.

Some methods of interest are:

laktory.spark.dataframe.groupby_and_agg: Apply a groupby and create aggregation columns.
laktory.spark.dataframe.smart_join: Join tables, clean up duplicated columns and support watermarking for streaming joins. This is the recommended spark function for all joins defined in pipelines.
laktory.spark.dataframe.window_filter: Apply spark window-based filtering