Spark Extension
API Documentation
Apache Spark, an open-source, distributed computing system. Apache Spark is designed for big data processing and analytics and provides a fast and general-purpose cluster-computing framework. It supports various programming languages, including Scala, Java, Python, and R.
To facilitate the transformation of your data Laktory extends spark native functions by in a laktory
namespace.
Functions¤
The first extension is the provision of a library of functions that can be used to build columns from other columns or constants.
import laktory # noqa: F401
import pyspark.sql.functions as F
df = spark.createDataFrame([{"x": 1}, {"x": 2}, {"x": 3}])
df = df.withColumn("y", F.laktory.convert_units("x", "ft", "m"))
poly1
is a Laktory-specific function and is available because of the import laktory
statement. All
other custom functions are also available from the pyspark.sql.functions.laktory
namespace.
Dataframe methods¤
In this case the methods are designed to be applied directly on a spark dataframe.
import laktory # noqa: F401
df = spark.createDataFrame([{"x": 1}, {"x": 2}, {"x": 3}])
df.laktory.has_column("x")
Laktory is monkey patching the DataFrame class from spark by assigning all the custom methods under the laktory
namespace at runtime.
Some methods of interest are:
laktory.spark.dataframe.groupby_and_agg
: Apply a groupby and create aggregation columns.laktory.spark.dataframe.smart_join
: Join tables, clean up duplicated columns and support watermarking for streaming joins. This is the recommended spark function for all joins defined in pipelines.laktory.spark.dataframe.window_filter
: Apply spark window-based filtering