FileDataSource

laktory.models.datasources.FileDataSource ¤

Bases: BaseDataSource

Data source using disk files, such as data events (json/csv) and dataframe parquets. It is generally used in the context of a data pipeline.

ATTRIBUTE	DESCRIPTION
`format`	Format of the data files TYPE: `Literal['AVRO', 'BINARYFILE', 'CSV', 'DELTA', 'EXCEL', 'JSON', 'JSONL', 'NDJSON', 'ORC', 'PARQUET', 'TEXT', 'XML']`
`read_options`	Other options passed to `spark.read.options` TYPE: `dict[str, Any]`
`schema`	Target schema specified as a list of columns, as a dict or a json serialization. Only used when reading data from non-strongly typed files such as JSON or csv files.
`schema_location`	Path for schema inference when reading data as a stream. If `None`, parent directory of `path` is used. TYPE: `str`

Examples:

from laktory import models

source = models.FileDataSource(
    path="/Volumes/sources/landing/events/yahoo-finance/stock_price",
    format="JSON",
    as_stream=False,
)
# df = source.read(spark)

# With Explicit Schema
source = models.FileDataSource(
    path="/Volumes/sources/landing/events/yahoo-finance/stock_price",
    format="JSON",
    as_stream=False,
    schema=[
        {"name": "description", "type": "string", "nullable": True},
        {"name": "close", "type": "double", "nullable": False},
    ],
)
# df = source.read(spark)