Skip to content

FileDataSource

laktory.models.datasources.FileDataSource ยค

Bases: BaseDataSource

Data source using disk files, such as data events (json/csv) and dataframe parquets. It is generally used in the context of a data pipeline.

ATTRIBUTE DESCRIPTION
format

Format of the data files

TYPE: Literal['AVRO', 'BINARYFILE', 'CSV', 'DELTA', 'EXCEL', 'JSON', 'JSONL', 'NDJSON', 'ORC', 'PARQUET', 'TEXT', 'XML']

read_options

Other options passed to spark.read.options

TYPE: dict[str, Any]

schema

Target schema specified as a list of columns, as a dict or a json serialization. Only used when reading data from non-strongly typed files such as JSON or csv files.

schema_location

Path for schema inference when reading data as a stream. If None, parent directory of path is used.

TYPE: str

Examples:

from laktory import models

source = models.FileDataSource(
    path="/Volumes/sources/landing/events/yahoo-finance/stock_price",
    format="JSON",
    as_stream=False,
)
# df = source.read(spark)

# With Explicit Schema
source = models.FileDataSource(
    path="/Volumes/sources/landing/events/yahoo-finance/stock_price",
    format="JSON",
    as_stream=False,
    schema=[
        {"name": "description", "type": "string", "nullable": True},
        {"name": "close", "type": "double", "nullable": False},
    ],
)
# df = source.read(spark)