Delta Live Tables (DLT)

saurav kumar
4 min readJun 30, 2023

--

Databricks DLT

Delta Live Tables are materialized view for the Lakehouse. With Delta Live Tables, we easily define end-to-end data pipelines in SQL or Python.

Delta Live Tables: Python vs SQL

Now lets see how we can define Delta Live Table in Python and its SQL equivalent:

Python code:

The @dlt.create_table decorator tells Delta Live Tables to create a table that contains the result of a DataFrame returned by a function.

Python code for Delta Live Table

And its SQL equivalent is :

SQL code for Delta Live Table

Let us see how we can manage data quality with Delta Live Tables:

We use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. We apply expectations to queries using Python decorators or SQL constraint clauses.

Delta live tables(DLT) support three types of expectations to fix bad data in DLT pipeline:

DLT Exceptions

Let us see how do we use Delta Live Table(DLT):

Delta Live Table Pipeline Modes (Development and Production):

Let us see how the Pipeline UI looks like :

DLT Pipeline UI

Auto Loader With DELTA Live Table :

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup.

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.

We can use Auto Loader to process billions of files to migrate or backfill a table.

Auto Loader scales to support near real-time ingestion of millions of files per hour.

How Auto Loader Tracks Ingestion progress ?

  • As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline.
  • This key-value store ensures that data is processed exactly once.
  • In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake.

Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows us to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with:

  1. AutoScaling compute infrastructure for cost savings .
  2. Data quality checks with expectations .
  3. Automatic schema evolution handling .
  4. Monitoring via metrics in the event log.

NOTE:

In Delta Live Table you do not need to provide a schema or checkpoint location because Delta Live Tables(DLT) automatically manages these settings for your pipelines.

Auto Loader syntax for DLT

In Python

The @dlt.table decorator tells Delta Live Tables to create a table that contains the result of a DataFrame returned by a function.

or In SQL you can write :

DLT automatically optimizes data for performance & ease-of-use:

My linkdin id : https://www.linkedin.com/in/saurav-kumar-919a70109/

Happy Learning !!!

Databricks DLT

Source :

  1. https://www.databricks.com/product/delta-live-tables
  2. Databricks Academy self paced learning

--

--

saurav kumar
saurav kumar

Written by saurav kumar

Associate Data Scientist @Yash Technologies, Databricks Certified Data Engineer Associate, Microsoft Certified Data Scientist Associate

No responses yet