Home / Papers / Making Data Engineering Declarative

Making Data Engineering Declarative

Semantic Scholar

1 Citations•2023•

Y. Papakonstantinou, Michael Armbrust, A. Ghodsi

journal unavailable

No TL;DR found

Abstract

1 ETL and ELT are required to prepare comprehensive, clean, and correct derived data that can fuel successful analytics and ML. Based on our observations from thousands of customers processing data in the cloud at Databricks, the preparation of derived data typically involves a complex DAG of transformations, which are split into two activities: (a) Ingestion: At the sources of the DAG, raw data are fetched from streaming platforms, like Apache Kafka TM and Amazon Kinesis, and from cloud storage that stages incoming data. This data is typically in blob stores such as AWS S3. The majority of our customers store it in Delta Lake tables, the data format that enables transaction processing on data lakes [1,2]. (b) Downstream ELT: Multiple transformations clean, enrich, and aggregate the ingested data. The downstream ELT typically leads to many transformations, using popular frameworks such as DBT, or individual jobs that transform the data. In the former case, the transformations are typically in SQL as CREATE-TABLE-AS (CTAS) statements. In all cases, the underlying processing engine sees a series of seemingly independent SQL queries. Delta