Utilising Delta Lake and Azure Synapse to deliver same-day analytics for data exploration

In this article, I describe how we reduced data delivery times from weeks to same day

4 min readNov 1, 2022

Currently, the company offers three tools for the retail market. Those applications consume data from individual client data warehouses based on the legacy SQL Server architecture. The architecture is over ten years old and has both vertical and horizontal scalability issues.

In order to provide business consultants with access to new client data or data from a new sub-brand, the following must be considered:

Usually, the client sends the data to be imported via SFTP, but it can also be done via a direct database connection using a B2B VPN.
An engineer must be allocated, most often transferring time from another project or customer activity based on priority and availability.
Provided data sets contains may contain transaction and dimensional data.
Volumetric can range from 300M ~ 1B rows per year depending on client size.
Data consultants only have access the new added data through the existing data warehouse solution.
The data must be prepared in the same manner as any other active client (with a few exceptions).
Due to the fact that the actual DW model utilizes SURROGATE KEYS, it requires an extremely expensive process to search for FOREIGN KEYS.
As part of the import process, POLYBASE uses a specific SQL SERVER FORMAT FILE for each type of file. It is a very time-consuming and error-prone process to create this document.
Resources have to be allocated like VM’s, databases, credentials, storage, etc.

After those considerations and actions, delivering the new dataset for data analysis can take several weeks, consume many data engineering hours and requires a lot of computing power.

The implemented Solution

We have implemented a new import process which runs in parallel with the existing/legacy framework. This approach aims to maintain compatibility for existing clients as we migrate from a SQL Server based data warehouse to a Data Lake architecture.

We use a main pipeline and sub-pipelines to leverage reusability through parameterisation.

Here’s an overview of the full process:

a) Files are dropped in the SFTP folder.

b) A data pipeline (Azure ADF or Synapse Analytics) is used to move data from the SFTP into a Data Lake Gen2 under a `/raw` container.

When running event-driven mode, each new file dropped by AirFlow into a `/dropfolder` automatically starts a pipeline run.

c) The pipeline extracts metadata from the file name like: table name, year, month and date and File extension.

d) The file is copied without changes to: /raw/originals/TableName/Year/Month/Date/filename_file_id.txt

The Filename is added together with a GUID that matches the Pipeline runID. This will be used to lineage the data and will remain with it.
In the event that the file is compressed, it will be uncompressed.

e) A Databricks notebook is then used to “normalize” the file. Normalisation consists of converting files to the same code-page (UTF-8), replacing multi-byte delimiters with ‘pipe’, adding additional metadata to each row including a HASH column using sha-256 for data deduplication.

The resultant file is saved to:

/raw/normalized/TableName/Year/Month/Date/filename_file_id.csv

f) As a next step, the normalized file is appended to a Delta Table. The Delta Table is classified as a Bronze layer since no data has been changed from the original dataset other than the addition of the new metadata columns. The data types are inferred.

e) As a final step, a Synapse serverless database is created (if it does not already exist). An SQL Script activity is also used to create all requirements for connecting a View to an external data source. Users will use this view to query data.

Bulk Execution vs Event Driven

Batch and event-driven data loads differ slightly in their processes. In bulk operations, we run step (d) individually for each file in the batch in order to achieve a better performance.

As the code for the first notebook does not scale for parallel execution, we run it sequentially. However, it does process 1000’s of files in a few minutes.

Finally, we run the ‘Bronze’ table feed notebook using a loop and sending 10* files per run. Delta Bronze is created by reading these files into a Spark dataframe and appending them to the table.

*Note: The number of files per run is parameterised. The parallel execution capacity depends on input file size and cluster node’s memory. In the future we can automate this parameter.

Other automations

Each client has its own Storage account, Synapse and Databricks Workspaces, Key Vaults, etc., to ensure total data isolation. The platform deployment process was standardised using Terraform.

Connections between Databricks, Synapse and the storage accounts are done using Managed Credentials whenever possible. But some limitations do require the use os SAS credentials when users are unable to login into Synapse using AAD authentication.

Conclusion

As of the time I left the project, the event-driven process was in production, and bulk load was being implemented for a customer with > 1bi rows over a two-year period.

Our architecture enables business consultants to query data and perform exploratory analytics within a few days instead of requiring weeks in the past.

By automating the creation of views, permissions, data sources, and connections in Azure Synapse Analytics, data access can be made easier for users.

As a result of Databricks notebooks ETLs, file manipulation and data transformations can be accomplished more effectively and efficiently than with stored procedures alone.