Synapse — Same-day analytics Data Exploration — Implementation Guide
Implementation Guide — Part 1
As a result of the positive feedback received from my previous post, I have decided to create a series to describe how we implemented the solution in greater detail.
Please see my previous post:
Utilising Delta Lake and Azure Synapse to deliver same-day analytics for data exploration | by Vinny Paluch | Nov, 2022 | Medium
When a new file is created
Choosing the best service to receive the client’s data
My first step was to consider an alternative to the customer’s SFTP service. I evaluated some alternatives for a service that could begin processing the files as soon as they arrive.
after all considerations, our team decided to continue using the corporate SFTP server and simply add an intercept step in the existing AirFlow DAG to copy the files before the legacy process.
Initially, AirFlow schedule DAGs are not event-driven, however, at that point, other alternatives cannot be implemented.
Note: New clients who are not tied to the legacy solution will be able to use the “AZ copy” approach instead of SFTP. In this way, full event-driven capabilities could be supported.
Due to its lack of required features such as resumable uploads and event grid support, Azure SFTP was early dismissed.
Azure Data Factory or Synapse Pipelines?
I found some differences between Azure Data Factory and Synapse Pipelines during our implementation. Based on these differences, I reevaluated the implementation and ultimately decided that Azure Data Factory would be a better fit for our environment than Synapse, in the context of Data Pipelines.
In this demo scenario, it will not make any difference, but you should know that the features of the two services differ, and ADF is the most mature version of each.
Azure Data Factory vs Azure Synapse Decision Matrix
We chose ADF over Azure Synapse Pipelines because of the following features:
- Integration runtime sharing (ADF only)
- ADF global variables (ADF only)
- PowerQuery Activity
The Event-driven Part
Dropping a file into the Data Lake starts the event-driven aspects. For this purpose, we have created a separate directory called `/raw/dropfolder`
a) Storage Event-Driven Pipelines on ADF/Synapse
b) Fully Parameterised Connections in ADF/Synapse
c) A Generic Databricks Notebooks to convert RAW data into Delta Tables
d) Automating Synapse server less object creation for Exploratory Analytics
e) Source Aligned Domain data Products — An early stage for a Data Mesh implementation.
Next steps
The following post is part of a series describing the implementation we deployed in a previous project. I will provide details of the implementation in the following posts if you wish to reproduce the scenario in your own environment.
Hope it to be useful.