Synapse — Same-day analytics Data Exploration — Implementation Guide

Implementation Guide — Part 1

Vinny Paluch
3 min readNov 15, 2022

As a result of the positive feedback received from my previous post, I have decided to create a series to describe how we implemented the solution in greater detail.

Please see my previous post:
Utilising Delta Lake and Azure Synapse to deliver same-day analytics for data exploration | by Vinny Paluch | Nov, 2022 | Medium

In this post, we will focus on the “event-driven” aspects of the implementation.

When a new file is created

Choosing the best service to receive the client’s data

My first step was to consider an alternative to the customer’s SFTP service. I evaluated some alternatives for a service that could begin processing the files as soon as they arrive.

Solution Evaluation Matrix

after all considerations, our team decided to continue using the corporate SFTP server and simply add an intercept step in the existing AirFlow DAG to copy the files before the legacy process.

Initially, AirFlow schedule DAGs are not event-driven, however, at that point, other alternatives cannot be implemented.

Note: New clients who are not tied to the legacy solution will be able to use the “AZ copy” approach instead of SFTP. In this way, full event-driven capabilities could be supported.

Due to its lack of required features such as resumable uploads and event grid support, Azure SFTP was early dismissed.

Azure Data Factory or Synapse Pipelines?

I found some differences between Azure Data Factory and Synapse Pipelines during our implementation. Based on these differences, I reevaluated the implementation and ultimately decided that Azure Data Factory would be a better fit for our environment than Synapse, in the context of Data Pipelines.

In this demo scenario, it will not make any difference, but you should know that the features of the two services differ, and ADF is the most mature version of each.

Azure Data Factory vs Azure Synapse Decision Matrix

We chose ADF over Azure Synapse Pipelines because of the following features:

  • Integration runtime sharing (ADF only)
  • ADF global variables (ADF only)
  • PowerQuery Activity

The Event-driven Part

Dropping a file into the Data Lake starts the event-driven aspects. For this purpose, we have created a separate directory called `/raw/dropfolder`

a) Storage Event-Driven Pipelines on ADF/Synapse

b) Fully Parameterised Connections in ADF/Synapse

c) A Generic Databricks Notebooks to convert RAW data into Delta Tables

d) Automating Synapse server less object creation for Exploratory Analytics

e) Source Aligned Domain data Products — An early stage for a Data Mesh implementation.

Next steps

The following post is part of a series describing the implementation we deployed in a previous project. I will provide details of the implementation in the following posts if you wish to reproduce the scenario in your own environment.

Hope it to be useful.

--

--

Vinny Paluch

Expert in the use of Microsoft’s BI technology stack and Business Intelligence projects with more than 20 years of experience