Storage Event-Driven with ADF or Synapse Pipelines
Implementation Guide — Part 2
Previous Posts:
Parameterised Connections
In order to facilitate reuse, I try to parameterise my connections and pipelines as much as possible. This includes the connections to the linked services. In addition, I use an Azure Key vault in order to store the information required to connect to the storage accounts.
How Event Driven Pipelines Work
Everything begins with the Pipeline Trigger definition. When creating a trigger you need to provide the following information:
Type: Storage Events
Storage Account and Container names
Path* : As a base string, like “parentfolder/folder”
Event: Create and/or Delete
File Information
When a new file is copied or deleted in the Data Lake, a event is fired using the Azure event grid. The ADF/Synapse trigger will receive the following parameters from the Event Grid.
@triggerBody().folderPath
and @triggerBody().fileName
We will get back to this later.
Event Grid Resource Provider not Registered
If you get the error message bellow, you need to activate the service in your subscription before continuing.
Navigate to your subscription / resource providers and enable ‘Microsoft.EventGrid’ resource. Or use the AZ CLI console.
az provider register --namespace 'Microsoft.EventGrid'
Linked Services
Basically we gonna need 2 linked services:
Azure KeyVault linked Service
Create the Key Vault linked service. I use parameters also for the Key Vault. This will simplify the CI/CD process when deploying this into UAT or PRD and we won’t have to update the deployment scripts.
Of course, we still have to change this in the Workspace, but this could be done as a global parameter. (ADF only)
Note: Synapse/ADF Managed Identity account must a member of ‘Key Vaults Secrets User’ RBAC group.
Azure Key Vault Linked Service — Two steps to grant service access — Vinny Paluch — Medium
Storage Account Secrets
When I create a new storage account, I automatically save the connection information into the Key Vault, this is usually done through a terraform script. But in this scenario we will be creating those secrets manually.
Required secrets: Data Lake Endpoint, SAS Key and Primary Key
Those secrets will be used in other stages of this implementation.
a) During the copy phase when the connection to the Data Lake is not made using the MIS Credential.
b) During the last stage, Synapse object creation, we will use the SAS Key to register the Data Lake as a SQL Data Source.
Data Lake storage Linked Service
The connection between ADF/Synapse and the Data Lake will use a Managed Identity. If that’s not possible in your scenario, use the the Primary Key from the Key Vault.
Pipeline Design Logic
I plan to create 4 pipelines, this will allow better reutilisation.
Pipeline Main will be triggered by any file dropped into the /raw/dropfolder
in my storage account.
The sub pipelines are independent and can be reused latter by another process, e.q., a Bulk process pipeline.
The implementation process is described in this post series.
References:
Create event-based triggers — Azure Data Factory & Azure Synapse | Microsoft Learn
NIFTY 100 Stocks — Kaggle sample dataset | by Vinny Paluch | Nov, 2022 | Medium