Scaling Azure Data Factory & Azure Synapse Analytics Pipelines

Context

Back in May 2020 I wrote a blog post about 'When You Should Use Multiple Azure Data Factory's'. Following on from this post with a full year+ now passed and having implemented many more data platform solutions for some crazy massive (technical term) enterprise customers I've been reflecting on these scenario's. Specifically considering:

  • The use of having multiple regional Data Factory instances and integration runtime services.
  • The decoupling of wider orchestration processes from workers.

Furthermore, to supplement this understanding and for added context, in December 2020 I wrote about Data Factory Activity Concurrency Limits – What Happens Next? and Pipelines – Understanding Internal vs External Activities. Both of which now add to a much clearer picture regarding the ability to scale pipelines for the purposes of large-scale extraction and transformation processes.


Before getting into the meat of this, one further point of clarity, I choose to use the term 'data integration pipelines' in the title because 'pipelines' can now be implemented in both Data Factory and Synapse Analytics. Therefore, I wanted to refer to something that is basically orchestration resource agnostic within Azure. Or, at least for now, who knows! A H2 heading also added to be explicit 🙂

Pipelines – a now interchangeable term for different Azure data resources. Still not to be confused with Azure DevOps pipelines though!

Moving on…


Scenario

The scenario I'm proposing for this blog isn't exclusive to "big data" or "big enterprises". It could occur anywhere that requires a scalable data platform solution, involving data sources that are scattered far and wide (in a physical sense). Something in the magnitude of (let's say) 150+ data sources at 50 different locations, each with 250+ datasets that need processing.

To help make the point, let's also say we have an array of source systems on every continent coming from a range of different countries and ultimately localised network infrastructure. So, a multi-national corporation with lots of different operating companies around the world. Or similar.

How do you bring that data together in a scalable way, taking maximum advantage of Microsoft's cloud platform and avoiding any limitations in pipeline process concurrency?


Design Pattern

As you probably know if you follow my blog, I'm a visual person, so below is a pretty picture to address the stated scenario and allude to the design pattern I want to introduce.

In this pattern we have regional data extraction hubs pulling data from source systems via Self Hosted Integration Runtimes (IR's) and a central set of data transformation pipelines handling the flow of data into our Data Lake or Delta Lake-House (Data Warehouse).

We could think of this as a data orchestration hub and spoke architecture. Regional hubs. IR spokes.

Let's now dissect the picture (in GitHub as a PDF here if you prefer) with the bullet points below.

Another caveat, there are lots of variables here that can be subjective depending on your requirements. But let's tackle the generic design pattern, rather than thinking about every bit of low-level detail.

  • A set of regional orchestration resources, 1 per continent in this case. Deploy as and where required.
  • Source system IR's deployed either on site or in Azure as well if Express Route circuits exist.
  • Self Hosted IR's deployed as multi node clusters, to offer maximum load balancing and failover capabilities for locations where job concurrency is a factor.
  • A set of raw Azure Storage accounts used as initial data landing zones. Having this separated across regions can be useful next to the respective pipelines. However, be mindful of egress charges when transferring data between Azure Regions. A source to single region copy pattern might be better for you where the Self Hosted IR's does the work of writing directly to the central transformation area.
  • In the case of this picture, a pair of Azure IR's used to support internal pipeline activities, typically aligned to the Azure Regional failover zones. Whatever your requirement, don't use the default auto resolving IR, be explicit and be in control allowing for maximum activity concurrency.
  • From a security perspective ensure Managed Identities are used when writing data to the raw storage to avoid firewall rules.
  • Consider VNet's and Private Endpoint usage as Service Endpoints don't currently work across Azure Regions if you want to avoid public connections.
  • To overcome bandwidth limitations, breaking out of the public internet might be preferable from source infrastructure, then using the Microsoft backbone from your regional storage accounts to your central data lake.
  • Generic reusable pipelines wherever possible to simply troubleshooting and test overheads. Boiler plate code and metadata driven copying of data given the scope of your source system database technologies.
  • A central processing framework to handle all pipeline executions globally. Maybe procfwk.com 😉 . This could/should operate using a single UTC trigger for all processing, or be decoupled with regional triggers that align to metadata execution batches with downstream dependencies that check extraction has completed. Once you cross all time zones, data integrity and meaning becomes much harder so have a clear control flow is important.

Platform Hierachy

As a hierarchy we could label the items in the architecture as follows:

  • Microsoft Azure
    • Data platform solution.
      • Central processing framework, bootstrap resource.
      • Transformed data within our Lake House.
        • Regional processing resources.
        • Azure IR's supporting regional processing.
          • Locally hosted IR's for source system data pulls.
            • Multi source systems per location.

Summary

Scaling data integration pipelines in Azure (using Synapse or Data Factory) is possible and can work well when utilising different Azure Regions to create a set of 'hub and spoke' processing resources. This pattern also maximises pipeline execution parallelism given current IR activity concurrency limitations.


I hope you found this design pattern interesting and useful.

Many thanks for reading.