Azure Data Factory: What it is and how to use it

Move, transform, process and distribute business data from different sources with a single, convenient tool

What you'll find in this article

Azure Data Factory: What is it?
Azure Data Factory: How does it work?
Azure Data Factory connectors: the most used in the business environment
Azure Data Factory: a practical example
Azure Data Factory pricing

Azure Data Factory: What it is and how to use it

Azure Data Factory: What is it?

Azure Data Factory is the service dedicated to movement, integration and distribution of data (Data Integration Service) offered by the Microsoft Azure cloud computing platform.

One of the biggest problems for a company is having to manage a large amount of data from different sources, which may include relational databases, file-based data, data in cloud services, as well as data from different external sources that are difficult to communicate with each other.

Often, the processes of moving, managing and distributing this data may require additional efforts from developers and business data analysts to implement ad hoc solutions, with significantly longer development times and higher design and management costs.

Data Factory Azure makes this process much more uncomplicated and swift and allows users to move, transform and load data from different sources in a scalable and reliable way without problems related to compatibility issues and without the need to compile dedicated code for each operation.

Azure Data Factory: How does it work?

Let's take a general look at how Data Factory works before looking at its characteristics in a little more detail.

The service provides intuitive visual tools based on an interface Drag & Drop and gives the possibility to perform a series of transformation operations, such as uniting, filtration, aggregation and custom transformations, to prepare data for analysis or loading to a destination without problems related to the compatibility of individual data sources.

Through monitoring and management tools that allow you to observe the progress of the execution of workflows, it is possible to identify any problems or anomalies and immediately optimize their performance.
Azure Data Factory also provides several security features, including integration with Azure Active Directory for authentication and authorization, data encryption at rest and in transit, and role-based access control (RBAC) to manage access to data and pipelines.

The integration of data, as we have already seen, involves the collection of the latter from one or more sources and then their cleaning, transformation or enrichment with additional data. Finally, the combined data is stored in a data platform service that deals with the type of analysis we want to perform.

This process can be automated from Azure Data Factory in a configuration known as Extract, Transform, and Load (Extract, Transform, and Load or ETL). Let's see what each of these three steps includes.

Extract

Companies have data of various types, such as structured, unstructured and semi-structured data.

The first step to take consists of pick all data from different sources and later move The data in a centralized location for subsequent processing, configuring the connections with the data sources to which you want to access. Azure Data Factory offers preconfigured connectors for a wide range of data sources, and we'll take a closer look at them in the next section.

We then continue by defining the activities or pipelines within the Azure Data Factory. An activity can be a single action, such as uploading a CSV file into a database, while a pipeline is a series of activities linked together in a workflow.

Transform

Once connections to data sources have been established, data movement and transformation activities can be created within pipelines.
Possible operations include combining, dividing, adding, deriving, removing, or copying data from one source to another, transforming data using mapping and business rules, and executing SQL queries on relational databases. For example, we can use the Copy Activity in a data pipeline to move data from both cloud sources and local data repositories to a centralized data warehouse in the cloud.

You can also schedule the execution of pipelines at predefined times or in response to specific events, ensuring that the data is always updated and available when necessary to users and monitor its execution through Azure Monitor, PowerShell, APIS, Azure Monitor logs and status panels in the portal of Azure, collecting information on the results, displaying any errors or alerts, and analyzing the overall performance of the workflow.

Azure Data Factory also supports external activities to perform our transformations on computing services such as Spark, HDInsight Hadoop, Machine Learning and Data Lake Analytics.

Load

Once the execution of the pipelines has been completed, the processed data can be delivered to the desired destination, such as a database, a Data Warehouse Or a cloud service. Users can then use this data for analysis, reporting, application development, and other business purposes.

During an upload, many Azure destinations can accept data formatted as a files, JSON or blobs.
Azure Data Factory also offers a full CI/CD support (continuous integrations and distributions) of our data pipelines using GitHub and Azure DevOps.

Continuous integration (CI) is the frequent and automatic integration of code changes into a shared repository. Continuous distribution (CD) is a two-part process during which code changes are integrated, tested, and distributed.

What does "orchestration" mean?

Orchestration refers to the process of coordinate and manage execution of data workflows. This involves scheduling and controlling activities within the pipelines, ensuring that they are executed in the right order and with the right timeline.

The Azure Data Factory orchestrator is responsible for the execution of activities within the pipelines. This includes coordinating data extraction, transformation, and loading activities, as well as managing dependencies between them.

Sometimes Data Factory will instruct another service to do the actual work requested on its behalf, such as Databricks to execute a transformation query. The service will then orchestrate the execution of the query and prepare the pipelines to move the data to the destination or to the next step.

The orchestrator therefore ensures that the data is processed correctly and consistently, also allowing the execution to be scheduled based on a variety of triggers, such as a specific time, an event or a trigger based on specific data.

Did you know that we help our customers manage their Azure tenants?

We have created the Infrastructure & Security team, focused on the Azure cloud, to better respond to the needs of our customers who involve us in technical and strategic decisions. In addition to configuring and managing the tenant, we also take care of:
‍

optimization of resource costs
implementation of scaling and high availability procedures
creation of application deployments through DevOps pipelines
monitoring
and, above all, security!
‍

With Dev4Side, you have a reliable partner that supports you across the entire Microsoft application ecosystem.

Book a call

Azure Data Factory connectors: the most used in the business environment

The connectors (or connectors) offered by Azure Data Factory of Microsoft Azure are software components designed for simplify and facilitate integration data that allows you to connect and interact with a wide range of services and data platforms, both inside and outside Azure.

In a nutshell, they act as a bridge between Data Factory and the different sources and destinations and allow you to acquire data from external sources, such as database, files, cloud services, software applications and to load the data to desired destinations or vice versa.

The service provides about 100 connectors and robust resources for both code-based and non-code-based users to meet their transformation and data movement needs.

So given a brief definition of what connectors are, let's see together which are the types available most used in Azure Data Factory in the field of business data management and processing.

Connectivity to relational databases

Azure Data Factory offers a wide range of relational database connectors, including SQL Server, MySQL, Oracle, PostgreSQL, and others. This allows Azure Data Factory to connect directly to relational databases and perform operations such as extracting data, executing queries, updating and inserting data.
For example, you could use the Data Factory SQL Server connector to automatically extract data from a specific table in the corporate SQL Server database and load it into an Azure SQL Data Warehouse data warehouse for further analysis.

Integration with file archiving

There are several connectors supported by Data Factory for storing files. Some examples may be Azure Blob Storage, Azure Data Lake Storage, and the local file system (making sure they are reachable from Azure). Among the different options available to us, we can, for example, access files stored on these platforms and perform operations such as reading and writing files and processing structured and unstructured data.

Let's say we need to process and merge data from different CSV files stored in Azure Blob Storage. With Data Factory, it would be possible to define a workflow that reads these files, performs transformation operations such as merging data and finally uploading them to an Azure SQL database for analysis.

Interaction with Microsoft Azure cloud services

Azure Data Factory is natively integrated with Microsoft Azure cloud services, including Azure SQL Database, Azure Cosmos DB, Azure Table Storage, allowing it to interact directly with these services and giving users the opportunity to perform operations such as the transfer and processing of data and the integration with other solutions offered by the Azure platform.

Let's take as an example the use of the connector for Azure Cosmos DB to perform periodic processing of data stored in a database NoSQL than Cosmos DB. This would make it possible to define a workflow aimed at extracting data from Cosmos DB, performing transformation operations and uploading them to a data analysis platform such as Azure Synapse Analytics.

Access to software applications (via API)

Azure Data Factory offers connectors for many popular software applications not only in the Microsoft environment, such as Dynamics 365 but also with other apps such as Salesforce and Google Analytics, allowing you to directly access the data stored in these applications and to perform operations such as extracting data, updating resources and synchronizing data between applications without the need to develop connection solutions yourself.

For example, if we want to integrate data from a customer management system such as Salesforce with our data analysis environment in Azure, we can, using the Salesforce connector in Data Factory, define a workflow that automatically extracts data from Salesforce modules, transform them according to our needs and then load them into an Azure data lake.

Support for data transfer protocols

Finally, Azure Data Factory also supports various data transfer protocols, including FTP, SFTP and HTTP allowing data to be transferred to and from external systems through these protocols and allowing data from different sources and systems to be easily integrated.

For example, you could use a Secure File Transfer Protocol (SFTP) connector in Azure Data Factory to extract files from a third-party SFTP server, such as those used by external data providers. This connector allows you to define a workflow that automatically retrieves files, processes them according to specific requests and loads them into a business analysis system.

Azure Data Factory: a practical example

Let's now move on to see together a practical example of how to create an Azure Data Factory to experience first-hand its speed and ease of use by integrating data from a local SQL Server database into an Azure SQL data warehouse.

Creating the resource

To begin with, let's log in to the Azure portal using our account. If you don't have an account, there's nothing to worry about since a few simple clicks will be enough to create one for free. Let's go to the portal menu and click on Create a resource.

Then select Analitics and in the list of analysis services we look for Azure Data Factory and click on it.

Let's click on Create and we'll get to the page for filling in the basic details. Here we are going to enter the required details in the fields present such as name, subscription, resource group, and area. Once completed, click on Git configuration.

We then select the option to configure the integration with a Git repository to manage the pipeline code. Once finished we click on Next.

Let's go to the page Network, we verify the default settings and, if necessary, we make any changes to permissions and network rules based on what our needs may be.

On the page Review and create, we check the settings and details of our Data Factory and when we are satisfied with the final result, we finally click on Create to start creating our resource.

Activating the user interface

Once our resource is created, click on Go to resource and select Start Studio to access the Azure Data Factory user interface in a separate tab.

Here we can configure data pipelines to extract data from the local SQL Server database and load it into the Azure SQL data warehouse.

We can define using the available visual components the activities of copying, transforming and loading data, such as extracting data from a SQL Server database, transforming data using Azure Databricks or loading data from local files into Azure SQL Data Warehouse.

Verification and optimization

Finally, make all the necessary improvements and corrections to the configurations defined above and after making sure that the pipelines have been organized correctly, we start monitoring activities and workflows to ensure that everything continues to work as expected.

As can be seen, the speed and simplicity of carrying out the procedures listed above, repeatable as desired for any other type of similar operation, is extreme and allows users and companies the maximum savings of time and resources in the management and movement of their data.

Azure Data Factory pricing

The Azure Data Factory pricing system adopts a payment model based on actual use of the service, in which users pay for the computing and storage resources used during the execution of data pipelines, the orchestration and execution of said pipelines, and the execution and debugging of data flows.

The costs are calculated automatically Based on the minutes the tasks were executed, the data integration units used (and based on the type of calculation performed) and the storage space occupied by the data (and the storage class used).
Some external connectors and resources used and the use of external cloud services such as Azure SQL Database or Azure Blob Storage may entail additional costs based on the use made and the rates of the service.

It's important to keep in mind that actual costs may vary based on current rates, actual usage, and other factors specific to the use case. We therefore recommend that you go deeper by consulting the official Azure documentation.

Also worth noting is the possibility of using the convenient Azure cost calculation tool offered by Microsoft (which you can find hither) to estimate the specific costs for your company based on geographical area, currency used for payment and length of use of each of the services that can be used (calculated by hours or by months) and find the most suitable employment solution for your needs.

FAQ on Azure Data Factory

What is Azure Data Factory?

Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and orchestrate data workflows across various data sources and destinations. It enables the movement and transformation of data at scale.

How does Azure Data Factory handle data integration?

Azure Data Factory facilitates data integration by enabling the creation of data pipelines that can ingest, transform, and load data across different on-premises and cloud sources. It supports multiple data sources and provides connectors for seamless data flow.

Can Azure Data Factory be used for ETL processes?

Yes, Azure Data Factory is ideal for ETL (Extract, Transform, Load) processes. It allows you to extract data from various sources, apply transformations, and load the processed data into a destination, such as a database or data warehouse.

What types of data sources can Azure Data Factory connect to?

Azure Data Factory can connect to a wide range of data sources, including on-premises databases, cloud-based services, SaaS applications, and various data storage systems like Azure Blob Storage, SQL Server, and more.

How does Azure Data Factory ensure data security?

Azure Data Factory ensures data security through encryption, secure data transfer protocols, and integration with Azure's security features. It supports role-based access control (RBAC) and manages sensitive data with encryption both in transit and at rest.

Is it possible to monitor and manage data pipelines in Azure Data Factory?

Yes, Azure Data Factory provides robust monitoring and management capabilities. You can monitor the status of data pipelines, set up alerts for failures, and view detailed logs to troubleshoot issues.

What are the key components of an Azure Data Factory pipeline?

An Azure Data Factory pipeline consists of linked services, datasets, and activities. Linked services define connections to data sources, datasets represent data structures, and activities define the operations performed on the data, such as copying, transforming, or moving it.

Can Azure Data Factory be integrated with other Azure services?

Yes, Azure Data Factory can be seamlessly integrated with other Azure services such as Azure Synapse Analytics, Azure Databricks, and Azure Machine Learning to enhance data processing, analytics, and machine learning workflows.

How does Azure Data Factory handle scheduling and automation?

Azure Data Factory provides built-in scheduling capabilities that allow you to automate the execution of data pipelines based on time or event triggers. This enables the automation of data workflows without manual intervention.

What are the pricing considerations for using Azure Data Factory?

Azure Data Factory pricing is based on several factors, including the number of pipeline activities, data movement, and integration runtime usage. It offers a pay-as-you-go pricing model, allowing flexibility based on usage.