Azure Databricks: What it is, architecture, and costs

Q: What are the main components of Azure Databricks?

The main components of Azure Databricks include Workspaces, Notebooks, Clusters, Jobs, Libraries, and Data Sources.

Q: How does Azure Databricks compare to Azure Data Factory?

Azure Data Factory is primarily for ETL processes, while Azure Databricks is a collaborative platform for data engineering and data science.

Q: What is the pricing model for Azure Databricks?

Azure Databricks uses a 'pay-as-you-go' model based on Databricks Units (DBUs) with discounted pre-purchase options available.

Q: How can I optimize costs in Azure Databricks?

You can optimize costs by choosing the right service level, using autoscaling, terminating inactive clusters, and leveraging efficient data formats.

Q: What programming languages does Azure Databricks support?

Azure Databricks supports Python, Scala, R, and SQL, offering flexibility for various data processing and machine learning tasks.

Azure Databricks is the version optimized for the Azure cloud environment of Apache Spark and is one of the best analytics platforms. The platform is able to process large volumes of data that allow data scientists, engineers and business analysts to process large volumes of data efficiently. Azure Databricks combines the power of Apache Spark with the scalability and security of Azure, providing a collaborative environment for optimal big data management. In this article, we will take a closer look at what Azure Databricks is, what are the peculiarities of its architecture, what differentiates it from other Azure services dedicated to data processing and what are the costs associated with its use.

Microsoft Azure

What you'll find in this article

What is Azure Databricks
Azure Databricks Architecture: main components and operation
Azure Databricks: differences with similar Microsoft Azure services
Azure Databricks Pricing: How can you optimize costs?

Azure Databricks: What it is, architecture, and costs

What is Azure Databricks

Organizations accumulate huge amounts of data related to operations, marketing, sales, and more every day. To exploit its full potential, it is essential for a company to rely on platforms and services that allow the integration and analysis of this data, in order to obtain useful information for the health and growth of its business. This is where Azure Databricks comes in, an advanced data analysis platform based on Apache Spark and integrated into the Microsoft Azure cloud.

The service integrates all the components and capabilities of Databricks, also allowing integration with other Azure services. This allows businesses to take advantage of Databricks's advanced data processing and machine learning capabilities, along with the robustness, scalability, and integration options of Microsoft's cloud computing platform.

Thanks to its ability to manage complex datasets and to support various programming languages such as Python, Scala and R, Azure Databricks is particularly suitable for applications that require high scalability and reliability, making it an ideal choice for companies looking to optimize their data analysis processes in the cloud.

Azure Databricks allows users to access and work in a collaborative environment based on Apache Spark technology, offering users all the computing power of Microsoft data centers to work on large amounts of data in an abstract and simplified way and allowing them to create, manage and distribute data processing workflows and perform machine learning tasks efficiently. Let's see together how below.

Azure Databricks Architecture: main components and operation

Azure Databricks is a lakehouse platform that offers a cost-effective solution for data management and governance. A data lakehouse is a modern data management architecture that combines the strengths of two traditional data storage and analysis models: the data lake and the data warehouse.

Traditionally, a Data Lake It is a storage space that allows you to store raw and unstructured data from different sources. It is ideal for advanced analysis, machine learning and big data, thanks to its ability to manage large volumes of data of any format (text, audio, video, etc.). However, this flexibility often involves less stringent data management, with potential quality and governance issues.

On the other hand, a Data Warehouse is designed for structured data analysis. It stores data in an orderly manner, allowing quick and precise queries on structured data. It's perfect for business reporting and operational analysis, but it can be costly and limiting when it comes to managing unstructured or semi-structured data.

The data lakehouse combines these two approaches and within it the data can initially be stored raw, as a data lake would, but it is managed and organized in such a way that it can be transformed, indexed and optimized for structured analysis, typical of a data warehouse.

This approach allows organizations to create a single continuous data management system, avoiding the complexity of having to manage multiple data architectures simultaneously.

Core components of Azure Databricks

In order to understand how Azure Databricks works, it is necessary to first familiarize yourself with some fundamental elements of its structure. To do this, we leave below a list of the most important ones with a brief description for each:

Workspace: the central hub of Azure Databricks, where all users can collaborate. This environment allows teams to create and manage notebooks, libraries, and data in an integrated, interactive context. Teams can share and work on projects in real time, facilitating collaboration and knowledge sharing.
‍
Notebooks: interactive web interfaces that allow users to write and execute code, visualize data and document their analyses. They are essential tools for iterative processing and rapid prototyping, and they support various programming languages such as Python, Scala, R, and SQL.
‍
Clusters: the Azure Databricks computational units. These clusters run code written in notebooks and can be configured to scale automatically based on processing needs, ensuring optimal performance even with variable workloads. Clusters can be temporary for performing specific tasks or persistent for continuous use.
‍
Jobs: allow users to schedule and automate the execution of notebooks or scripts. Users can define parameters for jobs, set schedule intervals, and monitor the status of jobs through the Azure portal or REST APIs. Especially useful for recurring tasks or for managing complex workflows.
‍
Libraries: reusable packages or dependencies that can be installed on clusters to extend the functionality of Azure Databricks. These may include machine learning libraries, data visualization tools, or other resources needed for specific data analysis projects.
‍
Data Sources: Azure Databricks supports a wide range of data sources, both structured and unstructured, including Azure Blob Storage, Azure Data Lake Store, SQL Data Warehouse, and many others.

The Azure Databricks architecture

Azure Databricks is designed around two main architectural components: the Control Plane and the Computing Plane.

The control plane is a management layer where Azure Databricks manages the workspace application and manages notebooks, configurations, and clusters. This plan includes backend services operated by Azure Databricks within your account. For example, the web application with which you interact is part of the Control Plan.

The calculation plan is where the data processing tasks take place. The Calculation Plan is divided into two categories based on usage:

Classic Calculation Plan: allows you to use Azure Databricks computing resources as part of your Azure subscription. Computing resources are generated within the virtual network of each workspace located in the customer's Azure subscription. This ensures that the Classic Calculation Plan operates with intrinsic isolation, since it is executed within the environment controlled by the customer.
‍
Serverless Computing Plan: in the serverless model, Azure Databricks manages computing resources within a shared infrastructure. This plan is designed to simplify operations by eliminating the need to manage underlying computing resources. It has multiple layers of security to protect data and isolate workspaces. This helps ensure that the infrastructure is shared while each customer's data and resources remain private.

Did you know that we help our customers manage their Azure tenants?

We have created the Infrastructure & Security team, focused on the Azure cloud, to better respond to the needs of our customers who involve us in technical and strategic decisions. In addition to configuring and managing the tenant, we also take care of:
‍

optimization of resource costs
implementation of scaling and high availability procedures
creation of application deployments through DevOps pipelines
monitoring
and, above all, security!
‍

With Dev4Side, you have a reliable partner that supports you across the entire Microsoft application ecosystem.

Book a call

Azure Databricks: differences with similar Microsoft Azure services

Microsoft Azure's significant offer includes other services dedicated to data management and processing in addition to Azure Databricks, each useful to cover different types of tasks and meet the specific needs of your company.

Good news for all users of Microsoft's cloud computing platform, however, this amount of tools can potentially confuse novice users who could get lost among the many services offered, without specifically understanding what in which contexts it would be preferable to use one service rather than another.

In this section, we will therefore focus on clarifying the differences between Azure Databricks and two other of the most important services provided by Azure, respectively Synapse and Data Factory, to better understand in which situations Databricks is the best choice for your needs.

Azure Synapse vs. Azure Databricks

Azure Synapse and Azure Databricks offer similar functionality, but there are distinct differences between them. Databricks is a batch and in-stream data processing engine built on the reliable Apache Spark, which allows distribution across multiple nodes. Azure Synapse (formerly Azure SQL Data Warehouse), on the contrary, while being able to exploit the capabilities of Spark, can be considered a unified data analysis platform for big data systems and data warehouses.

If the end user experience is more oriented towards data science, if you work with many open source libraries or work more closely in the field of machine learning, Databricks could be the ideal choice.

Not only does it have its own database utilities for widgets, dashboards and graphics, all based on Jupyter Notebooks, but it's also pretty universal in terms of supported programming languages, allowing you to run Python, Spark Scholar, SQL, NC SQL and more. In addition, Databricks is designed to work as a centralized platform, meaning that it has its own unique UI and systems for connecting through various endpoints, such as JDBC connectors.

Among Synapse's strengths, on the other hand, is its complete suite of tools. Basically, Microsoft took the traditional Azure SQL Data Warehouse and integrated all the Data Factory integration components for the movement of ETL and ELT data, and Power BI is also integrated for the data visualization component.

Because Synapse is based on traditional SQL, familiarity can be beneficial for organizations already experienced in the development platform, however, its integration with Spark is always a little behind Databricks when it comes to updates and new features.

The delay in updating the new features is partly offset by one of Synapse's main tools, which is Purview: essentially a cataloguing system that can be used for data governance. This means that a user can take data from two sources and simply insert it into a data lake where the information can be transformed, curated, and even cleaned up before being distributed to other users for analysis.

This integrated governance capability makes it easy to trace the data line, allowing someone to easily see a table schema, both on a file system and on a database, and understand the movement of data between the starting point and the destination.

Modern data warehouse architecture proposed by Microsoft

Azure Data Factory vs. Azure Databricks

Azure Data Factory is primarily used for data integration services, executing ETL processes and orchestrating large scale data movements, while Azure Databricks is more oriented to providing a collaborative platform for Data Engineers and Data Scientists to perform ETL and build Machine Learning models under a single platform.

Databricks uses Python, Spark, R, Java, or SQL to perform Data Engineering and Data Science tasks using notebooks. ADF, on the other hand, provides drag-and-drop functionality and more user-friendly graphical interface tools for visually creating and managing data pipelines, allowing applications to be distributed at a higher speed.

Although Data Factory facilitates the ETL pipeline process using GUI tools, developers have less flexibility as they cannot modify the backend code. Instead, Databricks implements a programmed approach that provides the flexibility to optimize performance by adjusting the code.

Companies often do batch processing or streaming when working with large volumes of data. While the batch processes data in bulk, the streaming manages real-time or archive data (less than twelve hours) based on the applications.

ADF and Databricks both support batch and streaming options, but ADF doesn't support live streaming, unlike Databricks, which instead supports both streaming options through the Spark API.

If a company wants to experiment with a no-code ETL pipeline for data integration, Azure Data Factory is the best choice, where Databricks instead provides its more technical users with a unified analytics platform to integrate various ecosystems for BI reporting, Data Science and Machine Learning at the price of less ease of use.

Workflow and pipeline planning with Azure Databricks and Azure Data Factory

Azure Databricks Pricing: How can you optimize costs?

Azure Databricks uses a “pay-as-you-go” payment model, which allows you to pay only for the resources you actually use without long-term commitments. This model is very flexible and suitable for companies that have varying workloads.

However, for those with predictable needs and stable usage volumes, Azure also offers' pre-purchase 'options with significant discounts for long-term commitments. These plans allow you to obtain lower rates than the pay-as-you-go model in exchange for a commitment for a certain period of time (typically one or three years) and lead to savings of up to 37% on the cost of using the service.

Pricing is mainly based on the concept of 'DBU' (Databricks Unit), which is a measure of computing power. Each computational activity on Azure Databricks consumes a certain number of DBUs, and the cost per DBU varies depending on the service level and the type of cluster instance used.

In addition to the cost per DBU, you must consider the cost of the cloud resources used by the clusters. Azure charges separately for the calculation of virtual machines (VMs) and the storage used. The price of VMs depends on the type of instance chosen (for example, standard instances, instances optimized for memory or compute), and can vary significantly. Storage also has a variable cost, based on the amount of data stored and the type of storage (for example, SSD or HDD).

There are two levels of service available for purchase and they are respectively:

Standard: suitable for general workloads and provides a good balance between cost and performance. It's the most affordable option and is ideal for development and test projects.
‍
Premium: at a slightly higher cost, it offers advanced security and management features, such as role-based access control and support for two-factor authentication. It is suitable for production environments that require a higher level of security and management.
‍

Azure Databricks also offers several add-ons and complementary services that may affect the final cost. These include support for Data Engineering, which is optimized for large scale ETL processing, and support for Data Science and Machine Learning. These components are also charged based on usage and can add an additional level of complexity to the overall pricing model.

For more detailed information on the prices of the service, please refer to the official Microsoft page (available hither), where the convenient tool provided will allow you to calculate pricing options based on geographical region, currency, type of service and time of employment (calculable in hours or months).

Best practices for optimizing Azure Databricks costs

As with all pay-as-you-go services, it is essential for companies to learn how to optimize costs while using the platform, in order to avoid waste and to modulate overall spending in an intelligent way.

To do this, we leave you below a list of the best practices and strategies that you can implement to be able to use the features of Azure Databricks without wasting your budget:

Choosing the right level of service: it is necessary to evaluate the needs of your organization and select the level that offers the necessary functionality without incurring unnecessary costs.
‍
Autoscaling: activating autoscaling for your Databricks clusters, in order to dynamically adjust the number of work nodes based on the workload, helps ensure optimal use of resources, reducing costs thanks to a more careful management of computing resources.
‍
Terminate inactive clusters: Setting automatic cluster termination policies after a period of inactivity avoids paying for unused resources when clusters are inactive.
‍
Use spot instances: spot instances are unused Azure resources offered at discounted rates and can be used for their databricks clusters to reduce computing costs, even if they can be claimed by Azure at short notice. They are suitable for fault-tolerant workloads or for development and test environments.
‍
Optimize data storage: it is recommended to use efficient data formats such as Parquet or Delta Lake, which offer integrated compression and reduce storage costs and keep your data in economic storage services such as Azure Blob Storage or Azure Data Lake Storage.
‍
Data cache: caching frequently accessed data to reduce the number of disk I/O operations and improve performance leads to faster execution and a significant reduction in cluster execution time and, as a result, lower usage costs.
‍
Optimize queries and transformations: to minimize data movement and reduce the execution time of your jobs, you can optimize the writing of queries and transformations using techniques such as partition pruning, predicate pushdown and broadcast joins to improve performance.
‍
Monitor and analyze usage: it goes without saying, but it is necessary to always keep under control and regularly analyze the use of Databricks using the tools provided by Azure such as Cost Management, Azure Monitor and Databricks usage metrics, to quickly identify areas of inefficiency and optimize operating costs accordingly.

Conclusions

Azure Databricks is currently among the services dedicated to Azure analytics most used and appreciated by companies of all types and sizes, which can meet the needs of both data engineers and data scientists for the creation of complete end-to-end big data solutions and their deployment in production.

Used by data engineers to configure the entire architecture, creating clusters, planning and executing jobs, establishing connections with data sources, etc., and by data scientists to perform machine learning and analysis operations in real time, Azure Databricks provides them with a large number of customization options and the versatility of a platform compatible with the major programming languages used in the sector.

The platform is therefore confirmed as one of the best choices for all those companies that need to perform complex data analysis or machine learning tasks in an efficient and scalable way, allowing them to meet their needs by exploiting the power of the data centers provided by Microsoft's cloud computing platform.

FAQ on Azure Databricks

What is Azure Databricks?

‍Azure Databricks is an advanced analytics platform built on Apache Spark and integrated into the Microsoft Azure cloud. It facilitates data engineering, machine learning, and big data analytics in a collaborative environment.

What are the main components of Azure Databricks?

‍Azure Databricks includes components like Workspaces, Notebooks, Clusters, Jobs, Libraries, and Data Sources for seamless data processing and collaboration.

How does Azure Databricks architecture work?

‍Azure Databricks architecture is divided into the Control Plane (managing workspaces) and the Computing Plane (handling data processing), with options for both classic and serverless computing.

How does Azure Databricks differ from Azure Synapse?

‍While both are used for data analytics, Azure Databricks is optimized for data science and machine learning with Apache Spark, whereas Azure Synapse is a more comprehensive data analysis platform, including SQL data warehousing and ETL processes.

How does Azure Databricks compare to Azure Data Factory?

‍Azure Data Factory is primarily a data integration service for ETL processes, while Azure Databricks provides a collaborative platform for data engineering and data science tasks.

What is the pricing model for Azure Databricks?

‍Azure Databricks uses a "pay-as-you-go" model based on Databricks Units (DBUs) and also offers discounted pre-purchase options for long-term commitments.

How can I optimize costs in Azure Databricks?

‍Optimize costs by choosing the right service level, using autoscaling, terminating inactive clusters, and leveraging spot instances and efficient data formats like Parquet.

What programming languages does Azure Databricks support?

‍Azure Databricks supports Python, Scala, R, and SQL, allowing flexible development for various data processing and machine learning tasks.

What are some best practices for using Azure Databricks?

‍Best practices include optimizing queries, caching frequently accessed data, and regularly monitoring usage to avoid inefficiencies and reduce costs.