Understand Azure data solutions

By Danilo Dominici, Francesco Milano, Daniel A. Seara
10/11/2021

Contents

Summary

Data engineers are responsible for data-related implementation tasks. These include provisioning data-storage services, ingesting streaming and batch data, transforming data, implementing security requirements, implementing data-retention policies, identifying performance bottlenecks, and accessing external data sources.
There are three main types of data: structured, semi-structured, and unstructured.
Structured data is data that adheres to a strict schema that defines field names, data types, and the relationship between objects. It is commonly known as relational data.
Semi-structured data is not organized and does not conform to a formal structure of tables. It does, however, have structures associated with it, such as tags and metadata, which allow for records, fields, and hierarchies within the data. JSON and XML documents are examples of unstructured data.
Unstructured data has no usable schema at all. Examples of unstructured data include video and image files.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine. It is designed for cloud applications when near-zero administration and enterprise-grade capabilities are needed. It always runs on the most recent stable version of the SQL Server database engine and patched OS.
Azure SQL Database is the best solution to store relational data. It uses advanced query processing features, such as high-performance in-memory technologies and intelligent query processing.
Azure SQL MI is a cloud database service that combines the broadest SQL Server database engine compatibility with all the benefits of a fully managed PaaS model.
Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse service, brings together enterprise data warehousing and big data analytics. It combines the power of an analysis engine with Apache Spark, an integration engine (based on Azure Data Factory), and a user interface to create and manage projects.
NoSQL databases, used to store semi-structured data, allow for various types of data stores, including documents, columns, key-value pairs, and graphs.
An Azure storage account is the simplest form of storage in Azure. It can be set to Blob storage, ADLS Gen1, or ADLS Gen2.
Azure Blob storage is optimized for storing massive amounts of unstructured data.
ADLS is designed for customers who require the ability to store massive amounts of data for big data analytics use cases.
ADLS Gen2 is meant for big data analytics. It is built on top of Azure Blob storage.
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service. It can use several APIs to access data, including SQL, MongoDB, Cassandra, Gremlin, and Tables.
One key concept in Cosmos DB is data distribution. You can configure Cosmos DB to replicate data geographically in real time. Applications can then access data that is stored nearest to them.
Azure Cosmos DB guarantees less than 10 ms latency for reads (indexed) and writes in every Azure region worldwide.
Data processing is the conversion of raw data to meaningful information. Depending on the type of data you have and how it is ingested into your system, you can process each data item as it arrives (stream processing) or store it in a buffer for later processing in groups (batch processing).
Batch processing is the processing of a large volume of data all at once. It is an extremely efficient way to process large amounts of data collected over a period of time.
Stream, or real-time, processing enables you to almost instantaneously analyze data streaming from one device to another.
Lambda architecture is a specialized data-processing architecture that defines two paths where data flows from a source to a destination: a hot path and a cold path. The hot path is used for stream processing to analyze data while it flows through the pipeline. The cold path is used to store data as is for later recomputation or analysis.
Kappa architecture enables you to build a streaming- and batch-processing system on a single technology. This means you can build a stream-processing application to handle real-time data, and if you need to modify your output, you can update your code and run it again in a batch manner.
Azure Stream Analytics is a real-time analytics and complex event-processing engine that is designed to analyze and process high volumes of fast streaming data from multiple sources simultaneously.
Azure Stream Analytics can ingest data from Azure Event Hubs (including Azure Event Hubs from Apache Kafka), Azure IoT Hub, and Azure Blob storage.
Azure Data Factory (ADF) is a hybrid data-integration service that enables you to quickly and efficiently create automated data-driven workflows (called pipelines) without writing code.
Pipelines created using ADF can ingest data from disparate data stores and write plain or transformed data to a destination (or sink).
Azure Databricks is a fully managed version of the popular open-source Apache Spark analytics and data-processing engine. It provides an enterprise-grade and secure cloud-based big data and ML platform.
Azure Databricks can be used with ML or real-time analysis applications because it goes deeper into ML features within Spark and provides a more comfortable developer experience.