Prepare and serve data
- By Daniil Maslyuk, Stepan Resl, Johnny Winter
- 8/5/2024
The data preparation process should be done with the most efficient means possible, ensuring optimal usage of your Microsoft Fabric resources. In this sample chapter, you will learn to create objects in a lakehouse or warehouse, copy and transform data, and optimize performance. Skills 2.1 to 2.4 from Exam DP-600 are covered.
For data to be used to get answers, it needs to be modeled, prepared, cleaned, orchestrated, and provided so that the right users can get to it when they need it. As part of the preparation process, you often need to decide what transformation tool to use, whether to duplicate the data, how the partitions should be created, which transformations to use, and how to control the transformations to ensure the correct continuity of individual operations between individual states. At the same time, you want to undertake these steps via the most efficient means possible, ensuring optimal usage of your Microsoft Fabric resources.
Skills covered in this chapter:
Skill 2.1: Create objects in a lakehouse or warehouse
Skill 2.2: Copy data
Skill 2.3: Transform data
Skill 2.4: Optimize performance
Skill 2.1: Create objects in a lakehouse or warehouse
With data coming from various sources and in multiple formats, you need somewhere to store it so that it can then be processed into the appropriate form, format, and style of storage most suitable for its subsequent use. You need to get the data into a unified OneLake environment, which then makes it possible to use no-code, low-code, and even full-code transformations to process the data. Once the data is in OneLake, it is not duplicated in multiple places, because OneLake uses shortcuts to point to specific locations of data rather than creating additional instances of that data when it’s needed by other items.
Ingest data by using a data pipeline, dataflow, or notebook
Fabric provides three basic ways to retrieve data from existing storage and systems: data pipelines, dataflows, and notebooks. Each item uses a different user approach and targets different types of users.
Data pipelines
A data pipeline is an item from the Data Factory experience that acts as an orchestration component. It can run other items and services and be scheduled to run at specific times. To start creating a pipeline, select New > Data pipeline, give the pipeline a name, then select Create. In Figure 2-1, you can see the blank canvas of the data pipeline editor.
FIGURE 2-1 Blank data pipeline canvas in the data pipeline editor
Within the Fabric services, the Copy Activity feature uses the Data Movement service and allows you to get data from child nodes, bring it to the Fabric environment, and save it. You can specify the data be saved in the original data format or first converted to another format or directly to Delta Parquet tables within Lakehouse Explorer. You can set up Copy Activity by selecting either:
Add pipeline activity (works on the canvas)
Copy data (launches a wizard)
Neither method requires code from you to retrieve the data and convert it to the desired format. In addition to lakehouses, Copy Activity can also work and ingest data from other Fabric items, such as warehouses.
EXAMPLE OF HOW TO INGEST DATA TO LAKEHOUSE BY PIPELINE
To upload data, open or create a pipeline in the workspace. In this pipeline, you can ingest your data by following these steps:
Select Copy data from the blank canvas (Figure 2-2).
FIGURE 2-2 Quick actions appear only if the pipeline is empty.
Search and choose data sources, such as Azure Data Lake Storage Gen2 shown in Figure 2-3.
FIGURE 2-3 Filtered data sources
Select Create a new connection (Figure 2-4), fill in your URL, and sign in.
FIGURE 2-4 Create a new connection. Blank fields in the wizard preview require values in the proper format to help you.
Choose data to import. For example, Figure 2-5 shows a file named Sales selected.
FIGURE 2-5 CSV preview of a selected file
Select Lakehouse as a data destination. You can use the search feature to quickly find a specific data destination (Figure 2-6).
FIGURE 2-6 Searching for lakehouse in data destinations
Select the existing lakehouse or create a new one (Figure 2-7).
FIGURE 2-7 Creating a new lakehouse
Map data to columns in a table or create a new table and define column names and their data types (Figure 2-8).
FIGURE 2-8 Preview of column mapping
Select Save + Run.
The pipeline will be immediately blocked among the items to be launched. After a while, the status of the ongoing data migration and the operation’s result will be displayed.
Dataflows
An item that uses Power Query Online, DataFlow Gen2 allows you to use all existing data connectors, including a connection to on-premises data using an on-premises data gateway. Figure 2-9 shows how DataFlow Gen2 looks when you open it for the first time.
FIGURE 2-9 Empty canvas of DataFlow Gen2
While working with data, DataFlow Gen2 uses two additional items for staging: An automatically generated lakehouse serves as Staging Storage, and a warehouse serves as Staging Compute and can be used to write to the data destination. Figure 2-10 illustrates the process.
FIGURE 2-10 Dataflow engine schema
When data enters the dataflow engine, the mashup engine based on staging can distribute the data to Staging Storage and Staging Compute or directly transform it and then save it to a destination. You can set data destinations within dataflows separately for each query. Your current destination choices are lakehouse, warehouse, Azure SQL database, and Azure Data Explorer (Kusto). If queries have disabled staging, these items are unused, and everything is calculated in memory (Figure 2-11), which on a smaller data sample can better impact the consumed CUs capacity and, simultaneously, the speed. If you have a larger sample of data or a sample requiring more transformations and even combining data from different sources, then the impact can be precisely the opposite.
FIGURE 2.11 Disabled staging
Because dataflows use Power Query Online, you can create transformations using a graphical interface or directly with the M language in Advanced editor, Script view, or Step script.
EXAMPLE OF HOW TO INGEST DATA TO LAKEHOUSE BY DATAFLOWS
To open a new DataFlow Gen2:
Select the “Get data from another source” link in DataFlow Gen2.
Search for Azure Data Lake Storage Gen2 (Figure 2-12) and select it from the New sources section.
FIGURE 2-12 Choose a data source wizard
Create a new connection in a new window (Figure 2-13), select Next, and then select Create.
FIGURE 2-13 Connection settings
Filter the files. If you have the file names, you can filter by the Name column in Figure 2-14; otherwise, you can use the column Folder Path to select the data container/folder destination.
FIGURE 2-14 Filter using the Name column.
If you are selecting just one file, you can directly select the value [Binary] in the Content column, and Power Query will extract data for you. Otherwise, use the Combine icon next to the column name (two arrows pointing down), set the file origin if necessary, and select OK. Figure 2-15 shows a preview of the data.
FIGURE 2-15 Data preview
Prepare the data, select the plus icon (right corner) to add a data destination, and then choose Lakehouse (Figure 2-16).
FIGURE 2.16 Possible data destinations
Create your connection for all lakehouses, or use the one you already have.
Search for your lakehouse (Figure 2-17), and choose a table to insert data or create a new one.
FIGURE 2-17 Destination target
Disable Use automatic settings. Select Replace and Dynamic schema. Set data types of all inserted columns or select columns that will be used with selected data types as a schema for the new table. You can see the full settings in Figure 2-18.
FIGURE 2-18 Destination settings
Disable staging of the query by right-clicking a query in the Queries pane and toggling off Enable staging.
Publish the dataflow, and refresh it.
The table is created and filled with data as soon as the dataflow is updated. You can update your dataflow manually, or you can set regular updates. Because users often need to branch out individual transformation processes, timing all items separately can be problematic. That’s why using the orchestration capabilities of a data pipeline is good, as they can also run dataflows.
Notebooks
Notebooks are items that are used primarily for creating Apache Spark jobs and machine learning experiments. A notebook itself does not allow you to perform data transformations using the UI. Instead, you must use of one of the supported scripting languages:
PySpark (Python)
Spark SQL
Spark (Scala)
SparkR (R)
You can use these languages in individual code cells that can be executed independently regardless of their order or run sequentially. If a notebook is started using a pipeline, for example, then all cells are executed in their order. Individual code cells can reuse previous cells’ variables and outputs, so combining individual scripting languages is possible to obtain a result. In addition to these languages, you can also use Markdown notepads. However, it is possible to create notes in code cells according to the rules of the chosen language.
Notebooks are extended with the Data Wrangler tool, which allows you to perform transformation and explorer operations with data using a graphical interface similar to Power Query. It currently allows editing data loaded as pandas DataFrame and Spark DataFrame.
Notebooks allow the use of many libraries, which are ready-made collections of code for the user. You can use three types of libraries:
Built-in These are pre-installed libraries for each Fabric Spark runtime, according to its settings. For specific details, consult “Apache Spar Runtimes in Fabric” at learn.microsoft.com/en-us/fabric/data-engineering/runtime.
Public These libraries are stored in public repositories like PyPI or Conda. Public libraries must be installed within individual notebook runs or in advance in the runtime via a custom environment or workspace default environment.
Custom These are libraries created within the organization or provided by any developer. You can use .whl libraries for Python, .jar for Java, or .tar.gz for R.
You can use the code below to inline call pieces of libraries for notebook purposes. The first line imports the full library, and the second imports only specifically named functions from a library:
import {name-of-package-from-library} [as {user-defined-name-of-package}] from {name-of-package-from-library} import {name-of-function}
Also, thanks to libraries, notebooks can get data from a large number of source locations and can also get it to a lot of destinations. Thus, notebooks use Fabric capacity for their operation, and the admin should monitor this use of capacity to prevent a possible shortage.
EXAMPLE OF HOW TO INGEST DATA TO A LAKEHOUSE BY A NOTEBOOK
Open a blank notebook, and follow these steps:
Insert the following code into the first cell:
azure_data_lake_gen2_name = "<name-of-your-ADLG2>" blob_container_name = "<container-name>" file_name = "<file-name>" path = f'abfss://{blob_container_name}@{azure_data_lake_gen2_name}.dfs.core. windows.net/{file_name}'
Fill variables by your content.
Create a new Code cell.
Insert the following code:
df = spark.read.format('csv').options(header='True',inferSchema='True').load(path)
Add a lakehouse by selecting Lakehouse in Explorer.
Select Add in the left of the window (Figure 2-19).
FIGURE 2-19 Preview of data in a notebook
In the popup, choose if you want to create a new lakehouse or use an existing one (Figure 2-20).
FIGURE 2.20 Add lakehouse options
Your decision in step 7 will stop the current Spark session. You need to confirm this by selecting the Stop now button.
Create a new Code cell, and insert the following code:
df.write.mode("overwrite").format("delta").saveAsTable('salesByNotebook')
Select Run all.
Once selected, the lakehouse will create a new table named salesbyNotebook with the schema defined by the data frame. In addition, you can use the function saveAsTable to save; the input would look like save(’Table/salesByNotebook’). There is often no need to overwrite all data stored in tables, so you can use mode(’append’) just to add new rows. If you want to save data not as a table but as a file, you can use save(’Files/<name-of-folder-for-files>’). The result would then look like:
df.write.format("csv").save("Files/SalesData") df.write.format("parquet").save("Files/SalesData")
Create and manage shortcuts
Shortcuts are objects in OneLake that point to other storage locations. They appear as folders in OneLake; any experience or service with access to OneLake can use them. OneLake shortcuts behave similarly to Microsoft Windows shortcuts. They’re independent objects from the target to which they are just pointing. If you delete a shortcut, the target remains unaffected. The shortcut can break if you move, rename, or delete a target path.
Shortcuts can be created in lakehouse or KQL (Kusto Query Language) databases, and you can use them as data directly in OneLake. Any Fabric service can use them without necessarily copying data directly from a data source. Shortcuts can be created as:
Table shortcut
File shortcut
Thanks to the ability to create shortcuts with data stored directly in OneLake, you can reuse data between lakehouses stored in different workspaces. These shortcuts can be generated from a lakehouse, warehouse, or KQL database. They can also access data for notebook transformations or other Fabric items.
EXAMPLE OF HOW TO CREATE A SHORTCUT INSIDE A LAKEHOUSE
You can create a shortcut if you own a lakehouse by following these steps:
Open Lakehouse Explorer.
Right-click a directory within the Explorer pane, or select the ellipsis icon that appears when you hover over the Tables or Files main folder.
Select New shortcut (Figure 2-21).
FIGURE 2.21 Creating a new shortcut
Select a source of data, such as Azure Data Lake Storage Gen2 (Figure 2-22).
FIGURE 2-22 Shortcut wizard
Fill in the connection settings.
Name the shortcut, and set the subpath to your data.
Select the new shortcut folder to preview the data (Figure 2-23).
FIGURE 2-23 Data preview
As you can see in Figure 2-24, shortcuts are indicated by a folder icon similar to the one used for Tables and Files but with an added link symbol. This icon is attached to the original icon and can recognize data connected as shortcuts. To delete a shortcut, select the ellipsis icon displayed after hovering over the shortcut name and select the Delete option.
FIGURE 2.24 Icon previews
Implement file partitioning for analytics workloads in a lakehouse
A crucial technique in data management, data partitioning involves dividing a large dataset into smaller, more manageable subsets known as partitions. Each partition holds a portion of the data, which can be stored and processed independently.
Partitions are represented as folders that contain Parquet files that all meet the same partition condition. A partition condition uses data in a selected column (or columns) because multiple partitions are supported. Based on them, you can create a partition folder with an exact name pattern: <partition-name>=<value>. For example, Figure 2-25 shows a preview of COUNTRY partitions. Note that a partition folder must contain at least one file; empty partition folders are automatically removed.
FIGURE 2-25 Deployed COUNTRY partitions
Not every column can be used as a partition column, because partition columns must have one of the following data types:
String
Integer
Boolean
DateTime
If a column contains empty values, one more partition with a condition equal to __HIVE_DEFAULT_PARTITION__ will be created.
Delta tables are also filed by composition so that the same principle can be applied to them. However, the Copy Activity options within pipelines, DataFlow Gen2, and notebooks currently allow you to create partitions using Copy Activity (only for tables) and notebook (for tables and files).
EXAMPLE OF IMPLEMENTING PARTITIONS BY COPY ACTIVITY IN A PIPELINE
Open the new data pipeline in the same workspace as a lakehouse, which will be used as a data destination, and then follow these steps:
Add a Copy Activity by selecting Add pipeline activity > Copy data. Under Source, select Sample dataset, select Browse, and then choose a dataset, such as NYC Taxi – Green (Parquet), as shown in Figure 2-26.
FIGURE 2-26 Inserting a sample dataset as a data store type
Under Destination, select a destination lakehouse.
Open Advanced options.
Enable partitions.
Add partitions columns (Figure 2-27).
FIGURE 2-27 Enabling partitions and assigned columns from the data source
Select Run.
This run’s result will look the same in the Lakehouse Explorer as the run without partitions. The difference occurs when you select the created table’s ellipsis and select View files. The result will then look similar to Figure 2-28.
FIGURE 2-28 Implemented partitions on a table with a blank value
EXAMPLE OF IMPLEMENTING PARTITIONS USING FABRIC NOTEBOOKS
Create and open a new notebook in a workspace where is also a lakehouse that might be used as a data destination, and then follow these steps:
Connect to the lakehouse with the data ingested by the previous pipeline.
Delete all default cells, and create one new Code cell.
Insert the following code:
df = spark.sql("SELECT * FROM nyc") df.write.partitionBy("vendorID","tripType","paymentType").mode("overwrite"). parquet("Files/nyc")
Select Run.
The function partitionBy from step 3 creates partitions based on the column names inserted, which in this case are vendorID, tripType, and paymentType. These appear both in Lakehouse Explorer (Figure 2-29) and a notebook’s Lakehouse Preview.
FIGURE 2.29 Preview of partitions in Lakehouse Explorer
Create views, functions, and stored procedures
Fabric lakehouses that use SQL analytic endpoints and warehouses both support the creation of views, functions, and stored procedures. Three components are integral parts of a SQL database:
View A virtual table whose content is pre-defined by a query
Function A user-defined function that accepts parameters, performs an action, such as a complex calculation, and returns the result as a scalar value or a table
Stored procedure A block of T-SQL code stored in a database that the client can execute
All of them can be created using the SQL query editor either directly in the SQL analytic endpoint interface via the SQL query of the mentioned items or by using SQL Management Studio or Azure Data Studio. You can create views and stored procedures using templates also. To access them, hover over their respective folder, and then select the ellipsis icon that appears (Figure 2-30). Note that the Refresh option in the resulting menu refreshes only the preview of the data, not the data itself.
FIGURE 2.30 Quick options for creating a Stored Procedure
Views
You can create a view in two ways: using a SQL command in the SQL query editor or a visual query, which is a no-code option that uses Diagram view in Power Query and is shown in Figure 2-31.
FIGURE 2-31 Diagram experience with data preview of a visual query
You can open the entire Power Query Online window, but you risk using an untranslatable operation. Power Query provides you with a data preview during each transformation step, allowing you to navigate your data easily and see what is happening. If you only use operations that Power Query can convert to SQL, you can save your results by selecting the Save as view button. (If you use a nontranslatable operation, an information banner will immediately tell you.) The Save as view popup is shown in Figure 2-32.
FIGURE 2-32 SQL Preview in the Save as view
Another approach is to use the SQL query editor, where you can write and execute all your queries. These queries can also be stored as a personal queries alias, My queries, or as Shared queries, which all users with the right to access that item (SQL endpoint or warehouse) can see and potentially use if they have permission to execute SQL queries. This option contains a Save as view button next to the Run button. You can save any selected part of the code to create a new view. You can, therefore, test even more complex queries or perform different queries simultaneously. When you find a specific part of the code that suits you and returns the correct results, you can create a view from it, as shown in Figure 2-33.
FIGURE 2-33 Selected T-SQL that will be used as a view
As with a visual query, you still need to set the view’s name (Figure 2-34). You can also take another look at the code that will be used.
FIGURE 2-34 T-SQL preview in Save as view window
Of course, you have the option to create views directly using CREATE VIEW using the following syntax:
CREATE [ OR ALTER ] VIEW [ schema_name . ] view_name [ ( column_name [ ,...n ] ) ] AS <select_statement> [;] <select_statement> ::= [ WITH <common_table_expression> [ ,...n ] ] <select_criteria>
Functions
Functions cannot be defined using a visual query, so you must use the T-SQL syntax directly within the SQL query option using the function syntax:
CREATE FUNCTION [ schema_name. ] function_name ( [ { @parameter_name [ AS ] parameter_ data_type [ = default ] } [ ,...n ]]) RETURNS TABLE [ WITH SCHEMABINDING ] [ AS ] RETURN [ ( ] select_stmt [ ) ] [ ; ]
Alternatively, you could use another tool, such as Azure Data Studio or SQL Management Studio.
Stored Procedures
Stored procedures cannot be created using a visual query either. T-SQL syntax must be used here as well:
CREATE [ OR ALTER ] { PROC | PROCEDURE } [ schema_name.] procedure_name [ { @parameter data_type } [ OUT | OUTPUT ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;][ ,...n ] [ END ] } [;]
You can also create stored procedures using a shortcut that prepares the piece of code. To use this shortcut, follow these steps:
Open Warehouse Explorer.
Right-click the Stored Procedures folder or the ellipsis that appear after hovering over it.
Select New stored procedure.
To use this shortcut as a SQL query in Warehouse Explorer:
Open Warehouse Explorer.
Expand more options at the New SQL query.
Select Stored procedure.
Both approaches create the code shown in Figure 2-35 as a new SQL query. You can then edit your code and prepare it to do exactly what you need.
FIGURE 2.35 Create a stored procedure template
Enrich data by adding new columns or tables
As you prepare the data based on the input scenarios, it may sometimes correspond to how it will need to look in the end. Often, though, additional columns or tables will need to be added and existing ones modified or removed. Microsoft Fabric, within its ingest items and T-SQL, allows you to enrich the data and thus compile the resulting views containing precisely what is needed.
Remember that Warehouse Explorer and SQL endpoint currently do not support ALTER TABLE ADD COLUMN within the lakehouse and warehouse data items. Therefore, extending tables with additional columns by T-SQL is impossible; instead, you must delete the table and create it again. Within Lakehouse Explorer, however, you can edit the Delta tables schema using notebooks or SparkJobs.
The withColumn() and select() functions can extend existing DataFrames. The select() function allows you to define the columns you want to keep from the existing DataFrame. At the same time, it allows the columns defined below to be renamed, be retyped, or to perform some function with them, such as explode(). This feature allows you to select the same column twice, expanding the DataFrame with a new column. The function withColumn() returns a new DataFrame by adding a column or replacing the existing one with the same name.
Consider a few examples of using these functions:
df.select("id","name") # Will select only two columns from whole DataFrame df.select("*",(df.UnitPrice-df.ProductionPrice).alias('ProductMargin')) # Calculates new column based on two existing and will contain all previous columns df.withColumn('ProductMargin', df.UnitPrice – df.ProductionPrice) # Calculates new column based on two existingwithColumn('UnitPrice', df.UnitPrice + 10) # Replaces current column UnitPrice with new values
To modify the existing schema of existing tables using the schema of extended DataFrames, you can use two options parameters of the options() function:
mergeSchema Expands the existing schema with the schema of the Spark frame being written
overwriteSchema Overwrites the schema of the existing Delta table with the schema of the written Spark frame
For example, you could use:
df.write.mode("overwrite").option("mergeSchema", "true").saveAsTable("sales") df.write.model("overwrite").option("overwriteSchema", "true").saveAsTable("sales")
As shown in the previous section, you can create a table using Copy Activity, DataFlow Gen2, or notebooks directly based on the data. You can also ingest data using a data pipeline, dataflow, or notebook. At the same time, you can create a new table using T-SQL within the Warehouse Explorer.
Select New SQL query to expand it.
Select Table (Figure 2-36).
FIGURE 2-36 All current T-SQL template options in Warehouse Explorer
Edit the inserted code as shown in Figure 2-37.
FIGURE 2-37 Creating a table with T-SQL
Select Run to create the table.
Unfortunately, this same progression does not work within SQL endpoint in Lakehouse Explorer. If you want to create a table without data, you can use PySpark and a similar syntax in SparkSQL. Changing from SQL data types to Spark data types is important; for example, use the STRING data type instead of VARCHAR (Figure 2-38).
FIGURE 2-38 Creating a table with PySpark