Describe fundamental principles of machine learning on Azure

By Julian Sharp, Stefano Costanzo, Martina D'Antoni, Francesco Esposito
2/25/2026

Contents

Skill 2.1: Identify common machine learning types
Skill 2.2: Describe core machine learning concepts

Machine learning (ML) is the current focus of AI in computer science. Machine learning focuses on identifying and making sense of the patterns and structures in data and using those patterns in software for reasoning and decision making. ML uses past experiences to make future predictions.

ML allows computers to consistently perform repetitive and well-defined tasks that are difficult to accomplish for humans. Over the past few years, machine learning algorithms have proved that computers can learn tasks that are tremendously complicated for machines and have demonstrated that ML can be employed in a wide range of scenarios and industries.

This chapter explains several machine learning algorithms, including clustering, classification, regression, deep learning techniques, and Transformer architecture. The chapter then explains how machine learning works in terms of organizing datasets and applying algorithms to train a machine learning model. The chapter then looks at the process of building a machine learning model and the tools available in Azure.

Skills covered in this chapter:

Skill 2.1: Identify common machine learning types
Skill 2.2: Describe core machine learning concepts
Skill 2.3: Identify core tasks in creating a machine learning solution
Skill 2.4: Describe capabilities of no-code machine learning with Azure Machine Learning

Skill 2.1: Identify common machine learning types

Machine learning requires lots of data to build and train models to make predictions and inferences based on the relationships in data. You can use machine learning to predict a new value based on historical values and trends, to categorize a new piece of data based on data the model has already seen, and to find similarities by discovering the patterns in the data.

Humans can often see patterns in small datasets with a few parameters. For example, consider Figure 2-1: a small set of data regarding students studying for this exam.

FIGURE 2.1 Sample data

You can probably see a pattern that shows studying more hours leads to a higher exam score and passing the exam. However, can you see a pattern between the students’ academic backgrounds and whether they pass or fail? Can you answer the question of how much completing the labs affects their score? What if you were to have more information about the student, and what if there were many more records of data? This is where machine learning can help.

Understand machine learning model types

The amount of data created by businesses, people, their devices, and applications in ordinary daily life has grown exponentially and will grow even more as sensors are embedded into machinery in factories, personal devices, and homes. This volume of data is such that we each can leverage it to improve the way we make decisions and how we operate.

When you decide that you want to use machine learning, one of the first things is to decide which type of learning you will use in your model:

Supervised
Unsupervised
Reinforcement

The type of learning determines how your model will use data to determine its outcome.

Supervised learning

In supervised learning, the existing data contains the desired outcome, called a label in machine learning. The labeled value is the output you want your model to determine for new data. A label can either be a value or a distinct category.

The other data that is supplied and that is used as inputs to the model are called features. A supervised learning model uses the features and label to train the model to fit the label to the features. After the model is trained, supplying the model with the features for new data will predict the value, or category, for the label.

Use supervised learning when you already have existing data that contains both the features and the label.

Unsupervised learning

In unsupervised learning, you do not have the outcome or label in the data. Instead, you use machine learning to determine the structure of the data and to look for commonalities or similarities in the data. Unsupervised learning separates the data based on the features.

Use unsupervised learning when you are trying to discover something about your data that you do not already know.

Reinforcement learning

Reinforcement learning uses feedback to improve the outcomes from the machine learning model. Reinforcement learning does not have labeled data. Reinforcement learning uses a computer program, an agent, to determine if the outcome is optimal or not and feeds that back into the model so it can learn from itself.

Use reinforcement learning when trying to model a task, such as building a model to play chess. This type of learning is commonly used in robotics.

Describe regression models

You will have probably used regression in school to draw a best fit line through a series of data points on a graph. Using the data from Figure 2-1, the hours studied are plotted against the exam scores in Figure 2-2.

FIGURE 2.2 Regression graph

Regression is an example of supervised machine learning where the features (in this case, the hours studied) and the label (the exam score) are known and are both used to make the model fit the features to the label.

This graph is a very simple example of linear regression with just one feature. Regression models in machine learning can have many features. There are regression algorithms other than a simple linear regression that can be used.

Regression is used to predict a numeric value using a formula that is derived from historic data. Regression predicts continuous values, not distinct categories. For Figure 2-2, the model generated the formula y = 9.6947x + 545.78, which implies that every hour of studying increases the exam score by almost 10 points. Suppose you asked the model how many hours a student should study to pass the exam. It would predict 16 hours of studying would achieve a passing score of 700. (Microsoft exams require a score of at least 700 to pass.)

However, this is when you need to start considering how the data can affect the model. If a student who studied for 30 hours and scored 650 was added to the data, the regression formula would change to y = 6.7243x + 579.49, as shown in Figure 2-3.

FIGURE 2.3 Regression graph with additional data

With this change to the model, the result changes to 18 hours of studying required to pass the exam. In machine learning, one of the major concerns is how data can bias the model (for more detail see the “Bias” section later in this chapter).

Describe classification models

Classification machine learning models are used to predict mutually exclusive categories, or classes. Classification involves learning that uses labels to classify data and is an example of supervised machine learning.

Classification is used to make predictions when you do not require continuous values but need distinct categories, such as Pass or Fail.

Using the data from Figure 2-3, you could build and train a classification model to use the hours studied to predict whether a student passes the exam or not. Using this data, a simple two-class model would likely predict that someone studying for less than 18 hours will fail and 18 hours or more will pass.

In a classification model, you can compare the actual labels with the prediction from the model, as shown in the table in Figure 2-4.

FIGURE 2-4 Classification model

You can see that the classification model correctly predicts all but two of the results. If the model predicts a pass and the actual is a pass, this is a true positive. If the model predicts a fail and the actual is a fail, this is a true negative.

To refine a classification model, you need to understand where the model gets it wrong. For example, the model predicts a pass for Student L, but the actual result was a fail—this is a false positive. Student E actually passed, but the model predicted that the student will fail—this is a false negative.

Describe clustering models

Clustering machine models learn by discovering similarities, patterns, and relationships in the data without the data being labeled. Clustering is an example of unsupervised learning where the model attempts to discover structure from the data or tell you something about the data that you didn’t know.

Clustering analyzes unlabeled data to find similarities in data points and groups them together into clusters. A clustering algorithm could be used, for example, to segment customers into multiple groups based on similarities in the customer’s details and history.

A clustering model predicts mutually exclusive categories, or classes. K-means clustering is a common clustering model where K is the number of distinct clusters you want the model to group the data by. The way clustering works is to calculate the distance between the data point and the center of the cluster, and then to minimize the distance of each data point to the center of its cluster.

Consider the sample data but assume no one has taken the exam yet, so you do not have the scores or pass/fail. Instead, you have unlabeled data. To determine if a relationship exists between the background of the students and the hours studied, you could plot these as shown in Figure 2-5.

FIGURE 2.5 Clustering data

If you were to create a clustering model with K = 3, then it might group the data into the three clusters—A, B, and C—as shown in Figure 2-6.

A common example of a clustering model is the recommender model. This is the model that is shown in Figure 2-6 for a company that wants to provide recommendations to its users by generating personalized targeted recommendations. A recommender model looks for similarities between customers. For example, a video streaming service knows which movies customers watch and can group customers by the types of movies they watch. A customer can then be shown other movies watched by other customers with similar viewing histories.

FIGURE 2.6 Clustering model

Describe deep learning techniques

Deep learning is a machine learning technique inspired by the way the human brain learns. It is based on the use of artificial neural networks, which are designed to emulate biological neurons’ behavior. Both biological and neural neurons are connected with other neurons in multilayered structures, and these connections ensure signal propagation. In contrast to biological neurons, which activate and send signals to connected neurons when stimulated, a neuron in neural networks applies a mathematical function to an input value, combined with a weight, and passes the result through an activation function to decide whether to propagate it to the next layer.

Complex tasks such as natural language processing and image recognition are made possible by deep learning techniques, although they can also be used to solve traditional machine learning problems like classification and regression.

Just like supervised learning models, deep learning models learn from labeled data by finding a function that given input features (x) can predict the corresponding output label (y). What makes deep learning different, often outperforming traditional approaches when large amounts of data and computational resources are available, is the complexity of this learning function resulting from the combination of nested functions of the layers of the network.

During training, the model takes input data and passes it through the network to produce a prediction (ŷ), compares this with the known true label (y), and calculates an error (known as the loss function) based on their difference. Then, through a process called backpropagation, the model updates the weights associated with each neuron to minimize this loss. Figure 2-7 illustrates this training mechanism.

FIGURE 2.7 Deep learning models training

To better understand how deep learning works, consider a deep learning classification model trained to classify the species of a bird based on such quantifiable features as color, wing length, and body weight. Each observation of a bird is represented as a feature vector with a numerical value for each characteristic considered and is passed into the input layer of the neural network.

Each neuron of the input layer computes a weighted sum of the inputs using weights learned during training (w) and passes the result to the activation function, which determines whether it meets the threshold to be passed to the next layer. Repeating this process, data transformation at each layer is propagated through the network to the output layer.

To determine the bird class, when data reaches the output layer of the neural network, a function like SoftMax is applied to convert its results into a set of probabilities for each of the possible categories. The class with the highest probability is selected as the prediction.

Describe features of the Transformer architecture

Another breakthrough in the evolution of machine learning, especially in natural language processing, has been the development of Transformer models. These models form the underlying structure of today’s large language models (LLMs). What makes Transformers so revolutionary is their ability to model semantic relationships between words and their consequent ability to generate text that is hard to distinguish from human writing.

Transformer architecture is based on two main multilayered networks: an encoder and a decoder.

The encoder receives text as input, breaks it into tokens (which may be words or parts of words), and generates vector representations of the input text known as embeddings.

Embeddings capture the meaning of every token in relation to its use. Considering a high-dimensional semantic space, words with similar meanings are represented by vectors that have similar directions in this space. Cosine distance is usually used to determine, in this semantic space, how similar two words are.

The decoder uses these embeddings to generate new sequences of text, predicting one token at a time to form coherent responses. Figure 2-8 provides a simplified schematic illustration of the process.

FIGURE 2.8 Basic schema of encoder and decoder

The reason why Transformer architecture outperforms earlier machine learning models for natural language processing, such as RNNs (recurrent neural networks) and LSTM (long short term memory), is the attention mechanism. The attention mechanism captures the relationships and dependencies between tokens within a sentence, regardless of their distance from each other. This enables Transformers to learn semantic rules more deeply than previous neural networks did. In modern architectures, the so-called self-attention mechanism is used, which computes how much each token in the input sequence should attend to (or influence) every other token (including itself) to build a contextualized representation of meaning.

Both the encoder and the decoder use attention layers.

In the encoder block, the attention mechanism is used in the embeddings’ generation process. It allows the creation of vectors with regard to all the relationships and dependencies within the input text, regardless of the distance between words. This context-sensitive approach allows the same word to have different embeddings depending on the context in which it is used. To further improve embedding generation, each token embedding is enriched with positional encoding, representing the position of the token in the sequence.

In the decoder block, the attention mechanism is used in the process of predicting the next token in a sentence. It evaluates the previous tokens’ embeddings to determine the most influential tokens for the prediction of the next token. The attention layer assigns weights to each token (precisely: to each token’s embedding), generating an attention score that influences the prediction of the next token. While training, the model attempts to predict the next token based on past tokens.

In practice, a technique called multi-head attention applies multiple attention mechanisms in parallel on different learned projections of each token’s embedding, allowing the model to capture diverse contextual relationships.

The output of each step is passed through a SoftMax layer to produce a probability distribution over the vocabulary, from which the most likely next token is sampled or selected. This process repeats iteratively, using the previously generated tokens as input to generate the sequence one token at a time. This iterative process continues until the sequence is complete, generating an EOS (end of sequence) token.

As in any neural network, the prediction of the output is compared to the actual value, and the loss is calculated. The weights are then incrementally adjusted to reduce the loss and improve the model. Through this training, the model learns not only individual word meanings but also how meanings shift depending on context.

Transformers’ power lies in the combination of massive training data, the use of embeddings to capture semantic relationships, and the attention mechanism to model contextual meaning. This is what enables LLMs like GPT-4 to produce human-like responses.

NOTE DIFFERENT IMPLEMENTATIONS OF TRANSFORMERS

There are different implementations of the Transformer architecture depending on the intended task. For instance, BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder component and is optimized for tasks such as classification and information retrieval. In contrast, GPT (Generative Pretrained Transformer) uses only the decoder and is designed to generate natural language, predicting the most likely next token given a preceding sequence

Save to your account