Data Aggregation
- 8/5/2024
Lists and Sets
We’ve been returning lots of interesting data so far in our KQL journey. What if we needed to temporarily store it to do some additional processing? For example, let’s say when we returned all the UserAgent strings, we wanted to check them against a known set of known malicious user agents. Another scenario would be a compromised user account, and we want to be able to quickly determine all the unique applications they have accessed from the time of known compromise until we regained control of the account.
To be able to temporarily store some of these results or even create our own dataset, we’ll use a common programming concept called a dynamic array. We’ll cover more details of leveraging arrays in Chapter 3, “Advanced KQL Operators,” and Chapter 5, “Security and Threat Hunting,” but we’ll use two very common functions—lists and sets—to get you started.
Lists
A list is pretty simple. You’ll add items to the list either manually or as part of a summarize query. Let’s first create our own list manually. Again, we’ll cover this more in Chapter 5, “Security and Threat Hunting.” Here, we’re just looking at a simple example to get you started. Run the following query; your output will be similar to Figure 2-28:
let worldSeriesChampions = datatable (teamName: string, yearWon: int) [ "New York Yankees", 2000, "Arizona Diamondback", 2001, "Anaheim Angels", 2002, "Florida Marlins", 2003, "Boston Red Sox", 2004, "Chicago White Sox", 2005, "St. Louis Cardinals", 2006, "Boston Red Sox", 2007, "Philadelphia Phillies", 2008, "New York Yankees", 2009, "San Francisco Giants", 2010, "St. Louis Cardinals", 2011, "San Francisco Giants", 2012, "Boston Red Sox", 2013, "San Francisco Giants", 2014, "Kansas City Royals", 2015 ]; worldSeriesChampions | summarize mylist = make_list(teamName)
FIGURE 2.28 MLB World Series winners 2000–2015
Here, we can see the values—World Series winners from 2000 to 2015—inputted into this list. The New York Yankees and St. Louis Cardinals appear twice in the output. The list will store whatever is inputted, including multiple values of the same thing. But you can now manipulate this data as we’ve done throughout this chapter. Let’s group these winners by even and odd years. Update your query; the output should be similar to Figure 2-29.
let worldSeriesChampions = datatable (teamName: string, yearWon: int) [ "New York Yankees", 2000, "Arizona Diamondback", 2001, "Anaheim Angels", 2002, "Florida Marlins", 2003, "Boston Red Sox", 2004, "Chicago White Sox", 2005, "St. Louis Cardinals", 2006, "Boston Red Sox", 2007, "Philadelphia Phillies", 2008, "New York Yankees", 2009, "San Francisco Giants", 2010, "St. Louis Cardinals", 2011, "San Francisco Giants", 2012, "Boston Red Sox", 2013, "San Francisco Giants", 2014, "Kansas City Royals", 2015 ]; worldSeriesChampions | summarize mylist = make_list(teamName) by isEvenYear= yearWon % 2 == 0
FIGURE 2.29 MLB World Series winners 2000–2015, by even- or odd-numbered years
The San Francisco Giants sure seem to do well in even-numbered years. This data is just for fun but demonstrates you can input your own dataset and perform different aggregate techniques. Let’s go back to our built-in sample data and use a different function to make a list—the make_list_if() function. This will work similarly to the previous if functions we’ve seen throughout this chapter, where an expression evaluated as true will be added to the list. Run the following query; your output will be similar to Figure 2-30:
SigninLogs | where TimeGenerated > ago (14d) | summarize RiskLevels= make_list_if(RiskEventTypes_V2, RiskState == "atRisk") by AppDisplayName
FIGURE 2.30 Applications with associated sign-in risk events
If the RiskState of a sign-in had risk indicated by the atRisk value, we then added the RiskEventType to the list. We then summarized this by application. In the output, we can see Azure Portal, Microsoft Office 365 Portal, and Microsoft 365 Security and Compliance Center have risky signs taking place. The other apps did not, so no risk events were added to their lists, essentially null lists. Depending on what you are trying to determine, you might want to remove the duplicate values. In other words, you might want only to store the distinct values. For that, we’ll need to use sets.
Sets
The make_set() function works very similarly to the make_list, except it only stores the distinct values. Let’s rerun our previous World Series champions query, but instead of making a list, let’s make a set. The output should be similar to Figure 2-31.
let worldSeriesChampions = datatable (teamName: string, yearWon: int) [ "New York Yankees", 2000, "Arizona Diamondback", 2001, "Anaheim Angels", 2002, "Florida Marlins", 2003, "Boston Red Sox", 2004, "Chicago White Sox", 2005, "St. Louis Cardinals", 2006, "Boston Red Sox", 2007, "Philadelphia Phillies", 2008, "New York Yankees", 2009, "San Francisco Giants", 2010, "St. Louis Cardinals", 2011, "San Francisco Giants", 2012, "Boston Red Sox", 2013, "San Francisco Giants", 2014, "Kansas City Royals", 2015 ]; worldSeriesChampions | summarize myset = make_set(teamName) by isEvenYear= yearWon % 2 == 0
FIGURE 2.31 Distinct MLB World Series winners from 2000–2015, broken out by even- and odd-numbered years
Notice that each team only appears once in that set, whereas previously, the San Francisco Giants appeared multiple times in the even-year list. This is because only distinct values are stored.
The make_set_if() function works similarly to make_list_if(), but once again, it will only store distinct values. Let’s rerun our previous make_list_if() query but store it as a set instead. The output should be similar to Figure 2-32:
SigninLogs | where TimeGenerated > ago (14d) | summarize RiskLevels= make_set_if(RiskEventTypes_V2, RiskState == "atRisk") by AppDisplayName
FIGURE 2.32 Distinct risk event sign-ins per application
If you compare this to the previous list, you will see that each RiskEventType_v2 is only stored once. The RiskEventTypes_V2 produces results with multiple event types, so it might look like some of these events are repeating. They are not. If you look at the Microsoft Office 365 Portal risk levels, three distinct results exist between the brackets []:
First is a sign-in that has been flagged for unlikely travel.
A second risk event type shows when a sign-in has been flagged for unfamiliar features.
A third set of event types shows that a sign-in has been flagged for both unfamiliar features and unlikely travel.
We’ll use these distinct results in future chapters.