Data Aggregation
- 8/5/2024
- We Are Dealing with a Lot of Data Here
- Obfuscating Results
- Distinct and Count
- Min, Max, Average, and Sum
- Bins, Percentages, and Percentiles
- Lists and Sets
After completing this chapter, you will be able to:
Perform common statistical analysis on data such as counting totals, distinct counts, and the first and last time an event takes place
Group your data by common time delimitations such as week, day, or hour
Visualize your dataset in various graph types
We Are Dealing with a Lot of Data Here
In the previous chapter, we stressed how critical it is to filter down the initial starting data to your desired dataset. There were many ways to do this: by time, by specific values in a column, and by when a specific value was not present. Despite being able to filter down millions of records to a subset you want to look at, you’re often left with, well, a lot of data—too much to deal with manually.
For example, let’s say you work at a 45,000-user company based in Chicago. You have large offices in New York, Atlanta, and Seattle. You also have smaller offices in New Orleans and Denver and a few international offices in London, Paris, and Tokyo. A phishing message is sent to all your users. It’s a very good message, and many of them fall for it. Your leadership team wants to know how many fell for it and which offices are impacted the most. You filter based on that specific message in the last 14 days and your heart drops; it’s 12,139.
Reporting on that number to your leadership team isn’t good enough. They need to know which office was most affected because the New York office has much of the finance team, and quarterly earnings will be posted in 10 days. The Chicago office is the home to the main research and development team. The Paris office is closing a strategic deal with a partner. Knowing which users at these locations are possibly compromised is critical because some parts of the business could suffer more impact if those compromises are not remediated quickly. With 12,139 users affected, that’s far too many to sort into regions manually.
In an attempt to reduce the dataset, you apply another filter to those locations, and the number drops to 7,013. However, in the sign-in logs, you notice that the same user is shown three times because multiple sign-ins have occurred. How do you determine if the user or the threat actor did those sign-ins? You also still have too many users to determine which region was hit the hardest.
Your leadership team needs to give a status update to the company’s senior leadership team. You have a few choices. First, you can just scroll down the list, trying to get a rough estimate based on what users you recognize. That is no way to make a critical and strategic decision. You can try exporting this data to another tool like Excel, where you can do additional deduplication filtering, but some data types don’t export cleanly, so many of your tools won’t work. So, to fully use the data export, more work must be done on those 7,013 records.
Or you can use another strength of KQL, data aggregation. In this chapter, we will show you how to answer these questions quickly and include much more information, such as the first and last time this was witnessed. You will turn your dataset into insights and actions. You can also convert them into one of the things managers love most: pretty charts. Many of the functions discussed in this chapter will be used as building blocks to answer questions like those in our scenario and many more!