Introduction to columnar databases
VertiPaq is an in-memory columnar database. Being in-memory means that all of the data handled by a model reside in RAM, and it is an easy concept to learn. We briefly introduced the concept of a columnar database in Chapter 5, “Understanding CALCULATE and CALCULATETABLE.” Now it is time to perform a deeper analysis of it.
We think of a table as a list of rows, where each row is divided into columns. Let’s take, as an example, the Product table in Figure 13-1.
FIGURE 13-1 The figure shows the Product table, with four columns and nine rows.
If you think of a table as a set of rows, then you are using the most natural visualization of a table structure, also known as a row store. In a row store, data is organized in rows and, when stored in memory, you might think that the value of the column Name in the first row is adjacent to the columns ID and Color in the same row. On the other hand, the column Name in the second row is slightly farther from Name in the first row because, in the middle, there are Color, Unit Price in the first row, and ID in the second row.
For example, if you need to compute the sum of Unit Price, you have to scan the entire table, reading many values that you are not interested in seeing. You can imagine that scanning the memory of the database sequentially: in order to read the first value of Unit Price, you have to read (and skip) ID, Name, and Color of the first row. Only then you will find an interesting value. The same process repeats for all the rows: You need to read and ignore many columns to find the interesting values to sum.
Reading and ignoring values takes time. In fact, if we asked you to compute the sum of Unit Price, you would not follow that algorithm. Instead, as a human being, you would probably scan the first row searching for the position of Unit Price, and then move your eyes vertically, reading only the values one at a time and mentally accumulating their values to produce the sum. The reason for this very natural behavior is that you save time by reading vertically instead of on a row-by-row basis.
In a columnar database, data is organized in such a way to optimize vertical scanning. To obtain this result, you need a way to make the different values of a column adjacent one to the other. In Figure 13-2 you can see the same Product table as organized by a columnar database.
FIGURE 13-2 The Product table organized on a column-by-column basis.
When stored in a columnar database, each column has its own data structure, and it is physically separated from the others. Thus, the different values of Unit Price are adjacent one to the other and distant from Color, Name, and ID.
On this data structure, computing the sum of Unit Price is much easier, because you immediately go to the column containing the Unit Price and then find, very near, all the values you need to perform the computation. In other words, you do not have to read and ignore other column values: In a single scan, you obtain only useful numbers and you can quickly aggregate them.
Now, imagine that, instead of asking you the sum of Unit Price, we asked you to compute the sum of Unit Price for only the Red products. Try that before you continue reading; it will help in understanding the algorithm.
This is not so easy anymore, because you cannot obtain such a number by simply scanning the Unit Price column. What you probably did is a scan of the Color column and, whenever it was Red, you grabbed the corresponding value of Unit Price. At the end, you summed up all the values to compute the result. This algorithm, although very natural, required you to constantly move your eyes from one column to another, possibly guiding your finger to keep the last scanned position of Color. It is definitely not an optimized way of computing the value! A better way, that only computers use, is to first scan the Color column, find the row numbers where the color is Red and then, once you know the row numbers, scan the Unit Price column summing only the rows you identified in the previous step.
This last algorithm is much better, because it lets you perform one scan of the first column and one scan of the second column, always accessing memory locations that are adjacent one to the other (apart from the jump between the scan of the first and second column).
For a more complex expression, such as the sum of all products that are either Blue or Black Console with a price higher than USD 50, things are even worse. This time, you have no chance of scanning the column one at a time, because the condition depends on many columns. As usual, if you try it on paper, it helps in understanding it better.
The simplest algorithm to produce such a result is to scan the table not on a column basis but, instead, on a row basis. You probably scanned the table row by row, even if the storage organization is column by column. Although it is a very simple operation when executed on paper by a human, the same operation is extremely expensive if executed by a computer in RAM, because it requires a lot of random reads of memory, leading to a worse performance than if computed doing a sequential scan.
Columnar databases provide very quick access to a single column but, as soon as you need a calculation that involves many columns, they need to spend some time—after having read the column content—to reorganize the information in such a way that the final expression can be computed. Even if this example was a very simple one, it is already very useful to highlight the most important characteristics of column stores:
- Single-column access is very fast, because it reads a single block of memory, and then it computes whatever aggregation you need on that memory block.
- If an expression uses many columns, the algorithm is more complex because it requires the engine to access different memory areas at different times, keeping track of the progress in some temporary area.
- The more columns you need to compute an expression, the harder it becomes to produce a final value, up to a point where it is easier to rebuild the row storage out of the column store to compute the expression.
Column stores aim to reduce the read time. However, they spend more CPU cycles to rearrange the data when many columns from the same table are used. Row stores, on the other hand, have a more linear algorithm to scan data, but they result in many useless reads. As a general rule, reducing reads at the cost of increasing CPU usage is a good deal, because with modern computers it is always easier (and cheaper) to increase the CPU speed versus reducing I/O (or memory access) time.
Moreover, as you learn in the next sections, columnar databases have more options to reduce the amount of time spent scanning data, that is, compression.