Columnstore Indexes
Columnstore indexes are used for reporting purposes especially when dealing with large amounts of data. They are based on the concept of columnar databases whereby the data is stored in a column together rather than storing it in a row together. Each column is stored independently, but the rows in the table are kept in the same order in each segment.
This format is suited best when you only need a few columns from the table but need a large percentage of the rows of the table. As an example of the following query SELECT SUM(COL1) FORM TableName; would only need to scan Col1 and would never need to touch Col2, Col3 etc…
Columnstore indexes are not ordered and thus are not useful for picking only a few rows out of a table, another limitation is that there are several data types that are not supported.
- Varchar(max) and nvarchar(max)
- Rowversion (also known as timestamp)
- Sql_variant
- CLR based types (hierarchyid and spatial types)
- Xml
With the introduction to SQL Server 2016 you can now apply columnstore indexes to your OLTP tables for real-time analytics. While the maintenance of theses indexes is more costly and complex compared to rowstore indexes the benefit gained is that one columnstore index can replace all of your rowstore indexes that you have applied for analytics.
A rowgroup is a group of rows that are compressed into columnstore format at the same time, the maximum rows that each row group usually contains is 1,048,756, broken down into segments that are all ordered physically the same, though in no logical order. Each rowgroup contains one column segment for every column in the table, each column segment is compressed together and stored on physical media.
As part of compression values that are duplicated over and over are replaced with a smaller value to look up the actual value. Each of the segments have information stored about the values to let the query processor know if the segment can be skipped in processing. If you have a WHERE clause that is looking for Col1 > 200 and the max value is a 100 the segment can be skipped.
The deltastore is used when modifying the data in a table with a columnstore index applied, when new rows are inserted they are added to the delatstore in a heap structure until the rows have been compressed in the deltastore and moved into compressed row group in column segments.
DELETE operations simply mark the row as removed from the column segment informing the query processor to ignore the row, with UPDATE operations the process is the same as for INSERT and DELETE.
The process that moves rows from the deltastore to compressed rowgroups is called the tuple mover. It is a background process that runs periodically and does most of the maintenance for your columnstore index, depending on how you use this table you will need to do some maintenance tasks as well.
Useage of Columnstore Indexes
When creating a clustered columnstore index the base rows of the table are compressed, removing the rowbased storage altogether. With a Non-clustered columnstore index leaves the heap or clustered rowstore index and adds a separate compressed structure. Columnstore indexes support two scenarios, one of which works with each type of index.
- Dimensional formatted data warehouses (Clustered Columnstore Indexes)
- Analytics on OLTP tables (Nonclustered Columnstore Indexes)
Create Columnstore Index
To Create a clustered columnstore index on a table use the command:
1 |
CREATE CLUSTERED COLUMNSTORE INDEX [<INDEX_NAME_HERE>] ON [<TABLE_NAME_INCLUDING SCHEMA>] |
As the the above example name implies this is the command to create a clustered columnstore index which in turns changes the internal structure of the able that it’s created on to be columnar structure. If you still had rowstore indexes on this table then the row locator of the rowstore index has been changed from the physical location in the heap, to the position in the columnstore structure (the row group, and the position in the row group). This was in simple terms the process is a lot more complex that what I have just described.
In nearly all data warehousing application the clustered columnstore is a useful structure for fact table where the table is large enough. With the main copy of the data being compressed you will have large space savings even up to 10 percent of the tables original size. With minimal changes being made to historical data this makes the clustered columnstore index the ideal choice, a non-clustered columnstore index would have to be chosen for the limitations of the data types such as nvarchar(max) and varchar(max).
Depening if a clustered columnstore index will be useful for dimensions tables or not will come down to how it is use. If the joins in your queries do not use a Nested Loop operator, there is a good chance it could be useful.
Another benefit that makes columnstore indexes better for queries for data warehousing is batch execution mode, when the query processor is scanning data in a columnstore index it’s possible for it to process rows in chunks of 900 rows at a time, rather than 1 row at a time like you would find in typical row execution mode.
Using non-clustered columnstore indexes on OLTP tables.
By applying a non-clustered columnstore index to a table that you wish to perform real time analytics on, you can enable tremendous performances with little additional query tuning. You can also tune depending on your concurrency needs as to how the columnstore index is maintained.
The command to create a non-clustered columnstore index is:
1 |
CREATE NONCLUSTERED COLUMNSTORE INDEX [<INDEX_NAME>] ON [<TABLE_NAME>](<COLUMNS LISTED HERE); |
Columnstore indexes have to be maintained in the same transaction with the modification statement just like normal indexes. However the modifications are done in a multi-step process that is optimized for the loading of the data. As already described all modification are done as an insert in the delta store, a delete from a column segment or the delta store, or both for an update to a row. The data is organised into compressed segments over time which is a burden on a busy system. In an OLTP system many rows can be updated multiple times soon after the rows are created, over time these rows become static for many systems.
With this there is a setting that lets you control the amount of time the data stays in the deltastore. This setting is called COMPRESSION_DELAY and it’s units is in minutes, this means that the data stays in the delta rowroup for at least a certain number of minutes.
1 |
CREATE NONCLUSTERED COLUMNSTORE INDEX [<INDEX_NAME>] ON [<TABLE_NAME>] () WITH (COMPRESSION_DELAY = 5); |
With the above example command this means that columns stay in the deltastore for 5 minutes allowing modifications to happen to a said column multiple times that is part of the index to be carried out in the deltastore, rather than wasting space in the compressed rowgroup being deleted and added over and over.