Note that diagnosing the query plan in MS SQL Server is not hugely different to examing Spark query plans. Some operations are conceptually the same.
Similarly, putting indexed columns into a function will remove any benefit they may bring just as they do for Spark's predicate pushdowns. In both, queries must be sargeable [Brent Ozar's blog]. SARGs are search arguments and "a sargable predicate is one of form (or which can be put into the form) column comparison-operator value". [SO]
Query Plans
"You shouldn't rely too much on cost percentages in execution plans" [SO] says SQL Server luminary, Paul White.
Heap Scan
Before we look at the scan aspect, what is a heap? "A heap is a table without a clustered index... One or more nonclustered indexes can be created on tables stored as a heap."
What's the advantage of a heap? "Because data is inserted without enforcing a strict order, the insert operation is usually faster than the equivalent insert into a clustered index."
There are many disadvantages to a heap, notably, "Do not use a heap when the data is frequently grouped together." [Microsoft]
Nested Loop Joins
"The database usually performs this matching step in a nested-loops join by indexed lookups on the join key, repeated for each row from the driving table" [1] The DB filters as it goes.
Hash Joins
"Based on table and index statistics, the cost-based optmizer estimates which of these two independent tables will retyurn fewer rows after discarding filtered rows. It chooses to has the complete results from that singe-table query...
"It then executes the larger query ... returning the driving rowset. As eadh rows exits this step, the database executes thte same hash function in its join key and uses the hash-function result to go directly to the corresponding hash bucket for the other rowset. When it reaches the right hash bucket, the database searches the tiny list of rows in that bucket to find matches."
The catch with this approach is you "hope those hash buckets fit entirely in memory, but if necessary, the database temporarily dedicates scracth disk space to hold the buckets... A large prehashed rowset could require unexpected disk scratch space, performing poorly and possibly even running out of space."
"It is the most memory-intensive of any of the joins" [SO].
Sort-merge Joins
Spark does exactly this. This is where it "reads the two tables independently but, instead of matching rows by hashing, it presorts rowsets on the join key and merges the sorted lists... Once the two lists are sorted, the matching process is fast but presorting lists is expensive and nonrobust." [1] For this reason, hash joins are preferred. They don't have this downside but have all the same advantages.
In the event of a nonrobust query, SQL Server may throw a error 701 "There is insufficient memory to run this query" [Microsoft docs].
Indices
You can see all the indexes by executing:
select * from sys.indexes
select * from sys.indexes
But remember that "cost-based optimizers often do a terrible job if they do not have statistics on all the tables and indexes involved in the query. It is therefore imperative to maintain statistics on tables and indexes reliably; this includes regenerating statistics anytime table volumes change much or anytime tables or indexes are rebuilt." [1]
Clustered vs. Non Clustered Indexes
Clustered indexes actually change where rows are stored on disk. As a result, "there can be only one clustered index per table" [Microsoft]. Non-clustered indexes by contrast are just pointers to the row data.
Adding a clustered index can be an intense operation that is measured in minutes or even hours. For instance, adding a clustered index to a table of 765k rows and about 20 columns that are dates and varchars (13 columns totally a size of about 1800) takes about 15 minutes on a 12 core Azure SQL Server. But this one index reduced the TotalSubtreeCost from c. 131k to 71k.
Bit-map indexes
"Each stored value of a bit-mapped index points to what amount to a list of yes/no bits that map to the whole list of table rows. These bit strings are ways to AND and OR together with other bit stroings of other bit-mapped indexes... The big catch is that such bit strings are expensive to maintain in sync with frequently changing table contents... Bit-mapped indexes work best for tables that are mostly read-only... The best scenario for success is precisely the data-warehouse scenario for which bit-mapped indexes were designed." [1]Columnstore
SQL Server seems to be stealing ideas from other big data tool as it now allows columnar storage. "Columnstore indexes are the standard for storing and querying large data warehousing fact tables." [Microsoft] Adding this to one of my tables made the cost drop two orders of magnitude... but the query still took over an hour before I killed it. I guess you should never judge a query by its cost [Brent Ozar].
[Aside: I eventually made the query work in a reasonable if not stellar duration by dropping a clustered index and having only the columnstore index, not the two together as I previously had.]
[Finally, some nice people on Discord gave some links to their favourite authors for all things SQL. They are Kimberly Tripp, Itzik Ben-Gan, Niko Neugebauer, and Paul White).
[1] SQL Tuning, Dan Tow
No comments:
Post a Comment