Coming Home

Almost 2 months ago I had my first day at Avanade. For those of you who don’t know, Avanade was cerated as a join venture between Microsoft and Accenture. Avanade has thier own business development streams but 99.9% of the Microsoft projects Accenture wins, are sent to the the Avanade team for execution.

Well let me just say what an absolute joy it has been to come back to the Microsoft family of products. After 13 months of wasting my life away fighting with Open Source garbage, I’ve come home to integrated enterprise solutions that work as advertised or at least have some reliable sources for support when they don’t. I was actually told to stop blogging about how much the Open Stack is a waste of time and money… Anyway, that’s behind me.

To add to the good vibes, Avanade is connected to Microsft in so many ways. We’ve actually had advanced looks at new technologies before the rest of the community. There 20+ MVPs in just the Midwest region, Avanade requires 80+ hours of training every year, and employees are encouraged to participate in developer community organizations.

I’m excited to talk about the first area of expertise they’d like me to look at, Avanade Touch Analytics (ATA). I haven’t completed the training yet, but this offering is fantastic. The easiest interface I’ve ever used to create dashboards that look and feel like Tableau or Spotfire, but perform lightyears ahead of both. Once the data sources are made available to the ATA server for any customer’s instance, the dashboards can be authored for or on any device. Switch between layout views to see how your dashboards will look on any device before releasing them. Publish multiple dashboards to different Active Directory security groups and let your users pick the information that’s important to them. It’s exciting, and I’m glad to see an offering addressing the shortcomings of the competition in a hosted or onsite instalations.

Well that’s enough advertising. Now that my censorship is at an end, I’ll be blogging mroe often I really want to discuss SQL Server’s memory resident database product, interesting things I’ve learned about the SSIS Service recently, and Service Broker.

I’m not a DBA, But I Play One on TV: Part 3 – Database Files

When a customer invites me to review their SQL Server or Oracle databases and server architecture, I start with the servers. I review the hard disk layout and a few server settings. The very next thing I do is review the data files and log files for the databases. In the case of SQL Server, when I see one data file and one log file in the same directory and the database has one file group called Primary, I know I am once again presiding over amateur hour at the local chapter of the Jr. Database Developer Wannabe Club.

 

One file pointing to one file group indicates to me:

  1. Someone went through the “create new database” wizard.
  2. There wasn’t any pre-development design analysis done before the database was created
  3. No one bothered to check readily available best practices for SQL Server
  4. I can anticipate equally uninformed approaches to table and index design and query authoring

 

This will antagonize the hardware striping advocacy group, but there are reasons to split up split up your data files and log files. Specifically in the case of TempDB files, you can greatly improve performance by creating the same number of log files as you have processors. With this configuration each processor will control the I/O for each file.

 

Check out number 8 here: http://technet.microsoft.com/en-US/library/cc966534

 

In addition to performance, recovery processes greatly benefit for splitting up the database files. Previously, if a data file failed, if everything was in one file or not, SQL Server would take the database offline. With SQL Server 2012 a new feature was added that will leave your database accessible, just not the data located in the corrupt or otherwise unavailable file. Well if all the data is in that one file your dataset is down until you can recover. Even if that data file contains only a subset of the data in a table, the rest of the data in that table is still available for querying.

 

Now, you might say ok we’re going to have a separate file for every table and multiple files for some. Ok, I’ve seen that configuration and there isn’t anything wrong with it. If your IT department isn’t using SQL Server to manage their backups, instead they’re backing up the actual files across all the drives, they’re going to be annoyed with you. However, this configuration gives you maximum flexibility.  For instance, placing tables that are commonly used at the same time on different spindles won’t conflict for disk I/O.

 

Splitting up your log files is also beneficial. Log files are populated in a round robin fashion. When one reaches the level you’ve set it starts filling up the next. Hopefully you have at least 4 and they are of a sufficient size. This gives you time to archive the transaction logs between backups making sure no transactions are lost due to the file rolling over before the backup removes completed transactions and shrinks the file.

 

Next episode will cover backup basics. The purpose in all these posts is to provide the understanding to apply the best configuration to the database system your building.

 

I’m not a DBA, But I Play One on TV: Part 2 – CPU and RAM

In Part 1 I discussed SQL Server and Hard Disk configurations. Now let’s have a look at CPU and RAM. This topic is actually kind of easy. More is better… most of the time.

CPU

It’s my opinion that most development environments should have a minimum of 4, 2.5+ GHz Processors, If that’s one socket with two cores, or one socket with 4 cores or, or two sockets with 2 cores, doesn’t really make that much of a difference. For a low utilization production system you’ll need 8, 2.5+ GHz processors. Look, you can get this level of chip in a mid-high grade laptop. Now if you’re looking at a very high utilization system it’s time to think about 16 processors or 32 split up over 2 or more sockets. Once you get to the land of 32 processors advanced SQL Server configuration knowledge is required. In particular you will need to know how to tweak the MAXDOP (Maximum Degree of Parallelism) setting.

Here’s a great read for setting a query hint: http://blog.sqlauthority.com/2010/03/15/sql-server-maxdop-settings-to-limit-query-to-run-on-specific-cpu/

And here are instructions for a system wide setting: http://technet.microsoft.com/en-us/library/ms189094(v=sql.105).aspx

What does this setting do? It controls the number of parallel processes SQL Server will use when servicing your queries. So why don’t we want SQL Server to maximize the number of parallel processes all the time? There is another engine involved in the process that is responsible for determining which processes can and cannot be done in parallel and the order of the parallel batches. In a very highly utilized SQL Server environment this engine can get bogged down. Think of it like air traffic control at a large airport… but there’s only one controller in the tower and it’s Thanksgiving the biggest air travel holiday in the US. Well the one air traffic controller has to assign the runway for every plane coming in and going out. Obviously, he/she becomes the bottleneck for the whole airport. If this individual only had one or two runways to work with, they wouldn’t be the bottleneck; the airport architecture is the bottleneck. I have seen 32 processor systems grind to a halt with MAXDOP set at 0 because the parallelism rule processing system was overwhelmed.

For more information on the parallel processing process: http://technet.microsoft.com/en-us/library/ms178065(v=sql.105).aspx

RAM

RAM is always a “more is better” situation. Keep in mind that if you don’t set the size and location of the page file manually, the O/S is going to try and take 1.5 times of the RAM from the O/S hard drive. The more RAM on the system, the less often the O/S will have to utilize the much slower page file. For a development system 8GB will probably be fine, but now a days you can get a mid-high level Laptop with 16GB even 32GB is getting pretty cheap. For production 16GB is the minimum, but I’d really urge you to get 24GB. And like I said 32GB configurations are becoming very affordable.

I’m not a DBA, But I Play One on TV: Part 1 – Hard Drives

This is the first in a series of posts relating to hardware considerations for a SQL Server 2008 R2 or later server. In Part 1 – Hard Drives I’m going to discuss RAID levels and what works for the Operating System (O/S) versus what works for various SQL Server components.

As a consultant I always go through the same hardware spec dance. It sounds like this:

Q: How much disk space does your application database require?

A: Depends on your utilization.

Q: Ok, what’s the smallest server we can give you for a proof of concept or 30 day trial?

A: Depends on your utilization.

Q: Well we have this VM with a 40 GB disk, 8 GB RAM, and a dual Core virtual processor available. Will that work?

A: Depends on your utilization, but I seriously doubt it.

SQL Server 2008 R2, depending on the flavor will run on just about any Windows Server O/S 2005 and newer, Windows 7 and Windows 8. This isn’t really a discussion about the O/S, more of how the O/S services SQL Server hardware requests. At the hardware level the O/S has two main functions managing memory and the hard disks and servicing requests to those resources to applications.

In a later post we’ll look at memory in a little more depth, but for the hard disk discussion we’ll need to understand the page file. The page file has been part of Microsoft’s O/S products since NT maybe windows for workgroups, but I don’t want to go look it up. The page file is an extension of the physical memory that resides one or more of the system’s hard disks. The O/S will decide when to access this portion of the Memory available to services and applications (processes) requesting memory resources. Many times when a process requires more memory than is currently available the O/S will use the page file to virtually increase the size of the memory on the system in a manner transparent to the requesting process.

Let’s sum that up. The page file is a portion of disk space used by the O/S to expand the amount of memory available to processes running on the system. The implication here is that the O/S will be performing some tasks meant for lightning fast chip RAM, on the much slower hard disk virtual memory because there is insufficient chip RAM for the task. By default the O/S wants to set aside 1.5 times the physical chip RAM in virtual memory disk space. For 16GB of RAM that’s a 24GB page file. On a 40GB drive that doesn’t leave much room for anything else. The more physical chip RAM on the server the bigger the O/S will want to make the page file, but the O/S will actually access it less often.

Now let’s talk RAID settings! You may find voluminous literature arguing the case for software RAID versus Hardware Raid. I’ll leave that to the true server scientists. I’m just going to give quick list of which RAID configurations O/S and SQL Server components will perform well with and which will cause issues. I’m going for understanding here. There are plenty of great configuration lists you can reference, but if you don’t understand how this stuff works you’re relying on memorization or constantly referencing the lists.

Summarization from: http://en.wikipedia.org/wiki/RAID

But this has better pictures: http://technet.microsoft.com/en-us/library/ms190764(v=SQL.105).aspx

RAID 0 – Makes multiple disks act like one, disk size is the sum of all identical disk sizes and there isn’t any failover or redundancy. One disk dies and all info is lost on all drives.

RAID 1 – Makes all the disks act like one, disk size is that of one of the identical disks in the array. Full fail over and redundancy.

RAID 2 – Theoretical, not used. Ha!

RAID 3 – Not very popular, but similar RAID 1, except that each third byte switches to the next disk in the array.

RAID 4 – One drive holds pointers to which drive holds each file. All disks act independently buy access by one drive letter.

RAID 5 – Requires at least 3 identical drives. All but one are live at all times the last acts as a backup should one of the other drives fail.

RAID 6 – Like RAID 5 except, you need at least 4 identical disks and two are offline backup disks.

RAID 10 or 1+0 – A tiered approach where two groups of RAID 1 arrays form a RAID 0 array. So two fully redundant RAID 1 arrays of 500GB made up of 3 500GB disks come together to form 1 RAID 0 array of 1TB. Sounds expensive, 3TB in physical disks to get 1TB accessible drive.

At this point I’ll paraphrase the information found here: http://technet.microsoft.com/en-US/library/cc966534

SQL Server Logs are written synchronously. One byte after the other. There isn’t any random or asynchronous read requests performed against these files by SQL Server. RAID 1 or 1+0 is recommended for this component for two reasons 1. Having a full redundant backup of the log files for disaster recovery. 2. RAID 1 mirrored drives support the sequential write I/O (I/O is short for disk read and write Input and Output. I’m not going to write that 50 times.) of the log file process better than RAID configuration that will split one file over multiple disks.

TempDB is the workhorse of SQL Server. When a query is sent to the databases engine all the work of collecting, linking, grouping, aggregating and ordering happens in the TempDB before the results are sent to the requestor. This makes TempDB a heavy write I/O process. So the popular recommendation is RAID 1+0. Here’s the consideration, TempDB is temporary, and that’s where it gets its name from. So redundancy isn’t required for disaster recovery. However if the disk your TempDB files are on fails, no queries can be processed until the disk is replaced and TempDB restored/rebuilt. RAID 1+0 helps fast writes and ensures uptime. RAID 5 provides the same functionality with fewer disks, but decreased performance when a disk fails.

TempDB and the Logs should NEVER EVER reside on the same raid arrays, so if we’re talking a minimum two RAID 1+0 arrays, might be more cost effective to put TempDB on RAID 5.

Application OLTP (On-line Transaction Processing) databases will benefit the most from RAID 5, which equally supports read and write I/O. Application databases should NEVER EVER reside on the same arrays as the Log files and co-locating with TempDB is also not recommended.

SQL Server comes with other database engine components like the master database and MSDB. These are SQL Server configuration components and mostly utilize read I/O. It’s good to have these components on a mirrored RAID configuration that doesn’t need a lot of write performance, like RAID 1.

A best practice production SQL Server configuration minimally looks like this:

Drive 1: O/S or C: Drive where the virtual memory is also serviced – RAID 1, 80 to 100 GB.

Drive 2: SQL Server Components (master, MSDB, and TempDB) data files – RAID 1+0, 100-240 GB

Drive 3: SQL Server Logs – RAID 1+0, 100-240 GB

Drive 4: Application databases – RAID 5, As much as the databases need…

Where to skimp on a development system? Maybe RAID isn’t available either?

Drive 1: O/S or C: Drive where the virtual memory is also serviced, 80 to 100 GB.

Drive 2: SQL Server Components (master, MSDB, and TempDB) data files Application database files, As much as the databases need…

Drive 3: SQL Server Logs, 100-240 GB

Optimal Production configuration?

Drive 1: O/S or C: Drive – RAID 1, 60 GB.

Drive 2: SQL Server Components (master, MSDB) data files – RAID 5, 100GB

Drive 3: SQL Server Logs – RAID 1+0, 100-240 GB

Drive 4: Application databases – RAID 5, As much as the databases need…

Drive 5: TempDB RAID 1+0, 50–100 GB

Drive 6: Dedicated Page File only RAID 1, 40GB. You don’t want to see what happens to a Windows O/S when the page file is not available.

Buffer I/O is the bane of my existence. I have left no rock unturned on the internet trying to figure out how this process works. So if someone reading can leave a clarifying comment for an edit I’d appreciate it. This I do know, the buffer is kind of like SQL Server’s own page file. A place on a hard disk where information is staged before it is written to the memory pool managed by the O/S. If your system is low on memory and using the page file extensively you will see Buffer I/O waits in the SQL Server Management Studio activity monitor. Basically, this indicates that the staging process is waiting on memory to become available to move data out of the buffer and into the memory pool. The query can’t write more information to the buffer until there is space open in the buffer for it. In fact if the query resultset is big enough, the whole system will begin to die a slow and horrible death as information cannot move in and out of memory or in and out of the buffer because so much information is going in and out of the page file. This is why I highly recommend splitting up the disks so that SQL Server does not have to fight with the page file for Disk I/O.

Look if you have 10 records in one table used by one user 2 times a day that VM with a 40 GB disk, 8 GB RAM, and a dual Core virtual processor available is going to do just fine. But you might as well save some cash and move that sucker onto Access or MYSQL or some other non-enterprise level RDBMS.

 

 

To Proc or Not to Proc

I’ve had some interesting conversations and fun arguments about how to author queries for SQL Server Report Services (SSRS) reports. There are a lot of professionals out there who really want hard fast answers on best practices. The challenge with SSRS is the multitude of configurations available for the system. Is everything (Database Engine, SSAS, SSRS, and SSIS) on one box? Is every service on a dedicated box? Is SSRS integrated with a SharePoint cluster? Where are the hardware investments made in the implementation?

Those are a lot of variables to try and make universal best practices for. Lucky for us Microsoft provided a tool to help troubleshoot report performance. Within the Report Server database there is a view called ExecutionLog3. ExecutionLog3 links together various logging tables in the Report Server database. Here are some of the more helpful columns exposed.

  •          ItemPath – The path and report names that was executed in this record.
  •          UserName – The User the report was ran as.
  •          Format – Format the report was rendered in (PDF, CSV, HTML4.0, etc.)?
  •          Parameters – Prompt selections made.
  •          TimeStart – Server local date and time the prport was executed.
  •          TimeEnd – Server local date and time the report finished rendering.
  •          TimeDataRetrieval – Amount of time in milliseconds to get report data from data source.
  •          TimeProcessing – Amount of time in milliseconds SSRS took to process the results.
  •          TimeRendering – Amount of time in milliseconds Required to produce the final output (PDF, CSV, HTML4.0, etc.)
  •          Status – Succeeded, Failed, Aborted, etc.

I always provide two reports based on the information found in this view. The first report utilizes the time columns to give me insight into how the reports are performing and when the systems peaks utilization. The second report focuses on which users are using what reports to gauge the effectiveness of the reports to the audience.

Generally I’m a big fan for stored procedures, mostly because my reports are usually related to a common data source and stored procedures provide me with a lot of code reuse. Standardizing, the report prompt behavior with stored procedures is also a handy tool. A simple query change can cascade to all the reports that use a stored procedure, alleviating the need to open each report and perform the same change. Additionally, I like to order the result sets in SQL not after the data is returned to the report. But that doesn’t mean that you’re not going to find better performance moving some functionality between tiers based on the results you find in ExecutionLog3.

I’m sorry there just isn’t a one size fits all recommendation for how SSRS reports are structured. Which means; 1 you’ll have to do some research on your configuration, and 2 don’t accept a consultant’s dogma on the topic.

SQL Server Indexes: Using the Clustered Index

If you really want to understand SQL Server indexing I suggest following Stairway to SQL Server Indexes. My first blog post on SQL Server indexes is going to focus on clustered indexes I’m going to paraphrase a lot of the information found in the articles linked above to save those of us who don’t want an intimate scientific knowledge on this topic and address what is pertinent to re-factoring indexes for to reduce deadlocks.

Simply put, indexes are smaller, concise reference tables associated with the data tables that tell SQL Server where some of the information requested in a query is located. Without indexes, the query engine performs a full table scan (sequentially looks at every Page in a table) to retrieve the requested data rather than jumping around segmented portions of the data where the requested rows are stored. In some cases where all the information requested resides in an index, SQL Server will simply return the row from the index and not access the main data table at all.

Non-clustered indexes are separate objects with separate storage. A clustered index instructs SQL Server how to sort the data in the main data table itself and creates a logical hidden permanent key. This is why identity columns are popular to create clustered indexes on. These integer or big integer column values are created sequentially by SQL Server when a new row is inserted; there is basically never any resorting when rows are inserted or deleted.

SQL Server saves data tables and non-clustered indexes in 8k byte blocks called Pages within their data files. As CRUD actions are performed on your tables the data within these pages must be re-sorted. SQL Server determines which rows go into which Pages. In the case where sequentially created pages do not contain the data sequentially, external fragmentation is created. The percentage of empty page space is called internal fragmentation. The fixed size nature of pages means some data types are better for indexes then others (creating less fragmentation), additionally the number of field included in an index can adversely affect index performance. I’m going to address fragmentation in a later post and explain why creating clustered indexes on GUID data types is death to a database.

From Stairway to SQL Server Indexes: Level 3, Clustered Indexes:

The clustered index key can be comprised of any columns you chose; it does not have to be based on the primary key.

Keep in mind these additional points about SQL Server clustered indexes:

  • Because the entries of the clustered index are the rows of the table, there is no bookmark value in a clustered index entry. When SQL Server is already at a row, it does not need a piece of information that tells it where to find that row.
  • A clustered index always covers the query. Since the index and the table are one and the same, every column of the table is in the index.
  • Having a clustered index on a table does not impact your options for creating non-clustered indexes on that table.

There can be only one clustered index on a table; the data in a table can’t be sorted two different ways at the same time. Clustering does not require uniqueness. In the case where the clustered index is made up of non-unique fields the sorting results in grouping for these fields.

If the preceding is clear, it should be obvious that creating a clustered index on a particular column and then also creating a non-clustered index with the same column is a waste of resources. The creation of the clustered index enforces sorting of the data and created a permanent key for SQL server to quickly locate the Page where every row is stored.

Additionally, creating indexes in this matter can lead to deadlocks. Consider a large number of rows being inserted into a table in a batch. Even if your clustered index is created on an identity column, SQL Server will perform a sort check on the clustered index. After the Clustered sort is finished the non-clustered indexes that also include this column will be filled and the batch will not be committed until the indexes are ready. The reason is the non-clustered indexes can’t create their bookmarks until page each record is going to reside is determined. When a select against this table happens at the same time, the query engine may decide to access the non-clustered index, but the insert batch has it locked. The insert can’t complete because the select is accessing the non-clustered index is the insert also has to write to.

Now wait, if the clustered column is removed from the non-clustered index, the select query is just going to use the clustered index (if the clustered index column is part of the table links or where clause of the select statement) and then the deadlock will occur due to the clustered index being locked. So how do we get the query engine to use the non-clustered index without using it to create the bookmark in the index? We use a covering index, with is created with the include statement.

From Stairway to SQL Server Indexes: Level 5, Included Columns:

Columns that are in a non-clustered index, but are not part of the index key, are called included columns. These columns are not part of the key, and so do not impact the sequence of entries in the index. Also, as we will see, they cause less overhead than key columns.

There are several advantages in having these columns in the index but not in the index key, such as:

  • Columns that are not part of the index key do not affect the location of the entry within the index. This, in turn, reduces the overhead of having them in the index.
  • The effort required to locate an entry(s) in the index is less.
  • The size of the index will be slightly smaller.
  • The data distribution statistics for the index will be easier to maintain.

Deciding whether an index column is part of the index key, or just an included column, is not the most important indexing decision you will ever make. That said, columns that frequently appear in the SELECT list but not in the WHERE clause of a query are best placed in the included columns portion of the index.

select objt.object_id, idxc.index_id, clmn.column_id

into #ClustIdxCols

from  sys.objects objt

inner join sys.indexes idx on objt.object_id = idx.object_id

inner join sys.index_columns idxc on objt.object_id = idxc.object_id and idx.index_id = idxc.index_id

inner join sys.columns clmn on objt.object_id = clmn.object_id and idxc.column_id = clmn.column_id

where objt.type = ‘U’ and idx.type = 1

order by objt.object_id, idxc.index_id, clmn.column_id

select distinct objt.name, idx.name

from  sys.objects objt

inner join sys.indexes idx on objt.object_id = idx.object_id

inner join sys.index_columns idxc on objt.object_id = idxc.object_id

and idx.index_id = idxc.index_id

inner join sys.columns clmn on objt.object_id = clmn.object_id

and idxc.column_id = clmn.column_id

inner join #ClustIdxCols test on objt.object_id = test.object_id

and idxc.column_id = test.column_id

where idx.type = 2 and idxc.is_included_column = 0

order by objt.name, Idx.name

This script identifies non-clustered indexes using clustered index columns (include columns are filtered out). In some cases the tables only contain two indexes, a clustered index on the identity column for the table and an non-clustered index on the same column.

How are you coming with those TPS reports?

Does anyone remember the original “Weekend at Bernie’s”? When the two accountants are pouring over the green and white dot matrix printouts of the accounts on the hot tar roof of their apartment building? That’s the traditional report, pages and pages of numbers. Until the invention of spreadsheets, this was the means by which accountants reviewed the accounts. Larger companies have since outgrown even spreadsheets and demanded larger data storage, like databases. However a majority of the reporting provided from these robust data stores still looks like a spreadsheet.

Detailed row data has its uses. Financial transactions and system audit logs are very useful when displayed as uniform rows of data for visual scanning. You can easily find the row that doesn’t look like the others when searching for an error, but how easy is it to determine transaction volume, or the frequency of a particular event? Are you going to count the lines and keep a tick mark tally on another sheet? You can calculate some of these statistics and group them by date, and compare the groups if all the data is still available at the source. Hopefully the query doesn’t slow down the system while users are trying to do their work on it. Save the data in monthly spreadsheets that are backed up regularly? In most cases, the generation of these reports just becomes a meaningless process and waste of paper.

Business Intelligence (BI), I don’t know who coined the term, is meant to communicate the difference between a report (any formatted delivery of data) and the display of information in a way that aides in the business decision making process. BI reporting answers questions like how are this month’s sales compared to last month’s? Or has there been a statistically significant increase in defects with the new modifications to our product?

Many professionals familiar with BI reporting make the assumption that it’s really only applicable to data collected and aggregated over a large period of time. Contact center management is the best example of why this isn’t the case. A contact center is much like an old Amateur Radio that requires constant tuning to produce the best receiving and transmitting signals. These machines come with a panel full of dials and switches used to make sure the radio and the antenna are in perfect attunement. Similarly, contact center managers are constantly monitoring the call handle and queue times making sure the correct proportion of agents are staffed for email, voice, or chat processing. These managers require timely 15 or 30 minute latent reports to determine short term staffing levels. Most companies see the customer service departments as necessary expenses to keep their customers happy. Decision makers need nearly real-time information to make constant adjustments maximizing the efficiency of the staff and keeping their customers happy.

The challenge for BI professionals is, understanding the users’ needs well enough to deliver the correct solution for the need. There isn’t a one size fits all approach to BI delivery. The assembly manager needs metrics on how many completed plastic toys are failing inspection every half hour. Management needs to compare this month’s inspection failures to the samples before switching to the new vendor, perhaps a few times a week. The executive might want to know how sales are going this year compared to the last five, but she only needs this information on the first of the month when she first walks into the office. Each one of these examples has different requirements for the size of the data set, the amount of time the report needs to be displayed for, and the near or distant data term period access.

What’s the point? Go run a search on any technology job board for Business Intelligence or BI. Employers are looking for qualified BI professionals to deliver reporting solutions way that aide in the business decision making process. It’s a growing space/niche on par with security and mobile development. If you can get past the stigma placed on this practice by developers that “Reporting Work” is somehow inferior to software development, there is a lot of opportunity to be had.

 

 

GUID’s – Never for Clustered Indexes

Globally Unique Identifiers have their place in software development. They’re great for identifying a library in the GAC or windows registry. They are, however, huge data types from the database perspective.

Oracle, MYSQL, Sybase, and DB2 do not provide any special data type for fields storing GUID’s, for these vendors a GUID is a 34-38 character string (depending on including dashes and “{}”). SQL Server has provided a Unique Identifier data type which has some benefits in storage and access speeds over a 36 character varchar, or nvarchar field. However, they’re still huge…

Unique Identifier Data Type

http://msdn.microsoft.com/en-us/library/ms190215(v=sql.105).aspx

SQL Server’s Unique Identifier displays as a 36 character string (dashes and no “{}”) and stores a GUID as 16 byte binary value. There’s no argument that it’s nearly impossible (not mathematically impossible) to create a duplicate GUID, but how many data sources are going to outgrow a bigint (-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)) data type? That’s only 8 bytes, half a Unique Identifier. Hard disk space has gotten cheap, why do we care about data type size anyway?  In the article mentioned above, it’s mentioned that indexes created on Unique Identifier fields are going to perform slower than indexes built on integer fields. That statement hardly scratches the surface of performance implications with Unique Identifier indexes, and it’s all related to the size.

Pages and Extents

http://msdn.microsoft.com/en-us/library/ms190969(SQL.105).aspx

The above article explains how SQL stores data and indexes in 8KB pages. 96 bytes are reserved for the page header, there’s a 36 byte row offset and then 8,060 bytes remain for the data or index storage. If your table consisted of just one column, a page could store 503 GUID’s,  or  1007 BigInt’s, or 2015 int’s. Put another way, the smaller the amount of bytes in a row, the more you can store in one page. SQL Server doesn’t control where the Pages are written on the hard disk, the O/S and hardware decide. The chances of consecutive or sequential pages being stored in distant disk sectors increases with the more pages stored for each table or index in the system. As the number of index pages grows, the more out of sync they become with the data pages leading to index fragmentation.

Index Fragmentation

http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-1-the-basics/

http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/

Let’s recap what we have so far,

  1. GUID’s are randomly generated values without any sequential nature or restrictions.
  2. GUID’s are twice as big as the biggest integer data types.
  3. The larger a tables rows are the more pages have to be created to store the data.
  4. The more pages an index has, the more fragmented they get.
  5. The more fragmented the indexes get the more frequently they have to be rebuilt.

Clustered Index Implications

Clustered indexes set the organization and sorting of the actual data in the table. Non-clustered indexes created on a table with a clustered index have to be updated with pointer changes as records are inserted or deleted, or the clustered index value updated because these changes require the data pages to be resorted and new keys generated. SQL Server Identity columns of an integer data type reduce a lot of I/O overhead and SQL server processing because the rows are always inserted in the correct order. GUID values are almost never inserted in the correct order because of their random nature. Thus, with a GUID clustered index every insert or delete or update of that field requires data page reorganization, non-clustered index updates, more pages to be created, and increased fragmentation.