Kimball Dimensional Modeling Practices Waterfall Only?

Kimball Dimensional Modeling theory and practices are the most widely accepted processes for consolidating data from different sources into a central “Delivery Area” for consolidated cross functional reporting. Or simply a process for normalizing and standardizing data from several data marts into a data warehouse for Business Intelligence (BI) reporting.

From “The Data Warehouse Tool Kit”; Second Edition; Kimball, Ross; 2002; John Wiley and Sons, Inc.; Page 22:

Finally, dimensional models are gracefully extensible to accommodate change. The predictable framework of a dimensional model withstands unexpected changes in user behavior. Every dimension is equivalent; all dimensions are symmetrically equal entry points into the fact table. The logical model has no built-in bias regarding expected query patterns. There are no preferences for the business questions we’ll ask this month versus the questions we’ll ask next month. We certainly don’t want to adjust our schemas if business users come up with new ways to analyze the business.

The main tool used to discover the applicable Dimensions for a model is the Business Process Dimensions Matrix, see Figure 1. Each row represents a Business Process and Each Column a Dimension available within the Delivery Area that is populated manually or through the consolidation of the source data.

BPM

Figure 1. Kimball, Ross; et al.; Page 79.

On this surface this looks like a Waterfall approach… Identifying all the requirements upfront before development starts. I disagree. This document should be an organic repository that is constantly updated with changes like new Dimensions or Business process as additional data sources are added to the system. Further, I believe this matrix is the perfect primer for authoring User Stories. For Instance the first Business Process would translate to:

As an Inbound Contact Center Supervisor I want to see Voice, Chat, Email and Fax metrics summarized by Date, Time, Agents (Users), Goals and Locations So that I can…

The last section of the User Story where the reason or benefit is recorded also derive from a central tenant of BI practices. That central tenant is, “Every report must answer a question to aid in the conclusion of one or more business decisions.” The question we’re answering does not show up on the matrix, but should be part of the BI project management artifacts and the User story is the perfect place to record it.

If you’re reading this you may be a BI solution developer and suffered the frustrations of pointless and repetitive presentations (reports and dashboards) because your customers don’t know what they want. Someone on the project must take it upon themselves to get the stake holders to commit to the questions they want to answer. In Agile Scrum, it would make sense that the Product Owner maintains the matrix and the user stories and therefore should be responsible for those commitments. Here’s an Example of the resulting user story.

As an Inbound Contact Center Supervisor I want to see Voice, Chat, Email and Fax metrics summarized by Date, Time, Agents (Users), Goals and Locations So that I can more accurately forecast future staffing needs.

The wide acceptance of Kimball practices predates the wide acceptance of Agile Iterative Development practices. Therefore, several professionals in the space are unwilling to adapt their practice of Kimball methodologies. Hopefully this discussion will aid in efforts to convince these BI resources to modify their approach to conform to the Software Development Life Cycle (SDLC) methodology the rest of the development team uses.

 

Job Req. Sanity Check

Let me start by saying I am not an HR guy. Nor have I ever been a full-time recruiter of any sort. So perhaps, I’m way off base with my thoughts on this topic. PLEASE straighten me out if I am because there are a lot of practices within this space that make no sense to me.

I.  The Skill Set Years Experience Mismatch

Lately I have seen a flood of open position postings on the various job boards that will say something to the effect of “Jr. Developer\Recent College Grad\1-2 Years experience” as the headline of the posting. Only to find in the requirements section, experience (which to me means more than just exposure or reading a help doc online) for some 30 different technologies. Maybe, yes maybe with the right set of circumstances a Jr. resource as described in the headline might have started in an environment where he or she was given free rein to provide solutions through whatever means. I was lucky enough to have started my career as the only software developer for a successful Insurance company where I was able to explore whatever new technology came along and experiment with different techniques. I think this is pretty rare. Some companies spend the first 6 months breathing over a new resources shoulder with weekly code reviews before they’re promoted to level on and the code reviews come when the developer is ready. Many companies only let their resources sustain existing code and teach them just the basics to troubleshoot the existing technologies while the more senior staff works on innovation.

So are the hiring managers or recruiters looking for 80% of the required skills? One or two? Software design and development professionals are detail oriented and precise personalities. If I can’t talk about every skill listed, I’m not going to apply for a position.

II.   Competing Technologies

Another favorite of mine is when the laundry list of experience includes market competitors. The posting is looking for someone with 5 years experience and expert knowledge of Oracle, DB2 and SQL Server, or Expert level .NET and Java. First, can you really become expert in 5 years, especially if the maybe 2 of those you were just doing maintenance work (i.e. spell checking websites)? Secondly how many companies invest tens of thousands of dollars in SQL Server and more tens of thousands on Oracle? As a vendor software developer your product may need to support more than one database platform. However, what percentage of the candidates the job market hail from vendor software companies? Are there really any transferrable skills between .NET and Java? It seems to me trying to grow one resource into an expert of both is far more expense than cultivating two specialists and most companies would do the latter.

These types or requirements lead to a lot of confusion for candidates. They don’t know if they should bother applying or not. The recruiters are inundated with resumes that don’t fit the request from the hiring customer.

III.   Automated Recruitment Phone Recruiting

This year in particular I have been flooded with outsourced call center recruiter calls. These calls always follow the same format.

  • I answer the phone to silence
  • A few seconds later someone in a very thick accent says, “Hello may I speak to George?”
  • “Yes this is George.”
  • Faster than any normal human being should be able to speak -“Uh hi. My name is gibberish. gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish …”
  • Me, “Whatever you’re talking about I’m not interested. Thanks.”
  • Hang up.

It’s as bad as the campaign calls around supper time during an election cycle. Who in their right mind thinks this is in any way an effective means to find a qualified candidate? I seriously doubt these individuals understand the technical requirements well enough to successfully phone screen much less are able to fight through the language barrier well enough to have a real conversation about the candidate or the opportunity.

IV.   Don’t Read the Resume

Another new interesting fishing tactic is the mail blast, or I guess that’s what’s going on. Why else am I getting emails for Jr. or Intermediate 5 years or less positions from the job boards where my resume clearly showing 16 years of experience are posted? Or the expert Java Architect roles I was sent when Java J2EE doesn’t appear anywhere on my resume? Recruiters, does this tactic work?

I understand there is a perception in the US job market right now that a lot of people are out of work and some companies are hoping to cash in on getting better qualified candidates for less compensation. This perception has created a recruiter feeding frenzy atmosphere. The truth is most of the top ranked talent is aware of what’s going on and they’re sitting this cycle out, or contracting. The unemployment rate among software development professionals is not nearly as high as other skill sets like manufacturing and construction. I believe this tactics will not be successful, and my land your corporation with a lot of negative feedback on a site like GlassDoor.com.

Work Yourself Out of a Job

Has anyone in your management structure ever told you that if they could just clone you 4 or 5 times the company would be worry free? That is great feedback. You have earned yourself many years of job security. In fact you might say you’ve made yourself so indispensable at your current position, you’ll never be able to grow out of it.

 

For a lot of professionals in the world that’s exactly the position they want to be in. Joe is the best administrator BigCo Corp has ever had. No issues with any of the servers under his care. His management gets a little nervous when he takes a vacation because there is no one else who can do what he does the way he does it. Management doesn’t want to add any new variables to the mix and Joe will never be burdened with management training, team building, Agile methodology, mentoring, or other distractions. Likewise, Joe isn’t really all that motivated to share his expertise, both because he doesn’t like trying to explain what he does to others, and because he’s worried about losing some of his organizational expertise currency.

 

If you’re like Joe, read no further, because we’re going to talk about Sally. Sally, another administrator at BigCo Corp, likes having uninterrupted vacations. She has back filled her own position by training a few of her colleagues to do support her when she needs a break. Sally has taken the time to create process and systems documentation and publish it where others can reference it. Management notices that Sally can take care of her normal duties and build a team to add more redundancies to the processes. Management decides to send Sally to team building and management training. Sally becomes Joe’s new boss.

 

More than once in my career I worked really work and waited for my supervisors to recognize my talent and give me that promotion to management. I gotten the “Clone you” comment many times and sat on it. It took some advice for a mentor to open my eyes to the power of creating your own opportunities. You may not be able to personally improve the company’s bottom line and create more revenue that results in more growth and upward movement opportunities. If you do a great job, create a backfill for yourself and haven’t made yourself a pariah in your organizations politics, there’s a great chance you’ll get top consideration when an opportunity does show up.

To Proc or Not to Proc

I’ve had some interesting conversations and fun arguments about how to author queries for SQL Server Report Services (SSRS) reports. There are a lot of professionals out there who really want hard fast answers on best practices. The challenge with SSRS is the multitude of configurations available for the system. Is everything (Database Engine, SSAS, SSRS, and SSIS) on one box? Is every service on a dedicated box? Is SSRS integrated with a SharePoint cluster? Where are the hardware investments made in the implementation?

Those are a lot of variables to try and make universal best practices for. Lucky for us Microsoft provided a tool to help troubleshoot report performance. Within the Report Server database there is a view called ExecutionLog3. ExecutionLog3 links together various logging tables in the Report Server database. Here are some of the more helpful columns exposed.

  •          ItemPath – The path and report names that was executed in this record.
  •          UserName – The User the report was ran as.
  •          Format – Format the report was rendered in (PDF, CSV, HTML4.0, etc.)?
  •          Parameters – Prompt selections made.
  •          TimeStart – Server local date and time the prport was executed.
  •          TimeEnd – Server local date and time the report finished rendering.
  •          TimeDataRetrieval – Amount of time in milliseconds to get report data from data source.
  •          TimeProcessing – Amount of time in milliseconds SSRS took to process the results.
  •          TimeRendering – Amount of time in milliseconds Required to produce the final output (PDF, CSV, HTML4.0, etc.)
  •          Status – Succeeded, Failed, Aborted, etc.

I always provide two reports based on the information found in this view. The first report utilizes the time columns to give me insight into how the reports are performing and when the systems peaks utilization. The second report focuses on which users are using what reports to gauge the effectiveness of the reports to the audience.

Generally I’m a big fan for stored procedures, mostly because my reports are usually related to a common data source and stored procedures provide me with a lot of code reuse. Standardizing, the report prompt behavior with stored procedures is also a handy tool. A simple query change can cascade to all the reports that use a stored procedure, alleviating the need to open each report and perform the same change. Additionally, I like to order the result sets in SQL not after the data is returned to the report. But that doesn’t mean that you’re not going to find better performance moving some functionality between tiers based on the results you find in ExecutionLog3.

I’m sorry there just isn’t a one size fits all recommendation for how SSRS reports are structured. Which means; 1 you’ll have to do some research on your configuration, and 2 don’t accept a consultant’s dogma on the topic.

Agile: The Consultant’s Savior

Thanks to Jeff Nall for contributing to this post.

How many times have you delivered to spec and hit your milestone only to get the “That’s not what I asked for.” feedback? Guess what, your customers don’t always know what they want much less what they really need. They might think they do, but if that were the case they would have been able to staff the project internally or with some new direct hires.

I once worked on a project where the company actually hired a consultant to translate corporate jargon into generic tech and software development terms. Apparently, they were struggling to find new hires or consultants with the skills they were looking for. Additionally, the resources they would gamble on were so lost trying to understand requirements, the development departments turned into revolving doors. The “Demystification Consultant” had a full time job translating specs and RFP’s so vendors could understand what to bid on. It might have been cheaper to adopt language learning methodologies and switch to a common communication device, illustrations.

In Agile that translates to frequent demonstrations of development progress. Rather than placing the entire project’s success on a nearly finalized demonstration of a product 2 weeks from delivery, Agile iterative development practices frequent illustrations of what the product will be so stake holders can approve or make changes with enough lead time to actually see the modifications implemented.

How does a consultant benefit from this process? Well like it or not, no matter what the contract says, an unhappy customer can remove a consultant and withhold payment if the customer believes they can argue in court that the contract was violated. Consultants are burdened with not just delivering what was promised on the agreed schedule, but also executing the contract in such a way that the customer wants to work with them again or act as a reference for other potential clients. Conducting frequent demonstrations: illustrates your responsiveness to your customers’ needs, reaffirms progress made on the deliverables  and keeps the lines of communication open for timely reactions to change. Agile is the best defense against “That’s not what I asked for” in the final days of your project.

The customer isn’t always right–they’re hiring you to answer the question for them. It’s your job to read between the lines of what the customer says they want and give them what they need. The point is the answers you get from someone who isn’t an expert in YOUR field can’t logically be the solution to the problem. You’re dealing with breadcrumbs, not road maps.

SQL SERVER – All Table & its Index sizes in MBs

SQL Server Portal

Today, I received a query enquiring how to calculate the table Size, its index size along with the total number of rows, as the client’s database file size suddenly grew from MBs to GBs and they wanted to know what was inserted in which tables that caused this instant growth.

So, I started searching for solution over the web, but most of the solutions use cursor to calculate each table & its index size along with the total number of rows. I could not get what I was looking for, so, I thought of writing this and calculate it with an optimized query and convert the table & its index size to MBs instead of KBs because windows keep the file sizes in MBs also.

Given below is the Script :

View original post

SQL Server Indexes: Using the Clustered Index

If you really want to understand SQL Server indexing I suggest following Stairway to SQL Server Indexes. My first blog post on SQL Server indexes is going to focus on clustered indexes I’m going to paraphrase a lot of the information found in the articles linked above to save those of us who don’t want an intimate scientific knowledge on this topic and address what is pertinent to re-factoring indexes for to reduce deadlocks.

Simply put, indexes are smaller, concise reference tables associated with the data tables that tell SQL Server where some of the information requested in a query is located. Without indexes, the query engine performs a full table scan (sequentially looks at every Page in a table) to retrieve the requested data rather than jumping around segmented portions of the data where the requested rows are stored. In some cases where all the information requested resides in an index, SQL Server will simply return the row from the index and not access the main data table at all.

Non-clustered indexes are separate objects with separate storage. A clustered index instructs SQL Server how to sort the data in the main data table itself and creates a logical hidden permanent key. This is why identity columns are popular to create clustered indexes on. These integer or big integer column values are created sequentially by SQL Server when a new row is inserted; there is basically never any resorting when rows are inserted or deleted.

SQL Server saves data tables and non-clustered indexes in 8k byte blocks called Pages within their data files. As CRUD actions are performed on your tables the data within these pages must be re-sorted. SQL Server determines which rows go into which Pages. In the case where sequentially created pages do not contain the data sequentially, external fragmentation is created. The percentage of empty page space is called internal fragmentation. The fixed size nature of pages means some data types are better for indexes then others (creating less fragmentation), additionally the number of field included in an index can adversely affect index performance. I’m going to address fragmentation in a later post and explain why creating clustered indexes on GUID data types is death to a database.

From Stairway to SQL Server Indexes: Level 3, Clustered Indexes:

The clustered index key can be comprised of any columns you chose; it does not have to be based on the primary key.

Keep in mind these additional points about SQL Server clustered indexes:

  • Because the entries of the clustered index are the rows of the table, there is no bookmark value in a clustered index entry. When SQL Server is already at a row, it does not need a piece of information that tells it where to find that row.
  • A clustered index always covers the query. Since the index and the table are one and the same, every column of the table is in the index.
  • Having a clustered index on a table does not impact your options for creating non-clustered indexes on that table.

There can be only one clustered index on a table; the data in a table can’t be sorted two different ways at the same time. Clustering does not require uniqueness. In the case where the clustered index is made up of non-unique fields the sorting results in grouping for these fields.

If the preceding is clear, it should be obvious that creating a clustered index on a particular column and then also creating a non-clustered index with the same column is a waste of resources. The creation of the clustered index enforces sorting of the data and created a permanent key for SQL server to quickly locate the Page where every row is stored.

Additionally, creating indexes in this matter can lead to deadlocks. Consider a large number of rows being inserted into a table in a batch. Even if your clustered index is created on an identity column, SQL Server will perform a sort check on the clustered index. After the Clustered sort is finished the non-clustered indexes that also include this column will be filled and the batch will not be committed until the indexes are ready. The reason is the non-clustered indexes can’t create their bookmarks until page each record is going to reside is determined. When a select against this table happens at the same time, the query engine may decide to access the non-clustered index, but the insert batch has it locked. The insert can’t complete because the select is accessing the non-clustered index is the insert also has to write to.

Now wait, if the clustered column is removed from the non-clustered index, the select query is just going to use the clustered index (if the clustered index column is part of the table links or where clause of the select statement) and then the deadlock will occur due to the clustered index being locked. So how do we get the query engine to use the non-clustered index without using it to create the bookmark in the index? We use a covering index, with is created with the include statement.

From Stairway to SQL Server Indexes: Level 5, Included Columns:

Columns that are in a non-clustered index, but are not part of the index key, are called included columns. These columns are not part of the key, and so do not impact the sequence of entries in the index. Also, as we will see, they cause less overhead than key columns.

There are several advantages in having these columns in the index but not in the index key, such as:

  • Columns that are not part of the index key do not affect the location of the entry within the index. This, in turn, reduces the overhead of having them in the index.
  • The effort required to locate an entry(s) in the index is less.
  • The size of the index will be slightly smaller.
  • The data distribution statistics for the index will be easier to maintain.

Deciding whether an index column is part of the index key, or just an included column, is not the most important indexing decision you will ever make. That said, columns that frequently appear in the SELECT list but not in the WHERE clause of a query are best placed in the included columns portion of the index.

select objt.object_id, idxc.index_id, clmn.column_id

into #ClustIdxCols

from  sys.objects objt

inner join sys.indexes idx on objt.object_id = idx.object_id

inner join sys.index_columns idxc on objt.object_id = idxc.object_id and idx.index_id = idxc.index_id

inner join sys.columns clmn on objt.object_id = clmn.object_id and idxc.column_id = clmn.column_id

where objt.type = ‘U’ and idx.type = 1

order by objt.object_id, idxc.index_id, clmn.column_id

select distinct objt.name, idx.name

from  sys.objects objt

inner join sys.indexes idx on objt.object_id = idx.object_id

inner join sys.index_columns idxc on objt.object_id = idxc.object_id

and idx.index_id = idxc.index_id

inner join sys.columns clmn on objt.object_id = clmn.object_id

and idxc.column_id = clmn.column_id

inner join #ClustIdxCols test on objt.object_id = test.object_id

and idxc.column_id = test.column_id

where idx.type = 2 and idxc.is_included_column = 0

order by objt.name, Idx.name

This script identifies non-clustered indexes using clustered index columns (include columns are filtered out). In some cases the tables only contain two indexes, a clustered index on the identity column for the table and an non-clustered index on the same column.

How are you coming with those TPS reports?

Does anyone remember the original “Weekend at Bernie’s”? When the two accountants are pouring over the green and white dot matrix printouts of the accounts on the hot tar roof of their apartment building? That’s the traditional report, pages and pages of numbers. Until the invention of spreadsheets, this was the means by which accountants reviewed the accounts. Larger companies have since outgrown even spreadsheets and demanded larger data storage, like databases. However a majority of the reporting provided from these robust data stores still looks like a spreadsheet.

Detailed row data has its uses. Financial transactions and system audit logs are very useful when displayed as uniform rows of data for visual scanning. You can easily find the row that doesn’t look like the others when searching for an error, but how easy is it to determine transaction volume, or the frequency of a particular event? Are you going to count the lines and keep a tick mark tally on another sheet? You can calculate some of these statistics and group them by date, and compare the groups if all the data is still available at the source. Hopefully the query doesn’t slow down the system while users are trying to do their work on it. Save the data in monthly spreadsheets that are backed up regularly? In most cases, the generation of these reports just becomes a meaningless process and waste of paper.

Business Intelligence (BI), I don’t know who coined the term, is meant to communicate the difference between a report (any formatted delivery of data) and the display of information in a way that aides in the business decision making process. BI reporting answers questions like how are this month’s sales compared to last month’s? Or has there been a statistically significant increase in defects with the new modifications to our product?

Many professionals familiar with BI reporting make the assumption that it’s really only applicable to data collected and aggregated over a large period of time. Contact center management is the best example of why this isn’t the case. A contact center is much like an old Amateur Radio that requires constant tuning to produce the best receiving and transmitting signals. These machines come with a panel full of dials and switches used to make sure the radio and the antenna are in perfect attunement. Similarly, contact center managers are constantly monitoring the call handle and queue times making sure the correct proportion of agents are staffed for email, voice, or chat processing. These managers require timely 15 or 30 minute latent reports to determine short term staffing levels. Most companies see the customer service departments as necessary expenses to keep their customers happy. Decision makers need nearly real-time information to make constant adjustments maximizing the efficiency of the staff and keeping their customers happy.

The challenge for BI professionals is, understanding the users’ needs well enough to deliver the correct solution for the need. There isn’t a one size fits all approach to BI delivery. The assembly manager needs metrics on how many completed plastic toys are failing inspection every half hour. Management needs to compare this month’s inspection failures to the samples before switching to the new vendor, perhaps a few times a week. The executive might want to know how sales are going this year compared to the last five, but she only needs this information on the first of the month when she first walks into the office. Each one of these examples has different requirements for the size of the data set, the amount of time the report needs to be displayed for, and the near or distant data term period access.

What’s the point? Go run a search on any technology job board for Business Intelligence or BI. Employers are looking for qualified BI professionals to deliver reporting solutions way that aide in the business decision making process. It’s a growing space/niche on par with security and mobile development. If you can get past the stigma placed on this practice by developers that “Reporting Work” is somehow inferior to software development, there is a lot of opportunity to be had.

 

 

Politics be Damned! Fess up When You Mess up.

When I was quite young, between 3 and 5 I think, I made a mistake and lied about it. My father has always worked on and collected various instruments for commercial and amateur radio communications. One of these radios was on the floor by the couch one morning and I was playing with it. I remember at some point dropping a dime in it. I was scared and my father came over to ask what I was doing.

I remember asking, “What would happen if a dime fell in those slots?”

He asked, “Did you drop a dime in one of those slots?”

“No.” I answered and walked away. I entered the dining room some time later and saw him screwing the case back on the radio we had been discussing.

He turned to me and said, “You know if I had plugged this in it could have blown up in my face because of where that dime landed. It’s always better to admit a mistake so someone can help you fix it faster before anything really bad happens.”

I started my professional carrier at Tower Hill Insurance Group, Inc. in Gainesville, FL in 1997. In the early 1990’s the state of Florida had entered the insurance business because so many carrier had decided the hurricane risk was too great after the devastation left by Amelia. Five years or so big carriers were again willing to take risks in Florida and the new insurance commissioner wanted insurance carriers to buy the policies issued by the state in order to earn the privilege to again sell property casualty insurance products in the state that they abandoned. Tower Hill negotiated an agreement between a carrier and the state where Tower Hill would administrate the transfer, premium collection and claim processing for the policies this carrier purchased. My role in this arrangement was to import the policy information from a huge text file sent on CD once a month for the policies that were renewing for the next month. Those policies would renew with the new carrier.

I was… still pretty new at a lot of this stuff. I had only been out of college and at this job for a year. There was still an ample amount of self-directed on the job learning taking place. Well somehow, the details are a little foggy on wither I misunderstood the file specification or I got things switched around when I was creating the values, but I managed to get all the days and months for the effective and expiration dates switched. Yes I managed to create the European formatted dates. Worse yet… it was 3 months into the imports that I found the problem.

Give me some credit, I had a self-imposed sanity check, and found my own mistake before our policy management software vendor ever detected an issue, and just 5 days before we had a scheduled audit… Do I take vacation, abruptly quit, or just go get absurdly drunk?

Actually, I thought of my father’s advice. I took my problem to my manager who was gifted with a saintly proportion of patience and understanding. Mr. Chris Allen as I remember it laughed at me, shook his head, and said, “Common we need to go tell Keaton”. That’s Mr. Benson VP of IT.

“Wait, I have to go with you and explain what happened?” I said.

“Yup.”

Other than thinking I was either getting fired or possibly shot, I only remember one other detail of that situation. As we were walking to the vendor reps office in our building Keaton stopped me and said, “Look George, if you’re not making any mistakes… you’re not working hard.”

That was the biggest motivator of all time. Yeah it was hard, and involved a lot of manual checking from the data entry team, and a torrid of sarcasm form the vendor rep who was threatened by the fact I could write robust GUI apps in 1/10th the time his company could put out a patch. But we got it fixed, and we had a great story of how our processes averted a disaster to give to the carrier’s auditor which earned us shiny gold stars on the audit report for honestly, ethics, compliance, and dedication to accuracy.

GUID’s – Never for Clustered Indexes

Globally Unique Identifiers have their place in software development. They’re great for identifying a library in the GAC or windows registry. They are, however, huge data types from the database perspective.

Oracle, MYSQL, Sybase, and DB2 do not provide any special data type for fields storing GUID’s, for these vendors a GUID is a 34-38 character string (depending on including dashes and “{}”). SQL Server has provided a Unique Identifier data type which has some benefits in storage and access speeds over a 36 character varchar, or nvarchar field. However, they’re still huge…

Unique Identifier Data Type

http://msdn.microsoft.com/en-us/library/ms190215(v=sql.105).aspx

SQL Server’s Unique Identifier displays as a 36 character string (dashes and no “{}”) and stores a GUID as 16 byte binary value. There’s no argument that it’s nearly impossible (not mathematically impossible) to create a duplicate GUID, but how many data sources are going to outgrow a bigint (-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)) data type? That’s only 8 bytes, half a Unique Identifier. Hard disk space has gotten cheap, why do we care about data type size anyway?  In the article mentioned above, it’s mentioned that indexes created on Unique Identifier fields are going to perform slower than indexes built on integer fields. That statement hardly scratches the surface of performance implications with Unique Identifier indexes, and it’s all related to the size.

Pages and Extents

http://msdn.microsoft.com/en-us/library/ms190969(SQL.105).aspx

The above article explains how SQL stores data and indexes in 8KB pages. 96 bytes are reserved for the page header, there’s a 36 byte row offset and then 8,060 bytes remain for the data or index storage. If your table consisted of just one column, a page could store 503 GUID’s,  or  1007 BigInt’s, or 2015 int’s. Put another way, the smaller the amount of bytes in a row, the more you can store in one page. SQL Server doesn’t control where the Pages are written on the hard disk, the O/S and hardware decide. The chances of consecutive or sequential pages being stored in distant disk sectors increases with the more pages stored for each table or index in the system. As the number of index pages grows, the more out of sync they become with the data pages leading to index fragmentation.

Index Fragmentation

http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-1-the-basics/

http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/

Let’s recap what we have so far,

  1. GUID’s are randomly generated values without any sequential nature or restrictions.
  2. GUID’s are twice as big as the biggest integer data types.
  3. The larger a tables rows are the more pages have to be created to store the data.
  4. The more pages an index has, the more fragmented they get.
  5. The more fragmented the indexes get the more frequently they have to be rebuilt.

Clustered Index Implications

Clustered indexes set the organization and sorting of the actual data in the table. Non-clustered indexes created on a table with a clustered index have to be updated with pointer changes as records are inserted or deleted, or the clustered index value updated because these changes require the data pages to be resorted and new keys generated. SQL Server Identity columns of an integer data type reduce a lot of I/O overhead and SQL server processing because the rows are always inserted in the correct order. GUID values are almost never inserted in the correct order because of their random nature. Thus, with a GUID clustered index every insert or delete or update of that field requires data page reorganization, non-clustered index updates, more pages to be created, and increased fragmentation.