Coming Home

Posted by mysticstache on December 29, 2014

Almost 2 months ago I had my first day at Avanade. For those of you who don’t know, Avanade was cerated as a join venture between Microsoft and Accenture. Avanade has thier own business development streams but 99.9% of the Microsoft projects Accenture wins, are sent to the the Avanade team for execution.

Well let me just say what an absolute joy it has been to come back to the Microsoft family of products. After 13 months of wasting my life away fighting with Open Source garbage, I’ve come home to integrated enterprise solutions that work as advertised or at least have some reliable sources for support when they don’t. I was actually told to stop blogging about how much the Open Stack is a waste of time and money… Anyway, that’s behind me.

To add to the good vibes, Avanade is connected to Microsft in so many ways. We’ve actually had advanced looks at new technologies before the rest of the community. There 20+ MVPs in just the Midwest region, Avanade requires 80+ hours of training every year, and employees are encouraged to participate in developer community organizations.

I’m excited to talk about the first area of expertise they’d like me to look at, Avanade Touch Analytics (ATA). I haven’t completed the training yet, but this offering is fantastic. The easiest interface I’ve ever used to create dashboards that look and feel like Tableau or Spotfire, but perform lightyears ahead of both. Once the data sources are made available to the ATA server for any customer’s instance, the dashboards can be authored for or on any device. Switch between layout views to see how your dashboards will look on any device before releasing them. Publish multiple dashboards to different Active Directory security groups and let your users pick the information that’s important to them. It’s exciting, and I’m glad to see an offering addressing the shortcomings of the competition in a hosted or onsite instalations.

Well that’s enough advertising. Now that my censorship is at an end, I’ll be blogging mroe often I really want to discuss SQL Server’s memory resident database product, interesting things I’ve learned about the SSIS Service recently, and Service Broker.

PostgreSQL, AWS, and Musical Bottlenecks

Posted by mysticstache on May 16, 2014

I have had the misfortune of working with PostgreSQL for the last 8 months. Working is a relative term, for me little work has been done mostly I’ve been kicking off queries waiting forever fo the returns and then trying to run down the bottleneck.

I am not a Linux professional and have to rely on those professionals to diagnose what’s going on with the AWS instance that runs PostgreSQL 9.3. Everyone who looked at the situation has had a different opinion. One person looked at one set of performance data and said the system isn’t being utilized at all, someone else would say it’s IO bound, still someone else would say it’s the network card… So we wnet through all these suppositions added more RAM, then more processors, then we used the SSD drives more, finally switching from Non-provisioned IOPS to Provisioned IOPS got the system roughly as far as we could push it to where the complex queries would drive one CPU Core to 100%.

Now those of you who work with read enterprise RDBMS might say, “Wait… One CPU core reached 100%?” Well yes, of course, because you see PostgreSQL does not have parallel processing. Yeah…

No matter how many CTEs or sub queries present in a query statement sent to PostgreSQL, The processing of said query will happen in a synchronous, single threaded fashion on CPU core. I’m thinking SQL Server had parallel processing in the late 90’s or early 2000’s? It’s 2014 for crying out loud.

And it gets better! According to my observations, the Postgres process is also single threaded. This process is responsible for writing to the transaction logs. So there isn’t any benefit to create multiple log files for software striping and efficient log writing. In fact, one big insert seemed to back up all the smaller transactions, while the first insert wrote to the transaction log.

This is one of the joys of Open Source offerings. If the development community doesn’t think a feature is important you have to fork the code and write the feature yourself. What blows me away is that companies are willing to gamble the success of their products and implementations on something so hokey.

I’m not a DBA, But I Play One on TV: Part 3 – Database Files

Posted by mysticstache on September 3, 2013

When a customer invites me to review their SQL Server or Oracle databases and server architecture, I start with the servers. I review the hard disk layout and a few server settings. The very next thing I do is review the data files and log files for the databases. In the case of SQL Server, when I see one data file and one log file in the same directory and the database has one file group called Primary, I know I am once again presiding over amateur hour at the local chapter of the Jr. Database Developer Wannabe Club.

One file pointing to one file group indicates to me:

Someone went through the “create new database” wizard.
There wasn’t any pre-development design analysis done before the database was created
No one bothered to check readily available best practices for SQL Server
I can anticipate equally uninformed approaches to table and index design and query authoring

This will antagonize the hardware striping advocacy group, but there are reasons to split up split up your data files and log files. Specifically in the case of TempDB files, you can greatly improve performance by creating the same number of log files as you have processors. With this configuration each processor will control the I/O for each file.

Check out number 8 here: http://technet.microsoft.com/en-US/library/cc966534

In addition to performance, recovery processes greatly benefit for splitting up the database files. Previously, if a data file failed, if everything was in one file or not, SQL Server would take the database offline. With SQL Server 2012 a new feature was added that will leave your database accessible, just not the data located in the corrupt or otherwise unavailable file. Well if all the data is in that one file your dataset is down until you can recover. Even if that data file contains only a subset of the data in a table, the rest of the data in that table is still available for querying.

Now, you might say ok we’re going to have a separate file for every table and multiple files for some. Ok, I’ve seen that configuration and there isn’t anything wrong with it. If your IT department isn’t using SQL Server to manage their backups, instead they’re backing up the actual files across all the drives, they’re going to be annoyed with you. However, this configuration gives you maximum flexibility. For instance, placing tables that are commonly used at the same time on different spindles won’t conflict for disk I/O.

Splitting up your log files is also beneficial. Log files are populated in a round robin fashion. When one reaches the level you’ve set it starts filling up the next. Hopefully you have at least 4 and they are of a sufficient size. This gives you time to archive the transaction logs between backups making sure no transactions are lost due to the file rolling over before the backup removes completed transactions and shrinks the file.

Next episode will cover backup basics. The purpose in all these posts is to provide the understanding to apply the best configuration to the database system your building.

I’m not a DBA, But I Play One on TV: Part 2 – CPU and RAM

Posted by mysticstache on August 30, 2013

In Part 1 I discussed SQL Server and Hard Disk configurations. Now let’s have a look at CPU and RAM. This topic is actually kind of easy. More is better… most of the time.

CPU

It’s my opinion that most development environments should have a minimum of 4, 2.5+ GHz Processors, If that’s one socket with two cores, or one socket with 4 cores or, or two sockets with 2 cores, doesn’t really make that much of a difference. For a low utilization production system you’ll need 8, 2.5+ GHz processors. Look, you can get this level of chip in a mid-high grade laptop. Now if you’re looking at a very high utilization system it’s time to think about 16 processors or 32 split up over 2 or more sockets. Once you get to the land of 32 processors advanced SQL Server configuration knowledge is required. In particular you will need to know how to tweak the MAXDOP (Maximum Degree of Parallelism) setting.

Here’s a great read for setting a query hint: http://blog.sqlauthority.com/2010/03/15/sql-server-maxdop-settings-to-limit-query-to-run-on-specific-cpu/

And here are instructions for a system wide setting: http://technet.microsoft.com/en-us/library/ms189094(v=sql.105).aspx

What does this setting do? It controls the number of parallel processes SQL Server will use when servicing your queries. So why don’t we want SQL Server to maximize the number of parallel processes all the time? There is another engine involved in the process that is responsible for determining which processes can and cannot be done in parallel and the order of the parallel batches. In a very highly utilized SQL Server environment this engine can get bogged down. Think of it like air traffic control at a large airport… but there’s only one controller in the tower and it’s Thanksgiving the biggest air travel holiday in the US. Well the one air traffic controller has to assign the runway for every plane coming in and going out. Obviously, he/she becomes the bottleneck for the whole airport. If this individual only had one or two runways to work with, they wouldn’t be the bottleneck; the airport architecture is the bottleneck. I have seen 32 processor systems grind to a halt with MAXDOP set at 0 because the parallelism rule processing system was overwhelmed.

For more information on the parallel processing process: http://technet.microsoft.com/en-us/library/ms178065(v=sql.105).aspx

RAM

RAM is always a “more is better” situation. Keep in mind that if you don’t set the size and location of the page file manually, the O/S is going to try and take 1.5 times of the RAM from the O/S hard drive. The more RAM on the system, the less often the O/S will have to utilize the much slower page file. For a development system 8GB will probably be fine, but now a days you can get a mid-high level Laptop with 16GB even 32GB is getting pretty cheap. For production 16GB is the minimum, but I’d really urge you to get 24GB. And like I said 32GB configurations are becoming very affordable.

I’m not a DBA, But I Play One on TV: Part 1 – Hard Drives

Posted by mysticstache on August 27, 2013

This is the first in a series of posts relating to hardware considerations for a SQL Server 2008 R2 or later server. In Part 1 – Hard Drives I’m going to discuss RAID levels and what works for the Operating System (O/S) versus what works for various SQL Server components.

As a consultant I always go through the same hardware spec dance. It sounds like this:

Q: How much disk space does your application database require?

A: Depends on your utilization.

Q: Ok, what’s the smallest server we can give you for a proof of concept or 30 day trial?

A: Depends on your utilization.

Q: Well we have this VM with a 40 GB disk, 8 GB RAM, and a dual Core virtual processor available. Will that work?

A: Depends on your utilization, but I seriously doubt it.

SQL Server 2008 R2, depending on the flavor will run on just about any Windows Server O/S 2005 and newer, Windows 7 and Windows 8. This isn’t really a discussion about the O/S, more of how the O/S services SQL Server hardware requests. At the hardware level the O/S has two main functions managing memory and the hard disks and servicing requests to those resources to applications.

In a later post we’ll look at memory in a little more depth, but for the hard disk discussion we’ll need to understand the page file. The page file has been part of Microsoft’s O/S products since NT maybe windows for workgroups, but I don’t want to go look it up. The page file is an extension of the physical memory that resides one or more of the system’s hard disks. The O/S will decide when to access this portion of the Memory available to services and applications (processes) requesting memory resources. Many times when a process requires more memory than is currently available the O/S will use the page file to virtually increase the size of the memory on the system in a manner transparent to the requesting process.

Let’s sum that up. The page file is a portion of disk space used by the O/S to expand the amount of memory available to processes running on the system. The implication here is that the O/S will be performing some tasks meant for lightning fast chip RAM, on the much slower hard disk virtual memory because there is insufficient chip RAM for the task. By default the O/S wants to set aside 1.5 times the physical chip RAM in virtual memory disk space. For 16GB of RAM that’s a 24GB page file. On a 40GB drive that doesn’t leave much room for anything else. The more physical chip RAM on the server the bigger the O/S will want to make the page file, but the O/S will actually access it less often.

Now let’s talk RAID settings! You may find voluminous literature arguing the case for software RAID versus Hardware Raid. I’ll leave that to the true server scientists. I’m just going to give quick list of which RAID configurations O/S and SQL Server components will perform well with and which will cause issues. I’m going for understanding here. There are plenty of great configuration lists you can reference, but if you don’t understand how this stuff works you’re relying on memorization or constantly referencing the lists.

Summarization from: http://en.wikipedia.org/wiki/RAID

But this has better pictures: http://technet.microsoft.com/en-us/library/ms190764(v=SQL.105).aspx

RAID 0 – Makes multiple disks act like one, disk size is the sum of all identical disk sizes and there isn’t any failover or redundancy. One disk dies and all info is lost on all drives.

RAID 1 – Makes all the disks act like one, disk size is that of one of the identical disks in the array. Full fail over and redundancy.

RAID 2 – Theoretical, not used. Ha!

RAID 3 – Not very popular, but similar RAID 1, except that each third byte switches to the next disk in the array.

RAID 4 – One drive holds pointers to which drive holds each file. All disks act independently buy access by one drive letter.

RAID 5 – Requires at least 3 identical drives. All but one are live at all times the last acts as a backup should one of the other drives fail.

RAID 6 – Like RAID 5 except, you need at least 4 identical disks and two are offline backup disks.

RAID 10 or 1+0 – A tiered approach where two groups of RAID 1 arrays form a RAID 0 array. So two fully redundant RAID 1 arrays of 500GB made up of 3 500GB disks come together to form 1 RAID 0 array of 1TB. Sounds expensive, 3TB in physical disks to get 1TB accessible drive.

At this point I’ll paraphrase the information found here: http://technet.microsoft.com/en-US/library/cc966534

SQL Server Logs are written synchronously. One byte after the other. There isn’t any random or asynchronous read requests performed against these files by SQL Server. RAID 1 or 1+0 is recommended for this component for two reasons 1. Having a full redundant backup of the log files for disaster recovery. 2. RAID 1 mirrored drives support the sequential write I/O (I/O is short for disk read and write Input and Output. I’m not going to write that 50 times.) of the log file process better than RAID configuration that will split one file over multiple disks.

TempDB is the workhorse of SQL Server. When a query is sent to the databases engine all the work of collecting, linking, grouping, aggregating and ordering happens in the TempDB before the results are sent to the requestor. This makes TempDB a heavy write I/O process. So the popular recommendation is RAID 1+0. Here’s the consideration, TempDB is temporary, and that’s where it gets its name from. So redundancy isn’t required for disaster recovery. However if the disk your TempDB files are on fails, no queries can be processed until the disk is replaced and TempDB restored/rebuilt. RAID 1+0 helps fast writes and ensures uptime. RAID 5 provides the same functionality with fewer disks, but decreased performance when a disk fails.

TempDB and the Logs should NEVER EVER reside on the same raid arrays, so if we’re talking a minimum two RAID 1+0 arrays, might be more cost effective to put TempDB on RAID 5.

Application OLTP (On-line Transaction Processing) databases will benefit the most from RAID 5, which equally supports read and write I/O. Application databases should NEVER EVER reside on the same arrays as the Log files and co-locating with TempDB is also not recommended.

SQL Server comes with other database engine components like the master database and MSDB. These are SQL Server configuration components and mostly utilize read I/O. It’s good to have these components on a mirrored RAID configuration that doesn’t need a lot of write performance, like RAID 1.

A best practice production SQL Server configuration minimally looks like this:

Drive 1: O/S or C: Drive where the virtual memory is also serviced – RAID 1, 80 to 100 GB.

Drive 2: SQL Server Components (master, MSDB, and TempDB) data files – RAID 1+0, 100-240 GB

Drive 3: SQL Server Logs – RAID 1+0, 100-240 GB

Drive 4: Application databases – RAID 5, As much as the databases need…

Where to skimp on a development system? Maybe RAID isn’t available either?

Drive 1: O/S or C: Drive where the virtual memory is also serviced, 80 to 100 GB.

Drive 2: SQL Server Components (master, MSDB, and TempDB) data files Application database files, As much as the databases need…

Drive 3: SQL Server Logs, 100-240 GB

Optimal Production configuration?

Drive 1: O/S or C: Drive – RAID 1, 60 GB.

Drive 2: SQL Server Components (master, MSDB) data files – RAID 5, 100GB

Drive 3: SQL Server Logs – RAID 1+0, 100-240 GB

Drive 4: Application databases – RAID 5, As much as the databases need…

Drive 5: TempDB RAID 1+0, 50–100 GB

Drive 6: Dedicated Page File only RAID 1, 40GB. You don’t want to see what happens to a Windows O/S when the page file is not available.

Buffer I/O is the bane of my existence. I have left no rock unturned on the internet trying to figure out how this process works. So if someone reading can leave a clarifying comment for an edit I’d appreciate it. This I do know, the buffer is kind of like SQL Server’s own page file. A place on a hard disk where information is staged before it is written to the memory pool managed by the O/S. If your system is low on memory and using the page file extensively you will see Buffer I/O waits in the SQL Server Management Studio activity monitor. Basically, this indicates that the staging process is waiting on memory to become available to move data out of the buffer and into the memory pool. The query can’t write more information to the buffer until there is space open in the buffer for it. In fact if the query resultset is big enough, the whole system will begin to die a slow and horrible death as information cannot move in and out of memory or in and out of the buffer because so much information is going in and out of the page file. This is why I highly recommend splitting up the disks so that SQL Server does not have to fight with the page file for Disk I/O.

Look if you have 10 records in one table used by one user 2 times a day that VM with a 40 GB disk, 8 GB RAM, and a dual Core virtual processor available is going to do just fine. But you might as well save some cash and move that sucker onto Access or MYSQL or some other non-enterprise level RDBMS.

Open Suck… I mean Open Source

Posted by mysticstache on June 17, 2013

If you’re reading this for a socialist country, I’m sorry but you’re going to struggle to understand the basic premise of this discussion. The application of a common cliché in capitalist societies, “You get what you pay for” I believe is universally appropriate. From my father-in-law, who bought the cheapest satellite service and complains incessantly about how much he wishes he had the same cable service I have but is unwilling to pay the higher service charges, to out sourcing call centers to regions of the world that speak a different language than the users of this service, to booking a cheaper hotel near the Orlando amusements with free shuttle service that’s just a glorified, overcrowded city bus without the graffiti. Going cheap is almost always going to disappoint. But this is a technical blog and my focus is Business Intelligence.

I’m working on a favor for a friend and I wanted to take this opportunity to explore some new technology. This friend of mine doesn’t have any budget for this project so I’m looking for cost effective components for this application that’s simply client front end to an RDBMS. My friend runs a small collection of Windows 7 desktops, I love Entity Framework, I’m proficient in Visual Studio, and I don’t need a “Big Data” solution. So I start thinking Open Source. Alright, hurdle 1, I’m not a java guy, and some of you might start harping about how Ruby, Rails, PHP running on Apache, Beans and Java all vastly different things…. I’m not into any of them; they’re all Java to me. A lifetime ago I played with swing and it sucked on Windows. Most Java apps I see run in Windows, are crap.

I don’t want to go into an in depth discussion on all the options, but I decided to investigate PostreSQL based on a recommendation from someone in my network who swears by it. One of the things I liked is the multi-OS support. Just in case the world turns upside down and I want to install the database one something other than a Microsoft OS, I thought I’d work with an RDBMS that would work the same no matter where it was installed with ne common client. The installation was smooth enough. I installed everything and clicked next, next, next… no errors. Good. Then I started researching ADO .NET clients to support Entity Framework, that’s where the wheels fell off.

In the realm of free providers to go with the free RDBMS; there is an OLEDB provider pgnpoledb, multiple JDBC drivers, and one ODBC/.NET provider npgsql. Now, I’m skeptical man and before I went down the path of actually trying to connect Entity Framework to the PostgreSQL database I decided to read the npgsql wiki. Pages were devoted to all the different issues and bugs, what was or wasn’t being submitted for acceptance in GitHub. From the headache mounting on my cranium, I could tell this option was going to require maybe a bit more effort than I was willing to invest in a favor for a friend. A lot of posters were pointing to the .NET provider for PostgreSQL from DevArt. Long story short, $199 for what I wanted… Wait a second I thought this crap was all Open Source and free!

Let’s just explore this concept, which has long been my complaint with the Open Source stack. If your goal is to create a mission critical high availability enterprise application with the Open Source offerings, you must be prepared to not only code your application, but also the platform on which it runs, or abandon the “Potentially Free” benefits of Open Source by purchasing licensed products to augment and stabilize the Open Source platforms. Option 1 means roughly doubling your workforce or your time to market. You need resources to code the platform and resources to code the application or resources that do both, but really only one at a time. Option 2 cuts into your equipment and tools budget and you need to verify what the vendor’s royalty and redistribution requirements are. No one wants to depend on a component that requires $1000 royalty for every user on a 40,000 seat client server application, right?

There are other Open Source challenges I love to joke with the diehard apologists I know. Like the fact that your favorite platform was written by one talented foreigner who doesn’t speak your language and only responds to email questions once a week when the internet service satellite flies over his bunker. I like a challenge as much as the next person, and I sympathize with the desire to revolt against the powerful software companies that are so slow to accommodate user needs. But, I’m just not willing to chance providing a service, where contractually I have to pay a refund for every minute of down time, dependent on a platform that was developed by hobbyists and amateurs.

Look at the example I stated above where the free provider has lots of challenges and the paid one is stable and supports all features of the toolset it’s meant to service. Developers whose livelihood (paycheck) is dependent on the successful execution of a project are naturally going to be more motivated to generate a better product than those who are working merely to support a community. Likewise, those tasks that facilitate the collection of said paycheck will take priority over the needs of a community, which leads you to have more down time as you wait for someone to get off from work (or high school marching band practice and homework) to fix a bug in the platform your product depends on and publish it to GitHub.

Job Req. Sanity Check

Posted by mysticstache on January 17, 2013

Let me start by saying I am not an HR guy. Nor have I ever been a full-time recruiter of any sort. So perhaps, I’m way off base with my thoughts on this topic. PLEASE straighten me out if I am because there are a lot of practices within this space that make no sense to me.

I. The Skill Set Years Experience Mismatch

Lately I have seen a flood of open position postings on the various job boards that will say something to the effect of “Jr. Developer\Recent College Grad\1-2 Years experience” as the headline of the posting. Only to find in the requirements section, experience (which to me means more than just exposure or reading a help doc online) for some 30 different technologies. Maybe, yes maybe with the right set of circumstances a Jr. resource as described in the headline might have started in an environment where he or she was given free rein to provide solutions through whatever means. I was lucky enough to have started my career as the only software developer for a successful Insurance company where I was able to explore whatever new technology came along and experiment with different techniques. I think this is pretty rare. Some companies spend the first 6 months breathing over a new resources shoulder with weekly code reviews before they’re promoted to level on and the code reviews come when the developer is ready. Many companies only let their resources sustain existing code and teach them just the basics to troubleshoot the existing technologies while the more senior staff works on innovation.

So are the hiring managers or recruiters looking for 80% of the required skills? One or two? Software design and development professionals are detail oriented and precise personalities. If I can’t talk about every skill listed, I’m not going to apply for a position.

II. Competing Technologies

Another favorite of mine is when the laundry list of experience includes market competitors. The posting is looking for someone with 5 years experience and expert knowledge of Oracle, DB2 and SQL Server, or Expert level .NET and Java. First, can you really become expert in 5 years, especially if the maybe 2 of those you were just doing maintenance work (i.e. spell checking websites)? Secondly how many companies invest tens of thousands of dollars in SQL Server and more tens of thousands on Oracle? As a vendor software developer your product may need to support more than one database platform. However, what percentage of the candidates the job market hail from vendor software companies? Are there really any transferrable skills between .NET and Java? It seems to me trying to grow one resource into an expert of both is far more expense than cultivating two specialists and most companies would do the latter.

These types or requirements lead to a lot of confusion for candidates. They don’t know if they should bother applying or not. The recruiters are inundated with resumes that don’t fit the request from the hiring customer.

III. Automated Recruitment Phone Recruiting

This year in particular I have been flooded with outsourced call center recruiter calls. These calls always follow the same format.

I answer the phone to silence
A few seconds later someone in a very thick accent says, “Hello may I speak to George?”
“Yes this is George.”
Faster than any normal human being should be able to speak -“Uh hi. My name is gibberish. gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish gibberish …”
Me, “Whatever you’re talking about I’m not interested. Thanks.”
Hang up.

It’s as bad as the campaign calls around supper time during an election cycle. Who in their right mind thinks this is in any way an effective means to find a qualified candidate? I seriously doubt these individuals understand the technical requirements well enough to successfully phone screen much less are able to fight through the language barrier well enough to have a real conversation about the candidate or the opportunity.

IV. Don’t Read the Resume

Another new interesting fishing tactic is the mail blast, or I guess that’s what’s going on. Why else am I getting emails for Jr. or Intermediate 5 years or less positions from the job boards where my resume clearly showing 16 years of experience are posted? Or the expert Java Architect roles I was sent when Java J2EE doesn’t appear anywhere on my resume? Recruiters, does this tactic work?

I understand there is a perception in the US job market right now that a lot of people are out of work and some companies are hoping to cash in on getting better qualified candidates for less compensation. This perception has created a recruiter feeding frenzy atmosphere. The truth is most of the top ranked talent is aware of what’s going on and they’re sitting this cycle out, or contracting. The unemployment rate among software development professionals is not nearly as high as other skill sets like manufacturing and construction. I believe this tactics will not be successful, and my land your corporation with a lot of negative feedback on a site like GlassDoor.com.

To Proc or Not to Proc

Posted by mysticstache on December 20, 2012

I’ve had some interesting conversations and fun arguments about how to author queries for SQL Server Report Services (SSRS) reports. There are a lot of professionals out there who really want hard fast answers on best practices. The challenge with SSRS is the multitude of configurations available for the system. Is everything (Database Engine, SSAS, SSRS, and SSIS) on one box? Is every service on a dedicated box? Is SSRS integrated with a SharePoint cluster? Where are the hardware investments made in the implementation?

Those are a lot of variables to try and make universal best practices for. Lucky for us Microsoft provided a tool to help troubleshoot report performance. Within the Report Server database there is a view called ExecutionLog3. ExecutionLog3 links together various logging tables in the Report Server database. Here are some of the more helpful columns exposed.

ItemPath – The path and report names that was executed in this record.
UserName – The User the report was ran as.
Format – Format the report was rendered in (PDF, CSV, HTML4.0, etc.)?
Parameters – Prompt selections made.
TimeStart – Server local date and time the prport was executed.
TimeEnd – Server local date and time the report finished rendering.
TimeDataRetrieval – Amount of time in milliseconds to get report data from data source.
TimeProcessing – Amount of time in milliseconds SSRS took to process the results.
TimeRendering – Amount of time in milliseconds Required to produce the final output (PDF, CSV, HTML4.0, etc.)
Status – Succeeded, Failed, Aborted, etc.

I always provide two reports based on the information found in this view. The first report utilizes the time columns to give me insight into how the reports are performing and when the systems peaks utilization. The second report focuses on which users are using what reports to gauge the effectiveness of the reports to the audience.

Generally I’m a big fan for stored procedures, mostly because my reports are usually related to a common data source and stored procedures provide me with a lot of code reuse. Standardizing, the report prompt behavior with stored procedures is also a handy tool. A simple query change can cascade to all the reports that use a stored procedure, alleviating the need to open each report and perform the same change. Additionally, I like to order the result sets in SQL not after the data is returned to the report. But that doesn’t mean that you’re not going to find better performance moving some functionality between tiers based on the results you find in ExecutionLog3.

I’m sorry there just isn’t a one size fits all recommendation for how SSRS reports are structured. Which means; 1 you’ll have to do some research on your configuration, and 2 don’t accept a consultant’s dogma on the topic.

SQL Server Indexes: Using the Clustered Index

Posted by mysticstache on December 12, 2012

If you really want to understand SQL Server indexing I suggest following Stairway to SQL Server Indexes. My first blog post on SQL Server indexes is going to focus on clustered indexes I’m going to paraphrase a lot of the information found in the articles linked above to save those of us who don’t want an intimate scientific knowledge on this topic and address what is pertinent to re-factoring indexes for to reduce deadlocks.

Simply put, indexes are smaller, concise reference tables associated with the data tables that tell SQL Server where some of the information requested in a query is located. Without indexes, the query engine performs a full table scan (sequentially looks at every Page in a table) to retrieve the requested data rather than jumping around segmented portions of the data where the requested rows are stored. In some cases where all the information requested resides in an index, SQL Server will simply return the row from the index and not access the main data table at all.

Non-clustered indexes are separate objects with separate storage. A clustered index instructs SQL Server how to sort the data in the main data table itself and creates a logical hidden permanent key. This is why identity columns are popular to create clustered indexes on. These integer or big integer column values are created sequentially by SQL Server when a new row is inserted; there is basically never any resorting when rows are inserted or deleted.

SQL Server saves data tables and non-clustered indexes in 8k byte blocks called Pages within their data files. As CRUD actions are performed on your tables the data within these pages must be re-sorted. SQL Server determines which rows go into which Pages. In the case where sequentially created pages do not contain the data sequentially, external fragmentation is created. The percentage of empty page space is called internal fragmentation. The fixed size nature of pages means some data types are better for indexes then others (creating less fragmentation), additionally the number of field included in an index can adversely affect index performance. I’m going to address fragmentation in a later post and explain why creating clustered indexes on GUID data types is death to a database.

From Stairway to SQL Server Indexes: Level 3, Clustered Indexes:

The clustered index key can be comprised of any columns you chose; it does not have to be based on the primary key.

Keep in mind these additional points about SQL Server clustered indexes:

Because the entries of the clustered index are the rows of the table, there is no bookmark value in a clustered index entry. When SQL Server is already at a row, it does not need a piece of information that tells it where to find that row.
A clustered index always covers the query. Since the index and the table are one and the same, every column of the table is in the index.
Having a clustered index on a table does not impact your options for creating non-clustered indexes on that table.

There can be only one clustered index on a table; the data in a table can’t be sorted two different ways at the same time. Clustering does not require uniqueness. In the case where the clustered index is made up of non-unique fields the sorting results in grouping for these fields.

If the preceding is clear, it should be obvious that creating a clustered index on a particular column and then also creating a non-clustered index with the same column is a waste of resources. The creation of the clustered index enforces sorting of the data and created a permanent key for SQL server to quickly locate the Page where every row is stored.

Additionally, creating indexes in this matter can lead to deadlocks. Consider a large number of rows being inserted into a table in a batch. Even if your clustered index is created on an identity column, SQL Server will perform a sort check on the clustered index. After the Clustered sort is finished the non-clustered indexes that also include this column will be filled and the batch will not be committed until the indexes are ready. The reason is the non-clustered indexes can’t create their bookmarks until page each record is going to reside is determined. When a select against this table happens at the same time, the query engine may decide to access the non-clustered index, but the insert batch has it locked. The insert can’t complete because the select is accessing the non-clustered index is the insert also has to write to.

Now wait, if the clustered column is removed from the non-clustered index, the select query is just going to use the clustered index (if the clustered index column is part of the table links or where clause of the select statement) and then the deadlock will occur due to the clustered index being locked. So how do we get the query engine to use the non-clustered index without using it to create the bookmark in the index? We use a covering index, with is created with the include statement.

From Stairway to SQL Server Indexes: Level 5, Included Columns:

Columns that are in a non-clustered index, but are not part of the index key, are called included columns. These columns are not part of the key, and so do not impact the sequence of entries in the index. Also, as we will see, they cause less overhead than key columns.

There are several advantages in having these columns in the index but not in the index key, such as:

Columns that are not part of the index key do not affect the location of the entry within the index. This, in turn, reduces the overhead of having them in the index.
The effort required to locate an entry(s) in the index is less.
The size of the index will be slightly smaller.
The data distribution statistics for the index will be easier to maintain.

Deciding whether an index column is part of the index key, or just an included column, is not the most important indexing decision you will ever make. That said, columns that frequently appear in the SELECT list but not in the WHERE clause of a query are best placed in the included columns portion of the index.

select objt.object_id, idxc.index_id, clmn.column_id

into #ClustIdxCols

from sys.objects objt

inner join sys.indexes idx on objt.object_id = idx.object_id

inner join sys.index_columns idxc on objt.object_id = idxc.object_id and idx.index_id = idxc.index_id

inner join sys.columns clmn on objt.object_id = clmn.object_id and idxc.column_id = clmn.column_id

where objt.type = ‘U’ and idx.type = 1

order by objt.object_id, idxc.index_id, clmn.column_id

select distinct objt.name, idx.name

from sys.objects objt

inner join sys.indexes idx on objt.object_id = idx.object_id

inner join sys.index_columns idxc on objt.object_id = idxc.object_id

and idx.index_id = idxc.index_id

inner join sys.columns clmn on objt.object_id = clmn.object_id

and idxc.column_id = clmn.column_id

inner join #ClustIdxCols test on objt.object_id = test.object_id

and idxc.column_id = test.column_id

where idx.type = 2 and idxc.is_included_column = 0

order by objt.name, Idx.name

This script identifies non-clustered indexes using clustered index columns (include columns are filtered out). In some cases the tables only contain two indexes, a clustered index on the identity column for the table and an non-clustered index on the same column.

GUID’s – Never for Clustered Indexes

Posted by mysticstache on November 30, 2012

Globally Unique Identifiers have their place in software development. They’re great for identifying a library in the GAC or windows registry. They are, however, huge data types from the database perspective.

Oracle, MYSQL, Sybase, and DB2 do not provide any special data type for fields storing GUID’s, for these vendors a GUID is a 34-38 character string (depending on including dashes and “{}”). SQL Server has provided a Unique Identifier data type which has some benefits in storage and access speeds over a 36 character varchar, or nvarchar field. However, they’re still huge…

Unique Identifier Data Type

http://msdn.microsoft.com/en-us/library/ms190215(v=sql.105).aspx

SQL Server’s Unique Identifier displays as a 36 character string (dashes and no “{}”) and stores a GUID as 16 byte binary value. There’s no argument that it’s nearly impossible (not mathematically impossible) to create a duplicate GUID, but how many data sources are going to outgrow a bigint (-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)) data type? That’s only 8 bytes, half a Unique Identifier. Hard disk space has gotten cheap, why do we care about data type size anyway? In the article mentioned above, it’s mentioned that indexes created on Unique Identifier fields are going to perform slower than indexes built on integer fields. That statement hardly scratches the surface of performance implications with Unique Identifier indexes, and it’s all related to the size.

Pages and Extents

http://msdn.microsoft.com/en-us/library/ms190969(SQL.105).aspx

The above article explains how SQL stores data and indexes in 8KB pages. 96 bytes are reserved for the page header, there’s a 36 byte row offset and then 8,060 bytes remain for the data or index storage. If your table consisted of just one column, a page could store 503 GUID’s, or 1007 BigInt’s, or 2015 int’s. Put another way, the smaller the amount of bytes in a row, the more you can store in one page. SQL Server doesn’t control where the Pages are written on the hard disk, the O/S and hardware decide. The chances of consecutive or sequential pages being stored in distant disk sectors increases with the more pages stored for each table or index in the system. As the number of index pages grows, the more out of sync they become with the data pages leading to index fragmentation.

Index Fragmentation

http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-1-the-basics/

http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/

Let’s recap what we have so far,

GUID’s are randomly generated values without any sequential nature or restrictions.
GUID’s are twice as big as the biggest integer data types.
The larger a tables rows are the more pages have to be created to store the data.
The more pages an index has, the more fragmented they get.
The more fragmented the indexes get the more frequently they have to be rebuilt.

Clustered Index Implications

Clustered indexes set the organization and sorting of the actual data in the table. Non-clustered indexes created on a table with a clustered index have to be updated with pointer changes as records are inserted or deleted, or the clustered index value updated because these changes require the data pages to be resorted and new keys generated. SQL Server Identity columns of an integer data type reduce a lot of I/O overhead and SQL server processing because the rows are always inserted in the correct order. GUID values are almost never inserted in the correct order because of their random nature. Thus, with a GUID clustered index every insert or delete or update of that field requires data page reorganization, non-clustered index updates, more pages to be created, and increased fragmentation.

Wisdom from the Stache

Bitcoin Simplified

Tag Archives: SQL Server

Coming Home

PostgreSQL, AWS, and Musical Bottlenecks

I’m not a DBA, But I Play One on TV: Part 3 – Database Files

I’m not a DBA, But I Play One on TV: Part 2 – CPU and RAM

I’m not a DBA, But I Play One on TV: Part 1 – Hard Drives

Open Suck… I mean Open Source

Job Req. Sanity Check

I. The Skill Set Years Experience Mismatch

II. Competing Technologies

III. Automated Recruitment Phone Recruiting

IV. Don’t Read the Resume

To Proc or Not to Proc

SQL Server Indexes: Using the Clustered Index

GUID’s – Never for Clustered Indexes

Unique Identifier Data Type

Pages and Extents

Index Fragmentation

Clustered Index Implications

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

I. The Skill Set Years Experience Mismatch

II. Competing Technologies

III. Automated Recruitment Phone Recruiting

IV. Don’t Read the Resume

Share this:

Share this:

Share this:

Unique Identifier Data Type

Pages and Extents

Index Fragmentation

Clustered Index Implications

Share this: