High Availability means different things to different people in the IT industry. IMHO, High Availability refers to keeping the downtime to as low as possible for the hours of operation. Let us assume that in an utopian world, nothing goes wrong and everything works as planned – in that scenario, our SQL Server installation will be available 100% of the time as needed by the end users. This uptime could be 24/7 for 365 days a year or could be 24/7 for 5 days of the week or could be 9-6 per day during the week days. As long as those business requirements are met for the uptime, the availability solution is a high availability solution. Not every shop requires a 24/7 solution for 365 days or even 5 days of the week. Because of outages, mistakes made by personnel or upgrades, there will be outages though so this uptime could be 99.999% or 90% or any other number. What the uptime requirement is needs to be ascertained by the business users and the management.
Only after that decision has been made, should one start looking into which HA (High Availability) and DR (Disaster Recovery) solution one should opt for. If the above brain-storming is not done, then you might end up spending too much when you could have easily opted for a low cost solution, likewise, you might end up choosing a low cost option when indeed you needed a better uptime for your database servers.
Different failure types bringing the uptime down can be categorized under:
• Natural Causes – floods, earthquake, fire etc.
• Hardware failures – CPU, Memory, Network Card, System Board etc.
• Network Infrastructure
• Electrical Power
• Upgrades of the OS or SQL Server
• Human errors and scheduled maintenance
Different options available for HA:
1) Failover Clustering:
This is an expensive option but also one that provides very good high availability from the perspective of keeping the downtime low in case of a server failure in the node. The failover to the passive node is pretty fast (anywhere between 1-5 minutes depending upon the state of the cluster). This does not protect the data that resides on the SAN – this is only for the server redundancy.
This is very well documented and has been available since quite some time as the failover solution for MS SQL Server. Since this is at the instance level, all the resources are available after the failover happens. There are realms of documentation on this and we have provided those links in the References section towards the end of the document.
The reason why this is an expensive option is because unlike Oracle RAC (Real Application Clusters), the clustering for MS SQL Server is under the Shared Nothing architecture which means that at any given point of time, only one node can own the resources. This means that your passive server (which will be the same hardware configuration) will be sitting idle for most of the time till disaster strikes and a failover happens from the primary server to the passive one. One way that people off-set this issue is by having an Active-Active cluster which is essentially 2 Active-Passive cluster combinations. It does drive the processor and memory requirements a bit high on each server since in the event of a failure, the failover server will then house 2 instances, however this option works out well for a lot of shops who have more than one critical application that requires SQL Server as their back-end.
2) Database Mirroring:
This is a new feature that was introduced for the first time in SP1 for SQL Server 2005. This works at the database level and allows you to duplicate the production database (in real time if you chose to) from one server to another. If the primary server goes down for some reason, the system will failover to the mirrored database instance.
This is less costly than the failover clustering option and can also be combined with failover clustering to provide a very robust solution. When combined with failover clustering, the local failover is done by the clustering solution and the site failover is automated using database mirroring.
The risk is that this is a pretty new feature and is still not 100% proven in the field.
3) Database Snapshots:
This is also a new feature in SQL Server 2005 and also operates at the database level. This can be used to create snapshots which are static database copies of your production database. By using this option, one can off-load the reporting load on your production database over to the other server which in turn will boost your production server’s performance as well as HA.
Even though some shops use this as their HA mechanism, we will not recommend it. This is a time tested option but not for HA. It will involve a lot of manual steps in order to provide a good HA solution.
5) Log Shipping:
This is a very low cost option but also one that will cause data loss. However, some shops are ok with the data loss portion of it since the time synchronization is configurable. Database Mirroring is a much better option than log shipping and as it becomes more robust, we believe that that will be more in use as compared to log shipping.
Log shipping can be used in conjunction with failover clustering to provide a good site failure redundancy with a low cost solution. There will be manual steps for recovery though when a site failure occurs.
6) Back-ups and Restore:
The old back-up and restore technique. This is really not a solution for HA. It is necessary to have a good set of back-up and restore policies in place and one needs to have good procedures in place to ensure that their data is safe but this is not really a HA option even though shops that are ok with some data loss and want to go with the most cheap option available for their non-critical systems do opt for this option at times.
Bottom line is that you need to evaluate the criticality of the application that you plan to support and for which you are looking to have a HA plan in place and then choose from the options mentioned above.