High Availability in Microsoft Azure - Part I


Objective

We will try to understand the fundamentals of High Availability (HA), what do we mean by availability, why is it important, how we can leverage various service offerings available in Microsoft Azure cloud in an IaaS deployment option to achieve HA. We will also cover important points to note related to High Availability design with each of those services primarily in IaaS (Infrastructure as a Service) environment.

What is High Availability and how is it measured?

In the world of IT, Availability refers to the annual percentage rate that a system is functional and working. Today, High Availability (HA) is the need and requirement of all the cloud based systems and business require continuous service delivery without any interruptions and cannot afford downtime. HA is one of the significant Architectural concern when it comes to designing solutions and applications.

HA refers to an agreed level of operational uptime (measured in hours, minutes and seconds up to milliseconds of downtime) of the systems defined as per Service Layer Agreements (SLA). Some of these are mentioned below for understanding.

Availability %
Downtime per year
99%( "two nines")
3.65 days
99.9%("three nines")
8.76 hours
99.99% ("four nines")
52.60 minutes
99.999% ("five nines")
5.26 minutes
99.9999% ("six nines")
31.56 seconds
99.9999999% ("nine nines")
31.56       milliseconds

 

The basis for the calculation is 30 days per month, or 43,200 minutes. For example, a 0.05% downtime corresponds to 21.6 minutes. Similarly, we can calculate for a year. As usual, the availability of the various services in the software application or product is calculated in the following way:

(Availability Service #1/100) * (Availability Service #2/100) * (Availability Service #3/100) *…

For example:

(99.95/100) * (99.9/100) * (99.9/100) = 0.9975 or an overall availability of 99.75%.

 

Why is it important?

High availability is not only important in general IT and software applications and products because of business importance but also one of the primary requirements of the control systems , many real time systems in Health and Medical Field or Defense security Systems ,in unmanned vehicles , drones and autonomous maritime vessels, and self-driving cars / autonomous automobiles. Just think about the consequences to life and safety if these systems go down or becomes unavailable during operation. 

What causes downtime?

Downtime can be caused due to any of the events below. A true HA system will account for scheduled as well as unscheduled downtime although this should be clarified with the provider. 

  • Planned maintenance or periodic software update/patching
  • Unplanned hardware maintenance or failure
  • Unexpected downtime or failures due to malfunctioning of software or unforeseen circumstances - Power outage to any natural calamity or disaster

 

High Availability in Microsoft Azure Cloud IaaS (Infrastructure as a Service) Environment

  • Use Availability Sets when you want to deploy reliable VM-based solutions in Azure. A group with two or more virtual machines in the same Data Center is called Availability Set. This ensures that at least one of the virtual machines hosted on Azure will be available if something happens. This configuration offers 99.95% SLA.
    This grouping capability provides redundancy for unplanned downtime events and isolates VM resources from each other when they're deployed.  When you place your VM’s in an Availability Set, Azure guarantees to place them in different Fault domains (max 2 or 3 by default depending upon region) and Update domains (5 by default). If a hardware or software failure happens or an OS update or reboot happens, only a subset of your VMs are impacted and your overall solution stays operational.
    Update domains guarantee that multiple VMs are not rebooted at the same time during the planned maintenance of an Azure infrastructure. Only one VM is rebooted at a time.
    Fault domains guarantee that VMs are deployed on different physical racks that do not share a common power source and network switch. When servers, a network switch, or a power source undergo an unplanned downtime, only one VM is affected.
     
    Avoid leaving a single instance virtual machine in an availability set by itself. VMs in this configuration do not qualify for a SLA guarantee and face downtime during Azure planned maintenance events,
    Implementation Note
    -You can create an availability set using New-AzAvailabilitySet.
    -You can't add an existing VM to an availability set after it's created.
    Design Note - Configure each application tier into separate availability zones or availability sets
    In the above diagram, we see a typical VM-based solution where you might have three front-end web servers and 2 back-end VMs. With Azure, you’d want to define two availability sets before you deploy your VMs: one for the web tier (consisting of 3 front end web servers) and one for the back tier (2 backend VM’s). When you create a new VM, you specify the availability set as a parameter.

  • VM Scale sets - Azure VM scale sets let us create and manage a group of load balanced VMs together. Scale sets ensure the high availability for our VMs and at the same time let us manage all the VMs as a group. Also, the scale set can be configured to automatically scale based on load which will further help is in achieving our goal of high availability for VMs.
  • Availability Zones (AZ)- This is the next level of Azure Virtual Machines high-availability, because Virtual Machines are in different physical locations within an Azure Region. It can be deployed using one or more Virtual Machines in an Azure Region. Availability zones offer 99.99% SLA. Availability Zones are unique physical locations with independent power, network, and cooling. Each Availability Zone is comprised of one or more datacenters and houses infrastructure to support highly available, mission critical applications. Availability Zones are tolerant to datacenter failures through redundancy and logical isolation of services.
    (Automatic replication across zones is enabled for SQL Database, zone-redundant storage)


Design Note - Using Availability Zones, there are some things to consider. The considerations list like:

-You can't deploy Azure Availability Sets within an Availability Zone. You need to choose either an Availability Zone or an Availability Set as deployment frame for a VM.

-You can't use the Basic Load Balancer to create failover cluster solutions based on Windows Failover Cluster Services or Linux Pacemaker. Instead you need to use the Azure Standard Load Balancer SKU

-Azure Availability Zones are not giving any guarantees of certain distance between the different zones within one region

-The network latency between different Azure Availability Zones within the different Azure regions might be different from Azure region to region

 

  • Azure Storage redundancy -The data in your storage account is always replicated to ensure durability and high availability, meeting the Azure Storage SLA even in the face of transient hardware failures. Data can be replicated within same data center (LRS- Locally Redundant Storage – 3 copies) , across zonal data centers (ZRS-Zone Redundant Storage – 3 copies) within same region and across different regions ( GRS- Geo Redundant Storage – 3 copies within same region and 3 copies to other regions asynchronously). Choose the appropriate replication type when you create the new storage account.
     
  • Azure Managed Disks - Managed Disks is a resource type in Azure Resource Manager that is recommended to be used instead of virtual hard disks (VHDs) that are stored in Azure storage accounts. Managed disks automatically align with an Azure availability set of the virtual machine they are attached to. They increase the availability of your virtual machine and the services that are running on it.

  • Azure Load Balancer (LB)- Combine a load balancer with availability zones or sets to get the most application resiliency. The Azure Load Balancer distributes traffic between multiple virtual machines. For our Standard tier virtual machines, the Azure Load Balancer is included. If the load balancer is not configured to balance traffic across multiple virtual machines, then any planned maintenance event affects the only traffic-serving virtual machine, causing an outage to your application tier. Placing multiple virtual machines of the same tier under the same load balancer and availability set enables traffic to be continuously served by at least one instance

  • SQL Server Always On Availability Group – It is a collection of high availability & disaster recovery features introduced from SQL Server 2012. If you are using SQL Server, it is strongly recommend to use SQL Always On Availability Groups for high availability. Create a single availability group that includes the SQL Server instances in both regions. Prior to Windows Server 2016, SQL Server Always On Availability Groups require a domain controller, and all nodes in the availability group must be in the same Active Directory (AD) domain. In IaaS you need to implement additional mechanisms to ensure availability of your databases. With Always On availability groups, you can have HA solution at 99.99% by creating an additional SQL Server in VM (SQL VM’s – SQL-1 and SQL-2 in figure below).

  • Multi Region Deployment, Regional Pairing and Failover Mechanism - Primary and secondary regions. Use two regions to achieve higher availability. One is the primary region. The other region is for failover.
      Design Notes: -

  • Each Azure region is paired with another region within the same geography. In general, choose regions from the same regional pair (for example, East US 2 and US Central). Benefits of doing so include:
  • Planned Azure system updates are rolled out to paired regions sequentially, to minimize possible downtime.
  • Pairs reside within the same geography, to meet data residency requirements.
 
  • BCDR (Backup and Disaster Recovery using Azure Backup and Site Recovery)- As an organization you need to adopt a business continuity and disaster recovery (BCDR) strategy that keeps your data safe, and your apps and workloads up and running, when planned and unplanned outages occur.

             Azure Recovery Services contribute to your BCDR strategy:

  • Site Recovery service: Site Recovery helps ensure business continuity by keeping business apps and workloads running during outages. Site Recovery replicates workloads running on physical and virtual machines (VMs) from a primary site to a secondary location. When an outage occurs at your primary site, you fail over to secondary location, and access apps from there. After the primary location is running again, you can fail back to it.
  • Backup service: The Azure Backup service keeps your data safe and recoverable by backing it up to Azure.

          Site Recovery can manage replication for:

  • Azure VMs replicating between Azure regions.
  • On-premises VMs, Azure Stack VMs and physical servers.

  • Virtual networks. Create a separate virtual network for each region. Make sure the address spaces do not overlap. Virtual network peering. Peer the two virtual networks to allow data replication from the primary region to the secondary region.
  • Azure Traffic Manager - Traffic Manager routes incoming requests to one of the regions. Traffic Manager supports several routing algorithms. For the scenario described in this article, use priority routing (formerly called failover routing). With this setting, Traffic Manager sends all requests to the primary region, unless the primary region becomes unreachable or any regional outage makes it unavailable. At that point, it automatically fails over to the secondary region.
    Design Note- Traffic Manager is a possible failure point in the system. If the Traffic Manager service fails, clients cannot access your application during the downtime. We may consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS manually to point to the other traffic management service.
Resource groups. Create separate resource groups for the primary region, the secondary region, and for any Load Balancer / Traffic Manager. This gives the flexibility to manage each region as a single collection of resources. For example, you could redeploy one region, without taking down the other one. Link the resource groups, so that you can run a query to list all the resources for the application.
Based on these concepts outlined above, a reference architecture (taken from msdn) is depicted below for the completeness and understanding of the High Availability design.

Summary
This concludes the article. As we can see designing High Availability solutions involves some important considerations which I have tried to cover and put forth together in a comprehensive manner for better understanding for all. Thank you.

Comments

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. I am really impressed with the way of writing of this blog. The author has shared the info in a crisp and short way.
    Cloud Migration services


    Best Cloud Migration Tool

    ReplyDelete
  4. I read your post and got it quite informative. I couldn't find any knowledge on this matter prior to. I would like to thanks for sharing this article here. Cell Phone Recovery service london

    ReplyDelete
  5. I am always left astounded at the level of dedication and hard work you put in every situation. May you reach every height of success!
    pc troubleshoots

    ReplyDelete
  6. This blog explains Azure cloud migration services. After reading it, I got more knowledge regarding azure cloud migration services.
    azure cloud migration services

    ReplyDelete
  7. I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much. microsoft azure synapse

    ReplyDelete
  8. I gained some amazing knowledge from this post, as it has some useful details about what high availability is and how it is measured, which is good for knowledge. Thank you for posting this. OEE Tracking

    ReplyDelete
  9. Really, this is very important information which is shared by you. This info is meaningful and important for everyone to increase our knowledge about it. Always keep sharing this kind of info. Thank you. computer forensics service San Diego

    ReplyDelete
  10. The content you've posted here is fantastic because it provides some excellent information that will be quite beneficial to me. Thank you for sharing that. Keep up the good work. Cyber Security Training Courses In Canada

    ReplyDelete
  11. You have mentioned great information here. I would like to say this is a well-informed article and also beneficial for us. Keep sharing these kinds of articles. Thank you. Russia Import Data

    ReplyDelete
  12. You've provided some very useful information. I'm glad I came into this article because it provide a lot of important information. Thank you for sharing this story with us.SAP order management system

    ReplyDelete

Post a Comment

Popular posts from this blog

The Booming Fintech Industry

Artificial Intelligence - An Overview