A new era for SQL Server FCI on Azure VMs

Published by Marco Obinu on

Recently, I’ve blogged about several different ways to achieve high availability on Azure SQL Server VMs. SQL Failover Cluster Instances are the right choice if you’re tight on budget, but have their drawbacks. In that context, Azure Shared Disks and Distributed Network Names are the game-changers that help reducing complexity in SQL FCI on Azure.

But how can they help you?

Finally, the shared storage!

Traditionally, the lack of shared storage in Azure was one of the main blockers for SQL FCI on Azure. You could amend this issue by using storage replica or Azure File Shares for baking your storage subsystem. In any case, you had some disadvantages, as reported in the article.

Azure Shared Disks are here to fill this gap. Based upon Azure Disks, both Premium, and UltraSSD, and they support iSCSI reservations. They can be shared between different VMs like you usually do with SAN LUNs on-premises.

Once added to the Azure VMs, configure these disks as Clustered Shared Volumes for your FCI, and use them to host data and log files. By adding multiple disks with optimal caching settings, you can achieve high performance.

Load balancer? No, thanks!

The need for a load balancer in front of SQL FCIs is another point that usually hurts customers. This is a big difference between on-prem and Azure deployments.

Till now, you had to provision a load balancer to manage the cluster IPs of your solution. Without that, Azure is not aware of the IP addresses you assign at the Guest OS level and may cause IP-conflict issues.

The LB adds a layer of complexity to your solution and can cause issues in your event log due to a lack of connectivity.

Distributed Network Names act differently. Instead of having a  Virtual Network Name with its dedicated IP to expose a cluster server, the DNN registers the DNS name you assign to the resource with both the IP of the cluster nodes. So, your app won’t resolve the clustered network name to a load-balanced IP, but it will obtain a couple of IPs, and it can try both to connect to the resource.

Microsoft introduced DNNs some years ago, to support Scale-Out File Server scenarios if I recall well. Windows Server 2019 uses them by default to expose the cluster network name of your WSFC while you’re on Azure VMs. Now, starting with SQL Server 2019 CU2 and actually in preview, you can use them to manage SQL Server FCI clustered network names.

It’s a huge benefit since you can avoid the load balancer for SQL FCI when using both Windows Server 2019 and SQL Server 2019.

Let’s deploy a SQL FCI that use them!

You can quickly deploy a two-node cluster with Azure Shared Disks and DNNs by following my previous article. Here, we’ll focus on some aspects of the deployment.

In my provisioning script, I firstly deployed a couple of VMs starting from my SQL-optimized ARM template, and then I added two shared disks via PowerShell:

# Deploy two cluster nodes
New-AzResourceGroupDeployment -Name "SHDFCI-Node01" `
    -ResourceGroupName $rgName `
    -TemplateUri "https://raw.githubusercontent.com/OmegaMadLab/OptimizedSqlVm-v2/master/azuredeploy.json" `
    -TemplateParameterFile ".\SHDFCI\azuredeploy.parameters.shdfci.cl01.json" `
    -AsJob

Start-Sleep -Seconds 30

New-AzResourceGroupDeployment -Name "SHDFCI-Node02" `
    -ResourceGroupName $rgName `
    -TemplateUri "https://raw.githubusercontent.com/OmegaMadLab/OptimizedSqlVm-v2/master/azuredeploy.json" `
    -TemplateParameterFile ".\SHDFCI\azuredeploy.parameters.shdfci.cl02.json"
    

# Create two Shared Disks to host database data and log file
$diskConfig = New-AzDiskConfig `
                -Location $location `
                -DiskSizeGB 512 `
                -MaxSharesCount 2 `
                -AccountType Premium_LRS `
                -CreateOption Empty

$dataDisk = New-AzDisk -ResourceGroupName $rgName `
                -DiskName 'SharedDataDisk' -Disk $diskconfig

$logDisk = New-AzDisk -ResourceGroupName $rgName `
                -DiskName 'SharedLogDisk' -Disk $diskconfig

The MaxSharesCount defines the number of cluster nodes that can share the disk.

If you prefer doing everything via ARM templates, you can find a sample here.

Add these disks to both the VMs. Again, I did it via PowerShell:

$vmNames = @()
$vmNames += "SqlShdClNode01"
$vmNames += "SqlShdClNode02"

$vm = $vmNames | % { Get-AzVm -Name $_ -ResourceGroupName $rgName }
$vm | Stop-AzVM -Force
                
$vm | % { 
    Add-AzVMDataDisk -VM $_ `
        -Name $dataDisk.Name `
        -CreateOption Attach `
        -ManagedDiskId $dataDisk.Id `
        -Lun 0 `
        -Caching None # Using cache=none for data disks due to Shared Disks preview limits
    
    Add-AzVMDataDisk -VM $_ `
        -Name $logDisk.Name `
        -CreateOption Attach `
        -ManagedDiskId $logDisk.Id `
        -Lun 1 `
        -Caching None
}

$vm | Update-AzVm

$vm | Start-AzVM -asJob 

As you may have noticed, I used caching=none on both disks: this is a preview limitation; I expect that in GA, I’ll be able to differentiate cache settings.

You can find the above steps inside 06-SHDFCI.ps1 script.

Now, you’re ready to connect to the first cluster node, to go on with the setup.

Configuring Clustered Shared Volumes

I added the disks to the failover cluster as Clustered Shared Volumes, using this portion of PowerShell: 

# Initialize shared disks
$disk = Get-Disk | ? PartitionStyle -eq "RAW" 
$disk | Initialize-Disk -PartitionStyle GPT 
$disk | Set-Disk -IsOffline:$false
$disk | 
    New-Partition -AssignDriveLetter -UseMaximumSize |
    Format-Volume -FileSystem NTFS -AllocationUnitSize 64KB -Force -Confirm:$false

# Create WSFC  
Test-Cluster -Node $sqlNodes –Include "Inventory", "Network", "System Configuration"
New-Cluster -Name $clusterName -Node $sqlNodes -NoStorage

# Change cluster quorum configuration to storage account
Set-ClusterQuorum -CloudWitness -AccountName $witnessSaName -AccessKey $witnessSaKey

# Add available storage to the cluster as CSV
Get-ClusterAvailableDisk | Add-ClusterDisk | Add-ClusterSharedVolume

The New-Cluster cmdlet automatically provisions the DNN for the cluster network name, if executed in an Azure VM.

After doing that, you can install your SQL Server FCI as usual, taking care of using the CSVs as your path for database files.

You can find the above part, as well as automated setup commands for SQL Server, in the SHDFCI-GuestClusterConfig.ps1 script.

Configuring the DNN for SQL Server

To configure the DNN for SQL Server, you have to proceed manually: the setup wizard provisions the usual Virtual Network Name, and you have to switch between them. 

## Update cluster cluster to use DNN
$sqlClusterGroup = Get-ClusterGroup | Where-Object Name -Like "*SQL SERVER*"
$sqlClusterVNN = $sqlClusterGroup| get-clusterresource | ? ResourceType -eq "Network Name"

# Get current DNS Name assigned to the SQL FCI
$sqlDnsName = $sqlClusterVNN | Get-ClusterParameter -Name DnsName | Select -ExpandProperty Value

# Stop and rename (both name and DnsName) the VNN
$sqlClusterVNN | Stop-ClusterResource
$sqlClusterVNN.Name = "$sqlDnsName-VNN"
$sqlClusterVNN | Set-ClusterParameter -Name "DnsName" -value "SQLFCI-VNN"
$sqlClusterVNN | Start-ClusterResource

# Create a DNN and assign it the original FCI DnsName
$sqlClusterDNN = Add-ClusterResource -Name "$sqlDnsName-DNN" `
                       -ResourceType "Distributed Network Name" `
                       -Group $sqlClusterGroup

$sqlClusterDNN | Set-ClusterParameter -Name "DnsName" -value $sqlDnsName

# Create a SQL Alias from VNN to DNN on both cluster nodes
# https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/failover-cluster-instance-dnn-interoperability
New-DbaClientAlias -ComputerName $sqlNodes[0] -Alias "SQLFCI-VNN" -ServerName $sqlDnsName
New-DbaClientAlias -ComputerName $sqlNodes[1] -Alias "SQLFCI-VNN" -ServerName $sqlDnsName

$sqlClusterGroup | Stop-ClusterGroup
$sqlClusterGroup | Start-ClusterGroup

You need to rename the out-of-the-box VNN and create a new DNN. The DNN receive the SQL FCI DNS name that the VNN was previously using.

You can’t remove the VNN since it’s an integral part of the cluster. You should define SQL Alias on both nodes, the expose the DNN via the new VNN name; you need this since some legacy features have hard-coded references the VNN, as stated in the documentation. To do this, I used the fantastic DBATools module.

That’s all, after a restart of the cluster group, your SQL FCI will accept connections on the DNN DNS name.

As you can see here, the DNN resolves both the IP addresses of the cluster nodes:

DNN ip configuration

DNN IP configuration

And this is the behavior while failing over the nodes. You can see the active node is switching after a failover to the secondary node:

SQL FCI with DNN failover behavior

SQL FCI with DNN failover behavior

The outage due to the failover is very short, less than with a load balancer, that needs to probe the new active node before routing the traffic to it.

With DNN, you can also use the MultiSubnetFailover keyword in the client connection string, as you usually do with stretched cluster, to speed-up the re-connection process.

A final thought

There’s a step in this procedure that is a bit unclear to me.

We renamed the VNN, to assign its DNS name to the new DNN, but we left its clustered IP address untouched. Can we remove it? The documentation doesn’t mention it, at least as for now.

While asking for clarification from Microsoft, I left it alone in my cluster. But it’s something we should address before GA, since leaving a cluster IP configured only at the Guest-OS level it’s not a good idea. Azure isn’t aware of it, and this may lead to IP conflicts as in the first era of clustering on Azure.

I’ll update you on this topic once I find the correct way to address it!


Marco Obinu

Curious by nature, talkative geek who can speak in front of a public or a camera, in love with technology, especially SQL Server. I want to understand how things work and to solve problems by myself during work as in my many hobbies.

5 Comments

Thierry Van Durme · 7 August 2020 at 13:34

Thanks for sharing. As for your question about the clustered IP: could it be required by some of the other features that need special attention? Of course I suppose that would mean we still need the load balancer next to the DNN so that would be a pain…
What I’m also missing in the documentation is whether or not we need to add a dependency to the SQL Server resource on the DNN, just like it’s depending on the Network Name resource in a usual on-prem setup?
Cheers
Thierry

    Marco Obinu · 7 August 2020 at 15:00

    Hi Thierry, thanks for your feedback.
    With Windows Server 2019, we can avoid the load balancer in front of the cluster: both the cluster VNN and the SQL VNN can be replaced by DNNs. If you look at the official documentation, you must create SQL Alias on both cluster nodes to support those features that are still referencing the SQL VNN: with the alias, you “redirect” them to the DNN.
    I opened an issue to the doc to obtain a better indication about how to manage the VNN IP: even if the official page doesn’t reflect yet the change, it appears the product team agreed with my proposal of using a link-local IP, just to avoid messing with subnet IPs.
    Also, the doc doesn’t mention anything about dependencies, but probably it makes sense to add them; I’ll do some test asap.

      Thierry Van Durme · 7 August 2020 at 15:31

      Hi Marco,
      thanks for the quick reply!
      Your suggestion about the link-local IP is very nice.
      I’m not sure about the dependency, one might prefer to not see SQL go down if the DNN somehow fails. I’m just starting to find my way around in Azure, so looking forward to your test results and insights.
      Thanks again!
      Thierry

David Bermingham · 25 September 2020 at 00:03

I’ve seen documentation on using DNN on SOFS and SQL Listerners. Have you seen anyone use it for traditional file servers?

    Marco Obinu · 25 September 2020 at 07:43

    Hi David, I really appreciate your blog, I’ve read a lot of interesting posts during the time. Regarding the question, I know there’s support for SQL availability groups in roadmap, but I didn’t see anything about “standard” file servers. I’ll try to have a look around.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: