Quantcast
Channel: Infrastructure – rakhesh.com
Viewing all 75 articles
Browse latest View live

Notes on NLB, VMware, etc

$
0
0

Just some notes to myself so I am clear about it while reading about it. In the context of this VMware KB article – Microsoft NLB not working properly in Unicast mode.

Before I get to the article I better talk about a regular scenario. Say you have a switch and it’s got a couple of devices connected to it. A switch is a layer 2 device – meaning, it has no knowledge of IP addresses and networks etc. All devices connected to a switch are in the same network. The devices on a switch use MAC addresses to communicate with each other. Yes, the devices have IPv4 (or IPv6) addresses but how they communicate to each other is via MAC addresses.

Say Server A (IPv4 address 10.136.21.12) wants to communicate with Server B (IPv4 address 10.136.21.22). Both are connected to the same switch, hence on the same LAN. Communication between them happens in layer 2. Here the machines identify each other via MAC addresses, so first Server A checks whether it knows the MAC address of Server B. If it knows (usually coz Server A has communicated with Server B recently and the MAC address is cached in its ARP table) then there’s nothing to do; but if it does not, then Server A finds the MAC address via something called ARP (Address Resolution Protocol). The way this works is that Server A broadcasts to the whole network that it wants the MAC address of the machine with IPv4 address 10.136.21.22 (the address of Server B). This message goes to the switch, the switch sends it to all the devices connected to it, Server B replies with its MAC address and that is sent to Server A. The two now communicate – I’ll come to that in a moment.

When it’s communication from devices in a different network to Server A or Server B, the idea is similar except that you have a router connected to the switch. The router receives traffic for a device on this network – it knows the IPv4 address – so it finds the MAC address similar to above and passes it to that device. Simple.

Now, how does the switch know which port a particular device is connected to. Say the switch gets traffic addresses to MAC address 00:eb:24:b2:05:ac – how does the switch know which port that is on? Here’s how that happens –

  • First the switch checks if it already has this information cached. Switches have a table called the CAM (Content Addressable Memory) table which holds this cached info.
  • Assuming the CAM table doesn’t have this info the switch will send the frame (containing the packets for the destination device) to all ports. Note, this is not like ARP where a question is sent asking for the device to respond; instead the frame is simply sent to all ports. It is broadcast to the whole network.
  • When a switch receives frames from a port it notes the source MAC address and port and that’s how it keeps the CAM table up to date. Thus when Server A sends data to Server B, the MAC address and switch port of Server A are stored in the switch’s CAM table.  This entry is only stored for a brief period.

Now let’s talk about NLB (Network Load Balancing).

Consider two machines – 10.136.21.11 with MAC address 00:eb:24:b2:05:ac and 10.136.21.12 with MAC address 00:eb:24:b2:05:ad. NLB is a form of load balancing wherein you create a Virtual IP (VIP) such as 10.136.21.10 such that any traffic to 10.136.21.10 is sent to either of 10.136.21.11 or 10.136.21.12. Thus you have the traffic being load balanced between the two machines; and not only that if any one of the machines go down, nothing is affected because the other machine can continue handling the traffic.

But now we have a problem. If we want a VIP 10.136.21.10 that should send traffic to either host, how will this work when it comes to MAC addresses? That depends on the type of NLB. There’s two sorts – Unicast and Multicast.

In Unicast the NIC that is used for clustering on each server has its MAC address changed to a new Unicast MAC address that’s the same for all hosts. Thus for example, the NIC that holds the NLB IP address 10.136.21.10 in the scenario above will have its MAC address changed from 00:eb:24:b2:05:ac and 00:eb:24:b2:05:ad respectively to (say) 00:eb:24:b2:05:af. Note that the MAC address is a Unicast MAC (which basically means the MAC address looks like a regular MAC address, such as that assigned to a single machine). Since this is a Unicast MAC address, and by definition it can only be assigned to one machine/ switch port, the NLB driver on each machines cheats a bit and changes the source MAC address address to whatever the original NIC MAC address was. That is to say –

  • Server IP 10.136.21.11
    • Has MAC address 00:eb:24:b2:05:ac
    • Which is changed to a MAC address of 00:eb:24:b2:05:af as part of the Unicast IP/ enabling NLB
    • However when traffic is sent out from this machine the MAC address is changed back to 00:eb:24:b2:05:ac
  • Same for Server 10.136.21.12

Why does this happen? This is because –

  • When a device wants to send data to the VIP address, it will try find the MAC address using ARP. That is, it sends a broadcast over the network asking for the device with this IP address to respond. Since both servers now have the same MAC address for their NLB NIC either server will respond with this common MAC address.
  • Now the switch receives frames for this MAC address. The switch does not have this in its CAM table so it will broadcast the frame to all ports – reaching either of the servers.
  • But why does outgoing traffic from either server change the MAC address of outgoing traffic? That’s because if outgoing frames have the common MAC address, then the switch will associate this common MAC address with that port – resulting in all future traffic to the common MAC address only going to one of the servers. By changing the outgoing frame MAC address back to the server’s original MAC address, the switch never gets to store the common MAC address in its CAM table and all frames for the common MAC address are always broadcast.

In the context of VMware what this means is that (a) the port group to which the NLB NICs connect to must allow changes to the MAC address and allow forged transmits; and (b) when a VM is powered on the port group by default notifies the physical switch of the VMs MAC address, since we want to avoid this because this will expose the cluster MAC address to the switch this notification too must be disabled. Without these changes NLB will not work in Unicast mode with VMware.

(This is a good post to read more about NLB).

Apart from Unicast NLB there’s also Multicast NLB. In this form the NLB NIC’s MAC address is not changed. Instead, a new Multicast MAC address is assigned to the NLB NIC. This is in addition to the regular MAC address of the NIC. The advantage of this method is that since each host retains its existing MAC address the communication between hosts is unaffected. However, since the new MAC address is a Multicast MAC address – and switches by default are set to ignore such address – some changes need to be done on the switch side to get Multicast NLB working.

One thing to keep in mind is that it’s important to add a default gateway address to your NLB NIC. At work, for instance, the NLB IPv4 address was reachable within the network but from across networks it wasn’t. Turns out that’s coz Windows 2008 onwards have a strong host behavior – traffic coming in via one NIC does not go out via a different NIC, even if both are in the same subnet and the second NIC has a default gateway set. In our case I added the same default gateway to the NLB NIC too and it was then reachable across networks. 


Unable to login to vSphere because the admin@system-domain password cannot be reset

$
0
0

vSphere 5.1 has admin@system-domain as the default admin account. vSphere 5.5 changes that to administrator@vsphere.local. However, if you upgrade from 5.1 to 5.5 the default admin account remains admin@system-domain. Which is fine and dandy until the password for this account expires. Then you are unable to reset or login! See below. :)

Trying to login as usual

1 - login

Password has expired, needs a reset

2 - reset

Reset fails though coz you can only reset for the vsphere.local domain

3 - reset fails

Missed out on taking a screenshot but if you were to try and login with administrator@vsphere.local instead you get an error that the credentials are invalid (because that account doesn’t exist!). So you are stuck!

What do you do?

Solution is to reset the admin password

When you do this vSphere automatically creates the administrator@vsphere.local account. Follow the steps in this KB article.

4 - reset password

Now you can login with administrator@vsphere.local and the generated password.

Power cycle/ Reset an HP blade server

$
0
0

Was getting the following error on one of our servers. It’s from ESXi. None of the NICs were working for the server (the NICs seemed to be working, just that the driver wasn’t loading). 

error

Power cycle required. 

I switched off and switched on the server but that didn’t help. Turns out that doesn’t actually power cycle the server (because the server still has power – doh!). What you need to do is do something called an e-fuse reset. This power cycles the blade. You have to do this by opening an SSH session to the Onboard Administrator, finding the bay number of the blade you want to power cycle, and typing the command reset server <bay number>

Good to know!

Note: The command does not appear when you type help, but it’s there:

reset server

Invalid Arguments

RESET SERVER { <bay number> }: Reset the server bay by momentarily removing all
power.  If a double dense blade is present, a single side cannot be reset.  Only
the entire server bay can be reset.

Also, to get a list of your bays and servers use the show server list command. To do the same for interconnects use the show interconnect list command.

Using Solarwinds to monitor Windows Performance Monitor (perfmon) Counters

$
0
0

Had a request from our Exchange admin to setup Solarwinds alerts for some of our Exchange servers based on Performance Monitor counters.

MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length       (above 200)
MSExchangeTransport Queues(_total)\Largest Delivery Queue Length                 (above 200)
MSExchangeTransport Queues(_total)\Messages Queued For Delivery                (above 200)
MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length        (above 20)

Before setting up alerts I need to add them to Solarwinds first. Here’s how you do that.

First, open up the Solarwinds web console, go to Applications, and then SAM Settings.

applicationssam settings

Then go to Component Monitor Wizard.

component monitor

 

Select Windows Performance Counter Monitor.

perfmon

Notice that it says the data is collected using RPC. This means (1) the server must be monitored by Solarwinds using WMI and not SNMP. In case of the latter, switch to monitoring via WMI. And (2) RPC ports must be open between the Solarwinds server and the target server. If not, monitoring will fail.

Enter the name of a server you wish to target. This server would be one that contains the perfmon counters you are interested in. You use this server to setup monitoring for the counters you are interested in. Change to 64bit if 32bit doesn’t work.

target

Change the “Choose Credential” drop down according to your environment. To select the server it’s better to click “Browse” and find the server you are interested in if Solarwinds complains that it cannot find the name you type in.

Note: The next step will fail if you have not opened the required RPC ports.

Select the counters you are interested in. First select the object you want to monitor (MSExchangeTransport Queues, in the screenshot below) and then the counters.

select counters

The next screen will list all the counters you selected and give you a chance to set warning and critical thresholds. Customize these.

 

properties

Select where you would like these counters added to – a new application monitor/ monitor template, or an existing application monitor/ monitor template. I am going with a new application monitor template. Easier to make changes to templates than individual application monitors.

whereadd

 

Choose more nodes you would like to assign this application monitor to. Am skipping this screenshot. This step is optional as you can assign the application monitor to nodes later too.

An optional step – I also went to Manage Application Templates screen after the above steps, selected the template I created, and assigned it some tags and set a custom view.

defineview

A custom view lets you define what details are shown when anyone clicks this application monitor template on a particular node in the Solarwinds web console. You can customize the view by going to Settings (of Solarwinds) and selecting Manage Views.

Next step is to create an alert. For that you have to logon to the Solarwinds server itself, go to Alert Manager, create a new alert (skipping screenshots for all these) and create a new alert whose condition is as follows:solarwinds trigger

Note that the type of property to monitor is “APM: Component”. This is important for the correct variables to be visible in the alert message. Also, note that I am triggering for each of the component (with an “any” condition) and not for the application monitor itself. This lets me get alerts for individual components; if I don’t do this, and instead trigger on the application monitor itself, I will get alert emails for each component including the ones that don’t have an issue.

Here’s the alert message:

solarwinds message

Mute Solarwinds alerts during reboots/ maintenance windows

$
0
0

I wanted to mute Solarwinds alerts during our patch weekends when all servers are rebooted because they have to be and our mailboxes get flooded with Solarwinds alerts. I decided to use custom properties for this purpose. Here’s what I did.

Login to the Solarwinds web console. Go to the “Settings” page, and then “Manage Custom Properties” under “Node *& Group Management”.

Click “Add Custom Property”, select the default of “Nodes” from the drop down, and create something along the following lines –

customproperties

Select the nodes you’d like to apply this custom property to. I chose to apply it on all my Windows and VMware nodes. Set the value to be “No”.

customvalues

Now login to Orion Alerts Manager and pick an alert you’d like to mute during patch weekends. Go to its “Alert Suppression” tab and add a condition on the custom property we created earlier.

alert custom properties

alert custom properties2

And that’s it, really!

Update: Not sure why, but the above didn’t seem to work for me. So I added the Mute_Alerts check as part of the trigger condition itself.

new trigger

Note: If you don’t get the custom property in Orion, close and restart it as an administrator (i.e. right click and do “Run as Administrator” even if you are already running it with an admin account). Not sure why, but until I did that the custom property didn’t get picked up. You only need to do it one time; after that you can launch Orion normally.

Next time your server estate is being rebooted/ undergoing maintenance, login to Solarwinds webconsole and change the “Mute_Alerts” custom property to “Y” for a node/ nodes that you want to mute alerts for. Below I show how I will mute the alerts for all my Windows nodes.

Go to “Manage Nodes”. Group by “Vendor” and select Windows. Then select all nodes. (The checkbox to select all nodes got blanked out in the screenshot below but it’s easy to find).

apply custom property

Then click on “Custom Property Editor” to get to the screen below.

apply custom property

Here too select all the nodes and click “Edit multiple values”.

From the drop down, change the value for “Mute_Alerts” to “true”. Then save changes and that’s it. :)

 

Using Solarwinds to monitor Windows Services

$
0
0

This is similar to how I monitored performance counters with Solarwinds.

I want to monitor a bunch of AppSense services.

Similar to the performance counters where I created an application monitor template so I could define the threshold values, here I need to create an application monitor so that the service appears in the alert manager. This part is easy (and similar to what I did for the performance counters) so I’ll skip and just put a screenshot like below –

applimonitor

I created a separate application monitor template but you could very well add it to one of the standard templates (if it’s a service that’s to be monitored on all nodes for instance).

Now for the part where you create alerts.

Initially I thought this would be a case of creating triggers when the above application monitor goes down. Something like this – alert1

And create an alert message like this –

alert1a

With this I was hoping to get a one or more alert messages only for the services that actually went down. Instead, what happened is that whenever any one service went down I’d get an alert for the service that went down and also a message for the services that were up. Am guessing since I was triggering on the application monitor, Solarwinds helpfully sent the status for each of its components – up or down.

alert2

The solution is to modify your trigger such that you target each component.

alert3

Now I get alerts the way I want.

Hope this helps!

Removing a monitored resource from multiple nodes in Solarwinds

$
0
0

Had to remove a drive from being monitored from multiple servers in Solarwinds. Rather than go to each node, edit its properties and untick the drive, I figured that if you do a search for the nodes and expand each of them it’s possible to tick the ones you don’t need and click “Delete”.

multiple nodes

Nice!

Create Solarwinds account limitations based on Custom Properties

$
0
0

I wanted to create account limitations in Solarwinds based on Custom Properties but the web console by default doesn’t give an option to do that.

default limitationsThen I came across this helpful video.

The trick is to logon to your Solarwinds server and find the “Account Limitation Builder”. Then click “Add” and create a new limitation similar to this –

new limitation

Now the limitation you create will come in the list –

new limitation2

Select that, and in the next screen you can choose the value you want to limit to –

new limitation3


Solarwinds –“The WinRM client cannot process the request”

$
0
0

Added the Exchange 2010 Database Availability Group application monitor to couple of our Exchange 2010 servers and got the following error –

error1

Clicking “More” gives the following –

error2

This is because Solarwinds is trying to run a PowerShell script on the remote server and the script is unable to run due to authentication errors. That’s because Solarwinds is trying to connect to the server using its IP address, and so instead of using Kerberos authentication it resorts to Negotiate authentication (which is disabled). The error message too says the same but you can verify it for yourself from the Solarwinds server too. Try the following command

New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri http://<yourexchsvr>/PowerShell -Authentication Negotiate -Credential (get-credential)

This is what’s happening behind the scenes and as you will see it fails. Now replace “Negotiate” with “Kerberos” and it succeeds –

New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri http://<yourexchsvr>/PowerShell -Authentication Kerberos -Credential (get-credential)

So, how to fix this? Logon to the remote server and launch IIS Manager. It’s under “Administrative Tools” and may not be there by default (my server only had “Internet Information Services (IIS) 6.0 Manager”), in which case add it via Server Manager/ PowerShell –

> Import-Module ServerManager
> Add-WindowsFeature -Name RSAT-Web-Server

Then open IIS Manager, go to Sites > PowerShell and double click “Authentication”.

iis-1

Select “Windows Authentication” and click “Enable”.

iis-2

Now Solarwinds will work.

Solarwinds AppInsight for IIS – doing a manual install – and hopefully fixing invalid signature (error code: 16007)

$
0
0

AppInsight from Solarwinds is pretty cool. At least the one for Exchange is. Trying out the one for IIS now. Got it configured on a few of our servers easily but it failed on one. Got the following error –

appinsight-error

Bummer!

Manual install it is then. (Or maybe not! Read on and you’ll see a hopeful fix that worked for me).

First step in that is to install PowerShell (easy) and the IIS PowerShell snap-in. The latter can be downloaded from here. This downloads the Web Platform Installer (a.k.a. “webpi” for short) and that connects to the Internet to download the goods. In theory it should be easy, in practice the server doesn’t have connectivity to the Internet except via a proxy so I have to feed it that information first. Go to C:\Program Files\Microsoft\Web Platform Installer for that, find a file called WebPlatformInstaller.exe.config, open it in Notepad or similar, and add the following lines to it –

<system.net>
  <defaultProxy>
    <proxy proxyaddress="http://<myproxy>:8080" bypassonlocal="true" />
  </defaultProxy>
</system.net>

This should be within the <configuration> -- </configuration> block. Didn’t help though, same error.

webpi-error

Time to look at the logs. Go to %localappdata%\Microsoft\Web Platform Installer\logs\webpi for those.

From the logs it looked like the connection was going through –

DownloadManager Information: 0 : Loading product xml from: https://go.microsoft.com/?linkid=9842185
DownloadManager Information: 0 : https://go.microsoft.com/?linkid=9842185 responded with 302
DownloadManager Information: 0 : Response headers:
HTTP/1.1 302 Moved Temporarily
Cache-Control: no-cache
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Expires: -1
Location: https://www.microsoft.com/web/webpi/5.0/webproductlist.xml
Server: Microsoft-IIS/8.0
X-AspNetMvc-Version: 4.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 175
Date: Tue, 10 May 2016 12:10:12 GMT
Connection: keep-alive

But the problem was this –

DownloadManager Error: 0 : WebClient download error: System.Net.WebException: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. ---> System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.

If I go to the link – https://www.microsoft.com/web/webpi/5.0/webproductlist.xml – via IE on that server I get the following –

untrusted-cert

 

However, when I visit the same link on a different server there’s no error.

Interesting. I viewed the untrusted certificate from IE on the problem server and compared it with the certificate from the non-problem server.

Certificate on the problem server

Certificate on the problem server

Certificate on a non-problem server

Certificate on a non-problem server

Comparing the two I can see that the non-problem server has a VeriSign certificate in the root of the path, because of which there’s a chain of trust.

verisign - g5

If I open Certificate Manager on both servers (open mmc > Add/ Remove Snap-Ins > Certificates > Add > Computer account) and navigate to the “Trusted Root Certification Authorities” store) on both servers I can see that the problem server doesn’t have the VeriSign certificate in its store while the other server has.

cert manager - g5

So here’s what I did. :) I exported the certificate from the server that had it and imported it into the “Trusted Root Certification Authorities” store of the problem server. Then I closed and opened IE and went to the link again, and bingo! the website opens without any issues. Then I tried the Web Platform Installer again and this time it loads. Bam!

The problem though is that it can’t find the IIS PowerShell snap-in. Grr!

no snap-in

no snap-in 2

That sucks!

However, at this point I had an idea. The SolarWinds error message was about an invalid signature, and what do we know of that can cause an invalid signature? Certificate issues! So now that I have installed the required CA certificate for the Web Platform Installer, maybe it sorts out SolarWinds too? So I went back and clicked “Configure Server” again and bingo! it worked this time. :)

Hope this helps someone.

Using SolarWinds to highlight servers in a pending reboot status

$
0
0

Had a request to use SolarWinds to highlight servers in a pending reboot status. Here’s what I did.

Sorry, this is currently broken. After implementing this I realized I need to enable PowerShell remoting on all servers for it to work, else the script just returns the result from the SolarWinds server. Will update this post after I fix it at my workplace. If you come across this post before that, all you need to do is enable PowerShell remoting across all your servers and change the script execution to “Remote Host”.

SolarWinds has a built in application monitor called “Windows Update Monitoring”. It does a lot more than what I want so I disabled all the components I am not interested in. (I could have also just created a new application monitor, I know, just was lazy).

winupdatemon-1

The part I am interested in is the PowerShell Monitor component. By default it checks for the reboot required status by checking a registry key: HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Auto Update\RebootRequired. Here’s the default script –

$Error.Clear();
$regpath = "HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Auto Update\RebootRequired";
$res = Test-Path $regpath;
if ($Error.Count -ne 0) 
 {
 Write-Host "Message: $($Error[0])";
 exit 1;
 }
if ($res) { 
 write-host "Message: Reboot required."
 write-host "Statistic: 1"
 exit 0
 }
else
 {
 write-host "Message: Reboot not required."
 write-host "Statistic: 0"
 exit 0
 }

Inspired by this blog post which monitors three more registry keys and also queries ConfigMgr, I replaced the default PowerShell script with the following –

$res = $false
if (Get-ChildItem "HKLM:\Software\Microsoft\Windows\CurrentVersion\Component Based Servicing\RebootPending" -EA Ignore) { $res = $true }
if (Get-Item "HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Auto Update\RebootRequired" -EA Ignore) { $res = $true }
if (Get-ItemProperty "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager" -Name PendingFileRenameOperations -EA Ignore) { $res = $true }
try { 
   $util = [wmiclass]"\\.\root\ccm\clientsdk:CCM_ClientUtilities"
   $status = $util.DetermineIfRebootPending()
   if (($status -ne $null) -and $status.RebootPending) { $res = $true }
} catch { }

if ($res) { 
   write-host "Message: Reboot required."
   write-host "Statistic: 1"
   exit 0
} else {
   write-host "Message: Reboot not required."
   write-host "Statistic: 0"
   exit 0
}

Then I added the application monitor to all my Windows servers. The result is that I can see the following information on every node –

winupdatemon-2

Following this I created alerts to send me an email whenever the status of the above component (“Machine restart status …”) went down for any node. And I also created a SolarWinds report to capture all nodes for which the above component was down.

winupdatemon-3

Then I assigned this to a schedule to run once in a month after our patching window to email me a list of nodes that require reboots.

 

Find which bay an HP blade server is in

$
0
0

So here’s the situation. We have a bunch of HP rack enclosures. Some blade servers were moved from one rack to another but the person doing the move forgot to note down the new location. I knew the iLO IP address but didn’t know which enclosure it was in. Rather than login to each enclosure OA, expand the device bays and iLO info and find the blade I was interested in, I wrote this batch file that makes use of SSH/ PLINK to quickly find the enclosure the blade was in.

@echo off
FOR %%I IN (R1E1-OA1,R1E2-OA1,R2E1-OA1,R2E2-OA1) DO (
  echo Checking %%I
  echo y | PLINK.EXE admin@%%I -pw <password> "show ebipa" | findstr "<iLO-Address>"
)

Put this in a batch file in the same folder as PLINK and run it.

Note that this does depend on SSH access being allowed to your enclosures.

Update: An alternative way if you want to use PowerShell for the looping –

"R1E1-OA1","R1E2-OA1","R2E1-OA1","R2E2-OA1"| %{ 
  $EncName = $_; 
  Write-Host -ForegroundColor Yellow "Checking $EncName"; 
  .\PLINK.EXE -ssh $EncName -l admin -pw <passwd> "show ebipa" | ?{ $_ -match "<iLO Address>" }
}

 

Solarwinds not seeing correct disk size; “Connection timeout. Job canceled by scheduler.” errors

$
0
0

Had this issue at work today. Notice the disk usage data below in Solarwinds –

Disk Usage

The ‘Logical Volumes’ section shows the correct info but the ‘Disk Volumes’ section shows 0 for everything.

Added to that all the Application Monitors had errors –

Timeout

I searched Google on the error message “Connection timeout. Job canceled by Scheduler.” and found this Solarwinds KB article. Corrupt performance counters seemed to be a suspect. That KB article was a bit confusing me to in that it gives three resolutions and I wasn’t sure if I am to do all three or just pick and choose. :)

Event Logs on the target server did show corrupt performance counters.

Initial Errors

I tried to get the counters via PowerShell to double check and got an error as expected –

Broken Get-Counter

Ok, so performance counter issue indeed. Since the Solarwinds KB article didn’t make much sense to me I searched for the Event ID 3001 as in the screenshot and came across a TechNet article. Solution seemed simple – open up command prompt as an admin, run the command lodctr /R. This command apparently rebuilds the performance counters from scratch based on currently registry settings adn backup INI files (that’s what the help message says). The command completed straight-forwardly too.

lodctr - 1

With this the performance counters started working via PowerShell.

Working Get-Counter

Event Logs still had some error but those were to do with the performance counters of ASP.Net and Oracle etc.

More Errors

The fix for this seemed to be a bit more involved and requires rebooting the server. I decided to skip it for now as I don’t these additional counters have much to do with Solarwinds. So I let those messages be and tried to see if Solarwinds was picking up the correct info. Initially I took a more patient approach of waiting and trying to make it poll again; then I got impatient and did things like removing the node from monitoring and adding it back (and then wait again for Solarwinds to poll it etc) but eventually it began working. Solarwinds now sees the disk space correctly and all the Application Monitors work without any errors too.

Here’s what I am guessing happened (based on that Solarwinds KB article I linked to above). The performance counters of the server got corrupt. Solarwinds uses counters to get the disk info etc. Due to this corruption the poller spent more time than usual when fetching info from the server. This resulted in the Application Monitor components not getting a chance to run as the poller had run out of time to poll the server. Thus the Application Monitors gave the timeout errors above. In reality the timeout was not from those components, it was from the corrupt performance counters.

Installing a new license key in KMS

$
0
0

KMS is something you login to once in a blue moon and then you wonder how the heck are you supposed to install a license key and verify that it got added correctly. So as a reminder to myself.

To install a license key:

slmgr.vbs /ipk XXXXX-XXXXX-XXXXX-XXXXX-XXXXX-XXXXX-XXXXX

Then activate it:

slmgr.vbs /ato

If you want to check that it was added correctly:

cscript slmgr.vbs /dli all

I use cscript so that the output comes in the command prompt itself and I can scroll up and down (or put into a text file) as opposed to a GUI window which I can’t navigate.

Add-DnsServerZoneDelegation with multiple nameservers

$
0
0

Only reason I am creating this post is coz I Googled for the above and didn’t find any relevant hits

I know I can use the Add-DnsServerZoneDelegation cmdlet to create a new delegated zone (basically a sub-domain of a zone hosted by our DNS server, wherein some other DNS server hosts that sub-domain and we merely delegate any requests for that sub-domain to this other DNS server). But I wasn’t sure how I’d add multiple name servers. The switches give an option to give an array of IP addresses, but that’s just any array of IP addresses for a single name server. What I wanted was to have an array of name servers each with their own IP.

Anyways, turns out all I had to do was run the command for each name server. 

Add-DnsServerZoneDelegation -ComputerName myDNS_Server -ChildZoneName "subzone" -Name "parentzone.com" -NameServer "DNS01.somedomain" -IPAddress "192.168.1.1"
Add-DnsServerZoneDelegation -ComputerName myDNS_Server -ChildZoneName "subzone" -Name "parentzone.com" -NameServer "DNS02.somedomain" -IPAddress "192.168.1.2"
Add-DnsServerZoneDelegation -ComputerName myDNS_Server -ChildZoneName "subzone" -Name "parentzone.com" -NameServer "DNS03.somedomain" -IPAddress "192.168.1.3"

Above I create a delegation from my zone “parentzone.com” to 3 DNS servers “DNS0[1-3].somedomain” (also specified by their respective IPs) for the sub-domain “subzone.parentzone.com”.


Bug in Server 2016 and DNS zone delegation with CNAME records

$
0
0

A colleague at work mentioned a Server 2016 DNS zone delegation bug he had found. I found just one post on the Internet when I searched for this. According to my colleague Microsoft has now confirmed this as a bug in the support call he raised. 

DNS being an area of interest I wanted to replicate the issue and post it – so here goes. Hopefully it makes sense. :)

Imagine a split-DNS scenario for a zone rockylabs.zero

  • This zone is hosted externally by a DNS server (doesn’t matter what OS/ software) called data01
  • This zone is hosted internally by two DNS servers: a Server 2012R2 (called DC2012-01), and a Server 2016 (called DC2016-01). 

Now say there’s a record rakhesh.rockylabs.zero that is the same both internally and externally. As in, we want both internal and external users to get the same (external) IP address for this record. 

What you would typically do is add this record to your external DNS server and create a delegation from your two internal DNS servers, for this record, to the external DNS server. Here’s some screenshots:

The zone on my external DNS server. Notice I have an A record for rakhesh.rockylabs.zero.  

Ignore the rakhesh2.rockylabs.zero record for now. That comes in later. :)

Here’s a look at the delegation from my Server 2012R2 internal DNS server to the external DNS server for the rakhesh.rockylabs.zero record. Basically I create a delegation within the rockylabs.zero zone on the internal server, for the rakhesh domain, and point it to the external DNS server. On the external DNS server rakhesh.rockylabs.zero is defined as an A record so that will be returned as an answer when this delegation chain is followed. 

In my case both the internal DNS servers are also DCs, and the rockylabs.zero zone is AD integrated, so a similar delegation is automatically created on the other DNS server too. 

As would be expected, I am able to resolve this A record correctly from both internal DNS servers.

Now for the fun part!

Notice the rakhesh2.rockylabs.zero record on my external DNS server? Unlike rakhesh.rockylabs.zero this one is a CNAME record. This too should be common for both internal and external users. Shouldn’t be a problem really as it should work similarly to the A record. Following the chain of delegation when I resolve rakhesh2.rockylabs.zero to a CNAME record called rakhesh.com, my DNS server should automatically resolve the A record for rakhesh.com and give me its address as the answer for rakhesh2.rockylabs.zero.  It works with the Server 2012R2 internal DNS server as expected – 

But breaks for the 2016 internal DNS server!

And that’s it! That’s the bug basically. 

Here’s the odd bit though. If I were to query rakhesh.com (the domain to which the CNAME record points to), and then try to resolve the delegated record, it works!

If I go ahead and clear the cache on that 2016 internal server and try the name resolution again, it’s broken as before.

So the issue is that the 2016 DNS Server is able to follow the delegation for rakhesh2.rockylabs.zero to the external DNS server and resolve it to rakhesh.com, but it is doesn’t then go ahead and lookup rakhesh.com to get its A record. But if the A record for rakhesh.com is already cached with it, it is sensible enough to return that address. 

I dug a bit more into this by enabling debug logging on the 2016 server. Here’s what I found.

The 2016 server receives my query:

4/25/2017 8:56:35 PM 0FC0 PACKET  000001FF8DCA2D10 UDP Rcv 10.10.1.2       0002   Q [0001   D   NOERROR] A      (8)rakhesh2(9)rockylabs(4)zero(0)
UDP question info at 000001FF8DCA2D10
  Socket = 668
  Remote addr 10.10.1.2, port 63329
  Time Query=2087079, Queued=0, Expire=0
  Buf length = 0x0fa0 (4000)
  Msg length = 0x0029 (41)
  Message:
    XID       0x0002
    Flags     0x0100
      QR        0 (QUESTION)
      OPCODE    0 (QUERY)
      AA        0
      TC        0
      RD        1
      RA        0
      Z         0
      CD        0
      AD        0
      RCODE     0 (NOERROR)
    QCOUNT    1
    ACOUNT    0
    NSCOUNT   0
    ARCOUNT   0
    QUESTION SECTION:
    Offset = 0x000c, RR count = 0
    Name      "(8)rakhesh2(9)rockylabs(4)zero(0)"
      QTYPE   A (1)
      QCLASS  1
    ANSWER SECTION:
      empty
    AUTHORITY SECTION:
      empty
    ADDITIONAL SECTION:
      empty

It passes this on to the external server (10.10.1.11 is data01 – external DNS server where rakhesh2.rockylabs.zero is delegated to). FYI, I am truncating the output here:

4/25/2017 8:56:35 PM 0FC0 PACKET  000001FF8DCA02F0 UDP Snd 10.10.1.11      fbff   Q [0000       NOERROR] A      (8)rakhesh2(9)rockylabs(4)zero(0)
UDP question info at 000001FF8DCA02F0
  Socket = 15140
  Remote addr 10.10.1.11, port 53
...

It gets a reply with the CNAME record. So far so good. 

4/25/2017 8:56:35 PM 0FC0 PACKET  000001FF908D10B0 UDP Rcv 10.10.1.11      fbff R Q [8084 A  R  NOERROR] A      (8)rakhesh2(9)rockylabs(4)zero(0)
UDP response info at 000001FF908D10B0
  Socket = 15140
  Remote addr 10.10.1.11, port 53
  Time Query=2087079, Queued=0, Expire=0
  Buf length = 0x0fa0 (4000)
  Msg length = 0x004d (77)
  Message:
    XID       0xfbff
    Flags     0x8480
      QR        1 (RESPONSE)
      OPCODE    0 (QUERY)
      AA        1
      TC        0
      RD        0
      RA        1
      Z         0
      CD        0
      AD        0
      RCODE     0 (NOERROR)
    QCOUNT    1
    ACOUNT    1
    NSCOUNT   0
    ARCOUNT   1
    QUESTION SECTION:
    Offset = 0x000c, RR count = 0
    Name      "(8)rakhesh2(9)rockylabs(4)zero(0)"
      QTYPE   A (1)
      QCLASS  1
    ANSWER SECTION:
    Offset = 0x0029, RR count = 0
    Name      "[C00C](8)rakhesh2(9)rockylabs(4)zero(0)"
      TYPE   CNAME  (5)
      CLASS  1
      TTL    3600
      DLEN   13
      DATA   (7)rakhesh(3)com(0)
    AUTHORITY SECTION:
      empty
    ADDITIONAL SECTION:
    Offset = 0x0042, RR count = 0
    Name      "(0)"
      TYPE   OPT  (41)
      CLASS  4000
      TTL    32768
      DLEN   0
      DATA
		Buffer Size  = 4000
		Rcode Ext    = 0
		Rcode Full   = 0
		Version      = 0
		Flags        = 80 DO

Now it queries the external DNS server (data01 – 10.10.1.11) asking for the A record of rakhesh.com! That’s two wrong things: 1) Why ask the external DNS server (who as far as the internal DNS server knows is only delegated the rakhesh2.rockylabs.zero zone and has nothing to do with rakhesh.com) and 2) why ask it for the A record instead of the NS record so it can find the name servers for rakhesh.com and ask those for the IP address of rakhesh.com

4/25/2017 8:56:35 PM 0FC0 PACKET  000001FF8DCA02F0 UDP Snd 10.10.1.11      d43a   Q [0000       NOERROR] A      (7)rakhesh(3)com(0)
UDP question info at 000001FF8DCA02F0
  Socket = 14836
  Remote addr 10.10.1.11, port 53
  Time Query=0, Queued=0, Expire=0
  Buf length = 0x0fa0 (4000)
  Msg length = 0x0028 (40)
  Message:
    XID       0xd43a
    Flags     0x0000
      QR        0 (QUESTION)
      OPCODE    0 (QUERY)
      AA        0
      TC        0
      RD        0
      RA        0
      Z         0
      CD        0
      AD        0
      RCODE     0 (NOERROR)
    QCOUNT    1
    ACOUNT    0
    NSCOUNT   0
    ARCOUNT   1
    QUESTION SECTION:
    Offset = 0x000c, RR count = 0
    Name      "(7)rakhesh(3)com(0)"
      QTYPE   A (1)
      QCLASS  1
    ANSWER SECTION:
      empty
    AUTHORITY SECTION:
      empty
    ADDITIONAL SECTION:
    Offset = 0x001d, RR count = 0
    Name      "(0)"
      TYPE   OPT  (41)
      CLASS  4000
      TTL    32768
      DLEN   0
      DATA
		Buffer Size  = 4000
		Rcode Ext    = 0
		Rcode Full   = 0
		Version      = 0
		Flags        = 80 DO

It’s pretty much downhill from there, coz as expected the external DNS server replies saying “doh I don’t know” and gives it a list of root servers to contact. FYI, I truncated the output:

4/25/2017 8:56:35 PM 0FC0 PACKET  000001FF907E4950 UDP Rcv 10.10.1.11      d43a R Q [8080    R  NOERROR] A      (7)rakhesh(3)com(0)
UDP response info at 000001FF907E4950
  Socket = 14836
  Remote addr 10.10.1.11, port 53
  Time Query=2087079, Queued=0, Expire=0
  Buf length = 0x0fa0 (4000)
  Msg length = 0x01d7 (471)
  Message:
    XID       0xd43a
    Flags     0x8080
      QR        1 (RESPONSE)
      OPCODE    0 (QUERY)
      AA        0
      TC        0
      RD        0
      RA        1
      Z         0
      CD        0
      AD        0
      RCODE     0 (NOERROR)
    QCOUNT    1
    ACOUNT    0
    NSCOUNT   13
    ARCOUNT   14
    QUESTION SECTION:
    Offset = 0x000c, RR count = 0
    Name      "(7)rakhesh(3)com(0)"
      QTYPE   A (1)
      QCLASS  1
    ANSWER SECTION:
      empty
    AUTHORITY SECTION:
    Offset = 0x001d, RR count = 0
    Name      "(0)"
      TYPE   NS  (2)
      CLASS  1
      TTL    3600
      DLEN   20
      DATA   (1)d(12)root-servers(3)net(0)
    Offset = 0x003c, RR count = 1
    Name      "[C01D](0)"
      TYPE   NS  (2)
      CLASS  1
      TTL    3600
      DLEN   4
      DATA   (1)e[C02A](12)root-servers(3)net(0)
    Offset = 0x004c, RR count = 2
    Name      "[C01D](0)"
      TYPE   NS  (2)
      CLASS  1
      TTL    3600
      DLEN   4
      DATA   (1)f[C02A](12)root-servers(3)net(0)
    Offset = 0x005c, RR count = 3
...

The 2016 internal DNS server now replies to the client with a fail message. As expected. 

4/25/2017 8:56:35 PM 0FC0 PACKET  000001FF8DCA2D10 UDP Snd 10.10.1.2       0002 R Q [8281   DR SERVFAIL] A      (8)rakhesh2(9)rockylabs(4)zero(0)
UDP response info at 000001FF8DCA2D10
  Socket = 668
  Remote addr 10.10.1.2, port 63329
  Time Query=2087079, Queued=2087079, Expire=2087082
  Buf length = 0x0200 (512)
  Msg length = 0x0029 (41)
  Message:
    XID       0x0002
    Flags     0x8182
      QR        1 (RESPONSE)
      OPCODE    0 (QUERY)
      AA        0
      TC        0
      RD        1
      RA        1
      Z         0
      CD        0
      AD        0
      RCODE     2 (SERVFAIL)
    QCOUNT    1
    ACOUNT    0
    NSCOUNT   0
    ARCOUNT   0
    QUESTION SECTION:
    Offset = 0x000c, RR count = 0
    Name      "(8)rakhesh2(9)rockylabs(4)zero(0)"
      QTYPE   A (1)
      QCLASS  1
    ANSWER SECTION:
      empty
    AUTHORITY SECTION:
      empty
    ADDITIONAL SECTION:
      empty

So now we know why the 2016 server is failing. 

Until this is fixed one workaround would be to create a CNAME record directly in the internal DNS server to whatever the external DNS server points to. That is, don’t delegate to external – just create the same record internally too. Only for CNAME records; A is fine. Here’s an example of it working with a record called rakhesh3.rockylabs.zero where I simply made a CNAME on the internal 2016 DNS server. 

That’s all for now!

Citrix breaks after removing the root zone from your DNS server?

$
0
0

Two years ago I had removed the root zone on our DNS servers at work. Coz who needs root zones if your DC is only answering internal queries, i.e. for zone sit has. Right?

Well, that change broke our Citrix environment. :) Users could connect to our NetScaler gateway but couldn’t launch any resource after that. 

Our Citrix chaps logged a call with our vendor etc and they gave some bull about the DNS server not responding to TCP queries etc. Yours truly wasn’t looking after Citrix or NetScalers back then, so the change was quietly rolled back as no one had any clue why it broke Citrix. 

Fast forward to yesterday, I had to do the change again coz now we want our DNS servers to resolve external names too – i.e. use root hints and all that goodness! I did the change, and Citrix broke! Damn. 

But luckily now Citrix has been rolled into our team and I know way more about how Citrix works behind the scenes. Plus I keep dabbling with NetScalers, so I am not totally clueless (or so I’d like to think!). 

I went into the DNS section of the NetScaler to see what’s up. Turns out the DNS virtual server was marked as down. Odd, coz I could SSH into the NetScaler and do name lookups against that DNS virtual server (which pointed to my internal DC basically). And yes, I could do dig +notcp to force it to do UDP queries only and nothing was broken. So why was the virtual server marked as down?!

I took a look at the monitor on the DNS service and it had the following:

Ok, so what exactly does this monitor do? Click “Edit Monitor” – nothing odd there – click on “Special Parameters” and what do I find? 

Yup, it was set to query for the root zone. Doh! No wonder it broke. 

I have no idea why the DNS monitor was assigned to this service. By default DNS-UDP has the ping-default monitor assigned to it while DNS-TCP has the tcp-default monitor assigned to it.  Am guessing that since our firewall block ICMP from the NetScalers to the DCs, someone decided to use the DNS monitor instead and left it at the default values of monitoring for the root zone. When I removed the root zone that monitor failed, the DNS virtual server was marked as down, and the NetScaler could no longer resolve DNS names for the resources users were trying to connect to. Hence the STA error above. Nice, huh!

Fix is simple. Change the query in the DNS monitor to a zone your DNS servers. Preferably the zone your resources are in. Easy peasy. Made that change, and Citrix began working. 

As might be noticeable from the tone of the post, I am quite pleased at having figured this out. Yes, I know it’s not a biggie … but just, it makes me happy at having figured it coz I went down a logical path instead of just throwing up my hands and saying I have no idea why the DNS service is down or why the monitor is red etc. So I am pleased at that! :)

 

Notes on ADFS Certificates

$
0
0

Was trying to wrap my head around ADFS and Certificates today morning. Before I close all my links etc I thought I should make a note of them here. Whatever I say below is more or less based on these links (plus my understanding):

There are three types of certificates in ADFS. 

The “Service communications” certificate is also referred to as “SSL certification” or “Server Authentication Certificate”. This is the certificate of the ADFS server/ service itself. 

  • If there’s a farm of ADFS servers, each must have the same certificate
  • We have the private key too for this certificate and can export it if this needs to be added to other ADFS servers in the farm. 
  • The Subject Name must contain the federation service name. 
  • This is the certificate that end users will encounter when they are redirected to the ADFS page to sign-on, so this must be a public CA issued certificate. 

The “Token-signing” certificate is the crucial one

  • This is the certificate used by the ADFS server to sign SAML tokens.
  • We have the private key too this certificate too but it cannot be exported. There’s no option in the GUI to export the private key. What we can do is export the public key/ certificate. 
  • The exported public certificate can be loaded to the 3rd party provider who would be using our ADFS server for authentication.
  • The ADFS server signs tokens using this certificate (i.e. uses its private key to encrypt the token or a hash of the token – am not sure). The 3rd party using the ADFS server for authentication can verify the signature via the public certificate (i.e. decrypt the token or its hash using the public key and thus verify that it was signed by the ADFS server). This doesn’t provide any protection against anyone viewing the SAML tokens (as it can be decrypted with the public key) but does provide protection against any tampering (and verifies that the ADFS server has signed it). 
  • This can be a self-signed certificate. 
    • By default this certificate expires 1 year after the ADFS server was setup. A new certificate will be automatically generated 20 days prior to expiration. This new certificate will be promoted to primary 15 days prior to expiration. 
    • This auto-renewal can be disabled. PowerShell cmdlet (Get-AdfsProperties).AutoCertificateRollover tells you current setting. 
    • The lifetime of the certificate can be changed to 10 years to avoid this yearly renewal. 
    • No need to worry about this on the WAP server. 

The “Token-decrypting” certificate.

  • This one’s a bit confusing to me. Mainly coz I haven’t used it in practice I think. 
  • This is the certificate used if a 3rd party wants to send us encrypted SAML tokens. 
    • Take note: 1) Sending us. 2) Not signed; encrypted
  • As with “Token-signing” we export the public part of this certificate, upload it with the 3rd party, and they can use that to encrypt and send us tokens. Since only we have the private key, no one can decrypt en route. 
  • This can be a self-signed certificate. 
    • Same renewal rules etc. as the “Token-signing” certificate. 
  • Important thing to remember (as it confused me until I got my head around it): This is not the certificate used by the 3rd party to send us signed SAML tokens. The certificate for that can be found in the signature tab of that 3rd party’s relaying party trust (see screenshot below). 
  • Another thing to take note of: A different certificate is used if we want to send encrypted info to the 3rd party. This is in the encryption tab of the same screenshot. 

So – just to put it all in a table.

Clients accessing ADFS server Service communications certificate
ADFS server signing data and sending to 3rd party Token-signing certificate
3rd party encrypting data and sending ADFS server Token-decrypting certificate
3rd party signing data and sending ADFS server Certificate in Signature tab of Trust
ADFS server encrypting data and sending to 3rd party Certificate in Encryption tab of Trust

[Aside] Useful CA/ Certificates info

$
0
0

Certificates, Subject Alternative Names, etc.

$
0
0

I had encountered this in my testlab but never bothered much coz it was just my testlab after all. But now I am dabbling with certificates at work and hit upon the same issue. 

The issue is that if I create a certificate for mymachine.fqdn but I visit the machine at just mymachine, then I get an error. So how can I tell the certificate that the shorter name (and any other aliases I may have) are also valid? Turns out you need to use the Subject Alternative Name (SAN) field for that!

You can’t add a SAN field to an existing certificate. Got to create a new one. In my case I had simply requested a domain certificate from my IIS server and that doesn’t give any option to specify the SAN.

Instructions for creating a new certificate with SAN field are here and here. The latter has screenshots, so check that out first. In my case, at the step where I select “Web Server” I wasn’t getting “Web Server” as an option. I was only getting “Computer”. Looking into this, I realized it’s coz of the permissions difference. The “Web Server” template only has Domain Admins and Enterprise Admins in its ACLs, while the “Computer” template had Domain Computers too with “Enrol” rights. The fix is simple – go the Manage Templates and change the ACL of “Web Server” accordingly. (You could also use ADSI Edit and edit the ACL in the Configuration section). 

Viewing all 75 articles
Browse latest View live