Archive for the 'Troubleshooting' Category


Exchange 2010 DAG quick tips

Thanks for Exchange DAG, having a mail database that can recover from the failure isnt dream.

Admins no longer have to rely on a complicated shared disk clusters, avoiding headaches from troubleshooting a failed clusters, setting up complicated system or even having an expensive shared disks. (However, if you are running exchange, cost of shared disk should be negligible these days. )

Completely off topic, you should be able to but HP/3PAR SAN or Dell Equallogic for less than $50K (AUD).

Now to the point

These points are for the exchange admins who runs system for less than 1000 mailboxes, it may fit even 5000, but I would recommend you to follow best practice guide if you’re such size org.

1 MAPI and REPLICATED traffic on separate network ?

if you’ve read technet or several blog posts you should find that MAPI traffic should be separated from replication traffic. Now the keyword is SHOULD, its is not necessary. In my setup, we’ve got 10G ethernet and runing on top of Hypervisors. Less than 1000 users, and over 10G access layer network, you can imagine I didnt bother with dual NIC.

Single DAG network is SUPPORTED config as well as dual NIC config. It all depends on the MAPI traffic, but frankly, I would MOVE users and create new MBX servers when I get more than 500 users hitting single MBX servers. How big is your basket and how many eggs do you want to keep it together? Eh?

2 Cross site DAG ?

Continue on point 1, if you’re over multiple data centre and generally have single path to over WAN, (thx to route costs, or STP, most traffic would via a single pipe not multiple), again dual NIC was pointless for me.

3 DAG limits

A server can be only member of a single DAG, A DAG database can copied up to 16 DAG members, if you have even number of DAG members, you must have file server as the cluster witness server.

4 Alternate Witness WILL NOT take over when main Witness server is down.

read here (open new window), its a great blog from MS (why cant they make technet doc more like this), explaining how witness falls in the grand schema of things.

BTW, if you have odd number of servers, witness may not get used. Use CLUSTER command to verify what status your cluster is set as. There are several posts you can find with google that cluster quorum node was configured without witness.

5 Cross Site DAG

if you have majority number of server (thats including a witness server) at data centre thats no longer reachable, your database on the local network WILL SHUTDOWN.

eg DC A, with two mail servers and a file share as a witness, DC B, with two mail servers.

If DC B loses connectivity to DC A and mail servers in DC B isnt able to communicate to witness, it will treat the situation as non quorum status and cluster service will shutdown the mail database.

Cross site DAG isnt perfect, this is why some recommends to create multiple DAGs. (eg DAG1 for DC A users, DAG2 for DC B’s users.)

Now to troubleshoot

c:\>cluster <dagname> /quorum

if you’re monitoring the server’s services (like many admins do..) make sure you add CLUSTER SERVICE as well. If cluster service crashes DAG mail database will be dismounted together as well. (oh joy..)

i’ll try to add few more later on.


XenServer Idea Dump

1 Dont go over 95% disk for Xen server Storage Repository.

Running system will stop responding and you will not be able to start/stop servers correctly.

Make sure you MOVE the VMs to other SR if storage space is running short

2 how to find orphaned VDI

xe vdi-list sr-uuid=<sr id> params=uuid
This command will display list of registered Virtual Disks

eg, xe vdi-list sr-uuid=a6811919-143b-e8f8-7767-17bd1af1e968 params=uuid

then perform

ls -alh on the sr mount point

[root@/var/run/sr-mount/a6811919-143b-e8f8-7767-17bd1af1e968]ls -alh

if there are MORE files than vdi-list, that file may be an orphaned file. (dont delete yet)
xe vbd-list vdi-uuid=<disk uuid> params=all and find if the disk has any associations to any VMs.

if the name-label of the VDI is ” name-label ( RW): base copy

That disk a Original of the linked snapshot disks.

see if you can find vhd-parent

see example below.

uuid ( RO)                    : c7bcc486-86c1-48d6-9645-8d897df72d19

name-label ( RW): Citrix Profiler 5.2 C Drive

sm-config (MRO): vhd-parent: 82a7fbe1-ffb1-445d-a80d-45e355511e2d

uuid ( RO)                    : 82a7fbe1-ffb1-445d-a80d-45e355511e2d

name-label ( RW): base copy

================ example of snapshot VHDs ================

-rw-r–r–  1   96   96 6.6G Apr 12 14:26 82a7fbe1-ffb1-445d-a80d-45e355511e2d.vhd – base image

-rw-r–r–  1   96   96  35K Apr 12 14:26 1803ec76-d3b3-4a24-a805-b7422937ea9b.vhd – DIFF DISK

-rw-r–r–  1   96   96  35K Apr 12 14:26 c7bcc486-86c1-48d6-9645-8d897df72d19.vhd – Newly Created snapshot active image


3 Disable Large Receive Offload on 10G NIC Bonding

Disable the LRO (Large Receive Offload) feature for all 10 Gigabit Ethernet NIC member interfaces of a bond.

Identify the 10 Gigabit NICs:
Open XenCenter and connect to the XenServer. Click on the XenServer and navigate to the NICs tab of the XenServer. Identify the 10 Gigabit NICs either by their speed of “10000 Mbit/s” or their Device name, such as “82599EB 10-Gigabit Network Connection”.
The Linux device name for NIC0 will be eth0, NIC1 will be eth1, and so on.

Edit the /etc/rc.local file of the XenServer and add the following text to the end of the file for each of the above identified 10 Gigabit NICs:
ethtool -K <interface> lro off

If the bond consists of eth2 and eth3, add the following two lines:

ethtool -K eth2 lro off
ethtool -K eth3 lro off

Repeat step 2 above on all XenServers with 10 Gigabit NICs and reboot the servers after the modification of the /etc/rc.local file.

4 Enable PortFast on XenServer connected ports.

1 PortFast allows a switch port running Spanning Tree Protocol (STP) to go directly from blocking to forwarding mode; skipping learning and listening.

PortFast should only be enabled on ports connected to a single host.

Port cannot be a trunk port and port must be in access mode.

Ports used for storage should have PortFast enabled.

Note: It is important that you enable PortFast with caution, and only on ports that do not connect to multi-homed devices such as hubs or switches.

2. Disable Port Security on XenServer connected ports.

Port security prevents multiple MACs from being presented to the same port. In a virtual environment, you see multiple MACs presented from VMs to the same port causing your port to shutdown if you have Port Security enabled.

3. Disable Spanning Tree Protocol on XenServer connected ports.

Spanning Tree Protocol should be disabled if you are using Bonded or teamed NICs in a virtual environment. Because of the nature of Bonds and Nic teaming, Spanning Tree Protocol should be disabled to avoid failover delay issues when using bonding.

4. Disable BPDU guard on XenServer connected ports.

BPDU is a protection setting part of the STP that prevents you from attaching a network device to a switch port. When you attach a network device the port shuts down and has to be enabled by an administrator.

A PortFast port should never receive configuration BPDUs.

5 Considerations for IP Addressing in XenServer for Storage and Management Networks

6 Xenserver 5.6 Fp1 – may experience FREEZE (no HF yet , APR 2011)

1. Login to control domain (dom0) on affected box and execute below command
echo “NR_DOMAIN0_VCPUS=1” > /etc/sysconfig/unplug-vcpus
2. Reboot server
Hosts Become Unresponsive with XenServer 5.6 on Nehalem and Westmere CPUsEdit

6 XS snapshot notes

XenServer: Understanding Snapshots

Deleting Snapshot will not recover the DISK space if the storage is connected via iSCSI/FC.

NFS will have smaller foot print, but it will still have main disk + delta Disk remain even when there is no more snapshots.

You can export and import VM to reduce the disk to one or save the space.



7 How to create a virtual router/isolated network on Xenserver



8 vcpu tuning

The following section describes the procedure to modify the default setup with some example commands.

The virtual CPU (vCPU) behaviour can be modified by altering the VCPUs-params parameter of a virtual machine like the following:

vCPU pinning is the term for mapping vCPUs of a VM to specific physical resources.

You can tune a vCPU’s pinning with the following command:

[root@xenserver ~]# xe vm-param-set uuid=<VM UUID> VCPUs-params:mask=1,3,7

The VM from the above example will then run on physical CPUs 1, 3, and 7 only.

The VCPU priority weight parameters can also be modified to grant a specific VM more CPU time than others.

[root@xenserver ~]# xe vm-param-set uuid=<VM UUID> VCPUs-params:weight=512

The VM from the above example with a weight of 512 will get twice as much CPU as a domain with a weight of 256 on a busy XenServer Host where all CPU resources are in use.

Valid weights range from 1 to 65535 and the default is 256.

The CPU cap optionally fixes the maximum amount of CPU a VM can use.

[root@xenserver ~]# xe vm-param-set uuid=<VM UUID> VCPUs-params:cap=80



9 renaming volume or description on dell Equallogic results in connection failure.


In summary, do not rename volumes, do not change descriptions. Both values requires to match up with Xenservers’ data for SR to operate correctly.


10 VM not starting up with Error: Starting VM ‘Name-of-VM’ – This operation cannot be performed because the specified VDI could not be found on the storage substrate”.

mapped Disk image including DVD/CD may be missing. (SR is down or DVD image is missing) , unmount the disk, or find disconnected SR and repair it.


11 XS 5.6 can be configured to support different STEPPINGS of CPU for XenMotion/Live Migration.

You can simply join pool using XenCenter, it will perform CPU masking if it is possible to do so, otherwise manual configuration may be required.

12 Cant delete NIC/BOND ? – Cannot connect to server error

Xencetre WILL run matching command on ALL the xenservers.

If any server has problem your command WILL fail. and leave the status in discrepancy.

If any member of the pool is down, do not perform maintenance task on XenCentre. You will cause more damage if you do. REMOVE the dead server if you have to.

13 Pool master is DOWN!

If you lost pool master, only perform pool master recovery! before running any OTHER command, by running any other XE command you may cause more harm. (remember, VMs will not be affected by pool master down)

  1. Select any running XenServer within the pool that will be promoted. (Each
    member server has a copy of the management database and can take control of
    the pool without issue.)
  2. From the server’s command line, issue the following command:
    xe pool-emergency-transition-to-master
  3. Once the command has completed, recover connections to the other member
    servers using the following command:
    xe pool-recover-slaves
  4. Verify that pool management has been restored by issuing a test command at
    the CLI (xe host-list)

14 I want to install new driver

assuming rpm is driver.rpm
1 COPY/BACKUP modprobe files EVERYFILES under (you will have to recover them if you wish to REMOVE driver. )
on 5.6 SP2
then use RPM to install
>rpm -ivh driver.rpm

15 I want  to remove newly installed driver.

>rpm -e driver.rpm will remove the binary files, however it does not remove modprobe.dep file and as a result you will find that OLD driver will not start up.
find the old line (eg bnx2x) and make sure path is correctly set.

16 Iperf test is really slow, why ?

R610’s Broadcom 10G card 57711 model only performs about 2.3Gb/sec, we have opened the support ticket but only found out that by increasing the thread to significantly high number you will get higher throughput (eg 40 thread manages to get about 6Gb/sec)

you dont know your own software?

Its strange world indeed, my company uses a software that’s written by VB6 and runs within Citrix environment.

This software vendor decided to call me up asking for help integrating their software on OTHER customer’s Citrix environment.
Well, as a friendly Citrix dude, I did give him a few pointer but isnt it rather strange for the vendor to call the customer to seek help for other customer’s issue?

Its YOUR own software ppl, if you dont think it runs on Citrix or dont know how to troubleshoot software integration LEARN how to do it, seriously.

I really get frustrated when a developer calls an infrastructure engineer (usually me) saying, “I dont know anything about OS, Network, blah blah, can you fix it ?”
I have spent enough time learning C#, App debugging, network (switch/routing) so I can help myself integrating/debugging the issues I face.

Why is it always someone else’s problem when the problem is escalated to the developer ????

This is like a trend, dev who doesnt know SMTP, IIS, DBMS, firewall, TCP/IP.

or it must be just me working with horrible developers..



Its not the first time I have seen STOP AB.
It seems that it end endless battle for Microsoft to patch up STOP AB issue.

To my knowledge, there are numerous STOP AB patch (post W2k3 SP2)
and I have found last wk that they have released bland new patch on March 09.

yay… talking about patching server again….

If you’re another person who’ve seen STOP AB and STOP 50
use kb959828 patch
make sure there is no NT4 Printer Drive on your terminal server

(that kb959828 may be valid for another 6months or so..)


System hung? or dead?

You gotta be kidding me, another dead server?

Well past 4 wks or more, several servers were locking up in the middle of the day or morning (lucky not at the night as I am on-call for some days), especially when there are several users on the server. Sadly, by the time issue were reported by help desk to me, i have no remote access to the server.
Using ilo doesn’t work as I do not see login screen also remote procedures (such as event logs) just doesn’t responds. (You know what do to when you’re in this stage.. boot the server and get complaints from users+managers, in other word SHIT happens)

Complete LOCK up will be much better or even BSOD will give me less grief. BUT no, as RPC never times out nor IMASRV.EXE never dies but somehow responds by itself, Data collector redirects users to dead server even worse, user gets disconnected but can not be redirect to other “working” servers.

Yeah, 2nd blow to me. (again, I curse at the air and apologies to users) until the problem server is booted, or IMA finally realise (after 1-5mins) server is truly DEAD, user is STUCK at middle of no where.

FIrst thing first, what do you do? several options (not reboot)
1 read the event log (yes general rule of thumb but not in this case, event log was long dead by the time issue is discovered)
2 gather perfdata (well, using RM, I have started to gather more metrics such as memory, thread, etc etc)
again, by the time issue is reported, perfmon too is dead and data seems all green on the RM graph, yay.. no perfmon..)
3 login! (went to data centre to see if I can log on physically, but no.. Login screen doesnt come up)
4 telnet ? (screw security, I have enabled telnet service on the all the server and tried to login, but no it failed too)
5 RDP, ICA, VNC, Radmin, Ilo (list continues) no REMOTE tool works, regardless.

By the time I’ve reach step 5, more than 3 wks has passes and I have around 10 reports(incidents) and equally same number of issues were reported but never logged (call me stupid but never tracked them as I thought problem can be one off)

6, login to ALL the servers and wait for problem to hit.
Yes this worked and I have FINALLY observed server dying in front of me eyes.
Task Manager just stop responding, Process Explorer was hung when I flip the screen to problem server, explorer was frozen too, no new process could be started. BUT funny enough GUI was not frozen and I actually didnt get kicked out(disconnected)

Problem seems to be under LOW level somewhere I got no idea to.

Also, I had force system crash tool ready to kick in order to gather crash dump.

Well, you may have guessed by now, this tool didnt work either. It failed to crash and simply complain that files were not found.

7 last minute jump… call MS.

I didnt know what MS was going to say, but they gave me a light of how to crash server using NMI.
By then I was jumping in joy (ppl in office were looking at me like some weirdo.)

Finally! I can CRASH the system!!!

well rest if history, I’ve obtained debug analysis from MS that registry lock may be the cause and apply the patch from MS KB 935926

How annoying….