Archive for the 'Maintenance' Category


Exchange 2010 DAG quick tips

Thanks for Exchange DAG, having a mail database that can recover from the failure isnt dream.

Admins no longer have to rely on a complicated shared disk clusters, avoiding headaches from troubleshooting a failed clusters, setting up complicated system or even having an expensive shared disks. (However, if you are running exchange, cost of shared disk should be negligible these days. )

Completely off topic, you should be able to but HP/3PAR SAN or Dell Equallogic for less than $50K (AUD).

Now to the point

These points are for the exchange admins who runs system for less than 1000 mailboxes, it may fit even 5000, but I would recommend you to follow best practice guide if you’re such size org.

1 MAPI and REPLICATED traffic on separate network ?

if you’ve read technet or several blog posts you should find that MAPI traffic should be separated from replication traffic. Now the keyword is SHOULD, its is not necessary. In my setup, we’ve got 10G ethernet and runing on top of Hypervisors. Less than 1000 users, and over 10G access layer network, you can imagine I didnt bother with dual NIC.

Single DAG network is SUPPORTED config as well as dual NIC config. It all depends on the MAPI traffic, but frankly, I would MOVE users and create new MBX servers when I get more than 500 users hitting single MBX servers. How big is your basket and how many eggs do you want to keep it together? Eh?

2 Cross site DAG ?

Continue on point 1, if you’re over multiple data centre and generally have single path to over WAN, (thx to route costs, or STP, most traffic would via a single pipe not multiple), again dual NIC was pointless for me.

3 DAG limits

A server can be only member of a single DAG, A DAG database can copied up to 16 DAG members, if you have even number of DAG members, you must have file server as the cluster witness server.

4 Alternate Witness WILL NOT take over when main Witness server is down.

read here (open new window), its a great blog from MS (why cant they make technet doc more like this), explaining how witness falls in the grand schema of things.

BTW, if you have odd number of servers, witness may not get used. Use CLUSTER command to verify what status your cluster is set as. There are several posts you can find with google that cluster quorum node was configured without witness.

5 Cross Site DAG

if you have majority number of server (thats including a witness server) at data centre thats no longer reachable, your database on the local network WILL SHUTDOWN.

eg DC A, with two mail servers and a file share as a witness, DC B, with two mail servers.

If DC B loses connectivity to DC A and mail servers in DC B isnt able to communicate to witness, it will treat the situation as non quorum status and cluster service will shutdown the mail database.

Cross site DAG isnt perfect, this is why some recommends to create multiple DAGs. (eg DAG1 for DC A users, DAG2 for DC B’s users.)

Now to troubleshoot

c:\>cluster <dagname> /quorum

if you’re monitoring the server’s services (like many admins do..) make sure you add CLUSTER SERVICE as well. If cluster service crashes DAG mail database will be dismounted together as well. (oh joy..)

i’ll try to add few more later on.


Cleaning up user profiles

having large number of users’ profile on the file server creates problem in the long run.
Number of Cookies on the share can cause headache of its own.
Using Citrix User Profile Manager reduces the headache, but its still not good enough to leave unused cookies.

To clean it up from the file server, I’ve written some simple script.

For Normal single Company

For Citrix Provider (Multi-tennant)



Its not the first time I have seen STOP AB.
It seems that it end endless battle for Microsoft to patch up STOP AB issue.

To my knowledge, there are numerous STOP AB patch (post W2k3 SP2)
and I have found last wk that they have released bland new patch on March 09.

yay… talking about patching server again….

If you’re another person who’ve seen STOP AB and STOP 50
use kb959828 patch
make sure there is no NT4 Printer Drive on your terminal server

(that kb959828 may be valid for another 6months or so..)


Juniper Firewall.

We have High Availability configured firewall cluster in our environment.
For some reason, it was configured to use 10Mbit for the DMZ interface. Most may argue that 10Mb on DMZ where we do not host much service (other than Citrix) is more than adequate to match the need of the environment.

well firstly, I didnt know NSRP requires to change its “monitoring” interface.
and second, NSRP monitored interface had to be changed MANUALLY on both FW.

If you get error as system not in sync, check below and make sure monitor interface is identical on both.


vmware nightmare

Firstly, I love virtualisation. ESX, Hyper-V, Xen Server, VirtualBox anything.
But I when its not working as designed (due to whatever cause), I just have to scream out that I am utterly disappointed.

Just two days ago, we have had SAN Array upgrade on the mirrored cage. SAN is mirrored hence we were able to run the whole system without shutting down ESX as it fall back to 2nd array without issue (thanks to Datacore SANMelody)

Problem became apparent when ESX failed to log back in to the primary SAN Array and only logged in to the secondary SAN Arrays.

Ok… what do we do.

1 panic
2 panic
3 panic

well lets put a jokes aside, conclusion with speaking to vendors were to reboot ESX server or to reset SAN switch.

well we have V-motion so shutting down ESX werent big issue (as we thought)

Moral of the story – dont perform V-motion when system is unstable.

First we perform migration on test server without issue, (in fact 4 times) then we perform migration on the mail(exchange) and it failed. By this time, VC had lost connection to ESX and we were force to connect directly to ESX via VI or SSH.

well lucky enough we still had 1 stable ESX which manged to get mail server migrated and started up ok in 40min.

Aftermath – complaints from executives, blame from vendors that SAN swtich firmware is old, vmware update (3.5u2) is not installed.

I guess we got a lot of work to do…


sorry, its out of warranty

There are people who thinks they will NEVER have problem.
There are people who thinks support is perfect.
There are people who thinks software never breaks.

I know its bad assumptions but usually decision maker just take the risk and say “I’ll take the risk and move on for now”

Ok… What exactly is a risk ?
“I wont be here when its broken”?
“I dont care, Im not using it”?
“It works great why worry something later?” ?

I’ve seen way too many software roll outs that can potentially cause issues in the future that will suffer end-users badly. People who makes the call usually arent the end-users, they just make decisions and sometimes, bad one.

Point is, I just discovered that we’re running an application that designed for Win9x and running it on W2k3 T/S Environment. (great..) Why worry? Because vendor has no idea about the software’s fault and the usual request has came back. Guess what… REINSTALL.

1 I have error msg that I believe it helps to debug, why cant you tell us the meaning of error code?
(ans: I dont know, reinstall)

2 I have backup of file that has been restored but still failes, what makes you think reinstall is the fix?
(ans: I dont know, reinstall)

Im afraid when we ran out of support choices, we have to inform user that software is just no longer working. and there is no ways to FIX them.

well well well. now who is going to pay for bad support and bad app? Im 100% sure its not the manager who made the decision to use this piece of S**T software..


operation timed out but you cant do anything!

I mean… shouldnt program give OPTIONs when operation failed? or if there are some previous task?
This is about VMware, Virtual Infrastructure for ESX..

I have asked my colleague to take snapshot the server then install the .net framework 3.5 on a corporate web server.
Shutdown the server, take the snapshot, start the server, and install, reboot.

All should be 5min task.


snapshot has failed and miserably we have spent more than an hour trying to recover stuck server to start up. I knew immediatly that vm task or vmserver pid requires to be KILLED on the ESX but since my colleage was more GUI guy than CLI he never thought we requires to logon to ESX via CLI and issue KILL command.

Ok.. I like GUI, I do, but when CLI provides more options (in fact REQUIRED operation in this case) but GUI doesnt, I must say that GUI sux.

I think VI should

1 list all queued task
2 list all process that can be killed
3 provide all command wrapper (instead of cancel option thats greyed out..)

I know.. im asking too much…
I’ll keep using putty for now..

for others who may have suffered same issue and think you need to manually KILL the virtual guest machine, read this