I have recently run into a very strange, and seemingly random issue on our exchange 2010 server that started (1st presented itself as a real issue) on Monday. I have spent the past 2 days searching high and low, and have only found partial posts on this from google-fu-ing my face off.. Nothing has been a solid eureka moment though, and im no closer to finding the root cause than i was on monday night. I am going to apologize for the long post to come below, but I feel like I need to try and give as much detail as I can. Since this is one of those kind of issues where the cause doesn’t have a bull’s-eye on it, I hoping to gain some insight and advice on how I should be troubleshooting this kind of issue. I will attach the event logs I found following the dismounts and restart on Monday, but none of these events/errors have returned since then...
History:
We upgraded from exchange 2003 to 2010 back in February. The transition went smoothly without any major issues. Since it is a transition, we needed to leave to 2003 box on for all outlook clients to redirect to the new server once they reconnected to the network after we migrated all mailboxes. We do have many remote workers, aso we planned for a month or 2 of coexistence to be safe. This also went off without any major issues that I couldn’t resolve in more than half an hour. So, being a one man shop here, we also at the same time had 40 new PCs coming in, and a new backup server to implement. I went ahead and got to work on those projects, all of which I finished in late May, first week in June.
I think it’s ironic as I was just planning to do the complete removal of exchange 2003 in a few weeks. I haven’t seen any issues while in co-existence with 2003/2010 at all, so I figured a few more weeks wouldn’t hurt anybody, as it’s been rock solid so far. Until last thursday..
Issue:
I first noticed something was 'different' last Thursday night. I use thurs nights to do server patching, reboots, PC moves/installs around the office, etc.. I ve been doing this same routine for years, and when you do something that long you tend to get 'used' to the way things flow during the process.. So, I’m moving along archiving logs, and restarting boxes.. I get to the old exch 2003 server, and as soon as I restarted it my outlook client promptly lost connection to exchange 2010 and gave the usual username/password dialogue.
Monday:
Was in an executive meeting with the CEO at around 430pm when phone calls started flooding the CEOs office saying, "Where's Dave?! Do you know the system is down?!" We both exclaimed, "NO?!"
These are the kind of issues that freak me out. There is no rhyme or reason as to why the stores automatically dismounted themselves. ( or at least i havent put all the pieces together yet to clearly see the cause) I have found some interesting event logs about it, seemingly it started with a single VSS error at around 330pm monday, and some seem to hint at a hardware or "I/O problem" as the logs put it, but I’m not seeing anything regarding physical disk issues, and all lights are "green' in the IMM. no predictive failures, nothing. Some of the events mention possible JetDB corruption... and one says dirty shutdown. I haven’t been able to run eseutil /mh yet, as I’m unsure if I should do that on a live DB... to be safe, I’m assuming no. Anyone out there have any experience otherwise? I also see some mentioning DB size, and "...if its physical size minus the logical size exceed 1024GB.. the DB will dismount on a regular basis" My Db's are no where near 1TB is size.. and i have plenty of disk space left on this brand new server. (IBM x3650 M4)
So, my question is: Could this possibly be related to firmware? What else could be causing this? I will know by tonight without a doubt if this is a persistent, and patterned issue. From last Thursday to Monday at 430 is just under 96 hours apart, and I’ve been racking my brain all week trying to get to the bottom of it before the next '96 hour mark' comes to pass. I have seen a server in the past that we could literally predict when the next crash was going to happen, to the hour.. and, that was a definite firmware issue/bug. (but, on a much older IBM BladeCenter S) and again there were no warnings/events generated by the server before each crash... For some reason, i feel like this might happen again at any moment and its haunting me every minute of everyday! lol.. /facepalm
I’ve done a few exchange migrations/setups before, and have been able to manage and maintain them so far.. but I am definitely not a 'pro' when it comes to the down and dirty troubleshooting of not-so-apparent problems in Exchange. This is a first for me! Any help is kindly appreciated!
Im sure there will be questions, and will be trying to check this post as much as i can today.
I really appreciate any advice/guidance on what you guys think this could be, and how i should move ahead in determining what the heck happened!
Respectfully,
Dave