Had a major issue with one of our Open Directory replicas.
It had always been somewhat problematic even though it was getting updates from the Open Directory master server. After a reboot, the server came back and the kerberos service was not running and I was unable to access Workgroup Manager users and unable to connect to the kerberos realm at all. This was only an issue for a specfic replica server. Another Open Directory replica server was fine. Unfortunately, this server was the one that acted as a password server to a major network service, so we were offline until I fixed it.
Typical fix for Open Directory replicas seems to be that when you have something happen, you should demote the Open Directory replica to being a standalone which should break the bad stuff happening. After you do this, you should then promote the server back into its Open Directory replica role. But, Apple’s Server Admin GUI seems to be pretty fickle on this. When you do this via Server Admin, unless there is a connection happening between the replica and master, it really doesn’t do the switch. Just like in the Active Directory world, when you want to demote or remove a server, you need to basically check it out of the network BUT you need a viable connection to do so. If you do not have that, you are SOL.
I was pretty sure that a nuke of the replica database was in order but the demotion through the Server Admin application and switch to Standalone was NOT happening. I even disabled the service entirely, but when I turned it back on through Server Admin, it still was unchanged. I have seen issues with Apple’s Server Admin as it relates to the status of the Open Directory settings from server to server. Server Admin often, from what I have seen, provides incorrect or incomplete information on what is happening with the Open Directory / LDAP server. In this emergency situation where I had a server that was dishing passwords to critical network services, something had to happen quickly.
Apple Support when you contact them, will try to walk you down a road that focuses on DNS issues. Each time I have talked with them they seem to really focus on hostname resolution issues. I suppose they get a lot of people calling them with issues after they screw something up with DNS or something but that was NOT my issue. Of course, you need to make sure your DNS / hostname resolution is happening and your network is viable. We were way past that. After a little round and round, we finally got to the main issue – something freaked-out with kerberos and we could either try to fix that or rebuild the replica. I opted for the latter and since Server Admin on the replica wasn’t helping me, we had to do something else. Here is what we did.
1. Quit Server Admin on the damaged replica server
2. Run an Archive backup of the Open Directory Master just in case through Server Admin on the Master Open Directory server and confirm backup just in case.
3. Open a terminal on the replica server
4. In the terminal on the messed-up replica server enter
$ sudo slapconfig -destroyldapserver
This does exactly what it says. Nothing nice about it. After a few seconds it nukes the LDAP server on the server.
5. Reboot the replica server.
6. After reboot, go into Server Admin on the replica and my Open Directory service was not active (grey icon rather than green) and was defaulting to ‘Standalone’ mode. This was clear that the destroy had worked.
7. From the Server Admin, go through the promotion to Open Directory Replica back to the Master again.
8. This put us back in business.
Again, your experience may differ but I opted to destroy the server replica since there was nothing of value and just rebuild it from scratch off the Master that was operating fine. Since another replica was doing fine as well, it was clear something was up with this specific server and not the overall architecture or permissions. Since Server Admin was not helping at all on that damaged replica, it was time, in my estimation to go ‘terminal’ with it.
After promotion back to replica, I got my password server back and users were authenticating again. This sort of thing may not work for you and don’t try it unless you are SOL like I was. Go through Apple Support and there might be an easier fix there for you.