I was working the other day to rebuild my Linux 32 bit cluster in preparation for doing some tests between OEM and PAO for some internal training. I planned to use raw for my cluster configuration and voting files for CRS and ASM for the rest of the shared files. The 32 bit cluster had been up and operating fine and I had completed the testing I was doing so I retasked the drives to build a 64 bit cluster. I dropped the database and turned off ASM and retasked the disks.
Now comes this week and I want to rebuild the 32 bit cluster. I reassign the drives to the 32 bit zone and re-partition them to clear any flags set during the (unsuccessful) attempt at a 64 bit cluster using RedHat 5.0. I then attempt to restart CRS. Well, nothing worked so I reinstalled CRS from scratch. I could get one side to see the raws but then the other couldn’t. After battling the raws for a day or so and filing a Sev 2 SR (still no response by the way) I decided to chuck it all and use OCFS2, this would also allow me to retask several drives back into the ASM array as well. I set up the OCFS2 shared drives (2 of them) with no problems other than the normal first time issues which were resolved within a day. I then tried to build a database.
First I checked out the headers of the various drives and mapped SDA-SDS (skipping SDM and SDO as these were my OCFS2 system) drives as ASM devices by marking the headers using /etc/init.d/oracleasm. Those of you with rudimentary math skills should have an alarm going off right about now, as a hint, realize I only had 18 drives in my disk array. Next, I ran the DBCA and created an ASM RAC instance using the ASM marked drives in 3 disk groups, DATA, INDEX and RECOVERY.
Now I attempted to create a RAC instance. As soon as I started I realized something was amiss. The DBCA splash screen jumped over the choice screen for RAC verses normal databases. Thinking perhaps Oracle had changed the order of screens I went ahead and followed through getting all the way to the last screen and still no RAC choice. After several unsuccessful attempts at troubleshooting I decided to just reload the database software, so I deinstalled using the OUI and then cleaned out the residual files, followed by a reboot.
Nothing is as heart wrenching to a Linux SA as expecting a normal startup and getting the “grub:>” prompt instead. Getting the “grub:>” bootloader prompt means for some reason the system recognizes that it should be a Linux system but can’t mount the needed filesystem to make it so. Searching through my mounds of CDs I found the first RedHat 4.0 install disk and rebooted into Linux Rescue mode, sure enough it couldn’t mount the filesystem that contained the boot systems for Linux. Looking over at where it loaded the system as best it could I noted the /boot directory was empty. My first thought was that I maybe ran the “rm –rf *” command on the wrong directory. I used a scp copy to restore the files form my second 32 bit node and rebooted, still a “grub:>” prompt. Back into Linux Rescue and still it complained that it couldn’t mount the file system. Then the alarm bells started going off as the little voice in the back of my head got louder and louder. I had corrupted the header of the boot filesystem by marking it as an ASM disk. Counting from device SDA-SDR yields a count of 18, SDS makes it 19 and was actually the first internal system drive I had marked as DISK8 for use with ASM…oops. I felt a cold chill. How to recover the partition I had corrupted? Looking at it under Linux Rescue all the files were still there, but it wouldn’t mount. First, I went into fdisk and dropped and recreated the first primary partition. No go. It still wouldn’t boot and the files were still there (thank goodness). But what now?
Thank goodness for Google. I found the solution at:
First, I tried the e2fsck with no arguments against the SDS volume:
# e2fsck /dev/sds1
It complained about a bad superblock, so I used the mke2fs with the “-n” argument to read the alternate superblocks:
# mke2fs –n /dev/sds1
Armed with a list of alternative superblocks I then used the e2fsck command specifying it use an alternative superblock:
# e2fsck –b 32788 /dev/sds1
And viola! It repaired the various things wrong with my filesystem. I then was able to reboot and everything came up ok. Let me tell you I was not looking forward to reloading RedHat 4.0, then downloading several years of patches and applying them and re-downloading and installing Oracle11g from scratch -- so I was glad it worked.
Maybe next week I will be able to do my testing before I run off to RMOUG. If you are going to RMOUG, come by my talk on “The New Tuning Universe of Oracle 11g”. See you there!