This is mostly written for Google in the hopes that others may save some time.
A customer with an IBM xSeries 346 machine (product 8840AC1) was running a very old 32-bit Fedora Core 4, and we wanted to upgrade it to a modern 64-bit CentOS 5.6. I've done these upgrades a lot, mostly with Dell servers (which I know quite well), but have done a few IBMs in the past.
The 346 is a competent system, though dated, and the install went very smoothly after backing up the key data.
But the system would panic a few minutes after firing up a disk-intensive process, happening every time. This did not bode well.
NMI Watchdog detected LOCKUP on CPU 2 CPU 2 Modules linked in: ip_conntrack_netbios_ns xt_comment ipt_REJECT ipt_LOG xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables ext3 jbd dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 ide_cd e752x_edac tg3 edac_mc i2c_core serio_raw pcspkr sg tpm_tis cdrom tpm floppy shpchp tpm_bios dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata ips sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd Pid: 512, comm: scsi_eh_0 Not tainted 2.6.18-238.9.1.el5 #1 RIP: 0010:[
] [ ] __delay+0x8/0x10 RSP: 0018:ffff810327357db8 EFLAGS: 00000097 RAX: 00000000a26607b3 RBX: ffff81032717dcf8 RCX: 00000000a2653b89 RDX: 00000000000000ff RSI: ffff810037fff528 RDI: 000000000033dd3e RBP: 00000000000071bd R08: ffff8102fedd9820 R09: 0000000000000000 R10: ffff810037fff528 R11: 00000000000000d8 R12: 0000000000000001 R13: ffff810327357ea0 R14: ffff81032717d800 R15: ffff810327357e90 FS: 0000000000000000(0000) GS:ffff81010b1dce40(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000083dac78 CR3: 0000000000201000 CR4: 00000000000006e0 Process scsi_eh_0 (pid: 512, threadinfo ffff810327356000, task ffff8102fedd9820) Stack: ffffffff8807379a ffff810037fff528 ffff81032717dcf8 ffff81032717dd50 ffffffff880760f1 000000167ad254b7 ffffffff880310d8 ffff810327189080 ffff810327189080 0000000000002003 ffffffff880773f6 ffff810327189098 Call Trace: [ ] :ips:ips_send_wait+0x81/0x96 [ ] :ips:__ips_eh_reset+0xf7/0x36e [ ] :scsi_mod:scsi_device_get+0x26/0x72 [ ] :ips:ips_eh_reset+0x1b/0x2e [ ] :scsi_mod:scsi_try_host_reset+0x4c/0xb4 [ ] :scsi_mod:scsi_eh_ready_devs+0x38e/0x493 [ ] keventd_create_kthread+0x0/0xc4 [ ] :scsi_mod:scsi_error_handler+0x323/0x4ac [ ] :scsi_mod:scsi_error_handler+0x0/0x4ac [ ] keventd_create_kthread+0x0/0xc4 [ ] kthread+0xfe/0x132 [ ] child_rip+0xa/0x11 [ ] keventd_create_kthread+0x0/0xc4 [ ] kthread+0x0/0x132 [ ] child_rip+0x0/0x11 Code: 29 c8 48 39 f8 72 f5 c3 41 54 83 3d d5 74 44 00 00 49 89 f4 Kernel panic - not syncing: nmi watchdog WARNING: at kernel/panic.c:137 panic()
A complicating factor here was that we couldn't see the full panic message on the console, but the customer was fortunate to have had a null modem cable around so we could set up serial console on a nearby Windows machine. Did I mention I was working remotely, a thousand miles away?
The key point is the ips driver—IBM's RAID controller—was always present at the top of the the callstack. Though a CPU lockup could be a lot of things, including heat issues, this was clearly pointing to the ServerRAID 7k component.
The fix — Updating the BIOS and the RAID controller firmware from 7.10.20 to the latest (7.12.14) seems to have fixed this entirely.The release notes don't have anything directly on point, but there are references to fixes for 64-bit systems.
I don't really know the IBM product line at all, but I can pass along the link I used to perform the update:IBM ServeRAID BIOS and Firmware Diskettes v7.12.14 for SCSI - Servers
I downloaded all five floppy images to the /tmp directory, had the customer put the first floppy in the drive, then ran
dd if=/tmp/ibm_fw1_ips_7.12.14_dos_noarch.img of=/dev/fd0 bs=9k
to write the image to the floppy (repeating for all five). There are almost for sure Windows-based tools—probably from IBM— that can do this from a PC.
The customer booted #1, and the upgrade seems to have been without incident. CentOS ran just fine after this.
What a relief.
NOTE — searching around the IBM site shows a Linux-based update package for this firmware upgrade, but I couldn't get it to work. Unlike other IBM updaters whose shell scripts were trivially fixed, this one was much more complicated and was asking for trouble. Use the floppies.
Update — six weeks later, the machine has been completely happy, so this firmware update was obviously the thing to do.