Solaris and ATA and SCSI errors

After upgrading from snv_88 to snv_92 (Solaris 11 or Nevada b92) my home server started spewing the following errors:

Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: abort request, target=0 lun=0
Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: abort device, target=0 lun=0
Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: reset target, target=0 lun=0
Jun 25 14:23:53 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1 (ata3):
Jun 25 14:23:53 xeon timeout: reset bus, target=0 lun=0
Jun 25 14:23:54 xeon gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@1/cmdk@0,0 (Disk1):
Jun 25 14:23:54 xeon Error for command 'read sector' Error Level: Informational
Jun 25 14:23:54 xeon gda: [ID 107833 kern.notice] Sense Key: aborted command
Jun 25 14:23:54 xeon gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0×3
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: abort request, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: abort device, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: reset target, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: reset bus, target=0 lun=0
Jun 25 14:24:31 xeon scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0 (ata2):
Jun 25 14:24:31 xeon timeout: early timeout, target=0 lun=0
Jun 25 14:24:31 xeon gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0/cmdk@0,0 (Disk0):
Jun 25 14:24:31 xeon Error for command 'read sector' Error Level: Informational
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Sense Key: aborted command
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0×3
Jun 25 14:24:31 xeon gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@1f,2/ide@0/cmdk@0,0 (Disk0):
Jun 25 14:24:31 xeon Error for command 'read sector' Error Level: Informational
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Sense Key: aborted command
Jun 25 14:24:31 xeon gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0×3

Then UFS/ZFS errors, broken logging, fsck and eventually reboot. The first thought was broken disk, however iostat -E showed no errors and "bad sectors" suddenly appeared on all four disks. That was an indication of a software error.

And, indeed, it happened to be a bug with a not so obvious workaround – disable the "Intel Microcode Update" feature:

# rm -rf /platform/i86pc/ucode
# reboot
or
# mv /platform/i86pc/ucode /platform/i86pc/orig.ucode
# reboot

Or disable multiprocessing what is absolutely unacceptable. Removing the Intel microcode did help. According to the bug description (see below) it was fixed in snv_72, however it never showed up before and appeared in snv_92.

Update 2008-07-07 @00:05:13: Looks like this bug is fixed in snv_93

Update 2008-07-14 @18:33:56: Nope. Neither the bug is fixed, nor the workaround works. The system hangs and crashes three-four times a day regardless the microcode. I would say, if you have snv_88 installed don't rush with upgrades.

Update 2008-07-18 @14:03:38: I suspect it's related to a zone with exclusive TCP/IP stack. I shut the zone down for several days and there were no crashes. Probably it's also related to X-forwarding from the zone, heavy traffic or memory use. I noticed that the interface (rge0 – dedicated to the zone) speed  dropped from 1Gb to 100Mb since I upgraded from b88 to b92.

Update 2008-08-23 @15:11:26 Upgrade to b95 solved the ATA/SCSI problem. Also I figured out that I/OAT module crashes the system.  Not sure if the bugs are related, though.

Sources:
http://www.opensolaris.org/jive/thread.jspa?messageID=234154
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6586621

Related posts:

  1. IOAT fatal failures See I/OAT description here. Solaris (at least snv_93 and and...
  2. How to compile wine under Solaris. Part I Unfortunately, wine can not be compiled under Solaris "as is"....
  3. Compiling Wine 1.1.14 for OpenSolaris or Solaris Nevada Update 15.02.2009 @ 14:04: Same for wine 1.1.15… The following...
  4. Compiling Kompozer for Solaris The first attempt to compile Kompozer for Solaris Two macros...
  5. Bluefish 1.3.3 under OpenSolaris or Solaris Nevada To speed things up Bluefish 1.3.3 has static inline functions....

1 comment on this post.
  1. AlekZ’ Scratchpad » Un(un)mountable file system:

    [...] hitting this bug (yes, again and again) I had to unmount the broken file system and run fsck (twice). But then I [...]

Leave a comment