[LUGOS] follow-up: md array hecnosti

Uroš Golja goljau at comcom.si
Wed Oct 24 23:40:05 CEST 2007


Pozdrav,

pred kakim mesecem sem pisal na tale seznam, ker sem imel težave z nekim 
md arrayjem. Če vas zanimajo podrobnosti, poiščite po arhivu. Šlo je za 
to, da se mi je na SATA krmilniku vedno "zaskočil" en port, na katerega 
je bil zapet eden od diskov v md arrayju (RAID5). Nekateri ste bili 
mnenja, da disk na tem portu krepuje, in mi svetovali, naj ga zamenjam.

Pa ga nisem; raje sem zamenjal ves ostali hardver (matično, procesor, 
pomnilnik, grafično, napajalnik, SATA krmilnik etc). Po rošadi je 
strežnik deloval brez prej omenjenih težav ves mesec, vsak dan, 24 ur na 
dan.

Do včeraj. Strežnik je na omrežju opravljal tudi funckijo routerja (dve 
mrežni kartici, PPPoe, NAT, dnsmasq, firewall ipd), sedaj pa sem dobil 
"dedicated" router in na strežniku to funkcionalnost upokojil. Edina 
hardverska sprememba med prej in sedaj je bila, da sem iz mrežne kartice 
za WAN izpulil ethernet kabel, katerega drugi konec je bil prej zapičen 
v ADSL modem.

Prižgem, preizkusim, vse deluje kot mora. Dve uri pozneje se je začelo 
dogajati tole (se opravičujem, ampak v takih primerih se mi zdi smiselno 
priložit ves dnevnik):
Oct 24 20:03:42 floki kernel: irq 18: nobody cared (try booting with the 
"irqpoll" option)
Oct 24 20:03:42 floki kernel:  [<c014da8e>] __report_bad_irq+0x36/0x75
Oct 24 20:03:42 floki kernel:  [<c014dc86>] note_interrupt+0x1b9/0x1f7
Oct 24 20:03:42 floki kernel:  [<c014d1cf>] handle_IRQ_event+0x23/0x51
Oct 24 20:03:42 floki kernel:  [<c014e3fe>] handle_fasteoi_irq+0x85/0xa4
Oct 24 20:03:42 floki kernel:  [<c010639e>] do_IRQ+0x57/0x71
Oct 24 20:03:42 floki kernel:  [<c0135f8d>] hrtimer_start+0xe7/0xf1
Oct 24 20:03:42 floki kernel:  [<c0102c2c>] default_idle+0x0/0x55
Oct 24 20:03:42 floki kernel:  [<c010476b>] common_interrupt+0x23/0x28
Oct 24 20:03:42 floki kernel:  [<c0102c2c>] default_idle+0x0/0x55
Oct 24 20:03:42 floki kernel:  [<c0119b92>] native_safe_halt+0x2/0x3
Oct 24 20:03:42 floki kernel:  [<c0102c65>] default_idle+0x39/0x55
Oct 24 20:03:42 floki kernel:  [<c01023a6>] cpu_idle+0xab/0xcc
Oct 24 20:03:42 floki kernel:  [<c035599a>] start_kernel+0x350/0x358
Oct 24 20:03:42 floki kernel:  [<c035517b>] unknown_bootoption+0x0/0x20c
Oct 24 20:03:42 floki kernel:  =======================
Oct 24 20:03:42 floki kernel: handlers:
Oct 24 20:03:42 floki kernel: [<d08497f3>] (pdc_interrupt+0x0/0x267 
[sata_promise])
Oct 24 20:03:42 floki kernel: Disabling IRQ #18
Oct 24 22:57:42 floki kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x2 frozen
Oct 24 22:57:42 floki kernel: ata4.00: cmd 
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Oct 24 22:57:42 floki kernel:          res 
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 24 22:57:42 floki kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x2 frozen
Oct 24 22:57:42 floki kernel: ata3.00: cmd 
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Oct 24 22:57:42 floki kernel:          res 
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 24 22:57:42 floki kernel: ata2.00: exception Emask 0x10 SAct 0x0 
SErr 0x4190002 action 0x2 frozen
Oct 24 22:57:42 floki kernel: ata2.00: cmd 
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Oct 24 22:57:42 floki kernel:          res 
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Oct 24 22:57:43 floki kernel: ata4: soft resetting port
Oct 24 22:57:43 floki kernel: ata3: soft resetting port
Oct 24 22:57:43 floki kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:57:43 floki kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:57:43 floki kernel: ata2: soft resetting port
Oct 24 22:57:43 floki kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:58:13 floki kernel: ata4.00: qc timeout (cmd 0x27)
Oct 24 22:58:13 floki kernel: ata4.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:58:13 floki kernel: ata4.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:58:13 floki kernel: ata4: failed to recover some devices, 
retrying in 5 secs
Oct 24 22:58:13 floki kernel: ata3.00: qc timeout (cmd 0x27)
Oct 24 22:58:13 floki kernel: ata3.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:58:13 floki kernel: ata3.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:58:13 floki kernel: ata3: failed to recover some devices, 
retrying in 5 secs
Oct 24 22:58:13 floki kernel: ata2.00: qc timeout (cmd 0x27)
Oct 24 22:58:13 floki kernel: ata2.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:58:13 floki kernel: ata2.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:58:13 floki kernel: ata2: failed to recover some devices, 
retrying in 5 secs
Oct 24 22:58:18 floki kernel: ata4: hard resetting port
Oct 24 22:58:18 floki kernel: ata3: hard resetting port
Oct 24 22:58:18 floki kernel: ata2: hard resetting port
Oct 24 22:58:18 floki kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:58:18 floki kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:58:19 floki kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:58:48 floki kernel: ata4.00: qc timeout (cmd 0x27)
Oct 24 22:58:48 floki kernel: ata4.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:58:48 floki kernel: ata4.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:58:48 floki kernel: ata4.00: limiting speed to UDMA/133:PIO3
Oct 24 22:58:48 floki kernel: ata4: failed to recover some devices, 
retrying in 5 secs
Oct 24 22:58:48 floki kernel: ata3.00: qc timeout (cmd 0x27)
Oct 24 22:58:48 floki kernel: ata3.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:58:48 floki kernel: ata3.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:58:48 floki kernel: ata3.00: limiting speed to UDMA/133:PIO3
Oct 24 22:58:48 floki kernel: ata3: failed to recover some devices, 
retrying in 5 secs
Oct 24 22:58:49 floki kernel: ata2.00: qc timeout (cmd 0x27)
Oct 24 22:58:49 floki kernel: ata2.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:58:49 floki kernel: ata2.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:58:49 floki kernel: ata2.00: limiting speed to UDMA/133:PIO3
Oct 24 22:58:49 floki kernel: ata2: failed to recover some devices, 
retrying in 5 secs
Oct 24 22:58:53 floki kernel: ata4: hard resetting port
Oct 24 22:58:53 floki kernel: ata3: hard resetting port
Oct 24 22:58:54 floki kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:58:54 floki kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:58:54 floki kernel: ata2: hard resetting port
Oct 24 22:58:55 floki kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:59:24 floki kernel: ata4.00: qc timeout (cmd 0x27)
Oct 24 22:59:24 floki kernel: ata4.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:59:24 floki kernel: ata4.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:59:24 floki kernel: ata4.00: disabled
Oct 24 22:59:24 floki kernel: ata3.00: qc timeout (cmd 0x27)
Oct 24 22:59:24 floki kernel: ata3.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:59:24 floki kernel: ata3.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:59:24 floki kernel: ata3.00: disabled
Oct 24 22:59:24 floki kernel: ata4: EH complete
Oct 24 22:59:24 floki kernel: ata3: EH complete
Oct 24 22:59:24 floki kernel: sd 2:0:0:0: [sdb] Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Oct 24 22:59:24 floki kernel: end_request: I/O error, dev sdb, sector 
398283327
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Oct 24 22:59:24 floki kernel: end_request: I/O error, dev sdc, sector 
398283327
Oct 24 22:59:24 floki kernel: md: super_written gets error=-5, uptodate=0
Oct 24 22:59:24 floki kernel: raid5: Disk failure on sdb1, disabling 
device. Operation continuing on 2 devices
Oct 24 22:59:24 floki kernel: md: super_written gets error=-5, uptodate=0
Oct 24 22:59:24 floki kernel: raid5: Disk failure on sdc1, disabling 
device. Operation continuing on 1 devices
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] READ CAPACITY failed
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Sense not available.
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Write Protect is off
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 00 00 00
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Asking for cache data failed
Oct 24 22:59:24 floki kernel: sd 3:0:0:0: [sdc] Assuming drive cache: 
write through
Oct 24 22:59:25 floki kernel: ata2.00: qc timeout (cmd 0x27)
Oct 24 22:59:25 floki kernel: ata2.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:59:25 floki kernel: ata2.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:59:25 floki kernel: ata2.00: disabled
Oct 24 22:59:26 floki kernel: ata2: soft resetting port
Oct 24 22:59:26 floki kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 22:59:56 floki kernel: ata2.00: qc timeout (cmd 0x27)
Oct 24 22:59:56 floki kernel: ata2.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 22:59:56 floki kernel: ata2.00: ATA-7: Maxtor 6L200S0, BANC1G10, 
max UDMA/133
Oct 24 22:59:56 floki kernel: ata2.00: 398297088 sectors, multi 0: LBA48 
NCQ (depth 0/32)
Oct 24 22:59:56 floki kernel: ata2.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 22:59:56 floki kernel: ata2: failed to recover some devices, 
retrying in 5 secs
Oct 24 23:00:01 floki kernel: ata2: hard resetting port
Oct 24 23:00:02 floki kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 23:00:32 floki kernel: ata2.00: qc timeout (cmd 0x27)
Oct 24 23:00:32 floki kernel: ata2.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 23:00:32 floki kernel: ata2.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 23:00:32 floki kernel: ata2.00: limiting speed to UDMA/133:PIO3
Oct 24 23:00:32 floki kernel: ata2: failed to recover some devices, 
retrying in 5 secs
Oct 24 23:00:37 floki kernel: ata2: hard resetting port
Oct 24 23:00:38 floki kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 24 23:01:08 floki kernel: ata2.00: qc timeout (cmd 0x27)
Oct 24 23:01:08 floki kernel: ata2.00: ata_hpa_resize 1: hpa sectors (0) 
is smaller than sectors (398297088)
Oct 24 23:01:08 floki kernel: ata2.00: failed to set xfermode 
(err_mask=0x40)
Oct 24 23:01:08 floki kernel: ata2.00: disabled
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: rejecting I/O to offline device
Oct 24 23:01:09 floki last message repeated 2 times
Oct 24 23:01:09 floki kernel: md: super_written gets error=-5, uptodate=0
Oct 24 23:01:09 floki kernel: raid5: Disk failure on sda1, disabling 
device. Operation continuing on 0 devices
Oct 24 23:01:09 floki kernel: ata2: EH complete
Oct 24 23:01:09 floki kernel: RAID5 conf printout:
Oct 24 23:01:09 floki kernel:  --- rd:3 wd:0
Oct 24 23:01:09 floki kernel:  disk 0, o:0, dev:sdc1
Oct 24 23:01:09 floki kernel:  disk 1, o:0, dev:sda1
Oct 24 23:01:09 floki kernel:  disk 2, o:0, dev:sdb1
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 620
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: Aborting journal on device md0.
Oct 24 23:01:09 floki kernel: EXT3-fs error (device md0) in 
ext3_reserve_inode_write: Journal has aborted
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 0
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: EXT3-fs error (device md0) in 
ext3_dirty_inode: Journal has aborted
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 0
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: EXT3-fs error (device md0) in 
ext3_dirty_inode: Journal has aborted
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 0
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: rejecting I/O to offline device
Oct 24 23:01:09 floki last message repeated 2 times
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] READ CAPACITY failed
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Result: 
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Sense not available.
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: rejecting I/O to offline device
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Write Protect is off
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Mode Sense: 00 00 00 00
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: rejecting I/O to offline device
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Asking for cache data failed
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Assuming drive cache: 
write through
Oct 24 23:01:09 floki kernel: ata2.00: detaching (SCSI 1:0:0:0)
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Stopping disk
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] START_STOP FAILED
Oct 24 23:01:09 floki kernel: sd 1:0:0:0: [sda] Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Oct 24 23:01:09 floki kernel: journal commit I/O error
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 27
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 10256386
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: Buffer I/O error on device md0, logical 
block 59506690
Oct 24 23:01:09 floki kernel: lost page write due to I/O error on md0
Oct 24 23:01:09 floki kernel: RAID5 conf printout:
Oct 24 23:01:09 floki kernel:  --- rd:3 wd:0
Oct 24 23:01:09 floki kernel:  disk 1, o:0, dev:sda1
Oct 24 23:01:09 floki kernel:  disk 2, o:0, dev:sdb1
Oct 24 23:01:09 floki kernel: RAID5 conf printout:
Oct 24 23:01:09 floki kernel:  --- rd:3 wd:0
Oct 24 23:01:09 floki kernel:  disk 1, o:0, dev:sda1
Oct 24 23:01:09 floki kernel:  disk 2, o:0, dev:sdb1
Oct 24 23:01:09 floki kernel: RAID5 conf printout:
Oct 24 23:01:09 floki kernel:  --- rd:3 wd:0
Oct 24 23:01:09 floki kernel:  disk 1, o:0, dev:sda1
Oct 24 23:01:09 floki kernel: RAID5 conf printout:
Oct 24 23:01:09 floki kernel:  --- rd:3 wd:0
Oct 24 23:01:09 floki kernel:  disk 1, o:0, dev:sda1
Oct 24 23:01:09 floki kernel: RAID5 conf printout:
Oct 24 23:01:09 floki kernel:  --- rd:3 wd:0

Skrajšana verzija, kakor si jo predstavljam: od nekod se pojavi 
interrupt, ki gonilnik za SATA krmilnik noče. Krmilnik, ki dejansko tudi 
je na IRQ18, se ustavi. Ko nekdo dostopi do datotečnega sistema na 
arrayu /dev/md0, kernel ugotovi, da nobena komponenta (sda1, sdb1, sdc1) 
arrayja ni več dosegljiva, zato ga ustavi. En reboot kasneje je vse 
zopet v delujočem stanju. Sreča v nesreči je, da vse komponente arrayja 
(sda, sdb in sdc) pocrkajo hkrati, zato array ostane v konsistentnem 
stanju (ni ga treba rebuildat).

Mislim, da gre za neko obskurno hardversko težavo; niti sanja pa se mi 
ne, kako bi jo izsledil, kaj šele odpravil. Fantje, ima kdo kakšno -- pa 
čeprav obskurno -- idejo?

LP,
U.


More information about the lugos-list mailing list