Intermittent Problem, Likely Disk IO Related – Mptscsih: Ioc0: Attempting Task Abort!

Home » CentOS » Intermittent Problem, Likely Disk IO Related – Mptscsih: Ioc0: Attempting Task Abort!
CentOS 9 Comments

NOTE: this is happening on CentOS 6 x86_64, 2.6.32-504.3.3.el6.x86_64 not CentOS 5

Dell PowerEdge 2970, Seagate SATA drive, non-raid.

I have this server which has been dying randomly, with no logs.

I had a tail -f over SSH for a week, when this just happened.

Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: attempting task abort! (sc

9 thoughts on - Intermittent Problem, Likely Disk IO Related – Mptscsih: Ioc0: Attempting Task Abort!

  • Here is a console picture.

    http://i.imgur.com/ZYHlB82.jpg

    # smartctl -a /dev/sda smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-504.3.3.el6.x86_64] (local build)
    Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF INFORMATION SECTION ==Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
    Device Model: ST1500DM003-9YN16G
    Serial Number: W24153R0
    LU WWN Device Id: 5 000c50 05d03cc1d Firmware Version: CC82
    User Capacity: 1,500,301,910,016 bytes [1.50 TB]
    Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show]
    ATA Version is: 8
    ATA Standard is: ATA-8-ACS revision 4
    Local Time is: Sat Feb 7 23:41:00 2015 EST
    SMART support is: Available – device has SMART capability. SMART support is: Enabled

    === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    No Offline surface scan supported.
    Self-test supported.
    Conveyance Self-test supported.
    Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 194) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always – 181943016
    3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always – 0
    4 Start_Stop_Count 0x0032 100 100 020 Old_age Always – 17
    5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always – 0
    7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always – 39599363
    9 Power_On_Hours 0x0032 100 100 000 Old_age Always – 821
    10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always – 0
    12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always – 17
    183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always – 0
    184 End-to-End_Error 0x0032 100 100 099 Old_age Always – 0
    187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always – 0
    188 Command_Timeout 0x0032 100 100 000 Old_age Always – 0
    189 High_Fly_Writes 0x003a 100 100 000 Old_age Always – 0
    190 Airflow_Temperature_Cel 0x0022 067 062 045 Old_age Always – 33 (Min/Max 30/33)
    191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always – 0
    192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always – 16
    193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always – 4551
    194 Temperature_Celsius 0x0022 033 040 000 Old_age Always – 33 (0 21 0 0 0)
    197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always – 0
    198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline – 0
    199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always – 0
    240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline – 267112606073648
    241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline – 2764453802303
    242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline – 3442873711291

    SMART Error Log Version: 1
    No Errors Logged

    SMART Self-test log structure revision number 1
    No self-tests have been logged. [To run self-tests, use: smartctl -t]

    SMART Selective self-test log data structure revision number 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

    sk-abort-success-rv-2002-causes-30-seconds-freezing

  • Thanks to netconsole, I have the panic to post:

    Feb 16 06:06:56 BUG: soft lockup – CPU#0 stuck for 67s! [ksmd:88]
    Feb 16 06:06:56 Modules linked in:
    Feb 16 06:06:56 nf_nat Feb 16 06:06:56 mpt3sas Feb 16 06:06:56 mpt2sas Feb 16 06:06:56 raid_class Feb 16 06:06:56 mptctl Feb 16 06:06:56 ipmi_si Feb 16 06:06:56 ipmi_devintf Feb 16 06:06:56 netconsole Feb 16 06:06:56 configfs Feb 16 06:06:56 ebtable_nat Feb 16 06:06:56 ebtables Feb 16 06:06:56 nfs Feb 16 06:06:56 lockd Feb 16 06:06:56 fscache Feb 16 06:06:56 auth_rpcgss Feb 16 06:06:56 nfs_acl Feb 16 06:06:56 sunrpc Feb 16 06:06:56 bridge Feb 16 06:06:56 stp Feb 16 06:06:56 llc Feb 16 06:06:56 ipt_REJECT
    Feb 16 06:06:56 nf_conntrack_ipv4
    Feb 16 06:06:56 nf_defrag_ipv4
    Feb 16 06:06:56 iptable_filter Feb 16 06:06:56 ip_tables Feb 16 06:06:56 ip6t_REJECT
    Feb 16 06:06:56 nf_conntrack_ipv6
    Feb 16 06:06:56 nf_defrag_ipv6
    Feb 16 06:06:56 xt_state Feb 16 06:06:56 nf_conntrack Feb 16 06:06:56 ip6table_filter Feb 16 06:06:56 ip6_tables Feb 16 06:06:56 ipv6
    Feb 16 06:06:56 dm_snapshot Feb 16 06:06:56 dm_bufio Feb 16 06:06:56 dm_zero Feb 16 06:06:56 vhost_net Feb 16 06:06:56 macvtap Feb 16 06:06:56 macvlan Feb 16 06:06:56 tun Feb 16 06:06:56 kvm_amd Feb 16 06:06:56 kvm Feb 16 06:06:56 ipmi_msghandler Feb 16 06:06:56 dcdbas Feb 16 06:06:56 serio_raw Feb 16 06:06:56 bnx2
    Feb 16 06:06:56 k10temp Feb 16 06:06:56 amd64_edac_mod Feb 16 06:06:56 edac_core Feb 16 06:06:56 edac_mce_amd Feb 16 06:06:56 sg Feb 16 06:06:56 i2c_piix4
    Feb 16 06:06:56 shpchp Feb 16 06:06:56 ext4
    Feb 16 06:06:56 jbd2
    Feb 16 06:06:56 mbcache Feb 16 06:06:56 sd_mod Feb 16 06:06:56 crc_t10dif Feb 16 06:06:56 mptsas Feb 16 06:06:56 mptscsih Feb 16 06:06:56 mptbase Feb 16 06:06:56 scsi_transport_sas Feb 16 06:06:56 ata_generic Feb 16 06:06:56 pata_acpi Feb 16 06:06:56 sata_svw Feb 16 06:06:56 radeon Feb 16 06:06:56 ttm Feb 16 06:06:56 drm_kms_helper Feb 16 06:06:56 drm Feb 16 06:06:56 i2c_algo_bit Feb 16 06:06:56 i2c_core Feb 16 06:06:56 dm_mirror Feb 16 06:06:56 dm_region_hash Feb 16 06:06:56 dm_log Feb 16 06:06:56 dm_mod Feb 16 06:06:56 [last unloaded: dell_rbu]
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 CPU 0
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 Modules linked in:
    Feb 16 06:06:56 nf_nat Feb 16 06:06:56 mpt3sas Feb 16 06:06:56 mpt2sas Feb 16 06:06:56 raid_class Feb 16 06:06:56 mptctl Feb 16 06:06:56 ipmi_si Feb 16 06:06:56 ipmi_devintf Feb 16 06:06:56 netconsole Feb 16 06:06:56 configfs Feb 16 06:06:56 ebtable_nat Feb 16 06:06:56 ebtables Feb 16 06:06:56 nfs Feb 16 06:06:56 lockd Feb 16 06:06:56 fscache Feb 16 06:06:56 auth_rpcgss Feb 16 06:06:56 nfs_acl Feb 16 06:06:56 sunrpc Feb 16 06:06:56 bridge Feb 16 06:06:56 stp Feb 16 06:06:56 llc Feb 16 06:06:56 ipt_REJECT
    Feb 16 06:06:56 nf_conntrack_ipv4
    Feb 16 06:06:56 nf_defrag_ipv4
    Feb 16 06:06:56 iptable_filter Feb 16 06:06:56 ip_tables Feb 16 06:06:56 ip6t_REJECT
    Feb 16 06:06:56 nf_conntrack_ipv6
    Feb 16 06:06:56 nf_defrag_ipv6
    Feb 16 06:06:56 xt_state Feb 16 06:06:56 nf_conntrack Feb 16 06:06:56 ip6table_filter Feb 16 06:06:56 ip6_tables Feb 16 06:06:56 ipv6
    Feb 16 06:06:56 dm_snapshot Feb 16 06:06:56 dm_bufio Feb 16 06:06:56 dm_zero Feb 16 06:06:56 vhost_net Feb 16 06:06:56 macvtap Feb 16 06:06:56 macvlan Feb 16 06:06:56 tun Feb 16 06:06:56 kvm_amd Feb 16 06:06:56 kvm Feb 16 06:06:56 ipmi_msghandler Feb 16 06:06:56 dcdbas Feb 16 06:06:56 serio_raw Feb 16 06:06:56 bnx2
    Feb 16 06:06:56 k10temp Feb 16 06:06:56 amd64_edac_mod Feb 16 06:06:56 edac_core Feb 16 06:06:56 edac_mce_amd Feb 16 06:06:56 sg Feb 16 06:06:56 i2c_piix4
    Feb 16 06:06:56 shpchp Feb 16 06:06:56 ext4
    Feb 16 06:06:56 jbd2
    Feb 16 06:06:56 mbcache Feb 16 06:06:56 sd_mod Feb 16 06:06:56 crc_t10dif Feb 16 06:06:56 mptsas Feb 16 06:06:56 mptscsih Feb 16 06:06:56 mptbase Feb 16 06:06:56 scsi_transport_sas Feb 16 06:06:56 ata_generic Feb 16 06:06:56 pata_acpi Feb 16 06:06:56 sata_svw Feb 16 06:06:56 radeon Feb 16 06:06:56 ttm Feb 16 06:06:56 drm_kms_helper Feb 16 06:06:56 drm Feb 16 06:06:56 i2c_algo_bit Feb 16 06:06:56 i2c_core Feb 16 06:06:56 dm_mirror Feb 16 06:06:56 dm_region_hash Feb 16 06:06:56 dm_log Feb 16 06:06:56 dm_mod Feb 16 06:06:56 [last unloaded: dell_rbu]
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 Pid: 88, comm: ksmd Not tainted 2.6.32-504.8.1.el6.CentOS.plus.x86_64 #1
    Feb 16 06:06:56 Dell Inc. PowerEdge 2970
    Feb 16 06:06:56 /0JKN8W
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 RIP: 0010:[]
    Feb 16 06:06:56 [
    ] __bitmap_empty+0x41/0x90
    Feb 16 06:06:56 RSP: 0018:ffff88021831dcb0 EFLAGS: 00000202
    Feb 16 06:06:56 RAX: 0000000000000000 RBX: ffff88021831dcb0 RCX: 0000000000000010
    Feb 16 06:06:56 RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffff81e2f198
    Feb 16 06:06:56 RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
    Feb 16 06:06:56 R10: ffffea0006679c20 R11: 0000000000000000 R12: 0000000000000000
    Feb 16 06:06:56 R13: ffff8801c1b8f650 R14: 0000000198152467 R15: ffffffffa03af44a Feb 16 06:06:56 FS: 00007fc4756b09a0(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
    Feb 16 06:06:56 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Feb 16 06:06:56 CR2: 000000c641faeff0 CR3: 0000000001a85000 CR4: 00000000000007f0
    Feb 16 06:06:56 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Feb 16 06:06:56 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Feb 16 06:06:56 Process ksmd (pid: 88, threadinfo ffff88021831c000, task ffff880218310040)
    Feb 16 06:06:56 Stack:
    Feb 16 06:06:56 ffff88021831dd00
    Feb 16 06:06:56 ffffffff81052268
    Feb 16 06:06:56 00007f30249b8000
    Feb 16 06:06:56 ffffffff81e2f180
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 d>
    Feb 16 06:06:56 8000000198152025
    Feb 16 06:06:56 ffff880219ade700
    Feb 16 06:06:56 00007f30249b8000
    Feb 16 06:06:56 ffff880219ade9c8
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 d>
    Feb 16 06:06:56 ffffea0006679c20
    Feb 16 06:06:56 ffff880219e57ed0
    Feb 16 06:06:56 ffff88021831dd30
    Feb 16 06:06:56 ffffffff810522e6
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 Call Trace:
    Feb 16 06:06:56 [] ? flush_tlb_others_ipi+0x128/0x130
    Feb 16 06:06:56 [] ? native_flush_tlb_others+0x76/0x90
    Feb 16 06:06:56 [] ? flush_tlb_page+0x5e/0xb0
    Feb 16 06:06:56 [] ? try_to_merge_with_ksm_page+0x532/0x660
    Feb 16 06:06:56 [] ? ksm_scan_thread+0xeb4/0x1120
    Feb 16 06:06:56 [] ? autoremove_wake_function+0x0/0x40
    Feb 16 06:06:56 [] ? ksm_scan_thread+0x0/0x1120
    Feb 16 06:06:56 [] ? kthread+0x9e/0xc0
    Feb 16 06:06:56 [] ? child_rip+0xa/0x20
    Feb 16 06:06:56 [] ? kthread+0x0/0xc0
    Feb 16 06:06:56 [] ? child_rip+0x0/0x20
    Feb 16 06:06:56 Code:
    Feb 16 06:06:56 c0
    Feb 16 06:06:56 7e Feb 16 06:06:56 24
    Feb 16 06:06:56 48
    Feb 16 06:06:56 83
    Feb 16 06:06:56 3f Feb 16 06:06:56 00
    Feb 16 06:06:56 48
    Feb 16 06:06:56 89
    Feb 16 06:06:56 f8
    Feb 16 06:06:56 74
    Feb 16 06:06:56 13
    Feb 16 06:06:56 eb Feb 16 06:06:56 5c Feb 16 06:06:56 0f Feb 16 06:06:56 1f Feb 16 06:06:56 40
    Feb 16 06:06:56 00
    Feb 16 06:06:56 48
    Feb 16 06:06:56 8b Feb 16 06:06:56 48
    Feb 16 06:06:56 08
    Feb 16 06:06:56 48
    Feb 16 06:06:56 83
    Feb 16 06:06:56 c0
    Feb 16 06:06:56 08
    Feb 16 06:06:56 48
    Feb 16 06:06:56 85
    Feb 16 06:06:56 c9
    Feb 16 06:06:56 75
    Feb 16 06:06:56 4b Feb 16 06:06:56 83
    Feb 16 06:06:56 c2
    Feb 16 06:06:56 01
    Feb 16 06:06:56 41
    Feb 16 06:06:56 39
    Feb 16 06:06:56 d0
    Feb 16 06:06:56 7f Feb 16 06:06:56 eb Feb 16 06:06:56 40
    Feb 16 06:06:56 f6
    Feb 16 06:06:56 c6
    Feb 16 06:06:56 3f Feb 16 06:06:56 b8>
    Feb 16 06:06:56 01
    Feb 16 06:06:56 00
    Feb 16 06:06:56 last message repeated 2 times Feb 16 06:06:56 75
    Feb 16 06:06:56 08
    Feb 16 06:06:56 c9
    Feb 16 06:06:56 c3
    Feb 16 06:06:56 66
    Feb 16 06:06:56 0f Feb 16 06:06:56 1f Feb 16 06:06:56 44
    Feb 16 06:06:56 00
    Feb 16 06:06:56 00
    Feb 16 06:06:56 89
    Feb 16 06:06:56 f0
    Feb 16 06:06:56 48
    Feb 16 06:06:56 63
    Feb 16 06:06:56 d2
    Feb 16 06:06:56 c1
    Feb 16 06:06:56 192.168.13.230
    Feb 16 06:06:56 Call Trace:
    Feb 16 06:06:56 [] ? flush_tlb_others_ipi+0x128/0x130
    Feb 16 06:06:56 [] ? native_flush_tlb_others+0x76/0x90
    Feb 16 06:06:56 [] ? flush_tlb_page+0x5e/0xb0
    Feb 16 06:06:56 [] ? try_to_merge_with_ksm_page+0x532/0x660
    Feb 16 06:06:56 [] ? ksm_scan_thread+0xeb4/0x1120
    Feb 16 06:06:56 [] ? autoremove_wake_function+0x0/0x40
    Feb 16 06:06:56 [] ? ksm_scan_thread+0x0/0x1120
    Feb 16 06:06:56 [] ? kthread+0x9e/0xc0
    Feb 16 06:06:56 [] ? child_rip+0xa/0x20
    Feb 16 06:06:56 [] ? kthread+0x0/0xc0
    Feb 16 06:06:56 [] ? child_rip+0x0/0x20
    Feb 16 06:07:01 Kernel panic – not syncing: Watchdog detected hard LOCKUP on cpu 1
    Feb 16 06:07:01 Pid: 1950, comm: qemu-kvm Not tainted 2.6.32-504.8.1.el6.CentOS.plus.x86_64 #1
    Feb 16 06:07:01 Call Trace:
    Feb 16 06:07:01
    Feb 16 06:07:01 [] ? panic+0xa7/0x16f Feb 16 06:07:01 [] ? sched_clock+0x9/0x10
    Feb 16 06:07:01 [] ? watchdog_overflow_callback+0xcd/0xd0
    Feb 16 06:07:01 [] ? __perf_event_overflow+0xa7/0x240
    Feb 16 06:07:01 [] ? perf_event_update_userpage+0x24/0x110
    Feb 16 06:07:01 [] ? perf_event_overflow+0x14/0x20
    Feb 16 06:07:01 [] ? x86_pmu_handle_irq+0x1eb/0x250
    Feb 16 06:07:01 [] ? perf_event_nmi_handler+0x39/0xb0
    Feb 16 06:07:01 [] ? notifier_call_chain+0x55/0x80
    Feb 16 06:07:01 [] ? atomic_notifier_call_chain+0x1a/0x20
    Feb 16 06:07:01 [] ? notify_die+0x2e/0x30
    Feb 16 06:07:01 [] ? do_nmi+0x1bb/0x340
    Feb 16 06:07:01 [] ? nmi+0x20/0x30
    Feb 16 06:07:01 [] ? _spin_lock+0x1e/0x30
    Feb 16 06:07:01 < >
    Feb 16 06:07:01 [] ? handle_pte_fault+0x833/0xb00
    Feb 16 06:07:01 [] ? kvm_ioapic_update_eoi+0x8a/0xf0 [kvm]
    Feb 16 06:07:01 [] ? handle_mm_fault+0x22a/0x300
    Feb 16 06:07:01 [] ? __do_page_fault+0x138/0x480
    Feb 16 06:07:01 [] ? update_curr+0xe1/0x1f0
    Feb 16 06:07:01 [] ? perf_event_task_sched_out+0x33/0x70
    Feb 16 06:07:01 [] ? invalidate_interrupt0+0xe/0x20
    Feb 16 06:07:01 [] ? finish_task_switch+0x4c/0xf0
    Feb 16 06:07:01 [] ? do_page_fault+0x3e/0xa0
    Feb 16 06:07:01 [] ? page_fault+0x25/0x30
    Feb 16 06:07:01 [] ? copy_user_generic_string+0x32/0x40
    Feb 16 06:07:01 [] ? kvm_write_guest_cached+0x7b/0xa0 [kvm]
    Feb 16 06:07:01 [] ? kvm_lapic_sync_to_vapic+0xcf/0x220 [kvm]
    Feb 16 06:07:01 [] ? kvm_apic_has_interrupt+0x48/0xd0 [kvm]
    Feb 16 06:07:01 [] ? kvm_arch_vcpu_ioctl_run+0x93d/0x1010 [kvm]
    Feb 16 06:07:01 [] ? futex_wake+0x93/0x150
    Feb 16 06:07:01 [] ? kvm_vcpu_ioctl+0x434/0x580 [kvm]
    Feb 16 06:07:01 [] ? perf_event_task_sched_out+0x33/0x70
    Feb 16 06:07:01 [] ? apic_timer_interrupt+0xe/0x20
    Feb 16 06:07:01 [] ? vfs_ioctl+0x22/0xa0
    Feb 16 06:07:01 [] ? do_vfs_ioctl+0x3aa/0x580
    Feb 16 06:07:01 [] ? sys_ioctl+0x81/0xa0
    Feb 16 06:07:01 [] ? __audit_syscall_exit+0x25e/0x290
    Feb 16 06:07:01 [] ? system_call_fastpath+0x16/0x1b Feb 16 06:07:01 drm_kms_helper: panic occurred, switching back to text console Feb 16 06:07:01 BUG: scheduling while atomic: qemu-kvm/1950/0x14010000
    Feb 16 06:07:01 Modules linked in:
    Feb 16 06:07:01 nf_nat Feb 16 06:07:01 mpt3sas Feb 16 06:07:01 mpt2sas Feb 16 06:07:01 raid_class Feb 16 06:07:01 mptctl Feb 16 06:07:01 ipmi_si Feb 16 06:07:01 ipmi_devintf Feb 16 06:07:01 netconsole Feb 16 06:07:01 configfs Feb 16 06:07:01 ebtable_nat Feb 16 06:07:01 ebtables Feb 16 06:07:01 nfs Feb 16 06:07:01 lockd Feb 16 06:07:01 fscache Feb 16 06:07:01 auth_rpcgss Feb 16 06:07:01 nfs_acl Feb 16 06:07:01 sunrpc Feb 16 06:07:01 bridge Feb 16 06:07:01 stp Feb 16 06:07:01 llc Feb 16 06:07:01 ipt_REJECT
    Feb 16 06:07:01 nf_conntrack_ipv4
    Feb 16 06:07:01 nf_defrag_ipv4
    Feb 16 06:07:01 iptable_filter Feb 16 06:07:01 ip_tables Feb 16 06:07:01 ip6t_REJECT
    Feb 16 06:07:01 nf_conntrack_ipv6
    Feb 16 06:07:01 nf_defrag_ipv6
    Feb 16 06:07:01 xt_state Feb 16 06:07:01 nf_conntrack Feb 16 06:07:01 ip6table_filter Feb 16 06:07:01 ip6_tables Feb 16 06:07:01 ipv6
    Feb 16 06:07:01 dm_snapshot Feb 16 06:07:01 dm_bufio Feb 16 06:07:01 dm_zero Feb 16 06:07:01 vhost_net Feb 16 06:07:01 macvtap Feb 16 06:07:01 macvlan Feb 16 06:07:01 tun Feb 16 06:07:01 kvm_amd Feb 16 06:07:01 kvm Feb 16 06:07:01 ipmi_msghandler Feb 16 06:07:01 dcdbas Feb 16 06:07:01 serio_raw Feb 16 06:07:01 bnx2
    Feb 16 06:07:01 k10temp Feb 16 06:07:01 amd64_edac_mod Feb 16 06:07:01 edac_core Feb 16 06:07:01 edac_mce_amd Feb 16 06:07:01 sg Feb 16 06:07:01 i2c_piix4
    Feb 16 06:07:01 shpchp Feb 16 06:07:01 ext4
    Feb 16 06:07:01 jbd2
    Feb 16 06:07:01 mbcache Feb 16 06:07:01 sd_mod Feb 16 06:07:01 crc_t10dif Feb 16 06:07:01 mptsas Feb 16 06:07:01 mptscsih Feb 16 06:07:01 mptbase Feb 16 06:07:01 scsi_transport_sas Feb 16 06:07:01 ata_generic Feb 16 06:07:01 pata_acpi Feb 16 06:07:01 sata_svw Feb 16 06:07:01 radeon Feb 16 06:07:01 ttm Feb 16 06:07:01 drm_kms_helper Feb 16 06:07:01 drm Feb 16 06:07:01 i2c_algo_bit Feb 16 06:07:01 i2c_core Feb 16 06:07:01 dm_mirror Feb 16 06:07:01 dm_region_hash Feb 16 06:07:01 dm_log Feb 16 06:07:01 dm_mod Feb 16 06:07:01 [last unloaded: dell_rbu]
    Feb 16 06:07:01 192.168.13.230
    Feb 16 06:07:01 Pid: 1950, comm: qemu-kvm Not tainted 2.6.32-504.8.1.el6.CentOS.plus.x86_64 #1
    Feb 16 06:07:01 Call Trace:
    Feb 16 06:07:01
    Feb 16 06:07:01 [] ? __schedule_bug+0x66/0x70
    Feb 16 06:07:01 [] ? thread_return+0x6ac/0x7d0
    Feb 16 06:07:01 [] ? write_msg+0xfd/0x110 [netconsole]
    Feb 16 06:07:01 [] ? drm_crtc_helper_set_config+0x1be/0xa60 [drm_kms_helper]
    Feb 16 06:07:01 [] ? __cond_resched+0x2a/0x40
    Feb 16 06:07:01 [] ? _cond_resched+0x30/0x40
    Feb 16 06:07:01 [] ? __kmalloc+0x138/0x230
    Feb 16 06:07:01 [] ? __module_text_address+0x12/0x60
    Feb 16 06:07:01 [] ? drm_crtc_helper_set_config+0x1be/0xa60 [drm_kms_helper]
    Feb 16 06:07:01 [] ? r100_mm_wreg+0x67/0x90 [radeon]
    Feb 16 06:07:01 [] ? radeon_crtc_cursor_set+0x92/0x6e0 [radeon]
    Feb 16 06:07:01 [] ? drm_mode_set_config_internal+0x5c/0xe0 [drm]
    Feb 16 06:07:01 [] ? drm_fb_helper_restore_fbdev_mode+0xb3/0xe0 [drm_kms_helper]
    Feb 16 06:07:01 [] ? drm_fb_helper_panic+0x78/0xa0 [drm_kms_helper]
    Feb 16 06:07:01 [] ? notifier_call_chain+0x55/0x80
    Feb 16 06:07:01 [] ? atomic_notifier_call_chain+0x1a/0x20
    Feb 16 06:07:01 [] ? panic+0xd2/0x16f Feb 16 06:07:01 [] ? sched_clock+0x9/0x10
    Feb 16 06:07:01 [] ? watchdog_overflow_callback+0xcd/0xd0
    Feb 16 06:07:01 [] ? __perf_event_overflow+0xa7/0x240
    Feb 16 06:07:01 [] ? perf_event_update_userpage+0x24/0x110
    Feb 16 06:07:01 [] ? perf_event_overflow+0x14/0x20
    Feb 16 06:07:01 [] ? x86_pmu_handle_irq+0x1eb/0x250
    Feb 16 06:07:01 [] ? perf_event_nmi_handler+0x39/0xb0
    Feb 16 06:07:01 [] ? notifier_call_chain+0x55/0x80
    Feb 16 06:07:01 [] ? atomic_notifier_call_chain+0x1a/0x20
    Feb 16 06:07:01 [] ? notify_die+0x2e/0x30
    Feb 16 06:07:01 [] ? do_nmi+0x1bb/0x340
    Feb 16 06:07:01 [] ? nmi+0x20/0x30
    Feb 16 06:07:01 [] ? _spin_lock+0x1e/0x30
    Feb 16 06:07:01 < >
    Feb 16 06:07:01 [] ? handle_pte_fault+0x833/0xb00
    Feb 16 06:07:01 [] ? kvm_ioapic_update_eoi+0x8a/0xf0 [kvm]
    Feb 16 06:07:01 [] ? handle_mm_fault+0x22a/0x300
    Feb 16 06:07:01 [] ? __do_page_fault+0x138/0x480
    Feb 16 06:07:01 [] ? update_curr+0xe1/0x1f0
    Feb 16 06:07:01 [] ? perf_event_task_sched_out+0x33/0x70
    Feb 16 06:07:01 [] ? invalidate_interrupt0+0xe/0x20
    Feb 16 06:07:01 [] ? finish_task_switch+0x4c/0xf0
    Feb 16 06:07:01 [] ? do_page_fault+0x3e/0xa0
    Feb 16 06:07:01 [] ? page_fault+0x25/0x30
    Feb 16 06:07:01 [] ? copy_user_generic_string+0x32/0x40
    Feb 16 06:07:01 [] ? kvm_write_guest_cached+0x7b/0xa0 [kvm]
    Feb 16 06:07:01 [] ? kvm_lapic_sync_to_vapic+0xcf/0x220 [kvm]
    Feb 16 06:07:01 [] ? kvm_apic_has_interrupt+0x48/0xd0 [kvm]
    Feb 16 06:07:01 [] ? kvm_arch_vcpu_ioctl_run+0x93d/0x1010 [kvm]
    Feb 16 06:07:01 [] ? futex_wake+0x93/0x150
    Feb 16 06:07:01 [] ? kvm_vcpu_ioctl+0x434/0x580 [kvm]
    Feb 16 06:07:01 [] ? perf_event_task_sched_out+0x33/0x70
    Feb 16 06:07:01 [] ? apic_timer_interrupt+0xe/0x20
    Feb 16 06:07:01 [] ? vfs_ioctl+0x22/0xa0
    Feb 16 06:07:01 [] ? do_vfs_ioctl+0x3aa/0x580
    Feb 16 06:07:01 [] ? sys_ioctl+0x81/0xa0
    Feb 16 06:07:01 [] ? __audit_syscall_exit+0x25e/0x290
    Feb 16 06:07:01 [] ? system_call_fastpath+0x16/0x1b Feb 16 06:07:01 Clocksource tsc unstable (delta = -77309385171 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.

  • I think the panic is the consequence of drive write failure. So the actual problem is before the panic call trace. I’d post the entire dmesg somewhere wrap safe (either you mail agent or the forum is hard wrapping and is a pain to read).

    What do you get for smartctl -x

    In the meantime check or replace cables, usually it’s the connectors that are faulty not the cable itself. Or replace the drive.

    Chris Murphy

  • At least part of the problem happens before this log starts.

    OK no smart extended test has been done, but also no pending bad or relocated sectors, and no phy event errors either. So the write (10)
    error seems isolated but it’s still really suspicious, so I’d start replacing hardware.

    The only thing that suggests it might not be hardware are all the kvm related messages in the kp. So if you’ve changed kernels, or VM
    configuration recently, then I’d revert. That’s the limit of the most likely software explanation. If there’s no recent software changes, then it must be hardware.

  • Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
    Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6)
    Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 — renewal in 8613 seconds. Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
    Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6)
    Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 — renewal in 8735 seconds. Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0
    Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076
    Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0
    Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076
    Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
    Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6)
    Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 — renewal in 9224 seconds.

    Dell tech is enroute. New system board and disk controller.

    How so, each of the results I find say these are to be ignored.

    No changes from install out of the box.

  • Doesn’t seem related.

    I’m curious what they replace.

    Well I found two older kernel bugs similar to this that suggested the problem stopped happening when running kvm with 1vcpu, and in another case when the VM was rebuilt 32-bit instead of 64-bit. But my ability to read kernel call traces is very limited, I really don’t know what’s going on.

    If it’s a kernel bug though, you could maybe clobber it with a substantially newer kernel. You might check out elrepo kernels. 2.6.32
    is really old, granted the CentOS one you’re running has a huge pile of backports that makes it less “ancient” from a stability perspective, but anything really new that’s hard to backport likely isn’t in that kernel. While you’re waiting for Dell you could try either:

    kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm

    What’s running in the VM?

  • Both, but the backplane is not on the replacement list.

    I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/CentOS6 and libvirtd

    We should start looking at CentOS7/RHEL7, ug systemd….. But these machines are ancient too.

    Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are CentOS6. The configs are kept as close as possible to each other.

    Besides I am doing the migration right now to another host.

    Mostly RHEL6/CentOS6 VMs. But there are some windows systems too. This system was handling most of the CipherShed.org Jenkins CI farm. I can say the resources are oversubscribed by a 15x. But the system runs at below 0.10 at any random time.

    Thanks for the thoughs on this.

    -Jason

  • 20 other identical systems doing the same work strongly suggests hardware problem when there’s a single outlier.

    I’ve been using it since Fedora 15, I find it easier to use to troubleshoot boot and service startup problems. systemd-analyze blame/plot are quite useful for boot performance optimizing. The journal on Fedora these days is persistent, on CentOS it’s volatile with rsyslog running by default; but I like being able to journalctl
    -b-2 or b-3 to view previous boots, or point all systems to a single server, and sealing the journal logs against tampering, etc. It’s certainly different, but wasn’t onerous to get used to, and these days I prefer it.

    I’d say it’s unnecessary at this point. It’s almost certainly a hardware problem given the numerous identical setups not having this problem. But, seeing as it panics every 30-40 hours, it can hardly be much worse with a new kernel running for a couple days… but my bet is there’d be no change.

LEAVE A COMMENT