The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Severe perofrmance problem requiring reboots

Discussion in 'General Discussion' started by fugtruck, Apr 29, 2011.

  1. fugtruck

    fugtruck Member

    Joined:
    Apr 27, 2010
    Messages:
    21
    Likes Received:
    0
    Trophy Points:
    1
    I am having a reoccurring (and seemingly random) problem where a server of mine becomes virtually unresponsive. I can SSH into the server and some commands execute properly and some just hang. For example, the 'uptime' command executes and returns data, but 'w' just hangs. Some details about the server: CentOS 5.6 x64 with kernel 2.6.18-238.9.1.el5 installed. It has 4 CPU cores with 6GB RAM. It is a virtual machine running on VMWare ESXi on a VMFS datastore. The partitions are a combination of ext3 and ext4. The server has been performing fine for the past year and a half and only suddenly started having problems within the past week or two.

    The uptime command shows a load average of greater than 300 (normal for this server is between 5 and 10), however the vSphere client shows CPU activity drop to almost nothing when the problem occurs. When it does occur, I have to power off the server and power it back on, as the restart or shutdown command just hangs. Once the server boots back up, it goes back to performing just fine.

    I found the following errors show up in the syslog when the problem occurs (see below). Any suggestions on what to do about this?

    Apr 28 11:55:29 servername kernel: INFO: task pdflush:339 blocked for more than 120 seconds.
    Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Apr 28 11:55:29 servername kernel: pdflush D ffff8101bbc4f000 0 339 71 340 338 (L-TLB)
    Apr 28 11:55:29 servername kernel: ffff8101bf369b70 0000000000000046 ffff81003bf1ed98 ffff8101bf369be8
    Apr 28 11:55:29 servername kernel: 0000000000000001 000000000000000a ffff8101bf9a80c0 ffff81000a502080
    Apr 28 11:55:29 servername kernel: 00007b9830595b62 00000000000568d1 ffff8101bf9a82a8 000000038004817a
    Apr 28 11:55:29 servername kernel: Call Trace:
    Apr 28 11:55:29 servername kernel: [<ffffffff800f6300>] write_cache_pages+0x2ac/0x332
    Apr 28 11:55:29 servername kernel: [<ffffffff8839f066>] :ext4:__mpage_da_writepage+0x0/0x162
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ec9e>] :jbd2:start_this_handle+0x2e9/0x3b3
    Apr 28 11:55:29 servername kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff800dde58>] alternate_node_alloc+0x70/0x8c
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ee09>] :jbd2:jbd2_journal_start+0xa1/0xd8
    Apr 28 11:55:29 servername kernel: [<ffffffff883a081c>] :ext4:ext4_da_writepages+0x296/0x4fc
    Apr 28 11:55:29 servername kernel: [<ffffffff8005ae89>] do_writepages+0x20/0x2f
    Apr 28 11:55:29 servername kernel: [<ffffffff8002fe0e>] __writeback_single_inode+0x19e/0x318
    Apr 28 11:55:29 servername kernel: [<ffffffff800bfe94>] delayacct_end+0x5d/0x86
    Apr 28 11:55:29 servername kernel: [<ffffffff8008dc33>] dequeue_task+0x18/0x37
    Apr 28 11:55:29 servername kernel: [<ffffffff800210db>] sync_sb_inodes+0x1b5/0x26f
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff80051308>] writeback_inodes+0x82/0xd8
    Apr 28 11:55:29 servername kernel: [<ffffffff800cbbd5>] wb_kupdate+0xd4/0x14e
    Apr 28 11:55:29 servername kernel: [<ffffffff8005689e>] pdflush+0x0/0x1fb
    Apr 28 11:55:29 servername kernel: [<ffffffff800569ef>] pdflush+0x151/0x1fb
    Apr 28 11:55:29 servername kernel: [<ffffffff800cbb01>] wb_kupdate+0x0/0x14e
    Apr 28 11:55:29 servername kernel: [<ffffffff80032afc>] kthread+0xfe/0x132
    Apr 28 11:55:29 servername kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff800329fe>] kthread+0x0/0x132
    Apr 28 11:55:29 servername kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
    Apr 28 11:55:29 servername kernel:
    Apr 28 11:55:29 servername kernel: INFO: task kswapd0:340 blocked for more than 120 seconds.
    Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Apr 28 11:55:29 servername kernel: kswapd0 D ffff8101bbc4f000 0 340 71 341 339 (L-TLB)
    Apr 28 11:55:29 servername kernel: ffff8101bf36dc10 0000000000000046 0000000000000001 ffffffff800c8994
    Apr 28 11:55:29 servername kernel: ffff8101063420c0 000000000000000a ffff8101bf9a8820 ffff8100136b2820
    Apr 28 11:55:29 servername kernel: 00007b95f5267fab 000000000009b63d ffff8101bf9a8a08 0000000200000001
    Apr 28 11:55:29 servername kernel: Call Trace:
    Apr 28 11:55:29 servername kernel: [<ffffffff800c8994>] __remove_from_page_cache+0x1f/0x6c
    Apr 28 11:55:29 servername kernel: [<ffffffff80023940>] __pagevec_free+0x21/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff8000b26f>] release_pages+0x14d/0x15a
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ec9e>] :jbd2:start_this_handle+0x2e9/0x3b3
    Apr 28 11:55:29 servername kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff800dde58>] alternate_node_alloc+0x70/0x8c
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ee09>] :jbd2:jbd2_journal_start+0xa1/0xd8
    Apr 28 11:55:29 servername kernel: [<ffffffff883ab6f7>] :ext4:ext4_release_dquot+0x42/0x7f
    Apr 28 11:55:29 servername kernel: [<ffffffff80103c27>] dqput+0x1be/0x200
    Apr 28 11:55:29 servername kernel: [<ffffffff801041f3>] dquot_drop+0x30/0x5e
    Apr 28 11:55:29 servername kernel: [<ffffffff80022ff9>] clear_inode+0xb4/0x123
    Apr 28 11:55:29 servername kernel: [<ffffffff80035164>] dispose_list+0x41/0xe0
    Apr 28 11:55:29 servername kernel: [<ffffffff8002d8b3>] shrink_icache_memory+0x1b7/0x1e6
    Apr 28 11:55:29 servername kernel: [<ffffffff8003f73f>] shrink_slab+0xdc/0x153
    Apr 28 11:55:29 servername kernel: [<ffffffff80057e74>] kswapd+0x35d/0x495
    Apr 28 11:55:29 servername kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff80057b17>] kswapd+0x0/0x495
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff80032afc>] kthread+0xfe/0x132
    Apr 28 11:55:29 servername kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff800329fe>] kthread+0x0/0x132
    Apr 28 11:55:29 servername kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
    Apr 28 11:55:29 servername kernel:
    Apr 28 11:55:29 servername kernel: INFO: task jbd2/sdg1-8:2610 blocked for more than 120 seconds.
    Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Apr 28 11:55:29 servername kernel: jbd2/sdg1-8 D ffff8101bffa29c0 0 2610 71 2611 2600 (L-TLB)
    Apr 28 11:55:29 servername kernel: ffff8101b8623d60 0000000000000046 0000000000000282 ffffffff8002232e
    Apr 28 11:55:29 servername kernel: ffff8101b8623cf0 000000000000000a ffff8101bf31b7a0 ffff810137ed2820
    Apr 28 11:55:29 servername kernel: 00007b95b09f8cf9 0000000000000be6 ffff8101bf31b988 0000000300000001
    Apr 28 11:55:29 servername kernel: Call Trace:
    Apr 28 11:55:29 servername kernel: [<ffffffff8002232e>] __up_read+0x19/0x7f
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ff6d>] :jbd2:jbd2_journal_commit_transaction+0x191/0x1068
    Apr 28 11:55:29 servername kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff8003dd6e>] lock_timer_base+0x1b/0x3c
    Apr 28 11:55:29 servername kernel: [<ffffffff8004b3fb>] try_to_del_timer_sync+0x7f/0x88
    Apr 28 11:55:29 servername kernel: [<ffffffff88383f7c>] :jbd2:kjournald2+0x9a/0x1ec
    Apr 28 11:55:29 servername kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff88383ee2>] :jbd2:kjournald2+0x0/0x1ec
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff80032afc>] kthread+0xfe/0x132
    Apr 28 11:55:29 servername kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
    Apr 28 11:55:29 servername kernel: [<ffffffff800a26db>] keventd_create_kthread+0x0/0xc4
    Apr 28 11:55:29 servername kernel: [<ffffffff800329fe>] kthread+0x0/0x132
    Apr 28 11:55:29 servername kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
    Apr 28 11:55:29 servername kernel:
    Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Apr 28 11:55:29 servername kernel: ffff8101891cbd78 0000000000000082 ffff810168a210f8 ffffffff8000d044
    Apr 28 11:55:29 servername kernel: ffff8101bf595780 000000000000000a ffff8101b48b3820 ffff810011f72860
    Apr 28 11:55:29 servername kernel: 00007b95e2033232 000000000003271a ffff8101b48b3a08 00000001891cbea8
    Apr 28 11:55:29 servername kernel: Call Trace:
    Apr 28 11:55:29 servername kernel: [<ffffffff8000d044>] do_lookup+0x65/0x1e6
    Apr 28 11:55:29 servername kernel: [<ffffffff8000a831>] __link_path_walk+0xf90/0xfb9
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ec9e>] :jbd2:start_this_handle+0x2e9/0x3b3
    Apr 28 11:55:29 servername kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
    Apr 28 11:55:29 servername kernel: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
    Apr 28 11:55:29 servername kernel: [<ffffffff8837ee09>] :jbd2:jbd2_journal_start+0xa1/0xd8
    Apr 28 11:55:29 servername kernel: [<ffffffff883a0f3a>] :ext4:ext4_setattr+0x1b5/0x339
    Apr 28 11:55:29 servername kernel: [<ffffffff8002ca25>] notify_change+0x145/0x2f3
    Apr 28 11:55:29 servername kernel: [<ffffffff800e2020>] do_truncate+0x67/0x82
    Apr 28 11:55:29 servername kernel: [<ffffffff800b9646>] audit_syscall_entry+0x1a4/0x1cf
    Apr 28 11:55:29 servername kernel: [<ffffffff8004cd3c>] sys_ftruncate+0xe4/0x101
    Apr 28 11:55:29 servername kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
    Apr 28 11:55:29 servername kernel:
    .
    .
     
  2. cPanelTristan

    cPanelTristan Quality Assurance Analyst
    Staff Member

    Joined:
    Oct 2, 2010
    Messages:
    7,623
    Likes Received:
    21
    Trophy Points:
    38
    Location:
    somewhere over the rainbow
    cPanel Access Level:
    Root Administrator
    Does the "df -k" command hang as well? If so, can you provide data for these commands and show the section where "df -k" hangs at (provided it returns any details before hanging)?

    Code:
    cat /etc/fstab
    mount
     
  3. fugtruck

    fugtruck Member

    Joined:
    Apr 27, 2010
    Messages:
    21
    Likes Received:
    0
    Trophy Points:
    1
    Commands only hang if the problem is occurring, which rebooting clears. Here is the output from the two commands you requested.

    # cat /etc/fstab
    LABEL=/ / ext3 defaults,usrquota 1 1
    LABEL=/usr /usr ext4 defaults,usrquota 1 2
    LABEL=/var /var ext4 defaults,usrquota 1 2
    LABEL=/tmp /tmp ext4 defaults,nosuid,noexec,nodev 1 2
    LABEL=/home /home ext4 defaults,usrquota 0 0
    LABEL=/home2 /home2 ext4 defaults,usrquota 0 0
    LABEL=/home3 /home3 ext4 defaults,usrquota 0 0
    LABEL=/boot /boot ext3 defaults 1 2
    tmpfs /dev/shm tmpfs defaults,nosuid,noexec 0 0
    devpts /dev/pts devpts gid=5,mode=620 0 0
    sysfs /sys sysfs defaults 0 0
    proc /proc proc defaults 0 0
    LABEL=SWAP-sdb1 swap swap defaults 0 0

    # mount
    /dev/sda1 on / type ext3 (rw,usrquota)
    proc on /proc type proc (rw)
    sysfs on /sys type sysfs (rw)
    devpts on /dev/pts type devpts (rw,gid=5,mode=620)
    /dev/sdc1 on /usr type ext4 (rw,usrquota)
    /dev/sdd1 on /var type ext4 (rw,usrquota)
    /dev/sde1 on /tmp type ext4 (rw,noexec,nosuid,nodev)
    /dev/sdg1 on /home type ext4 (rw,usrquota)
    /dev/sdh1 on /home2 type ext4 (rw,usrquota)
    /dev/sdi1 on /home3 type ext4 (rw,usrquota)
    /dev/sdf1 on /boot type ext3 (rw)
    tmpfs on /dev/shm type tmpfs (rw,noexec,nosuid)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    /tmp on /var/tmp type none (rw,noexec,nosuid,bind)
     
  4. cPanelTristan

    cPanelTristan Quality Assurance Analyst
    Staff Member

    Joined:
    Oct 2, 2010
    Messages:
    7,623
    Likes Received:
    21
    Trophy Points:
    38
    Location:
    somewhere over the rainbow
    cPanel Access Level:
    Root Administrator
    So every instance of the error such as this (the process being blocked isn't important, it's the message that it was blocked):

    Indicates some type of hardware or OS level issue on the machine. In my online research, I found discussions about a RAID bug for RedHat as one possible cause:

    https://bugzilla.redhat.com/show_bug.cgi?id=576749

    As such, do you happen to run Intel bios RAID 1 on the machine? If not, it still appears to be a hardware or OS level issue. The best course of action would be to contact your datacenter, NOC or provider to have them investigate this further.
     
  5. fugtruck

    fugtruck Member

    Joined:
    Apr 27, 2010
    Messages:
    21
    Likes Received:
    0
    Trophy Points:
    1
    I would be more inclined to think it's an OS issue rather than hardware. I have other cPanel servers running on the same physical hardware (these are all virtual machines) and none are experiencing the problem except this one
     
  6. cPanelTristan

    cPanelTristan Quality Assurance Analyst
    Staff Member

    Joined:
    Oct 2, 2010
    Messages:
    7,623
    Likes Received:
    21
    Trophy Points:
    38
    Location:
    somewhere over the rainbow
    cPanel Access Level:
    Root Administrator
    To clarify, it is just one VPS node having the issue where all others are on the exact same physical machine? Do you control the main node or only some of the VPS instances?
     
  7. fugtruck

    fugtruck Member

    Joined:
    Apr 27, 2010
    Messages:
    21
    Likes Received:
    0
    Trophy Points:
    1
    Yes, same physical machine (most of the time...we migrate VMs around as needed), but they reside on the same SAN at all times. And I am the system admin for the entire infrastructure.
     

Share This Page