The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Fix DNS Cluster Queue Bugs

Discussion in 'Bind / DNS / Nameserver Issues' started by VeZoZ, Jan 11, 2010.

  1. VeZoZ

    VeZoZ Well-Known Member

    Joined:
    Dec 14, 2002
    Messages:
    248
    Likes Received:
    0
    Trophy Points:
    16
    cPanel Access Level:
    DataCenter Provider
    I've tried addressing these issues in two separate tickets. One I believe I was told there was a ticket open for resolution. When this hadn't been resolved 6+ months later I asked in another ticket and was told it was fixed. Upon me explaining the issue again I was essentially being told do not produce this problem. At which point I voiced my opinion for the lack of understanding from the support person and closed it. I rather not argue back and forth with someone.

    The DNS cluster queue system does not differentiate between failures and communication issues. The system seems to be designed for communication issues not other types of errors. For example a big problem is it considers a a failed removal of a zone to be a failure. This is especially confusing when the error is because the zone does no exist which the remove is trying to do anyways. This should be revised if removing a zone and it's already gone then remove it from the queue.

    This issue is not suppose to happen but as I stressed to support it does happen. The same account could be on two servers on a cluster. One gets removed and then 6 months later on a cleanup remove the other. Now you have 14 days of every 15 minutes a remove zone is ran through the queue. So the queue system should be smarter to deal with issues that really should be nothing more than an error in the log and removed from the queue.

    The second issue is the queue runs are not by domain from my testing. They are by all the requests sent through. This is also why the remove issue is such a big deal. It takes a single zone not existing to cause an entire queue run to continue to be ran over and over again. For example I terminate an account with 50 zones and 1 zone was already removed. It will continue to call the same remove zone for all 50 zones even though one failed. Although as I said you really should not consider a remove failure because it does not exist as the same as a communication issue.


    I am pretty disappointed that the support team does not consider these issues bugs. The solution of not produce them is ridiculous and if they had any experience with having tens of thousands cpanel accounts on a large amount of machines they would know the cluster to web is not always 100% perfect as far as matching zones. You have cPanel bugs in the cluster system from years ago, possible other bugs in cPanel as well as human error from technicians which can all produce this problem.
     
  2. cPanelKenneth

    cPanelKenneth cPanel Development
    Staff Member

    Joined:
    Apr 7, 2006
    Messages:
    4,458
    Likes Received:
    22
    Trophy Points:
    38
    cPanel Access Level:
    Root Administrator
    Please PM me your ticket numbers, for internal review.

    Your observation is correct. The queueing mechanism is a simple agent, only aware of communication failures.

    The reason for the dnsadmin command failure needs to propagate up to the queue processor in order for it to make a more intelligent decision as to failure reason. This will require some redesign of the agent.
     
  3. VeZoZ

    VeZoZ Well-Known Member

    Joined:
    Dec 14, 2002
    Messages:
    248
    Likes Received:
    0
    Trophy Points:
    16
    cPanel Access Level:
    DataCenter Provider

    Well I hope it's something that's going to be changed. It's causing unavailability at random times due to the unnecessary queue runs. I'm not sure why queue runs cause it and not adding zones and such but they do. The solution we're using right now is to wipe out the queues when we see them but it's obviously far better if it was smarter to only consider communication failures as actual re-runs.
     
  4. TechBrein

    TechBrein Member

    Joined:
    Jul 31, 2006
    Messages:
    17
    Likes Received:
    0
    Trophy Points:
    1
    Increasing the value of remotewhmtimeout to about 2mins had helped us immensely to find a fix for the cPanel DNS cluster queue issue. We have also took various steps so that the dns update would never fail and it has fixed the issue for us. :)
     
  5. cPanelDon

    cPanelDon cPanel Quality Assurance Analyst
    Staff Member

    Joined:
    Nov 5, 2008
    Messages:
    2,557
    Likes Received:
    7
    Trophy Points:
    38
    Location:
    Houston, Texas, U.S.A.
    cPanel Access Level:
    DataCenter Provider
    Twitter:
    Other than the timeout setting, did any of the other steps utilized involve changes or tweaks to the cPanel/WHM configuration?

    For reference, the specified timeout may also be adjusted via the root access to WHM using the following menu path:
    WHM: Main >> Server Configuration >> Tweak Settings >> System
    * Specify the timeout in seconds for connections between this server and other remote WHM servers. Values less than 35 cannot be specified.
     
  6. cPanelNick

    cPanelNick Administrator
    Staff Member

    Joined:
    Mar 9, 2015
    Messages:
    3,426
    Likes Received:
    2
    Trophy Points:
    38
    cPanel Access Level:
    DataCenter Provider
    VeZoZ,

    What versions of cPanel are you running on all the machines in the cluster?
    11.25 has some major improvements in this area. However, all machines in the cluster must be running 11.25 to take full advantage of these improvements.
     
  7. VeZoZ

    VeZoZ Well-Known Member

    Joined:
    Dec 14, 2002
    Messages:
    248
    Likes Received:
    0
    Trophy Points:
    16
    cPanel Access Level:
    DataCenter Provider
    Running stable so no 11.25 as of yet.

    Although reading the release notes I don't see much that would solve this issue. I read this:

    "Each cPanelâ„¢ 11.25 cluster member may now configure a peer failure threshold. This option is found in the Configure
    Cluster interface in WHM. The threshold specifies how many dnsadmin commands a peer may fail to respond to before
    that peer is automatically disabled. The threshold is local to the server where it is stipulated."


    In the current situation we experience we'd have to disable this entirely as otherwise we'd have peers thrown out whenever a bunch of zone removals fail for the zones not existing.
     
  8. TechBrein

    TechBrein Member

    Joined:
    Jul 31, 2006
    Messages:
    17
    Likes Received:
    0
    Trophy Points:
    1
    Nope :) Only the aforementioned modification involved WHM configuration. The rest were done to modify the then-existing DNS cluster setup of our customer.

     
  9. VeZoZ

    VeZoZ Well-Known Member

    Joined:
    Dec 14, 2002
    Messages:
    248
    Likes Received:
    0
    Trophy Points:
    16
    cPanel Access Level:
    DataCenter Provider

    I wish ours was that easy :-(

    We have customers who split add-on domains into cPanel accounts which is where the problem lies. Have an account on Server1 and have domain.com as an add-on on it. Order another hosting account specifically for domain.com and it's put on another server. At some point if both accounts are terminated say for non payment we then have our queue issues. Also seems to be produced by resellers with accounts across numerous servers.

    Just no stopping having some weird zone issues from time to time when dealing with those sort of scenarios.
     
  10. VeZoZ

    VeZoZ Well-Known Member

    Joined:
    Dec 14, 2002
    Messages:
    248
    Likes Received:
    0
    Trophy Points:
    16
    cPanel Access Level:
    DataCenter Provider
    Well did some further testing and noticed something else. This is not even consistent on what happens with deletions of zones and not existing.

    This is what happens when you do delete zone on the machine that has the account:


    Unable to remove zone domain.com from the Bind configuration (named.conf) on ns2.
    The zone was possibly removed earlier on ns2.
    Zones Removed: domain.com => deleted from ns2.
    Unable to remove zone domain.com from the Bind configuration (named.conf) on ns1.
    The zone was possibly removed earlier on ns1.
    Zones Removed: domain.com => deleted from ns1.
    Zones Removed: domain.com => deleted from web.


    Now it gets odd because guess what nothing in the clusterqueue. However if I was to do terminate account on this same domain.com I'd get the constant clusterqueue runs.

    I'm confused now. Do the add, edit and delete zones not run through the clusterqueue system? Or for some reason they have logic that apparently none of the other portions of cPanel have to deal with the issue described in this thread?
     
Loading...

Share This Page