DNS Lookup Delay With CentOS & Postfix

Home » CentOS » DNS Lookup Delay With CentOS & Postfix
CentOS 12 Comments

I’m a bit baffled by this and I’m looking for ideas…

background:
two DNS servers (ns1 & ns2)(64bit CentOS 5.8)
one email server (64bit CentOS 5.8 & postfix 2.3.3)
one nagios server (64bit CentOS 5.8 & nagios 3.3.1)

situation:
– all servers configured to use both DNS servers for lookups
– ns1 server down for hardware problem
– nagios alerts that SMTP on email server taking longer than 2 seconds to respond
– nagios alert for SMTP on email server clears when ns1 returns to service

– when I use dig from the email server command line there is no problem or delay when ns1 is offline. It worked without a hitch using ns2.

Anyone have any ideas for why nagios would have trouble testing SMTP on the email server when the primary dns goes offline? I’m not even sure where to look or who else would make sense to ask the question of on this one. I’d appreciate any insight anyone out there has on this.

12 thoughts on - DNS Lookup Delay With CentOS & Postfix

  • Does dig use libresolv or read directly from resolv.conf? Also do you have a timeout configured in resolv.conf or are you relying on the os default?

  • The default timeout for a DNS lookup is usually 5 seconds so the system will try ns1, time out after 5 seconds and then use ns2.

    Regards,
    Dennis

  • Yes, a delay is normal when the 1st dns server is down. You might want to run a caching nameserver on your email server (and perhaps others) so you don’t wait for cached addresses. The caching servers can use the main ones as forwarders if necessary.

  • dig uses resolv.conf and no timeouts are configured there. I don’t know there the OS would have a default configured or what it is. Another reply indicated there would be a 5 second delay. That seems a bit high to me.

    I used dig from the email svr command line with the primary DNS svr up and (naturally) it pulled from there as normal. Then I downed the primary DNS svr, saw the nagios check fail and tried again. The same dig lookup was actually faster and pulled from the secondary DNS svr just fine. And, again, the nagios alert cleared as soon as the primary DNS svr was back online.

    For both tests I used: dig mx google.com

  • DNS lookups default to using 53/udp, and only use 53/tcp for zone transfers. could it be 53/udp is being lost/blocked between this host and your ns1 ?

  • i would always have a timeout in your resolv.conf rather than relying on the OS default.

    Set that to 1 second and test again to see if there is any difference.

  • and that sounds like the best solution so far. I hadn’t considered that… haven’t look at that file in ages.

    I do like knowing why something doesn’t work, but I’m good with just getting it to work too. I’ll give this a try, thanks!

  • Unfortunately that is a common misconception.

    Tcp is used far more often than “only” as stated such as for size of request exceeding udp response size etc…

    Bottom line is both ports are needed, not just for zone xfers.

    jlc

  • Except that the malware guys have figured out how to abuse port 53. Security recommendation is to block TCP unless you’re running a DNS server. And also block oversize port 53 UDP packets.

    Dave M

  • Blocking oversize UDP packets is a very bad idea. EDNS is used for a lot of look ups these days due to DNSSEC, and so blocking oversize UDP
    packets will force you to use TCP to get many of your DNS requests.

    Tris

    *************************************************************
    This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify postmaster@bgfl.org

    The views expressed within this email are those of the individual, and not necessarily those of the organisation
    *************************************************************

  • On Wednesday 25 July 2012 17:47, the following was written:

    I believe the reason you noticed a faster response is because the second query used the cached information from the first look-up not because the second server is/was faster.

    to verify this look at the TTL times in the response.

LEAVE A COMMENT