How To Increase DNS Reliability?

Home » CentOS » How To Increase DNS Reliability?
CentOS 33 Comments

Hi,

how can DNS reliability, as experienced by clients on the LAN who are sending queries, be increased?

Would I have to set up some sort of cluster consisting of several servers all providing DNS services which is reachable under a single IP address known to the clients?

Just setting up several name servers and making them known to the clients for the clients to automatically switch isn’t a good solution because the clients take their timeouts and users lacking even the most basic knowledge inevitably panic when the first name server does not answer queries.

33 thoughts on - How To Increase DNS Reliability?

  • Am 2019-07-25 14:51, schrieb hw:

    Run a local cache (unbound) and enter all your local resolvers as upstreams.

  • If you don’t want multiple DNS server entries on the client then a master and (possibly multiple) slave server configuration can be set up (I’m assuming ISC DNS – their solution to redundancy/failover is master and slave servers, this may be the way it is with all DNS). keepalived can be used for fail over and will present a single IP address (which the clients would use) shared among the servers. haproxy or alternatives might be another fail over option. Each technology has its own learning curve (and doing this will require at least two) and caveats. In particular systemd doesn’t appear to play well with technologies creating IP addresses it doesn’t manage. The version of keepalived we’re using also has its own nasty quirk as well where it comes up assuming it is master until discovered otherwise, this is true even if it is configured as backup. In most cases this is probably either a non-issue (no scripts being used) or a minor annoyance. But if you’re using scripts trigger ed by keepalived which make significant (and possibly conflicting) changes to the environment then you’ll need to embed “intelligence” in them to wait until final state is reached or test state before acting or some other option.

  • That can fail just as well — or be even worse when the clients can’t switch over anymore. I have that and am avoiding to use it for some clients because it takes a while for the cache to get updated when I make changes.

    However, if that cache fails, chances are that the internet connection is also down in which case it can be troublesome to even get local host names resolved. When that happens, trouble is to be expected.

  • Am 2019-07-25 15:41, schrieb hw:

    Anything else is – IMHO – much more work, much more complicated and much more likely to fail, in a more spectacular way. Especially all those keepalive “solutions”.

    I have found that I need to restart unbound if all upstreams had failed.

  • Sounds like you’re performing maintenance on your servers

    (a) too often
    (b) during office / peak hours

    You could load balance multiple servers (using lots of available load-balancing technologies) to allow you to perform maintenance at certain times, but it has its own issues.

    I’ve recently been looking at PowerDNS, which separates the recursor and the authoritative server into two distinct packages. I’m just running the authoritative server as a master, and keeping my old bind/named servers as recursors / slaves. It’s a home office network, but I only have issues when I’m tinkering, and if I were to be doing this kind of work in a larger commercial environment, then I would not be doing DNS
    server maintenance while others were relying on them.

    For much of the back end infrastructure I use IP addresses rather than DNS names in their configuration, just to take DNS issues out of the equation completely.

  • That’s what I was thinking. Perhaps it is better to live with a main server and one or two slaves so the clients can keep their alternatives.

    But still … There’s got to be a better way …

    You mean like probing if the DNS server is still responsive and somehow switching over when it’s not? I never tried, though it is evident that more complicated things may tend to be less reliable.

    Yet it reminds me that I could actually check the name servers and dispatch a message when one fails as I’m already doing for a couple other things. That would suffice and doesn’t introduce more possibilites of failure to name resolution.

  • I’m about to do an overhaul of the DNS service at work and my plan is to use powerdns recursor + dnsdist + keepalived.

  • Configure all dns servers as primary slaves (plus 1 primary master) for your own domains.  I have never seen problems with resolution of local dns domains when the Internet was down.

    Depending on the size of your network, you can run a caching server on each host (configured as a primary slave for your own domains) and  then configure that local server to use forwarders.  When you use multiple forwarders the local server does not have to wait for timeouts before querying another server.  Then you just run 2 or more servers to use for forwarding.  Use forward-only to force all local servers to use only forwarding (for security and caching reasons).  Much simpler than using keepalived.  In recent years I *have not had any* problems with bind9 or powerdns crashing.

    As far as using the ISC server vs powerdns, you may want to check on peoples recent experiences.  There was a time when many thought powerdns had much better performance and fewer security issues.  For various reasons  I’ve seen some people including myself, switch back to ISC
    bind9.  I switched about 1.5 years ago because I was getting better performance from bind9.  You may want to check out other peoples experience before switching to powerdns.

    Nataraj

  • I meant to say:

    Configure all dns servers as secondary/slaves (one should be the primary master) for your own domains. Thos means that all of your servers are authoritative for your own domains, so they cannot fail on local dns lookups due to Internet problems.

  • I’m ok with them, only the problem is that the clients take their timeouts when a server is unreachable, and users panic.

    Yes, bind9, and I’ve set up a master and a slave. The router uses them to forward requests to on behalf of those clients that use the router as a name server while other clients know master and slave but not the router as name servers.

    There was a failure a while ago (IIRC because of a UPS causing a server to shut down when the battery failed the self test), and things didn’t quite work anymore with the master server being unreachable.

    This is how I have a problem with the clients knowing multiple servers: The very setup is intended to keep things working during an outage and yet it doesn’t help.

    Thanks, I’ll look into that! I’ve been searching for “dns proxy” and no useful results came up …

    I consider myself warned :)

  • I’ve more or less done the overhaul, only some sort of failover thing is missing … I’ll check those out, thanks!

  • I can’t help it when the primary name server goes down because the UPS fails the self test and tells the server it has 2 minutes or so left in wich case the server figures it needs to shut down. I wanted better UPSs …

    Load balancing or clustering? At least clustering seems not entirely trivial to do.

    This can be done with bind, how does it require something called PowerDNS?

    The maintenance didn’t cause any problems. You can edit the configuration just fine and restart the server when done … :)

    I think this is a very bad idea because it causes lots of work and is likely to cause issues. What if you, for example, migrate remote logging to another server?
    All the time, you have to document every place where you put an IP address; you have to keep the documentation always up to date and then change the address at every place when you make a change. Forget one place, and things break.

    But when you use names instead of addresses, like ‘log.example.com’, you only need to make a single change at a single place such as you alter the address in your name server config.

    DNS can be difficult to get right, though it’s not all that difficult, and once it’s working, there aren’t really any issues other than that a server can become unreachable.

  • On Linux systems, you can set the timeout in /etc/resolv.conf, e.g.,

    # I think the default nameserver timeout is 5; use rotate
    # option if you prefer round-robin queries rather than
    # always using the first-listed first nameserver 10.11.12.13 timeout:2 rotate nameserver 10.11.12.14 timeout:2 rotate

    I’ll admit that I’m not sure if those options are configurable on Mac and/or Windows workstations.

  • Windows will ‘rotate’ the list of NS servers if the top one times out, so next time it will use the first alternate…. and if that times out, it will start using the next alternate, etc.

  • IMO such entries are done via “options” …

    yum install man-pages ; man resolv.conf

  • hw wrote:


    Change that. Are you using apcupsd? You can set the config from SHUTDOWN=/sbin/shutdown to /bin/false. Then, the next time you see the UPS, change the battery. If it’s just started to complain, it’s not dead yet!

    Works for me with all of our mostly APC SmartUPS 3000 rackmounts.

    mark

  • critical infrastructure servers should have redudant PSUs, on seperate UPSs.

    my last rack builds, I had 2 Eaton PowerWare 7KVA 4U UPS’s in the bottom of each rack. one fed the left side PDUs, the other fed the right side PDUs, and each server had redundant PSU’s, one plugged into each PDU.

    those Eaton UPS’s had hotswappable batteries, too.

  • It seemed to have to do with the TTL for the local names being too short and DNS being designed to generally query root servers rather than sticking to their local information.

    Hm. I thought about something like that, but without the separation into local slaves using forwarders and the forwarders. I will probably do that; it seems like the most reasonable solution, and I should have at least one forwarder anyway so as not to leak information to the internet-only VLANs. It would be an improvement in several ways and give better reliability.

    It doesn’t really help those clients I can not run name servers on, though.

    > In recent years I *have not had any* problems with bind9 or

    Bind has been around for ages, and I’ve never had any issues with it for the last 25 years or so. Just set it up and let it do its thing, and it does.

    If there were performance problems, I imagine they would be more likely due to insufficient internet bandwidth. Perhaps it would take all the DNS queries that come up during a week or even a month to arrive within a second before any performance considerations become relevant …

  • John Pierce wrote:

    *shrug* All UPSes have hot-swappable. Mine beep while you disconnect the batteries, pull out the sled, replace all 8, shove it back in, and reconnect, and it shuts up.

    For those that haven’t done it, though, DO NOT BELIEVE WHAT ANYONE SAYS, DO NOT USE *ANYTHING* BUT HR (high rate) batteries in a UPS (maybe in a home one, but…). APC, for example, simply stays red, and insists that you still need to change them. *Good* battery vendors know this.

  • It was those showing problems.

    Only 5 seconds isn’t long enough that I would expect any problems. What do I need to put into the ifcf files or tell nmcli to set these options?

  • I don’t remember which UPS it was, either the crappy one for which a replacement battery was already waiting to be put in, or the normal one that already had a new battery in it which is either broken or doesn’t get charged …

    That’s how I rather have not everything go dark even when Murphy comes along. I have generally deprecated all non-rackmount UPSs, and being able to change batteries without outage has become a requirement.

  • If you’re using dhclient to manage addresses, then you can add the RES_OPTIONS variable to /etc/sysconfig/network:

    # /etc/sysconfig/network RES_OPTIONS=”timeout:2 rotate”

    Or, with even less patience:

    RES_OPTIONS=”timeout:1 retries:1 rotate”

    Grep for RES_OPTIONS in /sbin/dhclient-script for the gory details.

  • Ah!?

    When I had it happen a couple years ago and wondered why even local names couldn’t be resolved (which didn’t make sense to me because the server would always know about them from the zone files), I was told that nothing could be done about it because DNS is designed to do lookups no matter what.

    However, that was a server acting as both a local master and as a forwarder. If what you say is true, I would now understand this much better — and I’d need to change my setup.

  • Separate DNS servers must be on a different subnet according to RFC2182
    (https://tools.ietf.org/html/rfc2182):

    Secondary servers must be placed at both topologically and
    geographically dispersed locations on the Internet, to minimise the
    likelihood of a single failure disabling all of them.

    I know that UPSs are physical, and subnets are logical, but the reasoning behind the requirement is due to having to be on a different infrastructure.

  • Shock horror, replying to my own post, but in cloud cluster environments, you might consider anti-affinity rules to prevent multiple name servers going down at the same time due to a cluster node failure
    (i.e. rules to ensure that hypervisors keep different name servers on different hosts).

    I know it doesn’t help OP, who was looking for cluster based solutions, but the same applies if using load balancing virtual appliances, hosting IPs as name servers.

  • It has nothing to do with the ttl. The TTL does cause expiration in an authoritative server.  TTLs only affect  caching servers.  The primary master gets changed when you edit the local zone database.  The secondary slave gets updated when the serial number in the SOA record on the primary master gets bumped.   You must either do that manually or use a zone database management tool that does it for you.

    If a dns server is configured as a primary master or a secondary slave for a domain, then it is authoritative for that domain and does not require queries to any other server on your network or on the Internet. 
    The difference between a primary master and a secondary slave is the primary master is where you edit the zone records and the secondary slave replicates the zone database from the primary master.  Even if the primary master goes down, the secondary slave still has a copy of the zone files in it’s disk files (or other database format that you configure) and will server them flawlessly.

    One way to see if a server is properly configured as authoritative for a domain is:

    nataraj@pygeum:~$ dig mydomain.com. soa @127.0.0.1

    ; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> mydomain.com. soa@127.0.0.1
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52104 ;; flags: qr *aa* rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 64f402c0c22d57aa2bbb10fc5d3a340d8c19377b924d01c2 (good) ;; QUESTION SECTION: ;mydomain.com.            IN    SOA ;; ANSWER SECTION: Mydomain.Com.        14400    IN    SOA    ns1.mydomain.com. postmaster.Mydomain.COM. 2019072505 1200 600 15552000 14400 ;; AUTHORITY SECTION: Mydomain.Com.        14400    IN    NS    ns1.Mydomain.Com. Mydomain.Com.        14400    IN    NS    ns2.Mydomain.Com. Mydomain.Com.        14400    IN    NS    ns3.Mydomain.com. ;; ADDITIONAL SECTION: ns1.mydomain.com.        14400    IN    A    8.8.8.8 ;; Query time: 1 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Thu Jul 25 15:58:21 PDT 2019 ;; MSG SIZE  rcvd: 243 The AA flag in the flags section tells you that you have queried a dns server that is authoritative for the domain that you queried.  If it doesn't have the AA flag then you have not properly set up the primary master or secondary slave for that domain. If your masters and slaves are all configured correctly for a domain then they will all have the same serial number  in the SOA record (and same results for any query in that domain).  If they don't then something is wrong and your zone transfers are not occuring properly. The local server can have forward-only either on or off.  If off, It will go out directly to the Internet if it does not receive a response from a forwarder.  Using forward only and putting your forwarders on a seperate network away from your inside network means if there is a security hole in the nameserver, your inside hosts are less likely to be compromised.    You could also configure your ISP's or google or other public recursive name servers as forwarders if you don't want to run your own. Exactly, a simple bind9 configuration is adequate unless you run an application with huge numbers of DNS queries.

  • Another alternative is to look at the multicast dns (mdns) protocol.  I
    have no experience with it, so I can’t say very much, but I know it exists.  I’m pretty sure it’s inplemented in avahi daemon, so it may just be an issue of enabling it on the client.  If your client supports it then I would think that all you have to do is enable it.

    Nataraj

  • This brings up one of the caveats for (at least ISC) DNS, if the master goes down the slaves will take over for a time but eventually will stop serving for the domains of the master if it remains down too long. If my (sometimes faulty) memory serves me well it is in the three day range (but configurable) which is ample time unless the problem occurs early in a holiday weekend and and the notification/escalation process isn’t what it should be (Murphey’s Law)…

  • If you administer the secondary slave servers, there is no reason not to use a very large number, 30 days or more for the SOA expiration.  Only reason to use a lower number would be if you don’t have control over the slave servers and don’t want to have old zone files that you can’t update.

    Another alternative, which many people did for years in the early days when zone transfers were unreliable, is to use a script which replicates the entire DNS configuration to the secondaries and then run all the servers as primary masters.  If the script is written cleanly, you can then edit the zone on any server and rsync it to the other servers. 
    Main thing is to prevent multiple people applying updates simultaneously.

    Nataraj

  • That’s for allowing a device to self-advertise its own name, along with other things, like available services. If you have such devices, then configuring the other machines on the network to pay attention to such advertisements allows them to see the new names and services when they appear.

    …And much more importantly, when they *disappear*, since many ZeroConf/Bonjour/Avahi/mDNS speaking devices are mobile and aren’t always available.

    This protocol is one common way for network printers to advertise their services, for example. (The other common way is SMB/CIFS.)

    Yes, that’s an implementation of mDNS for POSIX type systems.

    I’m not sure how this is relevant here. For mDNS to be the solution to the OP’s problems, he’d have to also have mDNS multicasts going out advertising services, so the Avahi daemon would have something to offer when a compatible program comes along looking for services to connect to.

    I suppose you could use mDNS in datacenter type environments, but it’s a long way away from the protocol’s original intent.

    You could imagine a load balancer that paid attention to mDNS advertisements to decide who’s available at the moment. But I don’t know of any such implementation.

  • PowerDNS supports MySQL backends for the zone files, so one way that they can work is in Native mode, as an alternative to Master / Slave, in which the replication and information resilience is handled by the backend (e.g. a MySQL cluster), and the servers just read the zone from the database, with no need to perform zone transfers at all. The expire timer in the SOA record then becomes pretty defunct, although if you export your zones to non-PowerDNS servers, e.g. bind, then they take effect.