resolve: do not derive query timeout from RTT

System Internals / systemd - Matthias-Christian Ott [mirix.org] - 12 June 2018 21:21 EDT

DNS queries need timeout values to detect whether a DNS server is unresponsive or, if the query is sent over UDP, whether a DNS message was lost and has to be resent. The total time that it takes to answer a query to arrive is t + RTT, where t is the maximum time that the DNS server that is being queried needs to answer the query.

An authoritative server stores a copy of the zone that it serves in main memory or secondary storage, so t is very small and therefore the time that it takes to answer a query is almost entirely determined by the RTT. Modern authoritative server software keeps its zones in main memory and, for example, Knot DNS and NSD are able to answer in less than 100 µs [1]. So iterative resolvers continuously measure the RTT to optimize their query timeouts and to resend queries more quickly if they are lost.

systemd-resolved is a stub resolver: it forwards DNS queries to an upstream resolver and waits for an answer. So the time that it takes for systemd-resolved to answer a query is determined by the RTT and the time that it takes the upstream resolver to answer the query.

It seems common for iterative resolver software to set a total timeout for the query. Such total timeout subsumes the timeout of all queries that the iterative has to make to answer a query. For example, BIND seems to use a default timeout of 10 s.

At the moment systemd-resolved derives its query timeout entirely from the RTT and does not consider the query timeout of the upstream resolver. Therefore it often mistakenly degrades the feature set of its upstream resolvers if it takes them longer than usual to answer a query. It has been reported to be a considerable problem in practice, in particular if DNSSEC=yes. So the query timeout systemd-resolved should be derived from the timeout of the upstream resolved and the RTT to the upstream resolver.

At the moment systemd-resolved measures the RTT as the time that it takes the upstream resolver to answer a query. This clearly leads to incorrect measurements. In order to correctly measure the RTT systemd-resolved would have to measure RTT separately and continuously, for example with a query with an empty question section or a query for the SOA RR of the root zone so that the upstream resolver would be able to answer to query without querying another server. However, this requires significant changes to systemd-resolved. So it seems best to postpone them until other issues have been addressed and to set the resend timeout to a fixed value for now.

As mentioned, BIND seems to use a timeout of 10 s, so perhaps 12 s is a reasonable value that also accounts for common RTT values. If we assume that the we are going to retry, it could be less. So it should be enough to set the resend timeout to DNS_TIMEOUT_MAX_USEC as DNS_SERVER_FEATURE_RETRY_ATTEMPTS * DNS_TIMEOUT_MAX_USEC = 15 s. However, this will not solve the incorrect feature set degradation and should be seen as a temporary change until systemd-resolved does probe the feature set of an upstream resolver independently from the actual queries.

[1] https://www.knot-dns.cz/benchmark/

dbc4661a2 resolve: do not derive query timeout from RTT
src/resolve/resolved-dns-server.c | 22 +---------------------
src/resolve/resolved-dns-server.h | 5 +----
src/resolve/resolved-dns-transaction.c | 8 +++++---
3 files changed, 7 insertions(+), 28 deletions(-)

Upstream: github.com


  • Share