[nsd-users] Frequent RRL false negatives when using multiple server processes on Linux

Ville Mattila vmattila at csc.fi
Wed Nov 6 13:26:15 UTC 2013


Hi,

Please advise how to use Response Rate Limiting on a server which has
multiple NSD server processes (nsd.conf server section has server-count
> 1).

We have a problem with NSD v3.2.16 repeatedly unblocking and blocking
again a single source which is flooding positive queries at a ~steady
700 qps rate.  rrl-ratelimit setting is the default 200 qps.  The
unblock-block happens multiple times a minute.  This is causing false
negatives: NSD bursts out 200 responses on every unblock:

Nov  6 10:11:18 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:19 dnstest1 nsd[6876]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:20 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:21 dnstest1 nsd[6875]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:23 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:25 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:27 dnstest1 nsd[6879]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:28 dnstest1 nsd[6877]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:29 dnstest1 nsd[6879]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:29 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:30 dnstest1 nsd[6880]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:42 dnstest1 nsd[6878]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:11:42 dnstest1 nsd[6881]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:12:30 dnstest1 nsd[6877]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:12:31 dnstest1 nsd[6880]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:12:31 dnstest1 nsd[6882]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:13:30 dnstest1 nsd[6881]: ratelimit unblock demo.funet.fi.
type positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:13:31 dnstest1 nsd[6876]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS
Nov  6 10:14:31 dnstest1 nsd[6878]: ratelimit block demo.funet.fi. type
positive target 193.166.5.0/24 query 193.166.5.1 NS

Noting how the PIDs change on the log messages lines I'm guessing what
happens here is that the operating system (RHEL 6; Linux kernel v2.6.32)
process scheduler decides to start using a different NSD server process
every now and then to handle the incoming data on the socket / NIC
receive queue.  The newly chosen process has the rrl hash bucket for the
flooding source/type empty and only after sending 200 replies it starts
blocking.  (NB: The behaviour/interval of changing to a different
process may depend on what NIC / Linux kernel version / cpu scheduler /
irq&cpu affinity settings etc. one is using, and of course cannot be
controlled by NSD.  In this example case the query flood source is our
lab nameserver 193.166.5.1 itself, but I'm afraid we can expect our
production Linux server behave ~similarly with external flood sources.)

If my guess is correct I think the options would be:
1. Do nothing and use RRL even though it's per-process.  Even if the
flood gets unblocked multiple times a minute RRL may still make the
attack ineffective enough.
2. Make use of the multiple receive queues / irq affinity of the server
network interface card and so that queries from a specific source IP
always end up being processed by the same CPU, and configure process
scheduling to tie a single NSD server process to each of those CPUs.
(Too complex for us!  And of course this has it's drawbacks, too, wrt
load distribution at least.  And unfortunately our Intel igb NICs only
can choose the receive queue based on IPv4 srcip,dstip tuples but all
IPv6 packets end up always in the same queue.)

FWIW, the unblocking seems to be triggered every time by this, around
line 425 of rrl.c from nsd-3.2.16:
-----
        } else if(now - b->stamp > 0) {
                /* older bucket */
                int olderblock = used_to_block(b->rate, b->counter, lm);
                rrl_attenuate_bucket(b, now - b->stamp);
                if(olderblock && b->rate < lm)
                        rrl_msg(query, "unblock");
                b->counter = 1;
                b->stamp = now;
        }
-----

Thanks,
-- 
Ville Mattila, CSC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0xF55B661A.asc
Type: application/pgp-keys
Size: 6992 bytes
Desc: not available
URL: <http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131106/941cdeb6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.nlnetlabs.nl/pipermail/nsd-users/attachments/20131106/941cdeb6/attachment-0001.bin>


More information about the nsd-users mailing list