[nsd-users] failed writing to tcp: Broken pipe

Michael Tokarev mjt at tls.msk.ru
Thu Oct 6 12:24:00 UTC 2011


On 06.10.2011 15:49, Michael Tokarev wrote:
> Hello.
> 
> I'm trying to diagnose a problem in our network which - apparently -
> started after I enabled DNSSEC on our local zones (island of security),
> but may be unrelated as well.
> 
> The problem is that there's frequent messages in the logs on the mahcine
> running nsd:
> 
> Oct  6 14:32:32 mother nsd[1363]: failed writing to tcp: Broken pipe

Here's one strace of the issue:

1363  pselect6(9, [3 4 8], [], [], NULL, {NULL, 8}) = 1 (in [4])
1363  accept(4, {sa_family=AF_INET, sin_port=htons(58332), sin_addr=inet_addr("192.168.177.5")}, [16]) = 7
1363  fcntl64(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
1363  brk(0xf8fd9000)                   = 0xf8fd9000
1363  gettimeofday({1317902272, 86998}, NULL) = 0
1363  gettimeofday({1317902272, 87060}, NULL) = 0
1363  pselect6(9, [3 4 7 8], [], [], {119, 999938000}, {NULL, 8}) = 1 (in [7], left {119, 999934221})
1363  read(7, "\0\34", 2)               = 2
1363  read(7, "\32\333\0\0\0\1\0\0\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1", 28) = 28

As far as I see, this is a plain AXFR IN request for tls.msk.ru domain,
made by 192.168.177.5, which runs both unbound and nsd.  I can guess it
was nsd asking for AXFR.

1363  gettimeofday({1317902272, 87498}, NULL) = 0
1363  gettimeofday({1317902272, 87557}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999941000}, {NULL, 8}) = 1 (out [7], left {119, 999938160})
1363  write(7, "?\325", 2)              = 2
1363  write(7, "\32\333\200\0\0\1\0\242\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1\300\f\0\6"..., 16341) = 11584
1363  gettimeofday({1317902272, 88002}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999496000}, {NULL, 8}) = 1 (out [7], left {119, 999493205})
1363  write(7, "\237\v\270U\235$\350\302\376\16\274\201\5\204\336\237\230;\16\330#\237\367\327\3330\203\310DM\33H"..., 4757) = 4757
1363  gettimeofday({1317902272, 88362}, NULL) = 0
1363  gettimeofday({1317902272, 88424}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999938000}, {NULL, 8}) = 1 (out [7], left {119, 999930669})
1363  write(7, "?\372", 2)              = 2
1363  write(7, "\32\333\200\0\0\0\0\241\0\0\0\0\5f0009\3tls\3msk\2ru\0\0."..., 16378) = 16378
1363  gettimeofday({1317902272, 88932}, NULL) = 0
1363  gettimeofday({1317902272, 89000}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999932000}, {NULL, 8}) = 1 (out [7], left {119, 999145123})
1363  write(7, "?r", 2)                 = -1 EPIPE (Broken pipe)
1363  --- SIGPIPE (Broken pipe) @ 0 (0) ---
1363  time(NULL)                        = 1317902272
1363  send(5, "<27>Oct  6 15:57:52 nsd[1363]: f"..., 65, MSG_NOSIGNAL) = 65

But after sending 2 + 11584 + 4757 + 2 + 16378 bytes, the remote
gave up and closed the connection.  Again, I can only guess it
was nsd who decided that it does not want the rest of data, most
likely due to the fact that it have that SOA already.

The SOA record for this zone is:

  tls.msk.ru. SOA mother.tls.msk.ru. mjt.tls.msk.ru. 2011100501 21600 600 1209600 86400

which gives 6h refresh and 10m retry interval, and the latter matches
with the logged pattern:

> Oct  6 14:41:58 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 14:41:58 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 14:51:02 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 15:00:07 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 15:00:07 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 15:09:37 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 15:09:37 mother last message repeated 2 times
> Oct  6 15:19:35 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct  6 15:19:35 mother nsd[1363]: failed writing to tcp: Broken pipe

14:41, 14:51, 15:00, 15:09, 15:19, ... - about every 10 minutes.

> There are just handful other machines accessing this service (it is a
> hidden primary), either other instances of nsd or unbound servers.
> As far as I can tell, the "bad" connections logged by nsd comes from
> unbound servers - running tcpdump right now.

There are 3 other machines accessing this server "by" nsd (and one of
them is 192.168.177.5).

> 
> But the thing is: the above error message is mostly useless, it is
> just a "random noize" which - IMHO anyway - should either tell more
> (at least, where the connection comes from) or should be dropped
> completely.
> 
> In the code, there's a check for ECONNRESET - should it be expanded
> to EPIPE as well?

Note that in the strace there are other cases like this, but with ECONNRESET
error instead of EPIPE:

1363  accept(4, {sa_family=AF_INET, sin_port=htons(43260), sin_addr=inet_addr("192.168.19.2")}, [16]) = 7
1363  fcntl64(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
1363  brk(0xf8fd9000)                   = 0xf8fd9000
1363  gettimeofday({1317902225, 62655}, NULL) = 0
1363  gettimeofday({1317902225, 62736}, NULL) = 0
1363  pselect6(9, [3 4 7 8], [], [], {119, 999919000}, {NULL, 8}) = 1 (in [7], left {119, 999910531})
1363  read(7, "\0\34", 2)               = 2
1363  read(7, 0xf8f98118, 28)           = -1 EAGAIN (Resource temporarily unavailable)
1363  gettimeofday({1317902225, 63112}, NULL) = 0
1363  pselect6(9, [3 4 7 8], [], [], {119, 999543000}, {NULL, 8}) = 1 (in [7], left {119, 982674025})
1363  read(7, "%\272\0\0\0\1\0\0\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1", 28) = 28
1363  gettimeofday({1317902225, 80761}, NULL) = 0
1363  gettimeofday({1317902225, 80823}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999938000}, {NULL, 8}) = 1 (out [7], left {119, 999934106})
1363  write(7, "?\325", 2)              = 2
1363  write(7, "%\272\200\0\0\1\0\242\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1\300\f\0\6"..., 16341) = 10792
1363  gettimeofday({1317902225, 81210}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999551000}, {NULL, 8}) = 1 (out [7], left {119, 983838077})
1363  write(7, "\310\211\17\251\355\17\366\257\275Tm\27\350\312\0/\0\1\0\1Q\200\0\35\3NET\4deo"..., 5549) = 5549
1363  gettimeofday({1317902225, 97713}, NULL) = 0
1363  gettimeofday({1317902225, 97776}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999937000}, {NULL, 8}) = 1 (out [7], left {119, 999933714})
1363  write(7, "?\372", 2)              = 2
1363  write(7, "%\272\200\0\0\0\0\241\0\0\0\0\5f0009\3tls\3msk\2ru\0\0."..., 16378) = 11986
1363  gettimeofday({1317902225, 98069}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999644000}, {NULL, 8}) = 1 (out [7], left {119, 962512462})
1363  write(7, "\241\320[*\272\201\20\247\331\351\277\334\221\34=\200\313\226\330\203\215\236R7\333\312\f\340{\376R\360"..., 4392) = 4392
1363  gettimeofday({1317902225, 136042}, NULL) = 0
1363  gettimeofday({1317902225, 136101}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999941000}, {NULL, 8}) = 1 (out [7], left {119, 999936949})
1363  write(7, "?r", 2)                 = 2
1363  write(7, "%\272\200\0\0\0\0\242\0\0\0\0\5f0055\3tls\3msk\2ru\0\0."..., 16242) = 16242
1363  gettimeofday({1317902225, 136481}, NULL) = 0
1363  gettimeofday({1317902225, 136537}, NULL) = 0
1363  pselect6(9, [3 4 8], [7], [], {119, 999944000}, {NULL, 8}) = 1 (out [7], left {119, 972729681})
1363  write(7, "?\366", 2)              = -1 ECONNRESET (Connection reset by peer)

So the two at least _may_ be treated the same.

But should the "client" NSD be doing like that?  Maybe it is better
to query for SOA first, instead of asking AXFR right away, and just
closing the connection when already up to date?

> And, where - in server.c:handle_tcp_writing() - one can get the peer
> address for handler->fd, short of calling getpeername()?  I mean, is
> the peer address available somewhere in the connection structures in
> a ready to use form?

Thanks,

/mjt



More information about the nsd-users mailing list