Discussion:
NFS Mount Hangs
Rick Macklem
2021-03-18 20:55:05 UTC
Permalink
Michael Tuexen wrote:
>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard <***@netapp.com> wrote:
>>
>>>> Output from the NFS Client when the issue occurs # netstat -an | grep
>>>> NFS.Server.IP.X
>>>> tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
>>> I'm no TCP guy. Hopefully others might know why the client would be stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but could be wrong?)
>>
>> When the client is in Fin-Wait2 this is the state you end up when the Client side actively close() the tcp session, and then the server also ACKed the FIN.
Jason noted:

>When the issue occurs, this is what I see on the NFS Server.
>tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.51550 CLOSE_WAIT
>
>which corresponds to the state on the client side. The server received the FIN
>from the client and acked it.
>The server is waiting for a close call to happen.
>So the question is: Is the server also closing the connection?
Did you mean to say "client closing the connection here?"

The server should call soclose() { it never calls soshutdown() } when
soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates
the socket is broken.
--> The soreceive() call is triggered by an upcall for the rcv side of the socket.
So, are you saying the FreeBSD NFS server did not call soclose() for this case?

rick

Best regards
Michael
> This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple can not be reused during this time.
>
> With other words, from the socket / TCP, a properly executed active close() will end up in this state. (If the other side initiated the close, a passive close, will not end in this state)
>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-03-19 16:14:01 UTC
Permalink
Hi Rick,

I did some reshuffling of socket-upcalls recently in the TCP stack, to prevent some race conditions with our $work in-kernel NFS server implementation.

Just mentioning this, as this may slightly change the timing (mostly delay the upcall until TCP processing is all done, while before an in-kernel consumer could register for a socket upcall, do some fancy stuff with the data sitting in the socket bufferes, before returning to the tcp processing).

But I think there is no socket data handling being done in the upstream in-kernel NFS server (and I have not even checked, if it actually registers an socket-upcall handler).

https://reviews.freebsd.org/R10:4d0770f1725f84e8bcd059e6094b6bd29bed6cc3

If you can reproduce this easily, perhaps back out this change and see if that has an impact...

NFS server is to my knowledge the only upstream in-kernel TCP consumer which may be impacted by this.

Richard Scheffenegger


-----Ursprüngliche Nachricht-----
Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> Im Auftrag von Rick Macklem
Gesendet: Freitag, 19. März 2021 16:58
An: ***@freebsd.org
Cc: Scheffenegger, Richard <***@netapp.com>; freebsd-***@freebsd.org; Alexander Motin <***@FreeBSD.org>
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




Michael Tuexen wrote:
>> On 18. Mar 2021, at 21:55, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> Michael Tuexen wrote:
>>>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard <***@netapp.com> wrote:
>>>>
>>>>>> Output from the NFS Client when the issue occurs # netstat -an |
>>>>>> grep NFS.Server.IP.X
>>>>>> tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>> I'm no TCP guy. Hopefully others might know why the client would
>>>>> be stuck in FIN_WAIT2 (I vaguely recall this means it is waiting
>>>>> for a fin/ack, but could be wrong?)
>>>>
>>>> When the client is in Fin-Wait2 this is the state you end up when the Client side actively close() the tcp session, and then the server also ACKed the FIN.
>> Jason noted:
>>
>>> When the issue occurs, this is what I see on the NFS Server.
>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.51550 CLOSE_WAIT
>>>
>>> which corresponds to the state on the client side. The server
>>> received the FIN from the client and acked it.
>>> The server is waiting for a close call to happen.
>>> So the question is: Is the server also closing the connection?
>> Did you mean to say "client closing the connection here?"
>Yes.
>>
>> The server should call soclose() { it never calls soshutdown() } when
>> soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates
>> the socket is broken.
Btw, I looked and the soreceive() is done with MSG_DONTWAIT, but the EWOULDBLOCK is handled appropriately.

>> --> The soreceive() call is triggered by an upcall for the rcv side of the socket.
>> So, are you saying the FreeBSD NFS server did not call soclose() for this case?
>Yes. If the state at the server side is CLOSE_WAIT, no close call has happened yet.
>The FIN from the client was received, it was ACKED, but no close() call
>(or shutdown(..., SHUT_WR) or shutdown(..., SHUT_RDWR)) was issued.
>Therefore, no FIN was sent and the client should be in the FINWAIT-2
>state. This was also reported. So the reported states are consistent.
For a test, I commented out the soclose() call in the server side krpc and, when I dismounted, it did leave the server socket in CLOSE_WAIT.
For the FreeBSD client, it did the dismount and the socket was in FIN_WAIT2 for a little while and then disappeared (someone mentioned a short timeout and that seems to be the case).
I might argue that the Linux client should not get hung when this occurs, but there does appear to be an issue on the FreeBSD end.

So it does appear you have a case where the soclose() call is not happening on the FreeBSD NFS server. I am a little surprised since I don't think I've heard of this before and the code is at least 10years old (at least the parts related to this).

For the soclose() to not happen, the reference count on the socket structure cannot have gone to zero. (ie a SVC_RELEASE() was missed) Upon code inspection, I was not able to spot a reference counting bug.
(Not too surprising, since a reference counting bug should have shown up long ago.)

The only thing I spotted that could conceivably explain this is that the function svc_vc_stat() which returns the indication that the socket has been closed at the other end did not bother to do any locking when it checked the status. (I am not yet sure if this could result in the status of XPRT_DIED being missed by the call, but if so, that would result in the soclose() call not happening.)

I have attached a small patch, which I think is safe, that adds locking to svc_vc_stat(),which I am hoping you can try at some point.
(I realize this is difficult for a production server, but...) I have tested it a little and will test it some more, to try and ensure it does not break anything.

I have also cc'd mav@, since he's the guy who last worked on this code, in case he has any insight w.r.t. how the soclose() might get missed (or any other way the server socket gets stuck in CLOSE_WAIT).

rick
ps: I'll create a PR for this, so that it doesn't get forgotten.

Best regards
Michael

>
> rick
>
> Best regards
> Michael
>> This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple can not be reused during this time.
>>
>> With other words, from the socket / TCP, a properly executed active
>> close() will end up in this state. (If the other side initiated the
>> close, a passive close, will not end in this state)
>>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-03-19 21:25:35 UTC
Permalink
Scheffenegger, Richard <***@netapp.com> wrote:
>Sorry, I though this was a problem on stable/13.
>
>This is only in HEAD, stable/13 and 13.0 - never MFC'd to stable/12 or backported to >12.1
>
>> I did some reshuffling of socket-upcalls recently in the TCP stack, to prevent some race conditions with our $work in-kernel NFS server implementation.
The FreeBSD krpc/nfs definitely uses upcalls. On the server side the upcall just
"activates" a thread to service the socket (soreceive() etc).
The client side upcall does quite a bit more, including soreceive().
Timing shouldn't be a problem, so long as upcalls happen when there is data
to be received or the socket has been closed at the other end.
I test with pretty current sources and haven't seen issues.
If I do see problems, I'll be sure to let you know.;-)

rick


Are these changes in 12.1p5? This is the OS version used by the reporter of the bug.

Best regards
Michael
Rodney W. Grimes
2021-03-18 14:31:54 UTC
Permalink
> >>Output from the NFS Client when the issue occurs # netstat -an | grep
> >>NFS.Server.IP.X
> >>tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
> >I'm no TCP guy. Hopefully others might know why the client would be stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but could be wrong?)
>
> When the client is in Fin-Wait2 this is the state you end up when the Client side actively close() the tcp session, and then the server also ACKed the FIN.
> This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple can not be reused during this time.

I do not think this is the full story. If infact the Client did call close() you are correct, you would end up in FIN_WAIT2 for upto what ever the timeout is set to, IIRC default on that is 60 seconds, further in this case the socket is disconnected from the application and only the kernel has knowledge of it. The socket can stay in FIN_WAIT2 indefanitly if the application
calls shutdown(2) on it, and never follows up with a close(2).

Which situation exists can be determined by looking at fstat to see if the socket has an associted PID or not,
not sure if that works on Linux though.

>
> With other words, from the socket / TCP, a properly executed active close() will end up in this state. (If the other side initiated the close, a passive close, will not end in this state)

But only for a brief period. Stuck in this state indicates close(2) was not called, but shutdown(2) was.

I believe it is also possible to get sockets stuck in this state when a client side port number reuse colides with a server side socket that is still in s CLOSE_WAIT state, oh wait, that ends up in no response to the SYN, hummmm...

> freebsd-***@freebsd.org mailing list
--
Rod Grimes ***@freebsd.org
Jason Breitman
2021-03-19 16:53:28 UTC
Permalink
Thank you for your focus on the issue I am having and I look forward to seeing your patch ported to FreeBSD 12.X.
I also appreciate that you understand the difficulties in testing changes on a core piece of infrastructure.

I will let the group know if the issue occurs following the change that disabled TSO and LRO on my FreeBSD NFS Server NICs.
Those types of settings are less disruptive and pose less risk to apply in production, so I am happy to do so. That change was made on 3/17/2021.
My hope is that this issue is a race condition specific to my NIC handling certain TCP operations.

I am happy that you are approaching the issue from multiple angles.
Thanks.

Jason Breitman


On Mar 19, 2021, at 11:58 AM, Rick Macklem <***@uoguelph.ca> wrote:

Michael Tuexen wrote:
>> On 18. Mar 2021, at 21:55, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> Michael Tuexen wrote:
>>>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard <***@netapp.com> wrote:
>>>>
>>>>>> Output from the NFS Client when the issue occurs # netstat -an | grep
>>>>>> NFS.Server.IP.X
>>>>>> tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>> I'm no TCP guy. Hopefully others might know why the client would be stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but could be wrong?)
>>>>
>>>> When the client is in Fin-Wait2 this is the state you end up when the Client side actively close() the tcp session, and then the server also ACKed the FIN.
>> Jason noted:
>>
>>> When the issue occurs, this is what I see on the NFS Server.
>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.51550 CLOSE_WAIT
>>>
>>> which corresponds to the state on the client side. The server received the FIN
>>> from the client and acked it.
>>> The server is waiting for a close call to happen.
>>> So the question is: Is the server also closing the connection?
>> Did you mean to say "client closing the connection here?"
>Yes.
>>
>> The server should call soclose() { it never calls soshutdown() } when
>> soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates
>> the socket is broken.
Btw, I looked and the soreceive() is done with MSG_DONTWAIT, but the
EWOULDBLOCK is handled appropriately.

>> --> The soreceive() call is triggered by an upcall for the rcv side of the socket.
>> So, are you saying the FreeBSD NFS server did not call soclose() for this case?
>Yes. If the state at the server side is CLOSE_WAIT, no close call has happened yet.
>The FIN from the client was received, it was ACKED, but no close() call
>(or shutdown(..., SHUT_WR) or shutdown(..., SHUT_RDWR)) was issued. Therefore,
>no FIN was sent and the client should be in the FINWAIT-2 state. This was also
>reported. So the reported states are consistent.
For a test, I commented out the soclose() call in the server side krpc and, when I
dismounted, it did leave the server socket in CLOSE_WAIT.
For the FreeBSD client, it did the dismount and the socket was in FIN_WAIT2
for a little while and then disappeared (someone mentioned a short timeout
and that seems to be the case).
I might argue that the Linux client should not get hung when this occurs,
but there does appear to be an issue on the FreeBSD end.

So it does appear you have a case where the soclose() call is not happening
on the FreeBSD NFS server. I am a little surprised since I don't think I've
heard of this before and the code is at least 10years old (at least the parts
related to this).

For the soclose() to not happen, the reference count on the socket
structure cannot have gone to zero. (ie a SVC_RELEASE() was missed)
Upon code inspection, I was not able to spot a reference counting bug.
(Not too surprising, since a reference counting bug should have shown
up long ago.)

The only thing I spotted that could conceivably explain this is that the
function svc_vc_stat() which returns the indication that the socket has
been closed at the other end did not bother to do any locking when
it checked the status. (I am not yet sure if this could result in the
status of XPRT_DIED being missed by the call, but if so, that would
result in the soclose() call not happening.)

I have attached a small patch, which I think is safe, that adds locking
to svc_vc_stat(),which I am hoping you can try at some point.
(I realize this is difficult for a production server, but...)
I have tested it a little and will test it some more, to try and ensure
it does not break anything.

I have also cc'd mav@, since he's the guy who last worked on this
code, in case he has any insight w.r.t. how the soclose() might get
missed (or any other way the server socket gets stuck in CLOSE_WAIT).

rick
ps: I'll create a PR for this, so that it doesn't get forgotten.

Best regards
Michael

>
> rick
>
> Best regards
> Michael
>> This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple can not be reused during this time.
>>
>> With other words, from the socket / TCP, a properly executed active close() will end up in this state. (If the other side initiated the close, a passive close, will not end in this state)
>>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

<xprtdied.patch>_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-03-19 21:17:35 UTC
Permalink
Jason Breitman wrote:
>Thank you for your focus on the issue I am having and I look forward to seeing your >patch ported to FreeBSD 12.X.
I'll only be committing the patch I am convinced it actually fixes something.
I'll be looking more closely at it and seeing what mav@ thinks aboutm it.

>I also appreciate that you understand the difficulties in testing changes on a core >piece of infrastructure.
>
>I will let the group know if the issue occurs following the change that disabled TSO >and LRO on my FreeBSD NFS Server NICs.
>Those types of settings are less disruptive and pose less risk to apply in production, >so I am happy to do so. That change was made on 3/17/2021.
>My hope is that this issue is a race condition specific to my NIC handling certain TCP >operations.
>
>I am happy that you are approaching the issue from multiple angles.
Well, another angle would be a nfsd thread that is "stuck" while doing
an RPC. If that happens, it won't call svc_freereq(), which derefs the
structure.
If it happens again and you can do, please go onto the FreeBSD server
and perform the following commands a few times, a few minutes apart,
capturing the output:
# ps axHl
# procstat -kk <pid of nfsd (server) as listed by "ps ax")
If the output doesn't have proprietary or security stuff, email me the results.
(Or if you can sanitize it.)
--> I'll be looking for an nfsd thread that is sleeping on the same thing
without cpu increasing (except for rpcsvc, which is just a thread waiting
for an RPC to work on).
--> The procstat -kk gives more information w.r.t what the thread is up to.

Good luck with it, rick

Thanks.

Jason Breitman


On Mar 19, 2021, at 11:58 AM, Rick Macklem <***@uoguelph.ca> wrote:

Michael Tuexen wrote:
>> On 18. Mar 2021, at 21:55, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> Michael Tuexen wrote:
>>>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard <***@netapp.com> wrote:
>>>>
>>>>>> Output from the NFS Client when the issue occurs # netstat -an | grep
>>>>>> NFS.Server.IP.X
>>>>>> tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>> I'm no TCP guy. Hopefully others might know why the client would be stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but could be wrong?)
>>>>
>>>> When the client is in Fin-Wait2 this is the state you end up when the Client side actively close() the tcp session, and then the server also ACKed the FIN.
>> Jason noted:
>>
>>> When the issue occurs, this is what I see on the NFS Server.
>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.51550 CLOSE_WAIT
>>>
>>> which corresponds to the state on the client side. The server received the FIN
>>> from the client and acked it.
>>> The server is waiting for a close call to happen.
>>> So the question is: Is the server also closing the connection?
>> Did you mean to say "client closing the connection here?"
>Yes.
>>
>> The server should call soclose() { it never calls soshutdown() } when
>> soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates
>> the socket is broken.
Btw, I looked and the soreceive() is done with MSG_DONTWAIT, but the
EWOULDBLOCK is handled appropriately.

>> --> The soreceive() call is triggered by an upcall for the rcv side of the socket.
>> So, are you saying the FreeBSD NFS server did not call soclose() for this case?
>Yes. If the state at the server side is CLOSE_WAIT, no close call has happened yet.
>The FIN from the client was received, it was ACKED, but no close() call
>(or shutdown(..., SHUT_WR) or shutdown(..., SHUT_RDWR)) was issued. Therefore,
>no FIN was sent and the client should be in the FINWAIT-2 state. This was also
>reported. So the reported states are consistent.
For a test, I commented out the soclose() call in the server side krpc and, when I
dismounted, it did leave the server socket in CLOSE_WAIT.
For the FreeBSD client, it did the dismount and the socket was in FIN_WAIT2
for a little while and then disappeared (someone mentioned a short timeout
and that seems to be the case).
I might argue that the Linux client should not get hung when this occurs,
but there does appear to be an issue on the FreeBSD end.

So it does appear you have a case where the soclose() call is not happening
on the FreeBSD NFS server. I am a little surprised since I don't think I've
heard of this before and the code is at least 10years old (at least the parts
related to this).

For the soclose() to not happen, the reference count on the socket
structure cannot have gone to zero. (ie a SVC_RELEASE() was missed)
Upon code inspection, I was not able to spot a reference counting bug.
(Not too surprising, since a reference counting bug should have shown
up long ago.)

The only thing I spotted that could conceivably explain this is that the
function svc_vc_stat() which returns the indication that the socket has
been closed at the other end did not bother to do any locking when
it checked the status. (I am not yet sure if this could result in the
status of XPRT_DIED being missed by the call, but if so, that would
result in the soclose() call not happening.)

I have attached a small patch, which I think is safe, that adds locking
to svc_vc_stat(),which I am hoping you can try at some point.
(I realize this is difficult for a production server, but...)
I have tested it a little and will test it some more, to try and ensure
it does not break anything.

I have also cc'd mav@, since he's the guy who last worked on this
code, in case he has any insight w.r.t. how the soclose() might get
missed (or any other way the server socket gets stuck in CLOSE_WAIT).

rick
ps: I'll create a PR for this, so that it doesn't get forgotten.

Best regards
Michael

>
> rick
>
> Best regards
> Michael
>> This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple can not be reused during this time.
>>
>> With other words, from the socket / TCP, a properly executed active close() will end up in this state. (If the other side initiated the close, a passive close, will not end in this state)
>>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

<xprtdied.patch>_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Jason Breitman
2021-03-21 13:26:33 UTC
Permalink
The issue did trigger again.
I ran the script below for ~15 minutes and hope this gets you what you need.
Let me know if you require the full output without grepping nfsd.


#!/bin/sh

while true
do
/bin/date >> /tmp/nfs-hang.log
/bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
/bin/sleep 60
done


On the NFS Server
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT

On the NFS Client
tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
Alan Somers
2021-03-17 21:45:47 UTC
Permalink
On Wed, Mar 17, 2021 at 3:37 PM Rick Macklem <***@uoguelph.ca> wrote:

> Jason Breitman wrote:
> >Please review the details below and let me know if there is a setting
> that I should >apply to my FreeBSD NFS Server or if there is a bug fix that
> I can apply to resolve my >issue.
> >I shared this information with the linux-nfs mailing list and they
> believe the issue is >on the server side.
> I actually lurk there and saw your post. I'll admit I smiled when Trond
> argued
> that a hung Linux system is the result of a server failing to send a
> fin/ack for
> a closing TCP connection. But, here's a few comments..
>
> >Issue
> >NFSv4 mounts periodically hang on the NFS Client.
> >
> >During this time, it is possible to manually mount from another NFS
> Server on the >NFS Client having issues.
> >Also, other NFS Clients are successfully mounting from the NFS Server in
> question.
> >Rebooting the NFS Client appears to be the only solution.
> >
> >Environment
> >NFS Server
> >OS: FreeBSD 12.1-RELEASE-p5
> >
> >NFS Client
> >OS: Debian Buster 10.8
> >Kernel: 4.19.171-2
> >Protocol: NFSv4 with Kerberos Security
> >Mount Options: nfs-server.domain.com:/data /mnt/data nfs4
> >lookupcache=pos,noresvport,sec=krb5,hard,rsize=1048576,wsize=1048576 00
> The maximum I/O size supported by FreeBSD is 128K.
>

Is the 128K limit related to MAXPHYS? If so, it should be greater in 13.0.


> The client should acquire the attributes that indicate that and set
> rsize/wsize
> to that. "# nfsstat -m" on the client should show you what the client
> is actually using. If it is larger than 128K, set both rsize and wsize to
> 128K.
>
> >Output from the NFS Client when the issue occurs
> ># netstat -an | grep NFS.Server.IP.X
> >tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049
> FIN_WAIT2
> I'm no TCP guy. Hopefully others might know why the client would be
> stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a
> fin/ack,
> but could be wrong?)
>
> ># cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info
> >netid: tcp
> >addr: NFS.Server.IP.X
> >port: 2049
> >state: 0x51
> >
> >syslog
> >Mar 4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status
> -client- --rqstp- ->timeout ---ops--
> >Mar 4 10:29:27 hostname kernel: [437414.133158] 57419 40a1 0
> 9b723c73 >143cfadf 30000 4ca953b5 nfsv4 OPEN_NOATTR
> a:call_connect_status [sunrpc] >q:xprt_pending
> I don't know what OPEN_NOATTR means, but I assume it is some variant
> of NFSv4 Open operation.
> [stuff snipped]
> >Mar 4 10:29:30 hostname kernel: [437417.110517] RPC: 57419
> xprt_connect_status: >connect attempt timed out
> >Mar 4 10:29:30 hostname kernel: [437417.112172] RPC: 57419
> call_connect_status
> >(status -110)
> I have no idea what status -110 means?
> >Mar 4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout
> (major)
> >Mar 4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind
> (status 0)
> >Mar 4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect
> xprt >00000000e061831b is not connected
> >Mar 4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect
> xprt >00000000e061831b is not connected
> >Mar 4 10:30:31 hostname kernel: [437478.551090] RPC: 57419
> xprt_connect_status: >connect attempt timed out
> >Mar 4 10:30:31 hostname kernel: [437478.552396] RPC: 57419
> call_connect_status >(status -110)
> >Mar 4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout
> (minor)
> >Mar 4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind
> (status 0)
> >Mar 4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect
> xprt >00000000e061831b is not connected
> >Mar 4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect
> xprt >00000000e061831b is not connected
> Is it possible that the client is trying to (re)connect using the same
> client port#?
> I would normally expect the client to create a new TCP connection using a
> different client port# and then retry the outstanding RPCs.
> --> Capturing packets when this happens would show us what is going on.
>
> If there is a problem on the FreeBSD end, it is most likely a broken
> network device driver.
> --> Try disabling TSO , LRO.
> --> Try a different driver for the net hardware on the server.
> --> Try a different net chip on the server.
> If you can capture packets when (not after) the hang
> occurs, then you can look at them in wireshark and see
> what is actually happening. (Ideally on both client and
> server, to check that your network hasn't dropped anything.)
> --> I know, if the hangs aren't easily reproducible, this isn't
> easily done.
> --> Try a newer Linux kernel and see if the problem persists.
> The Linux folk will get more interested if you can reproduce
> the problem on 5.12. (Recent bakeathon testing of the 5.12
> kernel against the FreeBSD server did not find any issues.)
>
> Hopefully the network folk have some insight w.r.t. why
> the TCP connection is sitting in FIN_WAIT2.
>
> rick
>
>
>
> Jason Breitman
>
>
>
>
>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
Rick Macklem
2021-03-17 21:58:25 UTC
Permalink
Alan Somers wrote:
[stuff snipped]
>Is the 128K limit related to MAXPHYS? If so, it should be greater in 13.0.
For the client, yes. For the server, no.
For the server, it is just a compile time constant NFS_SRVMAXIO.

It's mainly related to the fact that I haven't gotten around to testing larger
sizes yet.
- kern.ipc.maxsockbuf needs to be several times the limit, which means it would
have to increase for 1Mbyte.
- The session code must negotiate a maximum RPC size > 1 Mbyte.
(I think the server code does do this, but it needs to be tested.)
And, yes, the client is limited to MAXPHYS.

Doing this is on my todo list, rick

The client should acquire the attributes that indicate that and set rsize/wsize
to that. "# nfsstat -m" on the client should show you what the client
is actually using. If it is larger than 128K, set both rsize and wsize to 128K.

>Output from the NFS Client when the issue occurs
># netstat -an | grep NFS.Server.IP.X
>tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
I'm no TCP guy. Hopefully others might know why the client would be
stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack,
but could be wrong?)

># cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info
>netid: tcp
>addr: NFS.Server.IP.X
>port: 2049
>state: 0x51
>
>syslog
>Mar 4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status -client- --rqstp- ->timeout ---ops--
>Mar 4 10:29:27 hostname kernel: [437414.133158] 57419 40a1 0 9b723c73 >143cfadf 30000 4ca953b5 nfsv4 OPEN_NOATTR a:call_connect_status [sunrpc] >q:xprt_pending
I don't know what OPEN_NOATTR means, but I assume it is some variant
of NFSv4 Open operation.
[stuff snipped]
>Mar 4 10:29:30 hostname kernel: [437417.110517] RPC: 57419 xprt_connect_status: >connect attempt timed out
>Mar 4 10:29:30 hostname kernel: [437417.112172] RPC: 57419 call_connect_status
>(status -110)
I have no idea what status -110 means?
>Mar 4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout (major)
>Mar 4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind (status 0)
>Mar 4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect xprt >00000000e061831b is not connected
>Mar 4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect xprt >00000000e061831b is not connected
>Mar 4 10:30:31 hostname kernel: [437478.551090] RPC: 57419 xprt_connect_status: >connect attempt timed out
>Mar 4 10:30:31 hostname kernel: [437478.552396] RPC: 57419 call_connect_status >(status -110)
>Mar 4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout (minor)
>Mar 4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind (status 0)
>Mar 4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect xprt >00000000e061831b is not connected
>Mar 4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect xprt >00000000e061831b is not connected
Is it possible that the client is trying to (re)connect using the same client port#?
I would normally expect the client to create a new TCP connection using a
different client port# and then retry the outstanding RPCs.
--> Capturing packets when this happens would show us what is going on.

If there is a problem on the FreeBSD end, it is most likely a broken
network device driver.
--> Try disabling TSO , LRO.
--> Try a different driver for the net hardware on the server.
--> Try a different net chip on the server.
If you can capture packets when (not after) the hang
occurs, then you can look at them in wireshark and see
what is actually happening. (Ideally on both client and
server, to check that your network hasn't dropped anything.)
--> I know, if the hangs aren't easily reproducible, this isn't
easily done.
--> Try a newer Linux kernel and see if the problem persists.
The Linux folk will get more interested if you can reproduce
the problem on 5.12. (Recent bakeathon testing of the 5.12
kernel against the FreeBSD server did not find any issues.)

Hopefully the network folk have some insight w.r.t. why
the TCP connection is sitting in FIN_WAIT2.

rick



Jason Breitman






_______________________________________________
freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"

_______________________________________________
freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
Jason Breitman
2021-03-17 23:15:58 UTC
Permalink
We are using the Intel Ethernet Network Adapter X722.

Jason Breitman


On Mar 17, 2021, at 6:48 PM, Peter Eriksson <***@lysator.liu.se> wrote:

CLOSE_WAIT on the server side usually indicates that the kernel has sent the ACK to the clients FIN (start of a shutdown) packet but hasn’t sent it’s own FIN packet - something that usually happens when the server has read all data queued up from the client and taken what actions it need to shutdown down it’s service…

Here’s a fine ASCII art. Probably needs to be viewed using a monospaced font :-)

Client
> ESTABLISHED --> FIN-WAIT-1 +-----> FIN-WAIT-2 +-----> TIME-WAIT ---> CLOSED
> : ^ ^ :
> FIN : : ACK FIN : ACK :
> v : : v
> ESTABLISHED +--> CLOSE-WAIT --....---> LAST-ACK +--------> CLOSED
Server


TSO/LRO and/or “intelligence” in some smart network cards can cause all kinds of interesting bugs. What ethernet cards are you using?
(TSO/LRO seems to be working better these days for our Intel X710 cards, but a couple of years ago they would freeze up on us so we had to disable it)

Hmm.. Perhaps the NFS server is waiting for some locks to be released before it can close down it’s end of the TCP link? Reservations?

But I’d suspect something else since we’ve been running NFSv4.1/Kerberos on our FreeBSD 11.3/12.2 servers for a long time with many Linux clients and most issues (the last couple of years) we’ve seen have been on the Linux end of things… Like the bugs in the Linux gss daemons or their single-threaded mount() sys call, or automounter freezing up... and other fun bugs.

- Peter

> On 17 Mar 2021, at 23:17, Jason Breitman <***@tildenparkcapital.com> wrote:
>
> Thank you for the responses.
> The NFS Client does properly negotiate down to 128K for the rsize and wsize.
>
> The client port should be changing as we are using the noresvport option.
>
> On the NFS Client
> cat /proc/mounts
> nfs-server.domain.com:/data /mnt/data nfs4 rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,noresvport,proto=tcp,timeo=600,retrans=2,sec=krb5,clientaddr=NFS.Client.IP.X,lookupcache=pos,local_lock=none,addr=NFS.Server.IP.X 0 0
>
> When the issue occurs, this is what I see on the NFS Server.
> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.51550 CLOSE_WAIT
>
> Capturing packets right before the issue is a great idea, but I am concerned about running tcpdump for such an extended period of time on an active server.
> I have gone 9 days with no issue which would be a lot of data and overhead.
>
> I will look into disabling the TSO and LRO options and let the group know how it goes.
> Below are the current options on the NFS Server.
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
>
> Please share other ideas if you have them.
>
> Jason Breitman
>
>
> On Mar 17, 2021, at 5:58 PM, Rick Macklem <***@uoguelph.ca> wrote:
>
> Alan Somers wrote:
> [stuff snipped]
>> Is the 128K limit related to MAXPHYS? If so, it should be greater in 13.0.
> For the client, yes. For the server, no.
> For the server, it is just a compile time constant NFS_SRVMAXIO.
>
> It's mainly related to the fact that I haven't gotten around to testing larger
> sizes yet.
> - kern.ipc.maxsockbuf needs to be several times the limit, which means it would
> have to increase for 1Mbyte.
> - The session code must negotiate a maximum RPC size > 1 Mbyte.
> (I think the server code does do this, but it needs to be tested.)
> And, yes, the client is limited to MAXPHYS.
>
> Doing this is on my todo list, rick
>
> The client should acquire the attributes that indicate that and set rsize/wsize
> to that. "# nfsstat -m" on the client should show you what the client
> is actually using. If it is larger than 128K, set both rsize and wsize to 128K.
>
>> Output from the NFS Client when the issue occurs
>> # netstat -an | grep NFS.Server.IP.X
>> tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
> I'm no TCP guy. Hopefully others might know why the client would be
> stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack,
> but could be wrong?)
>
>> # cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info
>> netid: tcp
>> addr: NFS.Server.IP.X
>> port: 2049
>> state: 0x51
>>
>> syslog
>> Mar 4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status -client- --rqstp- ->timeout ---ops--
>> Mar 4 10:29:27 hostname kernel: [437414.133158] 57419 40a1 0 9b723c73 >143cfadf 30000 4ca953b5 nfsv4 OPEN_NOATTR a:call_connect_status [sunrpc] >q:xprt_pending
> I don't know what OPEN_NOATTR means, but I assume it is some variant
> of NFSv4 Open operation.
> [stuff snipped]
>> Mar 4 10:29:30 hostname kernel: [437417.110517] RPC: 57419 xprt_connect_status: >connect attempt timed out
>> Mar 4 10:29:30 hostname kernel: [437417.112172] RPC: 57419 call_connect_status
>> (status -110)
> I have no idea what status -110 means?
>> Mar 4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout (major)
>> Mar 4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind (status 0)
>> Mar 4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect xprt >00000000e061831b is not connected
>> Mar 4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect xprt >00000000e061831b is not connected
>> Mar 4 10:30:31 hostname kernel: [437478.551090] RPC: 57419 xprt_connect_status: >connect attempt timed out
>> Mar 4 10:30:31 hostname kernel: [437478.552396] RPC: 57419 call_connect_status >(status -110)
>> Mar 4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout (minor)
>> Mar 4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind (status 0)
>> Mar 4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect xprt >00000000e061831b is not connected
>> Mar 4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect xprt >00000000e061831b is not connected
> Is it possible that the client is trying to (re)connect using the same client port#?
> I would normally expect the client to create a new TCP connection using a
> different client port# and then retry the outstanding RPCs.
> --> Capturing packets when this happens would show us what is going on.
>
> If there is a problem on the FreeBSD end, it is most likely a broken
> network device driver.
> --> Try disabling TSO , LRO.
> --> Try a different driver for the net hardware on the server.
> --> Try a different net chip on the server.
> If you can capture packets when (not after) the hang
> occurs, then you can look at them in wireshark and see
> what is actually happening. (Ideally on both client and
> server, to check that your network hasn't dropped anything.)
> --> I know, if the hangs aren't easily reproducible, this isn't
> easily done.
> --> Try a newer Linux kernel and see if the problem persists.
> The Linux folk will get more interested if you can reproduce
> the problem on 5.12. (Recent bakeathon testing of the 5.12
> kernel against the FreeBSD server did not find any issues.)
>
> Hopefully the network folk have some insight w.r.t. why
> the TCP connection is sitting in FIN_WAIT2.
>
> rick
>
>
>
> Jason Breitman
>
>
>
>
>
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Gerrit Kuehn
2021-03-18 08:06:45 UTC
Permalink
On Wed, 17 Mar 2021 18:17:14 -0400
Jason Breitman <***@tildenparkcapital.com> wrote:

> I will look into disabling the TSO and LRO options and let the group
> know how it goes. Below are the current options on the NFS Server.
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST>
> metric 0 mtu 1500
> options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>

What laggproto are you using, and what kind of switch is connected on
the other end?


cu
Gerrit
Jason Breitman
2021-03-18 14:14:10 UTC
Permalink
The laggproto is lacp and the switch is made by Extreme Networks.

Jason Breitman


On Mar 18, 2021, at 4:06 AM, Gerrit Kuehn <***@aei.mpg.de> wrote:


On Wed, 17 Mar 2021 18:17:14 -0400
Jason Breitman <***@tildenparkcapital.com> wrote:

> I will look into disabling the TSO and LRO options and let the group
> know how it goes. Below are the current options on the NFS Server.
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST>
> metric 0 mtu 1500
> options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>

What laggproto are you using, and what kind of switch is connected on
the other end?


cu
Gerrit
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Youssef GHORBAL
2021-03-20 00:40:12 UTC
Permalink
Hi Jason,

> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com> wrote:
>
> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>
> Issue
> NFSv4 mounts periodically hang on the NFS Client.
>
> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
> Also, other NFS Clients are successfully mounting from the NFS Server in question.
> Rebooting the NFS Client appears to be the only solution.

I had experienced a similar weird situation with periodically stuck Linux NFS clients mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there own nfsd)
We’ve had better luck and we did manage to have packet captures on both sides during the issue. The gist of it goes like follows:

- Data flows correctly between SERVER and the CLIENT
- At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
- The client (eager to send data) can only ack data sent by SERVER.
- When SERVER was done sending data, the client starts sending TCP Window Probes hoping that the TCP Window opens again so he can flush its buffers.
- SERVER responds with a TCP Zero Window to those probes.
- After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the TCP connection sending a FIN Packet (and still a TCP Window at 0)
- CLIENT ACK that FIN.
- SERVER goes in FIN_WAIT_2 state
- CLIENT closes its half part part of the socket and goes in LAST_ACK state.
- FIN is never sent by the client since there still data in its SendQ and receiver TCP Window is still 0. At this stage the client starts sending TCP Window Probes again and again hoping that the server opens its TCP Window so it can flush it's buffers and terminate its side of the socket.
- SERVER keeps responding with a TCP Zero Window to those probes.
=> The last two steps goes on and on for hours/days freezing the NFS mount bound to that TCP session.

If we had a situation where CLIENT was responsible for closing the TCP Window (and initiating the TCP FIN first) and server wanting to send data we’ll end up in the same state as you I think.

We’ve never had the root cause of why the SERVER decided to close the TCP Window and no more acccept data, the fix on the Isilon part was to recycle more aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the teardown of the session on the client side, a new TCP handchake, etc and traffic flows again (NFS starts responding)

To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was implemented on the Isilon side) we’ve added a check script on the client that detects LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT --reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK disappears)

The bottom line would be to have a packet capture during the outage (client and/or server side), it will show you at least the shape of the TCP exchange when NFS is stuck.

Youssef
Jason Breitman
2021-03-21 13:41:15 UTC
Permalink
Thanks for sharing as this sounds exactly like my issue.

I had implemented the change below on 3/8/2021 and have experienced the NFS hang after that.
Do I need to reboot or umount / mount all of the clients and then I will be ok?

I had not rebooted the clients, but would to get out of this situation.
It is logical that a new TCP session over 2049 needs to be reestablished for the changes to take effect.

net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.finwait2_timeout=1000

I can also confirm that the iptables solution that you use on the client to get out of the hung mount without a reboot work for me.
#!/bin/sh

progName="nfsClientFix"
delay=15
nfs_ip=NFS.Server.IP.X

nfs_fin_wait2_state() {
/usr/bin/netstat -an | /usr/bin/grep ${nfs_ip}:2049 | /usr/bin/grep FIN_WAIT2 > /dev/null 2>&1
return $?
}


nfs_fin_wait2_state
result=$?
if [ ${result} -eq 0 ] ; then
/usr/bin/logger -s -i -p local7.error -t ${progName} "NFS Connection is in FIN_WAIT2!"
/usr/bin/logger -s -i -p local7.error -t ${progName} "Enabling firewall to block ${nfs_ip}!"
/usr/sbin/iptables -A INPUT -s ${nfs_ip} -j DROP

while true
do
/usr/bin/sleep ${delay}
nfs_fin_wait2_state
result=$?
if [ ${result} -ne 0 ] ; then
/usr/bin/logger -s -i -p local7.notice -t ${progName} "NFS Connection is OK."
/usr/bin/logger -s -i -p local7.error -t ${progName} "Disabling firewall to allow access to ${nfs_ip}!"
/usr/sbin/iptables -D INPUT -s ${nfs_ip} -j DROP
break
fi
done
fi


Jason Breitman


On Mar 19, 2021, at 8:40 PM, Youssef GHORBAL <***@pasteur.fr> wrote:

Hi Jason,

> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com> wrote:
>
> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>
> Issue
> NFSv4 mounts periodically hang on the NFS Client.
>
> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
> Also, other NFS Clients are successfully mounting from the NFS Server in question.
> Rebooting the NFS Client appears to be the only solution.

I had experienced a similar weird situation with periodically stuck Linux NFS clients mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there own nfsd)
We’ve had better luck and we did manage to have packet captures on both sides during the issue. The gist of it goes like follows:

- Data flows correctly between SERVER and the CLIENT
- At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
- The client (eager to send data) can only ack data sent by SERVER.
- When SERVER was done sending data, the client starts sending TCP Window Probes hoping that the TCP Window opens again so he can flush its buffers.
- SERVER responds with a TCP Zero Window to those probes.
- After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the TCP connection sending a FIN Packet (and still a TCP Window at 0)
- CLIENT ACK that FIN.
- SERVER goes in FIN_WAIT_2 state
- CLIENT closes its half part part of the socket and goes in LAST_ACK state.
- FIN is never sent by the client since there still data in its SendQ and receiver TCP Window is still 0. At this stage the client starts sending TCP Window Probes again and again hoping that the server opens its TCP Window so it can flush it's buffers and terminate its side of the socket.
- SERVER keeps responding with a TCP Zero Window to those probes.
=> The last two steps goes on and on for hours/days freezing the NFS mount bound to that TCP session.

If we had a situation where CLIENT was responsible for closing the TCP Window (and initiating the TCP FIN first) and server wanting to send data we’ll end up in the same state as you I think.

We’ve never had the root cause of why the SERVER decided to close the TCP Window and no more acccept data, the fix on the Isilon part was to recycle more aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the teardown of the session on the client side, a new TCP handchake, etc and traffic flows again (NFS starts responding)

To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was implemented on the Isilon side) we’ve added a check script on the client that detects LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT --reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK disappears)

The bottom line would be to have a packet capture during the outage (client and/or server side), it will show you at least the shape of the TCP exchange when NFS is stuck.

Youssef
Rick Macklem
2021-03-21 22:21:33 UTC
Permalink
Youssef GHORBAL <***@pasteur.fr> wrote:
>Hi Jason,
>
>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com> wrote:
>>
>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>
>> Issue
>> NFSv4 mounts periodically hang on the NFS Client.
>>
>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>> Rebooting the NFS Client appears to be the only solution.
>
>I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
Yes, my understanding is that Isilon uses a proprietary user space nfsd and
not the kernel based RPC and nfsd in FreeBSD.

>We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>
>- Data flows correctly between SERVER and the CLIENT
>- At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>- The client (eager to send data) can only ack data sent by SERVER.
>- When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>- SERVER responds with a TCP Zero Window to those probes.
Having the window size drop to zero is not necessarily incorrect.
If the server is overloaded (has a backlog of NFS requests), it can stop doing
soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
closes). This results in "backpressure" to stop the NFS client from flooding the
NFS server with requests.
--> However, once the backlog is handled, the nfsd should start to soreceive()
again and this shouls cause the window to open back up.
--> Maybe this is broken in the socket/TCP code. I quickly got lost in
tcp_output() when it decides what to do about the rcvwin.

>- After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
This probably does not happen for Jason's case, since the 6minute timeout
is disabled when the TCP connection is assigned as a backchannel (most likely
the case for NFSv4.1).

>- CLIENT ACK that FIN.
>- SERVER goes in FIN_WAIT_2 state
>- CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>- FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>- SERVER keeps responding with a TCP Zero Window to those probes.
>=> The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>
>If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>
>We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>
>To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>
>The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
Interesting story and good work w.r.t. sluething, Youssef, thanks.

I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
(They're just waiting for RPC requests.)
However, I do now think I know why the soclose() does not happen.
When the TCP connection is assigned as a backchannel, that takes a reference
cnt on the structure. This refcnt won't be released until the connection is
replaced by a BindConnectiotoSession operation from the client. But that won't
happen until the client creates a new TCP connection.
--> No refcnt release-->no refcnt of 0-->no soclose().

I've created the attached patch (completely different from the previous one)
that adds soshutdown(SHUT_WR) calls in the three places where the TCP
connection is going away. This seems to get it past CLOSE_WAIT without a
soclose().
--> I know you are not comfortable with patching your server, but I do think
this change will get the socket shutdown to complete.

There are a couple more things you can check on the server...
# nfsstat -E -s
--> Look for the count under "BindConnToSes".
--> If non-zero, backchannels have been assigned
# sysctl -a | fgrep request_space_throttle_count
--> If non-zero, the server has been overloaded at some point.

I think the attached patch might work around the problem.
The code that should open up the receive window needs to be checked.
I am also looking at enabling the 6minute timeout when a backchannel is
assigned.

rick

Youssef
Jason Breitman
2021-03-27 12:20:07 UTC
Permalink
The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
# ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
# ifconfig lagg0
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>

We can also say that the sysctl settings did not resolve this issue.

# sysctl net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.fast_finwait2_recycle: 0 -> 1

# sysctl net.inet.tcp.finwait2_timeout=1000
net.inet.tcp.finwait2_timeout: 60000 -> 1000

* I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.

The issue occurred after 5 days following a reboot of the client machines.
I ran the capture information again to make use of the situation.

#!/bin/sh

while true
do
/bin/date >> /tmp/nfs-hang.log
/bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
/bin/sleep 60
done


On the NFS Server
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT

On the NFS Client
tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
Youssef GHORBAL
2021-03-27 22:57:01 UTC
Permalink
On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:

The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
# ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
# ifconfig lagg0
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>

We can also say that the sysctl settings did not resolve this issue.

# sysctl net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.fast_finwait2_recycle: 0 -> 1

# sysctl net.inet.tcp.finwait2_timeout=1000
net.inet.tcp.finwait2_timeout: 60000 -> 1000

I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)

tcp_fin_timeout (integer; default: 60; since Linux 2.2)
This specifies how many seconds to wait for a final FIN
packet before the socket is forcibly closed. This is
strictly a violation of the TCP specification, but
required to prevent denial-of-service attacks. In Linux
2.2, the default value was 180.

So I don’t get why it stucks in the FIN_WAIT2 state anyway.

You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.

* I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.

The issue occurred after 5 days following a reboot of the client machines.
I ran the capture information again to make use of the situation.

#!/bin/sh

while true
do
/bin/date >> /tmp/nfs-hang.log
/bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
/bin/sleep 60
done


On the NFS Server
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT

On the NFS Client
tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2



You had also asked for the output below.

# nfsstat -E -s
BackChannelCtBindConnToSes
0 0

# sysctl vfs.nfsd.request_space_throttle_count
vfs.nfsd.request_space_throttle_count: 0

I see that you are testing a patch and I look forward to seeing the results.


Jason Breitman


On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:

Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>Hi Jason,
>
>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>
>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>
>> Issue
>> NFSv4 mounts periodically hang on the NFS Client.
>>
>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>> Rebooting the NFS Client appears to be the only solution.
>
>I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
Yes, my understanding is that Isilon uses a proprietary user space nfsd and
not the kernel based RPC and nfsd in FreeBSD.

>We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>
>- Data flows correctly between SERVER and the CLIENT
>- At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>- The client (eager to send data) can only ack data sent by SERVER.
>- When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>- SERVER responds with a TCP Zero Window to those probes.
Having the window size drop to zero is not necessarily incorrect.
If the server is overloaded (has a backlog of NFS requests), it can stop doing
soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
closes). This results in "backpressure" to stop the NFS client from flooding the
NFS server with requests.
--> However, once the backlog is handled, the nfsd should start to soreceive()
again and this shouls cause the window to open back up.
--> Maybe this is broken in the socket/TCP code. I quickly got lost in
tcp_output() when it decides what to do about the rcvwin.

>- After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
This probably does not happen for Jason's case, since the 6minute timeout
is disabled when the TCP connection is assigned as a backchannel (most likely
the case for NFSv4.1).

>- CLIENT ACK that FIN.
>- SERVER goes in FIN_WAIT_2 state
>- CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>- FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>- SERVER keeps responding with a TCP Zero Window to those probes.
>=> The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>
>If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>
>We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>
>To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>
>The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
Interesting story and good work w.r.t. sluething, Youssef, thanks.

I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
(They're just waiting for RPC requests.)
However, I do now think I know why the soclose() does not happen.
When the TCP connection is assigned as a backchannel, that takes a reference
cnt on the structure. This refcnt won't be released until the connection is
replaced by a BindConnectiotoSession operation from the client. But that won't
happen until the client creates a new TCP connection.
--> No refcnt release-->no refcnt of 0-->no soclose().

I've created the attached patch (completely different from the previous one)
that adds soshutdown(SHUT_WR) calls in the three places where the TCP
connection is going away. This seems to get it past CLOSE_WAIT without a
soclose().
--> I know you are not comfortable with patching your server, but I do think
this change will get the socket shutdown to complete.

There are a couple more things you can check on the server...
# nfsstat -E -s
--> Look for the count under "BindConnToSes".
--> If non-zero, backchannels have been assigned
# sysctl -a | fgrep request_space_throttle_count
--> If non-zero, the server has been overloaded at some point.

I think the attached patch might work around the problem.
The code that should open up the receive window needs to be checked.
I am also looking at enabling the 6minute timeout when a backchannel is
assigned.

rick

Youssef

_______________________________________________
freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
<xprtdied.patch>

<nfs-hang.log.gz>
Rick Macklem
2021-04-02 00:07:48 UTC
Permalink
I hope you don't mind a top post...
I've been testing network partitioning between the only Linux client
I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
(does soshutdown(..SHUT_WR) when it knows the socket is broken)
applied to it.

I'm not enough of a TCP guy to know if this is useful, but here's what
I see...

While partitioned:
On the FreeBSD server end, the socket either goes to CLOSED during
the network partition or stays ESTABLISHED.
On the Linux end, the socket seems to remain ESTABLISHED for a
little while, and then disappears.

After unpartitioning:
On the FreeBSD server end, you get another socket showing up at
the same port#
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED

The Linux client shows the same connection ESTABLISHED.
(The mount sometimes reports an error. I haven't looked at packet
traces to see if it retries RPCs or why the errors occur.)
--> However I never get hangs.
Sometimes it goes to SYN_SENT for a while and the FreeBSD server
shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
mount starts working again.

The most obvious thing is that the Linux client always keeps using
the same port#. (The FreeBSD client will use a different port# when
it does a TCP reconnect after no response from the NFS server for
a little while.)

What do those TCP conversant think?

rick
ps: I can capture packets while doing this, if anyone has a use
for them.






________________________________________
From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
Sent: Saturday, March 27, 2021 6:57 PM
To: Jason Breitman
Cc: Rick Macklem; freebsd-***@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca




On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:

The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
# ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
# ifconfig lagg0
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>

We can also say that the sysctl settings did not resolve this issue.

# sysctl net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.fast_finwait2_recycle: 0 -> 1

# sysctl net.inet.tcp.finwait2_timeout=1000
net.inet.tcp.finwait2_timeout: 60000 -> 1000

I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)

tcp_fin_timeout (integer; default: 60; since Linux 2.2)
This specifies how many seconds to wait for a final FIN
packet before the socket is forcibly closed. This is
strictly a violation of the TCP specification, but
required to prevent denial-of-service attacks. In Linux
2.2, the default value was 180.

So I don’t get why it stucks in the FIN_WAIT2 state anyway.

You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.

* I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.

The issue occurred after 5 days following a reboot of the client machines.
I ran the capture information again to make use of the situation.

#!/bin/sh

while true
do
/bin/date >> /tmp/nfs-hang.log
/bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
/usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
/bin/sleep 60
done


On the NFS Server
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT

On the NFS Client
tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2



You had also asked for the output below.

# nfsstat -E -s
BackChannelCtBindConnToSes
0 0

# sysctl vfs.nfsd.request_space_throttle_count
vfs.nfsd.request_space_throttle_count: 0

I see that you are testing a patch and I look forward to seeing the results.


Jason Breitman


On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:

Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>Hi Jason,
>
>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>
>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>
>> Issue
>> NFSv4 mounts periodically hang on the NFS Client.
>>
>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>> Rebooting the NFS Client appears to be the only solution.
>
>I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
Yes, my understanding is that Isilon uses a proprietary user space nfsd and
not the kernel based RPC and nfsd in FreeBSD.

>We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>
>- Data flows correctly between SERVER and the CLIENT
>- At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>- The client (eager to send data) can only ack data sent by SERVER.
>- When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>- SERVER responds with a TCP Zero Window to those probes.
Having the window size drop to zero is not necessarily incorrect.
If the server is overloaded (has a backlog of NFS requests), it can stop doing
soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
closes). This results in "backpressure" to stop the NFS client from flooding the
NFS server with requests.
--> However, once the backlog is handled, the nfsd should start to soreceive()
again and this shouls cause the window to open back up.
--> Maybe this is broken in the socket/TCP code. I quickly got lost in
tcp_output() when it decides what to do about the rcvwin.

>- After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
This probably does not happen for Jason's case, since the 6minute timeout
is disabled when the TCP connection is assigned as a backchannel (most likely
the case for NFSv4.1).

>- CLIENT ACK that FIN.
>- SERVER goes in FIN_WAIT_2 state
>- CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>- FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>- SERVER keeps responding with a TCP Zero Window to those probes.
>=> The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>
>If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>
>We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>
>To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>
>The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
Interesting story and good work w.r.t. sluething, Youssef, thanks.

I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
(They're just waiting for RPC requests.)
However, I do now think I know why the soclose() does not happen.
When the TCP connection is assigned as a backchannel, that takes a reference
cnt on the structure. This refcnt won't be released until the connection is
replaced by a BindConnectiotoSession operation from the client. But that won't
happen until the client creates a new TCP connection.
--> No refcnt release-->no refcnt of 0-->no soclose().

I've created the attached patch (completely different from the previous one)
that adds soshutdown(SHUT_WR) calls in the three places where the TCP
connection is going away. This seems to get it past CLOSE_WAIT without a
soclose().
--> I know you are not comfortable with patching your server, but I do think
this change will get the socket shutdown to complete.

There are a couple more things you can check on the server...
# nfsstat -E -s
--> Look for the count under "BindConnToSes".
--> If non-zero, backchannels have been assigned
# sysctl -a | fgrep request_space_throttle_count
--> If non-zero, the server has been overloaded at some point.

I think the attached patch might work around the problem.
The code that should open up the receive window needs to be checked.
I am also looking at enabling the 6minute timeout when a backchannel is
assigned.

rick

Youssef

_______________________________________________
freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
<xprtdied.patch>

<nfs-hang.log.gz>

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-02 16:15:16 UTC
Permalink
> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>
> I hope you don't mind a top post...
> I've been testing network partitioning between the only Linux client
> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
> applied to it.
>
> I'm not enough of a TCP guy to know if this is useful, but here's what
> I see...
>
> While partitioned:
> On the FreeBSD server end, the socket either goes to CLOSED during
> the network partition or stays ESTABLISHED.
If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
sent a FIN, but you never called close() on the socket.
If the socket stays in ESTABLISHED, there is no communication ongoing,
I guess, and therefore the server does not even detect that the peer
is not reachable.
> On the Linux end, the socket seems to remain ESTABLISHED for a
> little while, and then disappears.
So how does Linux detect the peer is not reachable?
>
> After unpartitioning:
> On the FreeBSD server end, you get another socket showing up at
> the same port#
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>
> The Linux client shows the same connection ESTABLISHED.
> (The mount sometimes reports an error. I haven't looked at packet
> traces to see if it retries RPCs or why the errors occur.)
> --> However I never get hangs.
> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
> mount starts working again.
>
> The most obvious thing is that the Linux client always keeps using
> the same port#. (The FreeBSD client will use a different port# when
> it does a TCP reconnect after no response from the NFS server for
> a little while.)
>
> What do those TCP conversant think?
I guess you are you are never calling close() on the socket, for with
the connection state is CLOSED.

Best regards
Michael
>
> rick
> ps: I can capture packets while doing this, if anyone has a use
> for them.
>
>
>
>
>
>
> ________________________________________
> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> Sent: Saturday, March 27, 2021 6:57 PM
> To: Jason Breitman
> Cc: Rick Macklem; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>
>
> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>
> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> # ifconfig lagg0
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>
> We can also say that the sysctl settings did not resolve this issue.
>
> # sysctl net.inet.tcp.fast_finwait2_recycle=1
> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>
> # sysctl net.inet.tcp.finwait2_timeout=1000
> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>
> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>
> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> This specifies how many seconds to wait for a final FIN
> packet before the socket is forcibly closed. This is
> strictly a violation of the TCP specification, but
> required to prevent denial-of-service attacks. In Linux
> 2.2, the default value was 180.
>
> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>
> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>
> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>
> The issue occurred after 5 days following a reboot of the client machines.
> I ran the capture information again to make use of the situation.
>
> #!/bin/sh
>
> while true
> do
> /bin/date >> /tmp/nfs-hang.log
> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> /bin/sleep 60
> done
>
>
> On the NFS Server
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>
> On the NFS Client
> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>
>
>
> You had also asked for the output below.
>
> # nfsstat -E -s
> BackChannelCtBindConnToSes
> 0 0
>
> # sysctl vfs.nfsd.request_space_throttle_count
> vfs.nfsd.request_space_throttle_count: 0
>
> I see that you are testing a patch and I look forward to seeing the results.
>
>
> Jason Breitman
>
>
> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>
> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>> Hi Jason,
>>
>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>
>>> Issue
>>> NFSv4 mounts periodically hang on the NFS Client.
>>>
>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>> Rebooting the NFS Client appears to be the only solution.
>>
>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> not the kernel based RPC and nfsd in FreeBSD.
>
>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>
>> - Data flows correctly between SERVER and the CLIENT
>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>> - The client (eager to send data) can only ack data sent by SERVER.
>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>> - SERVER responds with a TCP Zero Window to those probes.
> Having the window size drop to zero is not necessarily incorrect.
> If the server is overloaded (has a backlog of NFS requests), it can stop doing
> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> closes). This results in "backpressure" to stop the NFS client from flooding the
> NFS server with requests.
> --> However, once the backlog is handled, the nfsd should start to soreceive()
> again and this shouls cause the window to open back up.
> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> tcp_output() when it decides what to do about the rcvwin.
>
>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> This probably does not happen for Jason's case, since the 6minute timeout
> is disabled when the TCP connection is assigned as a backchannel (most likely
> the case for NFSv4.1).
>
>> - CLIENT ACK that FIN.
>> - SERVER goes in FIN_WAIT_2 state
>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>> - SERVER keeps responding with a TCP Zero Window to those probes.
>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>
>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>
>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>
>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>
>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>
> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> (They're just waiting for RPC requests.)
> However, I do now think I know why the soclose() does not happen.
> When the TCP connection is assigned as a backchannel, that takes a reference
> cnt on the structure. This refcnt won't be released until the connection is
> replaced by a BindConnectiotoSession operation from the client. But that won't
> happen until the client creates a new TCP connection.
> --> No refcnt release-->no refcnt of 0-->no soclose().
>
> I've created the attached patch (completely different from the previous one)
> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> connection is going away. This seems to get it past CLOSE_WAIT without a
> soclose().
> --> I know you are not comfortable with patching your server, but I do think
> this change will get the socket shutdown to complete.
>
> There are a couple more things you can check on the server...
> # nfsstat -E -s
> --> Look for the count under "BindConnToSes".
> --> If non-zero, backchannels have been assigned
> # sysctl -a | fgrep request_space_throttle_count
> --> If non-zero, the server has been overloaded at some point.
>
> I think the attached patch might work around the problem.
> The code that should open up the receive window needs to be checked.
> I am also looking at enabling the 6minute timeout when a backchannel is
> assigned.
>
> rick
>
> Youssef
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> <xprtdied.patch>
>
> <nfs-hang.log.gz>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-02 21:31:01 UTC
Permalink
***@freebsd.org wrote:
>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> I hope you don't mind a top post...
>> I've been testing network partitioning between the only Linux client
>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>> applied to it.
>>
>> I'm not enough of a TCP guy to know if this is useful, but here's what
>> I see...
>>
>> While partitioned:
>> On the FreeBSD server end, the socket either goes to CLOSED during
>> the network partition or stays ESTABLISHED.
>If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>sent a FIN, but you never called close() on the socket.
>If the socket stays in ESTABLISHED, there is no communication ongoing,
>I guess, and therefore the server does not even detect that the peer
>is not reachable.
>> On the Linux end, the socket seems to remain ESTABLISHED for a
>> little while, and then disappears.
>So how does Linux detect the peer is not reachable?
Well, here's what I see in a packet capture in the Linux client once
I partition it (just unplug the net cable):
- lots of retransmits of the same segment (with ACK) for 54sec
- then only ARP queries

Once I plug the net cable back in:
- ARP works
- one more retransmit of the same segement
- receives RST from FreeBSD
** So, is this now a "new" TCP connection, despite
using the same port#.
--> It matters for NFS, since "new connection"
implies "must retry all outstanding RPCs".
- sends SYN
- receives SYN, ACK from FreeBSD
--> connection starts working again
Always uses same port#.

On the FreeBSD server end:
- receives the last retransmit of the segment (with ACK)
- sends RST
- receives SYN
- sends SYN, ACK

I thought that there was no RST in the capture I looked at
yesterday, so I'm not sure if FreeBSD always sends an RST,
but the Linux client behaviour was the same. (Sent a SYN, etc).
The socket disappears from the Linux "netstat -a" and I
suspect that happens after about 54sec, but I am not sure
about the timing.

>>
>> After unpartitioning:
>> On the FreeBSD server end, you get another socket showing up at
>> the same port#
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>
>> The Linux client shows the same connection ESTABLISHED.
But disappears from "netstat -a" for a while during the partitioning.

>> (The mount sometimes reports an error. I haven't looked at packet
>> traces to see if it retries RPCs or why the errors occur.)
I have now done so, as above.

>> --> However I never get hangs.
>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>> mount starts working again.
>>
>> The most obvious thing is that the Linux client always keeps using
>> the same port#. (The FreeBSD client will use a different port# when
>> it does a TCP reconnect after no response from the NFS server for
>> a little while.)
>>
>> What do those TCP conversant think?
>I guess you are you are never calling close() on the socket, for with
>the connection state is CLOSED.
Ok, that makes sense. For this case the Linux client has not done a
BindConnectionToSession to re-assign the back channel.
I'll have to bug them about this. However, I'll bet they'll answer
that I have to tell them the back channel needs re-assignment
or something like that.

I am pretty certain they are broken, in that the client needs to
retry all outstanding RPCs.

For others, here's the long winded version of this that I just
put on the phabricator review:
In the server side kernel RPC, the socket (struct socket *) is in a
structure called SVCXPRT (normally pointed to by "xprt").
These structures a ref counted and the soclose() is done
when the ref. cnt goes to zero. My understanding is that
"struct socket *" is free'd by soclose() so this cannot be done
before the xprt ref. cnt goes to zero.

For NFSv4.1/4.2 there is something called a back channel
which means that a "xprt" is used for server->client RPCs,
although the TCP connection is established by the client
to the server.
--> This back channel holds a ref cnt on "xprt" until the

client re-assigns it to a different TCP connection
via an operation called BindConnectionToSession
and the Linux client is not doing this soon enough,
it appears.

So, the soclose() is delayed, which is why I think the
TCP connection gets stuck in CLOSE_WAIT and that is
why I've added the soshutdown(..SHUT_WR) calls,
which can happen before the client gets around to
re-assigning the back channel.

Thanks for your help with this Michael, rick

Best regards
Michael
>
> rick
> ps: I can capture packets while doing this, if anyone has a use
> for them.
>
>
>
>
>
>
> ________________________________________
> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> Sent: Saturday, March 27, 2021 6:57 PM
> To: Jason Breitman
> Cc: Rick Macklem; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>
>
> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>
> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> # ifconfig lagg0
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>
> We can also say that the sysctl settings did not resolve this issue.
>
> # sysctl net.inet.tcp.fast_finwait2_recycle=1
> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>
> # sysctl net.inet.tcp.finwait2_timeout=1000
> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>
> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>
> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> This specifies how many seconds to wait for a final FIN
> packet before the socket is forcibly closed. This is
> strictly a violation of the TCP specification, but
> required to prevent denial-of-service attacks. In Linux
> 2.2, the default value was 180.
>
> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>
> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>
> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>
> The issue occurred after 5 days following a reboot of the client machines.
> I ran the capture information again to make use of the situation.
>
> #!/bin/sh
>
> while true
> do
> /bin/date >> /tmp/nfs-hang.log
> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> /bin/sleep 60
> done
>
>
> On the NFS Server
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>
> On the NFS Client
> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>
>
>
> You had also asked for the output below.
>
> # nfsstat -E -s
> BackChannelCtBindConnToSes
> 0 0
>
> # sysctl vfs.nfsd.request_space_throttle_count
> vfs.nfsd.request_space_throttle_count: 0
>
> I see that you are testing a patch and I look forward to seeing the results.
>
>
> Jason Breitman
>
>
> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>
> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>> Hi Jason,
>>
>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>
>>> Issue
>>> NFSv4 mounts periodically hang on the NFS Client.
>>>
>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>> Rebooting the NFS Client appears to be the only solution.
>>
>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> not the kernel based RPC and nfsd in FreeBSD.
>
>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>
>> - Data flows correctly between SERVER and the CLIENT
>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>> - The client (eager to send data) can only ack data sent by SERVER.
>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>> - SERVER responds with a TCP Zero Window to those probes.
> Having the window size drop to zero is not necessarily incorrect.
> If the server is overloaded (has a backlog of NFS requests), it can stop doing
> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> closes). This results in "backpressure" to stop the NFS client from flooding the
> NFS server with requests.
> --> However, once the backlog is handled, the nfsd should start to soreceive()
> again and this shouls cause the window to open back up.
> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> tcp_output() when it decides what to do about the rcvwin.
>
>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> This probably does not happen for Jason's case, since the 6minute timeout
> is disabled when the TCP connection is assigned as a backchannel (most likely
> the case for NFSv4.1).
>
>> - CLIENT ACK that FIN.
>> - SERVER goes in FIN_WAIT_2 state
>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>> - SERVER keeps responding with a TCP Zero Window to those probes.
>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>
>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>
>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>
>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>
>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>
> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> (They're just waiting for RPC requests.)
> However, I do now think I know why the soclose() does not happen.
> When the TCP connection is assigned as a backchannel, that takes a reference
> cnt on the structure. This refcnt won't be released until the connection is
> replaced by a BindConnectiotoSession operation from the client. But that won't
> happen until the client creates a new TCP connection.
> --> No refcnt release-->no refcnt of 0-->no soclose().
>
> I've created the attached patch (completely different from the previous one)
> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> connection is going away. This seems to get it past CLOSE_WAIT without a
> soclose().
> --> I know you are not comfortable with patching your server, but I do think
> this change will get the socket shutdown to complete.
>
> There are a couple more things you can check on the server...
> # nfsstat -E -s
> --> Look for the count under "BindConnToSes".
> --> If non-zero, backchannels have been assigned
> # sysctl -a | fgrep request_space_throttle_count
> --> If non-zero, the server has been overloaded at some point.
>
> I think the attached patch might work around the problem.
> The code that should open up the receive window needs to be checked.
> I am also looking at enabling the 6minute timeout when a backchannel is
> assigned.
>
> rick
>
> Youssef
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> <xprtdied.patch>
>
> <nfs-hang.log.gz>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-02 22:24:10 UTC
Permalink
Updating my post slightly...

***@freebsd.org wrote:
>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> I hope you don't mind a top post...
>> I've been testing network partitioning between the only Linux client
>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>> applied to it.
>>
>> I'm not enough of a TCP guy to know if this is useful, but here's what
>> I see...
>>
>> While partitioned:
>> On the FreeBSD server end, the socket either goes to CLOSED during
>> the network partition or stays ESTABLISHED.
>If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>sent a FIN, but you never called close() on the socket.
>If the socket stays in ESTABLISHED, there is no communication ongoing,
>I guess, and therefore the server does not even detect that the peer
>is not reachable.
>> On the Linux end, the socket seems to remain ESTABLISHED for a
>> little while, and then disappears.
>So how does Linux detect the peer is not reachable?
Well, here's what I see in a packet capture in the Linux client once
I partition it (just unplug the net cable):
- lots of retransmits of the same segment (with ACK) for 54sec
- then only ARP queries

Once I plug the net cable back in:
- ARP works
- one more retransmit of the same segement
- receives RST from FreeBSD
--> I just did a retest and the above two packets
did not happen. The Linux client went straight
to SYN and there was no RST from FreeBSD.
Otherwise, everything was the same.
** So, is this now a "new" TCP connection, despite
using the same port#.
--> It matters for NFS, since "new connection"
implies "must retry all outstanding RPCs".
- sends SYN
- receives SYN, ACK from FreeBSD
--> connection starts working again
Always uses same port#.

On the FreeBSD server end:
- receives the last retransmit of the segment (with ACK)
- sends RST
--> AS above, these two packets did not happen during a
retest.
- receives SYN
- sends SYN, ACK

I thought that there was no RST in the capture I looked at
yesterday, so I'm not sure if FreeBSD always sends an RST,
but the Linux client behaviour was the same. (Sent a SYN, etc).
--> Just got the non-RST case. The Linux client did not do any
retransmit of the segment after unpartitioning and went
straight to SYN.
The socket disappears from the Linux "netstat -a" and I
suspect that happens after about 54sec, but I am not sure
about the timing.

>>
>> After unpartitioning:
>> On the FreeBSD server end, you get another socket showing up at
>> the same port#
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>
>> The Linux client shows the same connection ESTABLISHED.
But disappears from "netstat -a" for a while during the partitioning.

>> (The mount sometimes reports an error. I haven't looked at packet
>> traces to see if it retries RPCs or why the errors occur.)
I have now done so, as above.

>> --> However I never get hangs.
>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>> mount starts working again.
>>
>> The most obvious thing is that the Linux client always keeps using
>> the same port#. (The FreeBSD client will use a different port# when
>> it does a TCP reconnect after no response from the NFS server for
>> a little while.)
>>
>> What do those TCP conversant think?
>I guess you are you are never calling close() on the socket, for with
>the connection state is CLOSED.
Ok, that makes sense. For this case the Linux client has not done a
BindConnectionToSession to re-assign the back channel.
I'll have to bug them about this. However, I'll bet they'll answer
that I have to tell them the back channel needs re-assignment
or something like that.

I am pretty certain they are broken, in that the client needs to
retry all outstanding RPCs.

For others, here's the long winded version of this that I just
put on the phabricator review:
In the server side kernel RPC, the socket (struct socket *) is in a
structure called SVCXPRT (normally pointed to by "xprt").
These structures a ref counted and the soclose() is done
when the ref. cnt goes to zero. My understanding is that
"struct socket *" is free'd by soclose() so this cannot be done
before the xprt ref. cnt goes to zero.

For NFSv4.1/4.2 there is something called a back channel
which means that a "xprt" is used for server->client RPCs,
although the TCP connection is established by the client
to the server.
--> This back channel holds a ref cnt on "xprt" until the

client re-assigns it to a different TCP connection
via an operation called BindConnectionToSession
and the Linux client is not doing this soon enough,
it appears.

So, the soclose() is delayed, which is why I think the
TCP connection gets stuck in CLOSE_WAIT and that is
why I've added the soshutdown(..SHUT_WR) calls,
which can happen before the client gets around to
re-assigning the back channel.

Thanks for your help with this Michael, rick

Best regards
Michael
>
> rick
> ps: I can capture packets while doing this, if anyone has a use
> for them.
>
>
>
>
>
>
> ________________________________________
> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> Sent: Saturday, March 27, 2021 6:57 PM
> To: Jason Breitman
> Cc: Rick Macklem; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>
>
> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>
> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> # ifconfig lagg0
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>
> We can also say that the sysctl settings did not resolve this issue.
>
> # sysctl net.inet.tcp.fast_finwait2_recycle=1
> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>
> # sysctl net.inet.tcp.finwait2_timeout=1000
> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>
> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>
> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> This specifies how many seconds to wait for a final FIN
> packet before the socket is forcibly closed. This is
> strictly a violation of the TCP specification, but
> required to prevent denial-of-service attacks. In Linux
> 2.2, the default value was 180.
>
> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>
> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>
> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>
> The issue occurred after 5 days following a reboot of the client machines.
> I ran the capture information again to make use of the situation.
>
> #!/bin/sh
>
> while true
> do
> /bin/date >> /tmp/nfs-hang.log
> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> /bin/sleep 60
> done
>
>
> On the NFS Server
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>
> On the NFS Client
> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>
>
>
> You had also asked for the output below.
>
> # nfsstat -E -s
> BackChannelCtBindConnToSes
> 0 0
>
> # sysctl vfs.nfsd.request_space_throttle_count
> vfs.nfsd.request_space_throttle_count: 0
>
> I see that you are testing a patch and I look forward to seeing the results.
>
>
> Jason Breitman
>
>
> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>
> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>> Hi Jason,
>>
>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>
>>> Issue
>>> NFSv4 mounts periodically hang on the NFS Client.
>>>
>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>> Rebooting the NFS Client appears to be the only solution.
>>
>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> not the kernel based RPC and nfsd in FreeBSD.
>
>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>
>> - Data flows correctly between SERVER and the CLIENT
>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>> - The client (eager to send data) can only ack data sent by SERVER.
>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>> - SERVER responds with a TCP Zero Window to those probes.
> Having the window size drop to zero is not necessarily incorrect.
> If the server is overloaded (has a backlog of NFS requests), it can stop doing
> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> closes). This results in "backpressure" to stop the NFS client from flooding the
> NFS server with requests.
> --> However, once the backlog is handled, the nfsd should start to soreceive()
> again and this shouls cause the window to open back up.
> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> tcp_output() when it decides what to do about the rcvwin.
>
>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> This probably does not happen for Jason's case, since the 6minute timeout
> is disabled when the TCP connection is assigned as a backchannel (most likely
> the case for NFSv4.1).
>
>> - CLIENT ACK that FIN.
>> - SERVER goes in FIN_WAIT_2 state
>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>> - SERVER keeps responding with a TCP Zero Window to those probes.
>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>
>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>
>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>
>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>
>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>
> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> (They're just waiting for RPC requests.)
> However, I do now think I know why the soclose() does not happen.
> When the TCP connection is assigned as a backchannel, that takes a reference
> cnt on the structure. This refcnt won't be released until the connection is
> replaced by a BindConnectiotoSession operation from the client. But that won't
> happen until the client creates a new TCP connection.
> --> No refcnt release-->no refcnt of 0-->no soclose().
>
> I've created the attached patch (completely different from the previous one)
> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> connection is going away. This seems to get it past CLOSE_WAIT without a
> soclose().
> --> I know you are not comfortable with patching your server, but I do think
> this change will get the socket shutdown to complete.
>
> There are a couple more things you can check on the server...
> # nfsstat -E -s
> --> Look for the count under "BindConnToSes".
> --> If non-zero, backchannels have been assigned
> # sysctl -a | fgrep request_space_throttle_count
> --> If non-zero, the server has been overloaded at some point.
>
> I think the attached patch might work around the problem.
> The code that should open up the receive window needs to be checked.
> I am also looking at enabling the 6minute timeout when a backchannel is
> assigned.
>
> rick
>
> Youssef
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> <xprtdied.patch>
>
> <nfs-hang.log.gz>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-04-04 11:50:47 UTC
Permalink
For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.

One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);

The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...

I can try getting the relevant bug info next week...

________________________________
Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
Gesendet: Friday, April 2, 2021 11:31:01 PM
An: ***@freebsd.org <***@freebsd.org>
Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




***@freebsd.org wrote:
>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> I hope you don't mind a top post...
>> I've been testing network partitioning between the only Linux client
>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>> applied to it.
>>
>> I'm not enough of a TCP guy to know if this is useful, but here's what
>> I see...
>>
>> While partitioned:
>> On the FreeBSD server end, the socket either goes to CLOSED during
>> the network partition or stays ESTABLISHED.
>If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>sent a FIN, but you never called close() on the socket.
>If the socket stays in ESTABLISHED, there is no communication ongoing,
>I guess, and therefore the server does not even detect that the peer
>is not reachable.
>> On the Linux end, the socket seems to remain ESTABLISHED for a
>> little while, and then disappears.
>So how does Linux detect the peer is not reachable?
Well, here's what I see in a packet capture in the Linux client once
I partition it (just unplug the net cable):
- lots of retransmits of the same segment (with ACK) for 54sec
- then only ARP queries

Once I plug the net cable back in:
- ARP works
- one more retransmit of the same segement
- receives RST from FreeBSD
** So, is this now a "new" TCP connection, despite
using the same port#.
--> It matters for NFS, since "new connection"
implies "must retry all outstanding RPCs".
- sends SYN
- receives SYN, ACK from FreeBSD
--> connection starts working again
Always uses same port#.

On the FreeBSD server end:
- receives the last retransmit of the segment (with ACK)
- sends RST
- receives SYN
- sends SYN, ACK

I thought that there was no RST in the capture I looked at
yesterday, so I'm not sure if FreeBSD always sends an RST,
but the Linux client behaviour was the same. (Sent a SYN, etc).
The socket disappears from the Linux "netstat -a" and I
suspect that happens after about 54sec, but I am not sure
about the timing.

>>
>> After unpartitioning:
>> On the FreeBSD server end, you get another socket showing up at
>> the same port#
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>
>> The Linux client shows the same connection ESTABLISHED.
But disappears from "netstat -a" for a while during the partitioning.

>> (The mount sometimes reports an error. I haven't looked at packet
>> traces to see if it retries RPCs or why the errors occur.)
I have now done so, as above.

>> --> However I never get hangs.
>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>> mount starts working again.
>>
>> The most obvious thing is that the Linux client always keeps using
>> the same port#. (The FreeBSD client will use a different port# when
>> it does a TCP reconnect after no response from the NFS server for
>> a little while.)
>>
>> What do those TCP conversant think?
>I guess you are you are never calling close() on the socket, for with
>the connection state is CLOSED.
Ok, that makes sense. For this case the Linux client has not done a
BindConnectionToSession to re-assign the back channel.
I'll have to bug them about this. However, I'll bet they'll answer
that I have to tell them the back channel needs re-assignment
or something like that.

I am pretty certain they are broken, in that the client needs to
retry all outstanding RPCs.

For others, here's the long winded version of this that I just
put on the phabricator review:
In the server side kernel RPC, the socket (struct socket *) is in a
structure called SVCXPRT (normally pointed to by "xprt").
These structures a ref counted and the soclose() is done
when the ref. cnt goes to zero. My understanding is that
"struct socket *" is free'd by soclose() so this cannot be done
before the xprt ref. cnt goes to zero.

For NFSv4.1/4.2 there is something called a back channel
which means that a "xprt" is used for server->client RPCs,
although the TCP connection is established by the client
to the server.
--> This back channel holds a ref cnt on "xprt" until the

client re-assigns it to a different TCP connection
via an operation called BindConnectionToSession
and the Linux client is not doing this soon enough,
it appears.

So, the soclose() is delayed, which is why I think the
TCP connection gets stuck in CLOSE_WAIT and that is
why I've added the soshutdown(..SHUT_WR) calls,
which can happen before the client gets around to
re-assigning the back channel.

Thanks for your help with this Michael, rick

Best regards
Michael
>
> rick
> ps: I can capture packets while doing this, if anyone has a use
> for them.
>
>
>
>
>
>
> ________________________________________
> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> Sent: Saturday, March 27, 2021 6:57 PM
> To: Jason Breitman
> Cc: Rick Macklem; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>
>
> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>
> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> # ifconfig lagg0
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>
> We can also say that the sysctl settings did not resolve this issue.
>
> # sysctl net.inet.tcp.fast_finwait2_recycle=1
> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>
> # sysctl net.inet.tcp.finwait2_timeout=1000
> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>
> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>
> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> This specifies how many seconds to wait for a final FIN
> packet before the socket is forcibly closed. This is
> strictly a violation of the TCP specification, but
> required to prevent denial-of-service attacks. In Linux
> 2.2, the default value was 180.
>
> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>
> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>
> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>
> The issue occurred after 5 days following a reboot of the client machines.
> I ran the capture information again to make use of the situation.
>
> #!/bin/sh
>
> while true
> do
> /bin/date >> /tmp/nfs-hang.log
> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> /bin/sleep 60
> done
>
>
> On the NFS Server
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>
> On the NFS Client
> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>
>
>
> You had also asked for the output below.
>
> # nfsstat -E -s
> BackChannelCtBindConnToSes
> 0 0
>
> # sysctl vfs.nfsd.request_space_throttle_count
> vfs.nfsd.request_space_throttle_count: 0
>
> I see that you are testing a patch and I look forward to seeing the results.
>
>
> Jason Breitman
>
>
> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>
> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>> Hi Jason,
>>
>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>
>>> Issue
>>> NFSv4 mounts periodically hang on the NFS Client.
>>>
>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>> Rebooting the NFS Client appears to be the only solution.
>>
>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> not the kernel based RPC and nfsd in FreeBSD.
>
>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>
>> - Data flows correctly between SERVER and the CLIENT
>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>> - The client (eager to send data) can only ack data sent by SERVER.
>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>> - SERVER responds with a TCP Zero Window to those probes.
> Having the window size drop to zero is not necessarily incorrect.
> If the server is overloaded (has a backlog of NFS requests), it can stop doing
> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> closes). This results in "backpressure" to stop the NFS client from flooding the
> NFS server with requests.
> --> However, once the backlog is handled, the nfsd should start to soreceive()
> again and this shouls cause the window to open back up.
> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> tcp_output() when it decides what to do about the rcvwin.
>
>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> This probably does not happen for Jason's case, since the 6minute timeout
> is disabled when the TCP connection is assigned as a backchannel (most likely
> the case for NFSv4.1).
>
>> - CLIENT ACK that FIN.
>> - SERVER goes in FIN_WAIT_2 state
>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>> - SERVER keeps responding with a TCP Zero Window to those probes.
>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>
>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>
>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>
>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>
>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>
> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> (They're just waiting for RPC requests.)
> However, I do now think I know why the soclose() does not happen.
> When the TCP connection is assigned as a backchannel, that takes a reference
> cnt on the structure. This refcnt won't be released until the connection is
> replaced by a BindConnectiotoSession operation from the client. But that won't
> happen until the client creates a new TCP connection.
> --> No refcnt release-->no refcnt of 0-->no soclose().
>
> I've created the attached patch (completely different from the previous one)
> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> connection is going away. This seems to get it past CLOSE_WAIT without a
> soclose().
> --> I know you are not comfortable with patching your server, but I do think
> this change will get the socket shutdown to complete.
>
> There are a couple more things you can check on the server...
> # nfsstat -E -s
> --> Look for the count under "BindConnToSes".
> --> If non-zero, backchannels have been assigned
> # sysctl -a | fgrep request_space_throttle_count
> --> If non-zero, the server has been overloaded at some point.
>
> I think the attached patch might work around the problem.
> The code that should open up the receive window needs to be checked.
> I am also looking at enabling the 6minute timeout when a backchannel is
> assigned.
>
> rick
>
> Youssef
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> <xprtdied.patch>
>
> <nfs-hang.log.gz>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-04 15:27:15 UTC
Permalink
Well, I'm going to cheat and top post, since this is elated info. and
not really part of the discussion...

I've been testing network partitioning between a Linux client (5.2 kernel)
and a FreeBSD-current NFS server. I have not gotten a solid hang, but
I have had the Linux client doing "battle" with the FreeBSD server for
several minutes after un-partitioning the connection.

The battle basically consists of the Linux client sending an RST, followed
by a SYN.
The FreeBSD server ignores the RST and just replies with the same old ack.
--> This varies from "just a SYN" that succeeds to 100+ cycles of the above
over several minutes.

I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
pretty good at ignoring it.

A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
in case anyone wants to look at it.

Here's a tcpdump snippet of the interesting part (see the *** comments):
19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
*** Network is now partitioned...

19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
*** Lots of lines snipped.


19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
*** Network is now unpartitioned...

19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
*** This "battle" goes on for 223sec...
I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
"FreeBSD replies with same old ACK". In another test run I saw this
cycle continue non-stop for several minutes. This time, the Linux
client paused for a while (see ARPs below).

19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
*** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.

19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
*** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
of 13 (100+ for another test run).

19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
*** Now back in business...

19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063

This error 10063 after the partition heals is also "bad news". It indicates the Session
(which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
suspect a Linux client bug, but will be investigating further.

So, hopefully TCP conversant folk can confirm if the above is correct behaviour
or if the RST should be ack'd sooner?

I could also see this becoming a "forever" TCP battle for other versions of Linux client.

rick


________________________________________
From: Scheffenegger, Richard <***@netapp.com>
Sent: Sunday, April 4, 2021 7:50 AM
To: Rick Macklem; ***@freebsd.org
Cc: Youssef GHORBAL; freebsd-***@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca


For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.

One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);

The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...

I can try getting the relevant bug info next week...

________________________________
Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
Gesendet: Friday, April 2, 2021 11:31:01 PM
An: ***@freebsd.org <***@freebsd.org>
Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




***@freebsd.org wrote:
>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> I hope you don't mind a top post...
>> I've been testing network partitioning between the only Linux client
>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>> applied to it.
>>
>> I'm not enough of a TCP guy to know if this is useful, but here's what
>> I see...
>>
>> While partitioned:
>> On the FreeBSD server end, the socket either goes to CLOSED during
>> the network partition or stays ESTABLISHED.
>If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>sent a FIN, but you never called close() on the socket.
>If the socket stays in ESTABLISHED, there is no communication ongoing,
>I guess, and therefore the server does not even detect that the peer
>is not reachable.
>> On the Linux end, the socket seems to remain ESTABLISHED for a
>> little while, and then disappears.
>So how does Linux detect the peer is not reachable?
Well, here's what I see in a packet capture in the Linux client once
I partition it (just unplug the net cable):
- lots of retransmits of the same segment (with ACK) for 54sec
- then only ARP queries

Once I plug the net cable back in:
- ARP works
- one more retransmit of the same segement
- receives RST from FreeBSD
** So, is this now a "new" TCP connection, despite
using the same port#.
--> It matters for NFS, since "new connection"
implies "must retry all outstanding RPCs".
- sends SYN
- receives SYN, ACK from FreeBSD
--> connection starts working again
Always uses same port#.

On the FreeBSD server end:
- receives the last retransmit of the segment (with ACK)
- sends RST
- receives SYN
- sends SYN, ACK

I thought that there was no RST in the capture I looked at
yesterday, so I'm not sure if FreeBSD always sends an RST,
but the Linux client behaviour was the same. (Sent a SYN, etc).
The socket disappears from the Linux "netstat -a" and I
suspect that happens after about 54sec, but I am not sure
about the timing.

>>
>> After unpartitioning:
>> On the FreeBSD server end, you get another socket showing up at
>> the same port#
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>
>> The Linux client shows the same connection ESTABLISHED.
But disappears from "netstat -a" for a while during the partitioning.

>> (The mount sometimes reports an error. I haven't looked at packet
>> traces to see if it retries RPCs or why the errors occur.)
I have now done so, as above.

>> --> However I never get hangs.
>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>> mount starts working again.
>>
>> The most obvious thing is that the Linux client always keeps using
>> the same port#. (The FreeBSD client will use a different port# when
>> it does a TCP reconnect after no response from the NFS server for
>> a little while.)
>>
>> What do those TCP conversant think?
>I guess you are you are never calling close() on the socket, for with
>the connection state is CLOSED.
Ok, that makes sense. For this case the Linux client has not done a
BindConnectionToSession to re-assign the back channel.
I'll have to bug them about this. However, I'll bet they'll answer
that I have to tell them the back channel needs re-assignment
or something like that.

I am pretty certain they are broken, in that the client needs to
retry all outstanding RPCs.

For others, here's the long winded version of this that I just
put on the phabricator review:
In the server side kernel RPC, the socket (struct socket *) is in a
structure called SVCXPRT (normally pointed to by "xprt").
These structures a ref counted and the soclose() is done
when the ref. cnt goes to zero. My understanding is that
"struct socket *" is free'd by soclose() so this cannot be done
before the xprt ref. cnt goes to zero.

For NFSv4.1/4.2 there is something called a back channel
which means that a "xprt" is used for server->client RPCs,
although the TCP connection is established by the client
to the server.
--> This back channel holds a ref cnt on "xprt" until the

client re-assigns it to a different TCP connection
via an operation called BindConnectionToSession
and the Linux client is not doing this soon enough,
it appears.

So, the soclose() is delayed, which is why I think the
TCP connection gets stuck in CLOSE_WAIT and that is
why I've added the soshutdown(..SHUT_WR) calls,
which can happen before the client gets around to
re-assigning the back channel.

Thanks for your help with this Michael, rick

Best regards
Michael
>
> rick
> ps: I can capture packets while doing this, if anyone has a use
> for them.
>
>
>
>
>
>
> ________________________________________
> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> Sent: Saturday, March 27, 2021 6:57 PM
> To: Jason Breitman
> Cc: Rick Macklem; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>
>
> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>
> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> # ifconfig lagg0
> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>
> We can also say that the sysctl settings did not resolve this issue.
>
> # sysctl net.inet.tcp.fast_finwait2_recycle=1
> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>
> # sysctl net.inet.tcp.finwait2_timeout=1000
> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>
> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>
> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> This specifies how many seconds to wait for a final FIN
> packet before the socket is forcibly closed. This is
> strictly a violation of the TCP specification, but
> required to prevent denial-of-service attacks. In Linux
> 2.2, the default value was 180.
>
> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>
> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>
> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>
> The issue occurred after 5 days following a reboot of the client machines.
> I ran the capture information again to make use of the situation.
>
> #!/bin/sh
>
> while true
> do
> /bin/date >> /tmp/nfs-hang.log
> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> /bin/sleep 60
> done
>
>
> On the NFS Server
> Active Internet connections (including servers)
> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>
> On the NFS Client
> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>
>
>
> You had also asked for the output below.
>
> # nfsstat -E -s
> BackChannelCtBindConnToSes
> 0 0
>
> # sysctl vfs.nfsd.request_space_throttle_count
> vfs.nfsd.request_space_throttle_count: 0
>
> I see that you are testing a patch and I look forward to seeing the results.
>
>
> Jason Breitman
>
>
> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>
> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>> Hi Jason,
>>
>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>
>>> Issue
>>> NFSv4 mounts periodically hang on the NFS Client.
>>>
>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>> Rebooting the NFS Client appears to be the only solution.
>>
>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> not the kernel based RPC and nfsd in FreeBSD.
>
>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>
>> - Data flows correctly between SERVER and the CLIENT
>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>> - The client (eager to send data) can only ack data sent by SERVER.
>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>> - SERVER responds with a TCP Zero Window to those probes.
> Having the window size drop to zero is not necessarily incorrect.
> If the server is overloaded (has a backlog of NFS requests), it can stop doing
> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> closes). This results in "backpressure" to stop the NFS client from flooding the
> NFS server with requests.
> --> However, once the backlog is handled, the nfsd should start to soreceive()
> again and this shouls cause the window to open back up.
> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> tcp_output() when it decides what to do about the rcvwin.
>
>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> This probably does not happen for Jason's case, since the 6minute timeout
> is disabled when the TCP connection is assigned as a backchannel (most likely
> the case for NFSv4.1).
>
>> - CLIENT ACK that FIN.
>> - SERVER goes in FIN_WAIT_2 state
>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>> - SERVER keeps responding with a TCP Zero Window to those probes.
>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>
>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>
>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>
>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>
>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>
> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> (They're just waiting for RPC requests.)
> However, I do now think I know why the soclose() does not happen.
> When the TCP connection is assigned as a backchannel, that takes a reference
> cnt on the structure. This refcnt won't be released until the connection is
> replaced by a BindConnectiotoSession operation from the client. But that won't
> happen until the client creates a new TCP connection.
> --> No refcnt release-->no refcnt of 0-->no soclose().
>
> I've created the attached patch (completely different from the previous one)
> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> connection is going away. This seems to get it past CLOSE_WAIT without a
> soclose().
> --> I know you are not comfortable with patching your server, but I do think
> this change will get the socket shutdown to complete.
>
> There are a couple more things you can check on the server...
> # nfsstat -E -s
> --> Look for the count under "BindConnToSes".
> --> If non-zero, backchannels have been assigned
> # sysctl -a | fgrep request_space_throttle_count
> --> If non-zero, the server has been overloaded at some point.
>
> I think the attached patch might work around the problem.
> The code that should open up the receive window needs to be checked.
> I am also looking at enabling the 6minute timeout when a backchannel is
> assigned.
>
> rick
>
> Youssef
>
> _______________________________________________
> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> <xprtdied.patch>
>
> <nfs-hang.log.gz>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-04 16:41:46 UTC
Permalink
> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>
> Well, I'm going to cheat and top post, since this is elated info. and
> not really part of the discussion...
>
> I've been testing network partitioning between a Linux client (5.2 kernel)
> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
> I have had the Linux client doing "battle" with the FreeBSD server for
> several minutes after un-partitioning the connection.
>
> The battle basically consists of the Linux client sending an RST, followed
> by a SYN.
> The FreeBSD server ignores the RST and just replies with the same old ack.
> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
> over several minutes.
>
> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
> pretty good at ignoring it.
>
> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
> in case anyone wants to look at it.
On freefall? I would like to take a look at it...

Best regards
Michael
>
> Here's a tcpdump snippet of the interesting part (see the *** comments):
> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
> *** Network is now partitioned...
>
> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> *** Lots of lines snipped.
>
>
> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> *** Network is now unpartitioned...
>
> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
> *** This "battle" goes on for 223sec...
> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
> "FreeBSD replies with same old ACK". In another test run I saw this
> cycle continue non-stop for several minutes. This time, the Linux
> client paused for a while (see ARPs below).
>
> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>
> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
> of 13 (100+ for another test run).
>
> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
> *** Now back in business...
>
> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>
> This error 10063 after the partition heals is also "bad news". It indicates the Session
> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
> suspect a Linux client bug, but will be investigating further.
>
> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
> or if the RST should be ack'd sooner?
>
> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>
> rick
>
>
> ________________________________________
> From: Scheffenegger, Richard <***@netapp.com>
> Sent: Sunday, April 4, 2021 7:50 AM
> To: Rick Macklem; ***@freebsd.org
> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>
> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>
> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>
> I can try getting the relevant bug info next week...
>
> ________________________________
> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
> Gesendet: Friday, April 2, 2021 11:31:01 PM
> An: ***@freebsd.org <***@freebsd.org>
> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
> Betreff: Re: NFS Mount Hangs
>
> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
>
>
> ***@freebsd.org wrote:
>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> I hope you don't mind a top post...
>>> I've been testing network partitioning between the only Linux client
>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>> applied to it.
>>>
>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>> I see...
>>>
>>> While partitioned:
>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>> the network partition or stays ESTABLISHED.
>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>> sent a FIN, but you never called close() on the socket.
>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>> I guess, and therefore the server does not even detect that the peer
>> is not reachable.
>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>> little while, and then disappears.
>> So how does Linux detect the peer is not reachable?
> Well, here's what I see in a packet capture in the Linux client once
> I partition it (just unplug the net cable):
> - lots of retransmits of the same segment (with ACK) for 54sec
> - then only ARP queries
>
> Once I plug the net cable back in:
> - ARP works
> - one more retransmit of the same segement
> - receives RST from FreeBSD
> ** So, is this now a "new" TCP connection, despite
> using the same port#.
> --> It matters for NFS, since "new connection"
> implies "must retry all outstanding RPCs".
> - sends SYN
> - receives SYN, ACK from FreeBSD
> --> connection starts working again
> Always uses same port#.
>
> On the FreeBSD server end:
> - receives the last retransmit of the segment (with ACK)
> - sends RST
> - receives SYN
> - sends SYN, ACK
>
> I thought that there was no RST in the capture I looked at
> yesterday, so I'm not sure if FreeBSD always sends an RST,
> but the Linux client behaviour was the same. (Sent a SYN, etc).
> The socket disappears from the Linux "netstat -a" and I
> suspect that happens after about 54sec, but I am not sure
> about the timing.
>
>>>
>>> After unpartitioning:
>>> On the FreeBSD server end, you get another socket showing up at
>>> the same port#
>>> Active Internet connections (including servers)
>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>
>>> The Linux client shows the same connection ESTABLISHED.
> But disappears from "netstat -a" for a while during the partitioning.
>
>>> (The mount sometimes reports an error. I haven't looked at packet
>>> traces to see if it retries RPCs or why the errors occur.)
> I have now done so, as above.
>
>>> --> However I never get hangs.
>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>> mount starts working again.
>>>
>>> The most obvious thing is that the Linux client always keeps using
>>> the same port#. (The FreeBSD client will use a different port# when
>>> it does a TCP reconnect after no response from the NFS server for
>>> a little while.)
>>>
>>> What do those TCP conversant think?
>> I guess you are you are never calling close() on the socket, for with
>> the connection state is CLOSED.
> Ok, that makes sense. For this case the Linux client has not done a
> BindConnectionToSession to re-assign the back channel.
> I'll have to bug them about this. However, I'll bet they'll answer
> that I have to tell them the back channel needs re-assignment
> or something like that.
>
> I am pretty certain they are broken, in that the client needs to
> retry all outstanding RPCs.
>
> For others, here's the long winded version of this that I just
> put on the phabricator review:
> In the server side kernel RPC, the socket (struct socket *) is in a
> structure called SVCXPRT (normally pointed to by "xprt").
> These structures a ref counted and the soclose() is done
> when the ref. cnt goes to zero. My understanding is that
> "struct socket *" is free'd by soclose() so this cannot be done
> before the xprt ref. cnt goes to zero.
>
> For NFSv4.1/4.2 there is something called a back channel
> which means that a "xprt" is used for server->client RPCs,
> although the TCP connection is established by the client
> to the server.
> --> This back channel holds a ref cnt on "xprt" until the
>
> client re-assigns it to a different TCP connection
> via an operation called BindConnectionToSession
> and the Linux client is not doing this soon enough,
> it appears.
>
> So, the soclose() is delayed, which is why I think the
> TCP connection gets stuck in CLOSE_WAIT and that is
> why I've added the soshutdown(..SHUT_WR) calls,
> which can happen before the client gets around to
> re-assigning the back channel.
>
> Thanks for your help with this Michael, rick
>
> Best regards
> Michael
>>
>> rick
>> ps: I can capture packets while doing this, if anyone has a use
>> for them.
>>
>>
>>
>>
>>
>>
>> ________________________________________
>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>> Sent: Saturday, March 27, 2021 6:57 PM
>> To: Jason Breitman
>> Cc: Rick Macklem; freebsd-***@freebsd.org
>> Subject: Re: NFS Mount Hangs
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>
>>
>>
>>
>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>
>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>> # ifconfig lagg0
>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>
>> We can also say that the sysctl settings did not resolve this issue.
>>
>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>
>> # sysctl net.inet.tcp.finwait2_timeout=1000
>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>
>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>
>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>> This specifies how many seconds to wait for a final FIN
>> packet before the socket is forcibly closed. This is
>> strictly a violation of the TCP specification, but
>> required to prevent denial-of-service attacks. In Linux
>> 2.2, the default value was 180.
>>
>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>
>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>
>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>
>> The issue occurred after 5 days following a reboot of the client machines.
>> I ran the capture information again to make use of the situation.
>>
>> #!/bin/sh
>>
>> while true
>> do
>> /bin/date >> /tmp/nfs-hang.log
>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>> /bin/sleep 60
>> done
>>
>>
>> On the NFS Server
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>
>> On the NFS Client
>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>
>>
>>
>> You had also asked for the output below.
>>
>> # nfsstat -E -s
>> BackChannelCtBindConnToSes
>> 0 0
>>
>> # sysctl vfs.nfsd.request_space_throttle_count
>> vfs.nfsd.request_space_throttle_count: 0
>>
>> I see that you are testing a patch and I look forward to seeing the results.
>>
>>
>> Jason Breitman
>>
>>
>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>
>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>> Hi Jason,
>>>
>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>
>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>
>>>> Issue
>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>
>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>> Rebooting the NFS Client appears to be the only solution.
>>>
>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>> not the kernel based RPC and nfsd in FreeBSD.
>>
>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>
>>> - Data flows correctly between SERVER and the CLIENT
>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>> - The client (eager to send data) can only ack data sent by SERVER.
>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>> - SERVER responds with a TCP Zero Window to those probes.
>> Having the window size drop to zero is not necessarily incorrect.
>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>> closes). This results in "backpressure" to stop the NFS client from flooding the
>> NFS server with requests.
>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>> again and this shouls cause the window to open back up.
>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>> tcp_output() when it decides what to do about the rcvwin.
>>
>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>> This probably does not happen for Jason's case, since the 6minute timeout
>> is disabled when the TCP connection is assigned as a backchannel (most likely
>> the case for NFSv4.1).
>>
>>> - CLIENT ACK that FIN.
>>> - SERVER goes in FIN_WAIT_2 state
>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>
>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>
>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>
>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>
>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>
>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>> (They're just waiting for RPC requests.)
>> However, I do now think I know why the soclose() does not happen.
>> When the TCP connection is assigned as a backchannel, that takes a reference
>> cnt on the structure. This refcnt won't be released until the connection is
>> replaced by a BindConnectiotoSession operation from the client. But that won't
>> happen until the client creates a new TCP connection.
>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>
>> I've created the attached patch (completely different from the previous one)
>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>> connection is going away. This seems to get it past CLOSE_WAIT without a
>> soclose().
>> --> I know you are not comfortable with patching your server, but I do think
>> this change will get the socket shutdown to complete.
>>
>> There are a couple more things you can check on the server...
>> # nfsstat -E -s
>> --> Look for the count under "BindConnToSes".
>> --> If non-zero, backchannels have been assigned
>> # sysctl -a | fgrep request_space_throttle_count
>> --> If non-zero, the server has been overloaded at some point.
>>
>> I think the attached patch might work around the problem.
>> The code that should open up the receive window needs to be checked.
>> I am also looking at enabling the 6minute timeout when a backchannel is
>> assigned.
>>
>> rick
>>
>> Youssef
>>
>> _______________________________________________
>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>> <xprtdied.patch>
>>
>> <nfs-hang.log.gz>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-04 20:28:59 UTC
Permalink
Oops, yes the packet capture is on freefall (forgot to mention that;-).
You should be able to:
% fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap

Some useful packet #s are:
1949 - partitioning starts
2005 - partition healed
2060 - last RST
2067 - SYN -> gets going again

This was taken at the Linux end. I have FreeBSD end too, although I
don't think it tells you anything more.

Have fun with it, rick


________________________________________
From: ***@freebsd.org <***@freebsd.org>
Sent: Sunday, April 4, 2021 12:41 PM
To: Rick Macklem
Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca


> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>
> Well, I'm going to cheat and top post, since this is elated info. and
> not really part of the discussion...
>
> I've been testing network partitioning between a Linux client (5.2 kernel)
> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
> I have had the Linux client doing "battle" with the FreeBSD server for
> several minutes after un-partitioning the connection.
>
> The battle basically consists of the Linux client sending an RST, followed
> by a SYN.
> The FreeBSD server ignores the RST and just replies with the same old ack.
> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
> over several minutes.
>
> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
> pretty good at ignoring it.
>
> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
> in case anyone wants to look at it.
On freefall? I would like to take a look at it...

Best regards
Michael
>
> Here's a tcpdump snippet of the interesting part (see the *** comments):
> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
> *** Network is now partitioned...
>
> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> *** Lots of lines snipped.
>
>
> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> *** Network is now unpartitioned...
>
> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
> *** This "battle" goes on for 223sec...
> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
> "FreeBSD replies with same old ACK". In another test run I saw this
> cycle continue non-stop for several minutes. This time, the Linux
> client paused for a while (see ARPs below).
>
> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>
> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
> of 13 (100+ for another test run).
>
> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
> *** Now back in business...
>
> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>
> This error 10063 after the partition heals is also "bad news". It indicates the Session
> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
> suspect a Linux client bug, but will be investigating further.
>
> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
> or if the RST should be ack'd sooner?
>
> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>
> rick
>
>
> ________________________________________
> From: Scheffenegger, Richard <***@netapp.com>
> Sent: Sunday, April 4, 2021 7:50 AM
> To: Rick Macklem; ***@freebsd.org
> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>
> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>
> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>
> I can try getting the relevant bug info next week...
>
> ________________________________
> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
> Gesendet: Friday, April 2, 2021 11:31:01 PM
> An: ***@freebsd.org <***@freebsd.org>
> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
> Betreff: Re: NFS Mount Hangs
>
> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
>
>
> ***@freebsd.org wrote:
>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> I hope you don't mind a top post...
>>> I've been testing network partitioning between the only Linux client
>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>> applied to it.
>>>
>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>> I see...
>>>
>>> While partitioned:
>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>> the network partition or stays ESTABLISHED.
>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>> sent a FIN, but you never called close() on the socket.
>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>> I guess, and therefore the server does not even detect that the peer
>> is not reachable.
>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>> little while, and then disappears.
>> So how does Linux detect the peer is not reachable?
> Well, here's what I see in a packet capture in the Linux client once
> I partition it (just unplug the net cable):
> - lots of retransmits of the same segment (with ACK) for 54sec
> - then only ARP queries
>
> Once I plug the net cable back in:
> - ARP works
> - one more retransmit of the same segement
> - receives RST from FreeBSD
> ** So, is this now a "new" TCP connection, despite
> using the same port#.
> --> It matters for NFS, since "new connection"
> implies "must retry all outstanding RPCs".
> - sends SYN
> - receives SYN, ACK from FreeBSD
> --> connection starts working again
> Always uses same port#.
>
> On the FreeBSD server end:
> - receives the last retransmit of the segment (with ACK)
> - sends RST
> - receives SYN
> - sends SYN, ACK
>
> I thought that there was no RST in the capture I looked at
> yesterday, so I'm not sure if FreeBSD always sends an RST,
> but the Linux client behaviour was the same. (Sent a SYN, etc).
> The socket disappears from the Linux "netstat -a" and I
> suspect that happens after about 54sec, but I am not sure
> about the timing.
>
>>>
>>> After unpartitioning:
>>> On the FreeBSD server end, you get another socket showing up at
>>> the same port#
>>> Active Internet connections (including servers)
>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>
>>> The Linux client shows the same connection ESTABLISHED.
> But disappears from "netstat -a" for a while during the partitioning.
>
>>> (The mount sometimes reports an error. I haven't looked at packet
>>> traces to see if it retries RPCs or why the errors occur.)
> I have now done so, as above.
>
>>> --> However I never get hangs.
>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>> mount starts working again.
>>>
>>> The most obvious thing is that the Linux client always keeps using
>>> the same port#. (The FreeBSD client will use a different port# when
>>> it does a TCP reconnect after no response from the NFS server for
>>> a little while.)
>>>
>>> What do those TCP conversant think?
>> I guess you are you are never calling close() on the socket, for with
>> the connection state is CLOSED.
> Ok, that makes sense. For this case the Linux client has not done a
> BindConnectionToSession to re-assign the back channel.
> I'll have to bug them about this. However, I'll bet they'll answer
> that I have to tell them the back channel needs re-assignment
> or something like that.
>
> I am pretty certain they are broken, in that the client needs to
> retry all outstanding RPCs.
>
> For others, here's the long winded version of this that I just
> put on the phabricator review:
> In the server side kernel RPC, the socket (struct socket *) is in a
> structure called SVCXPRT (normally pointed to by "xprt").
> These structures a ref counted and the soclose() is done
> when the ref. cnt goes to zero. My understanding is that
> "struct socket *" is free'd by soclose() so this cannot be done
> before the xprt ref. cnt goes to zero.
>
> For NFSv4.1/4.2 there is something called a back channel
> which means that a "xprt" is used for server->client RPCs,
> although the TCP connection is established by the client
> to the server.
> --> This back channel holds a ref cnt on "xprt" until the
>
> client re-assigns it to a different TCP connection
> via an operation called BindConnectionToSession
> and the Linux client is not doing this soon enough,
> it appears.
>
> So, the soclose() is delayed, which is why I think the
> TCP connection gets stuck in CLOSE_WAIT and that is
> why I've added the soshutdown(..SHUT_WR) calls,
> which can happen before the client gets around to
> re-assigning the back channel.
>
> Thanks for your help with this Michael, rick
>
> Best regards
> Michael
>>
>> rick
>> ps: I can capture packets while doing this, if anyone has a use
>> for them.
>>
>>
>>
>>
>>
>>
>> ________________________________________
>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>> Sent: Saturday, March 27, 2021 6:57 PM
>> To: Jason Breitman
>> Cc: Rick Macklem; freebsd-***@freebsd.org
>> Subject: Re: NFS Mount Hangs
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>
>>
>>
>>
>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>
>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>> # ifconfig lagg0
>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>
>> We can also say that the sysctl settings did not resolve this issue.
>>
>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>
>> # sysctl net.inet.tcp.finwait2_timeout=1000
>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>
>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>
>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>> This specifies how many seconds to wait for a final FIN
>> packet before the socket is forcibly closed. This is
>> strictly a violation of the TCP specification, but
>> required to prevent denial-of-service attacks. In Linux
>> 2.2, the default value was 180.
>>
>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>
>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>
>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>
>> The issue occurred after 5 days following a reboot of the client machines.
>> I ran the capture information again to make use of the situation.
>>
>> #!/bin/sh
>>
>> while true
>> do
>> /bin/date >> /tmp/nfs-hang.log
>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>> /bin/sleep 60
>> done
>>
>>
>> On the NFS Server
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>
>> On the NFS Client
>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>
>>
>>
>> You had also asked for the output below.
>>
>> # nfsstat -E -s
>> BackChannelCtBindConnToSes
>> 0 0
>>
>> # sysctl vfs.nfsd.request_space_throttle_count
>> vfs.nfsd.request_space_throttle_count: 0
>>
>> I see that you are testing a patch and I look forward to seeing the results.
>>
>>
>> Jason Breitman
>>
>>
>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>
>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>> Hi Jason,
>>>
>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>
>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>
>>>> Issue
>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>
>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>> Rebooting the NFS Client appears to be the only solution.
>>>
>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>> not the kernel based RPC and nfsd in FreeBSD.
>>
>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>
>>> - Data flows correctly between SERVER and the CLIENT
>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>> - The client (eager to send data) can only ack data sent by SERVER.
>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>> - SERVER responds with a TCP Zero Window to those probes.
>> Having the window size drop to zero is not necessarily incorrect.
>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>> closes). This results in "backpressure" to stop the NFS client from flooding the
>> NFS server with requests.
>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>> again and this shouls cause the window to open back up.
>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>> tcp_output() when it decides what to do about the rcvwin.
>>
>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>> This probably does not happen for Jason's case, since the 6minute timeout
>> is disabled when the TCP connection is assigned as a backchannel (most likely
>> the case for NFSv4.1).
>>
>>> - CLIENT ACK that FIN.
>>> - SERVER goes in FIN_WAIT_2 state
>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>
>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>
>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>
>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>
>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>
>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>> (They're just waiting for RPC requests.)
>> However, I do now think I know why the soclose() does not happen.
>> When the TCP connection is assigned as a backchannel, that takes a reference
>> cnt on the structure. This refcnt won't be released until the connection is
>> replaced by a BindConnectiotoSession operation from the client. But that won't
>> happen until the client creates a new TCP connection.
>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>
>> I've created the attached patch (completely different from the previous one)
>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>> connection is going away. This seems to get it past CLOSE_WAIT without a
>> soclose().
>> --> I know you are not comfortable with patching your server, but I do think
>> this change will get the socket shutdown to complete.
>>
>> There are a couple more things you can check on the server...
>> # nfsstat -E -s
>> --> Look for the count under "BindConnToSes".
>> --> If non-zero, backchannels have been assigned
>> # sysctl -a | fgrep request_space_throttle_count
>> --> If non-zero, the server has been overloaded at some point.
>>
>> I think the attached patch might work around the problem.
>> The code that should open up the receive window needs to be checked.
>> I am also looking at enabling the 6minute timeout when a backchannel is
>> assigned.
>>
>> rick
>>
>> Youssef
>>
>> _______________________________________________
>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>> <xprtdied.patch>
>>
>> <nfs-hang.log.gz>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-04 22:08:40 UTC
Permalink
> On 4. Apr 2021, at 22:28, Rick Macklem <***@uoguelph.ca> wrote:
>
> Oops, yes the packet capture is on freefall (forgot to mention that;-).
> You should be able to:
> % fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap
>
> Some useful packet #s are:
> 1949 - partitioning starts
> 2005 - partition healed
> 2060 - last RST
> 2067 - SYN -> gets going again
>
> This was taken at the Linux end. I have FreeBSD end too, although I
> don't think it tells you anything more.
Hi Rick,

I would like to look at the FreeBSD side, too. Do you also know, what
the state of the TCP connection was when the SYN / ACK / RST game was
going on?
I would like to understand why the reestablishment of the connection
did not work...

Best regards
Michael
>
> Have fun with it, rick
>
>
> ________________________________________
> From: ***@freebsd.org <***@freebsd.org>
> Sent: Sunday, April 4, 2021 12:41 PM
> To: Rick Macklem
> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> Well, I'm going to cheat and top post, since this is elated info. and
>> not really part of the discussion...
>>
>> I've been testing network partitioning between a Linux client (5.2 kernel)
>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>> I have had the Linux client doing "battle" with the FreeBSD server for
>> several minutes after un-partitioning the connection.
>>
>> The battle basically consists of the Linux client sending an RST, followed
>> by a SYN.
>> The FreeBSD server ignores the RST and just replies with the same old ack.
>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>> over several minutes.
>>
>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>> pretty good at ignoring it.
>>
>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>> in case anyone wants to look at it.
> On freefall? I would like to take a look at it...
>
> Best regards
> Michael
>>
>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>> *** Network is now partitioned...
>>
>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> *** Lots of lines snipped.
>>
>>
>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> *** Network is now unpartitioned...
>>
>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>> *** This "battle" goes on for 223sec...
>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>> "FreeBSD replies with same old ACK". In another test run I saw this
>> cycle continue non-stop for several minutes. This time, the Linux
>> client paused for a while (see ARPs below).
>>
>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>
>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>> of 13 (100+ for another test run).
>>
>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>> *** Now back in business...
>>
>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>
>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>> suspect a Linux client bug, but will be investigating further.
>>
>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>> or if the RST should be ack'd sooner?
>>
>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>
>> rick
>>
>>
>> ________________________________________
>> From: Scheffenegger, Richard <***@netapp.com>
>> Sent: Sunday, April 4, 2021 7:50 AM
>> To: Rick Macklem; ***@freebsd.org
>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>> Subject: Re: NFS Mount Hangs
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>
>>
>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>
>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>
>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>
>> I can try getting the relevant bug info next week...
>>
>> ________________________________
>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>> An: ***@freebsd.org <***@freebsd.org>
>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>> Betreff: Re: NFS Mount Hangs
>>
>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>>
>>
>>
>> ***@freebsd.org wrote:
>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> I hope you don't mind a top post...
>>>> I've been testing network partitioning between the only Linux client
>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>> applied to it.
>>>>
>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>> I see...
>>>>
>>>> While partitioned:
>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>> the network partition or stays ESTABLISHED.
>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>> sent a FIN, but you never called close() on the socket.
>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>> I guess, and therefore the server does not even detect that the peer
>>> is not reachable.
>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>> little while, and then disappears.
>>> So how does Linux detect the peer is not reachable?
>> Well, here's what I see in a packet capture in the Linux client once
>> I partition it (just unplug the net cable):
>> - lots of retransmits of the same segment (with ACK) for 54sec
>> - then only ARP queries
>>
>> Once I plug the net cable back in:
>> - ARP works
>> - one more retransmit of the same segement
>> - receives RST from FreeBSD
>> ** So, is this now a "new" TCP connection, despite
>> using the same port#.
>> --> It matters for NFS, since "new connection"
>> implies "must retry all outstanding RPCs".
>> - sends SYN
>> - receives SYN, ACK from FreeBSD
>> --> connection starts working again
>> Always uses same port#.
>>
>> On the FreeBSD server end:
>> - receives the last retransmit of the segment (with ACK)
>> - sends RST
>> - receives SYN
>> - sends SYN, ACK
>>
>> I thought that there was no RST in the capture I looked at
>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>> The socket disappears from the Linux "netstat -a" and I
>> suspect that happens after about 54sec, but I am not sure
>> about the timing.
>>
>>>>
>>>> After unpartitioning:
>>>> On the FreeBSD server end, you get another socket showing up at
>>>> the same port#
>>>> Active Internet connections (including servers)
>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>
>>>> The Linux client shows the same connection ESTABLISHED.
>> But disappears from "netstat -a" for a while during the partitioning.
>>
>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>> traces to see if it retries RPCs or why the errors occur.)
>> I have now done so, as above.
>>
>>>> --> However I never get hangs.
>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>> mount starts working again.
>>>>
>>>> The most obvious thing is that the Linux client always keeps using
>>>> the same port#. (The FreeBSD client will use a different port# when
>>>> it does a TCP reconnect after no response from the NFS server for
>>>> a little while.)
>>>>
>>>> What do those TCP conversant think?
>>> I guess you are you are never calling close() on the socket, for with
>>> the connection state is CLOSED.
>> Ok, that makes sense. For this case the Linux client has not done a
>> BindConnectionToSession to re-assign the back channel.
>> I'll have to bug them about this. However, I'll bet they'll answer
>> that I have to tell them the back channel needs re-assignment
>> or something like that.
>>
>> I am pretty certain they are broken, in that the client needs to
>> retry all outstanding RPCs.
>>
>> For others, here's the long winded version of this that I just
>> put on the phabricator review:
>> In the server side kernel RPC, the socket (struct socket *) is in a
>> structure called SVCXPRT (normally pointed to by "xprt").
>> These structures a ref counted and the soclose() is done
>> when the ref. cnt goes to zero. My understanding is that
>> "struct socket *" is free'd by soclose() so this cannot be done
>> before the xprt ref. cnt goes to zero.
>>
>> For NFSv4.1/4.2 there is something called a back channel
>> which means that a "xprt" is used for server->client RPCs,
>> although the TCP connection is established by the client
>> to the server.
>> --> This back channel holds a ref cnt on "xprt" until the
>>
>> client re-assigns it to a different TCP connection
>> via an operation called BindConnectionToSession
>> and the Linux client is not doing this soon enough,
>> it appears.
>>
>> So, the soclose() is delayed, which is why I think the
>> TCP connection gets stuck in CLOSE_WAIT and that is
>> why I've added the soshutdown(..SHUT_WR) calls,
>> which can happen before the client gets around to
>> re-assigning the back channel.
>>
>> Thanks for your help with this Michael, rick
>>
>> Best regards
>> Michael
>>>
>>> rick
>>> ps: I can capture packets while doing this, if anyone has a use
>>> for them.
>>>
>>>
>>>
>>>
>>>
>>>
>>> ________________________________________
>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>> Sent: Saturday, March 27, 2021 6:57 PM
>>> To: Jason Breitman
>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>>
>>>
>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>> # ifconfig lagg0
>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>
>>> We can also say that the sysctl settings did not resolve this issue.
>>>
>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>
>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>
>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>
>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>> This specifies how many seconds to wait for a final FIN
>>> packet before the socket is forcibly closed. This is
>>> strictly a violation of the TCP specification, but
>>> required to prevent denial-of-service attacks. In Linux
>>> 2.2, the default value was 180.
>>>
>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>
>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>
>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>
>>> The issue occurred after 5 days following a reboot of the client machines.
>>> I ran the capture information again to make use of the situation.
>>>
>>> #!/bin/sh
>>>
>>> while true
>>> do
>>> /bin/date >> /tmp/nfs-hang.log
>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>> /bin/sleep 60
>>> done
>>>
>>>
>>> On the NFS Server
>>> Active Internet connections (including servers)
>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>
>>> On the NFS Client
>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>
>>>
>>>
>>> You had also asked for the output below.
>>>
>>> # nfsstat -E -s
>>> BackChannelCtBindConnToSes
>>> 0 0
>>>
>>> # sysctl vfs.nfsd.request_space_throttle_count
>>> vfs.nfsd.request_space_throttle_count: 0
>>>
>>> I see that you are testing a patch and I look forward to seeing the results.
>>>
>>>
>>> Jason Breitman
>>>
>>>
>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>
>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>> Hi Jason,
>>>>
>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>
>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>
>>>>> Issue
>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>
>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>
>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>> not the kernel based RPC and nfsd in FreeBSD.
>>>
>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>
>>>> - Data flows correctly between SERVER and the CLIENT
>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>> - SERVER responds with a TCP Zero Window to those probes.
>>> Having the window size drop to zero is not necessarily incorrect.
>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>> NFS server with requests.
>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>> again and this shouls cause the window to open back up.
>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>> tcp_output() when it decides what to do about the rcvwin.
>>>
>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>> This probably does not happen for Jason's case, since the 6minute timeout
>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>> the case for NFSv4.1).
>>>
>>>> - CLIENT ACK that FIN.
>>>> - SERVER goes in FIN_WAIT_2 state
>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>
>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>
>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>
>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>
>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>
>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>> (They're just waiting for RPC requests.)
>>> However, I do now think I know why the soclose() does not happen.
>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>> cnt on the structure. This refcnt won't be released until the connection is
>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>> happen until the client creates a new TCP connection.
>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>
>>> I've created the attached patch (completely different from the previous one)
>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>> soclose().
>>> --> I know you are not comfortable with patching your server, but I do think
>>> this change will get the socket shutdown to complete.
>>>
>>> There are a couple more things you can check on the server...
>>> # nfsstat -E -s
>>> --> Look for the count under "BindConnToSes".
>>> --> If non-zero, backchannels have been assigned
>>> # sysctl -a | fgrep request_space_throttle_count
>>> --> If non-zero, the server has been overloaded at some point.
>>>
>>> I think the attached patch might work around the problem.
>>> The code that should open up the receive window needs to be checked.
>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>> assigned.
>>>
>>> rick
>>>
>>> Youssef
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>> <xprtdied.patch>
>>>
>>> <nfs-hang.log.gz>
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
Rick Macklem
2021-04-04 23:12:49 UTC
Permalink
***@freebsd.org wrote:
>> On 4. Apr 2021, at 22:28, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> Oops, yes the packet capture is on freefall (forgot to mention that;-).
>> You should be able to:
>> % fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap
>>
>> Some useful packet #s are:
>> 1949 - partitioning starts
>> 2005 - partition healed
>> 2060 - last RST
>> 2067 - SYN -> gets going again
>>
>> This was taken at the Linux end. I have FreeBSD end too, although I
>> don't think it tells you anything more.
>Hi Rick,
>
>I would like to look at the FreeBSD side, too.
fetch https://people.freebsd.org/~rmacklem/freetolinuxnfs.pcap

>Do you also know, what
>the state of the TCP connection was when the SYN / ACK / RST game was
>going on?
Just ESTABLISHED when the battle goes on.
And it happens when the Send-Q is 0.
(If the Send-Q is not empty, it finds its way to CLOSED.)

If I wait long enough before healing the partition, it will
go to FIN_WAIT_1, and then if I plug it back in, it does not
do battle (at least not for long).

Btw, I have one running now that seems stuck really good.
It has been 20minutes since I plugged the net cable back in.
(Unfortunately, I didn't have tcpdump running until after
I saw it was not progressing after healing.
--> There is one difference. There was a 6minute timeout
enabled on the server krpc for "no activity", which is
now disabled like it is for NFSv4.1 in freebsd-current.
I had forgotten to re-disable it.
So, when it does battle, it might have been the 6minute
timeout, which would then do the soshutdown(..SHUT_WR)
which kept it from getting "stuck" forever.
-->This time I had to reboot the FreeBSD NFS server to
get the Linux client unstuck, so this one looked a lot
like what has been reported.
The pcap for this one, started after the network was plugged
back in and I noticed it was stuck for quite a while is here:
fetch https://people.freebsd.org/~rmacklem/stuck.pcap

In it, there is just a bunch of RST followed by SYN sent
from client->FreeBSD and FreeBSD just keeps sending
acks for the old segment back.
--> It looks like FreeBSD did the "RST, ACK" after the
krpc did a soshutdown(..SHUT_WR) on the socket,
for the one you've been looking at.
I'll test some more...

>I would like to understand why the reestablishment of the connection
>did not work...
It is looking like it takes either a non-empty send-q or a
soshutdown(..SHUT_WR) to get the FreeBSD socket
out of established, where it just ignores the RSTs and
SYN packets.

Thanks for looking at it, rick

Best regards
Michael
>
> Have fun with it, rick
>
>
> ________________________________________
> From: ***@freebsd.org <***@freebsd.org>
> Sent: Sunday, April 4, 2021 12:41 PM
> To: Rick Macklem
> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> Well, I'm going to cheat and top post, since this is elated info. and
>> not really part of the discussion...
>>
>> I've been testing network partitioning between a Linux client (5.2 kernel)
>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>> I have had the Linux client doing "battle" with the FreeBSD server for
>> several minutes after un-partitioning the connection.
>>
>> The battle basically consists of the Linux client sending an RST, followed
>> by a SYN.
>> The FreeBSD server ignores the RST and just replies with the same old ack.
>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>> over several minutes.
>>
>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>> pretty good at ignoring it.
>>
>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>> in case anyone wants to look at it.
> On freefall? I would like to take a look at it...
>
> Best regards
> Michael
>>
>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>> *** Network is now partitioned...
>>
>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> *** Lots of lines snipped.
>>
>>
>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> *** Network is now unpartitioned...
>>
>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>> *** This "battle" goes on for 223sec...
>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>> "FreeBSD replies with same old ACK". In another test run I saw this
>> cycle continue non-stop for several minutes. This time, the Linux
>> client paused for a while (see ARPs below).
>>
>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>
>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>> of 13 (100+ for another test run).
>>
>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>> *** Now back in business...
>>
>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>
>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>> suspect a Linux client bug, but will be investigating further.
>>
>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>> or if the RST should be ack'd sooner?
>>
>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>
>> rick
>>
>>
>> ________________________________________
>> From: Scheffenegger, Richard <***@netapp.com>
>> Sent: Sunday, April 4, 2021 7:50 AM
>> To: Rick Macklem; ***@freebsd.org
>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>> Subject: Re: NFS Mount Hangs
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>
>>
>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>
>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>
>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>
>> I can try getting the relevant bug info next week...
>>
>> ________________________________
>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>> An: ***@freebsd.org <***@freebsd.org>
>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>> Betreff: Re: NFS Mount Hangs
>>
>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>>
>>
>>
>> ***@freebsd.org wrote:
>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> I hope you don't mind a top post...
>>>> I've been testing network partitioning between the only Linux client
>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>> applied to it.
>>>>
>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>> I see...
>>>>
>>>> While partitioned:
>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>> the network partition or stays ESTABLISHED.
>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>> sent a FIN, but you never called close() on the socket.
>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>> I guess, and therefore the server does not even detect that the peer
>>> is not reachable.
>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>> little while, and then disappears.
>>> So how does Linux detect the peer is not reachable?
>> Well, here's what I see in a packet capture in the Linux client once
>> I partition it (just unplug the net cable):
>> - lots of retransmits of the same segment (with ACK) for 54sec
>> - then only ARP queries
>>
>> Once I plug the net cable back in:
>> - ARP works
>> - one more retransmit of the same segement
>> - receives RST from FreeBSD
>> ** So, is this now a "new" TCP connection, despite
>> using the same port#.
>> --> It matters for NFS, since "new connection"
>> implies "must retry all outstanding RPCs".
>> - sends SYN
>> - receives SYN, ACK from FreeBSD
>> --> connection starts working again
>> Always uses same port#.
>>
>> On the FreeBSD server end:
>> - receives the last retransmit of the segment (with ACK)
>> - sends RST
>> - receives SYN
>> - sends SYN, ACK
>>
>> I thought that there was no RST in the capture I looked at
>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>> The socket disappears from the Linux "netstat -a" and I
>> suspect that happens after about 54sec, but I am not sure
>> about the timing.
>>
>>>>
>>>> After unpartitioning:
>>>> On the FreeBSD server end, you get another socket showing up at
>>>> the same port#
>>>> Active Internet connections (including servers)
>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>
>>>> The Linux client shows the same connection ESTABLISHED.
>> But disappears from "netstat -a" for a while during the partitioning.
>>
>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>> traces to see if it retries RPCs or why the errors occur.)
>> I have now done so, as above.
>>
>>>> --> However I never get hangs.
>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>> mount starts working again.
>>>>
>>>> The most obvious thing is that the Linux client always keeps using
>>>> the same port#. (The FreeBSD client will use a different port# when
>>>> it does a TCP reconnect after no response from the NFS server for
>>>> a little while.)
>>>>
>>>> What do those TCP conversant think?
>>> I guess you are you are never calling close() on the socket, for with
>>> the connection state is CLOSED.
>> Ok, that makes sense. For this case the Linux client has not done a
>> BindConnectionToSession to re-assign the back channel.
>> I'll have to bug them about this. However, I'll bet they'll answer
>> that I have to tell them the back channel needs re-assignment
>> or something like that.
>>
>> I am pretty certain they are broken, in that the client needs to
>> retry all outstanding RPCs.
>>
>> For others, here's the long winded version of this that I just
>> put on the phabricator review:
>> In the server side kernel RPC, the socket (struct socket *) is in a
>> structure called SVCXPRT (normally pointed to by "xprt").
>> These structures a ref counted and the soclose() is done
>> when the ref. cnt goes to zero. My understanding is that
>> "struct socket *" is free'd by soclose() so this cannot be done
>> before the xprt ref. cnt goes to zero.
>>
>> For NFSv4.1/4.2 there is something called a back channel
>> which means that a "xprt" is used for server->client RPCs,
>> although the TCP connection is established by the client
>> to the server.
>> --> This back channel holds a ref cnt on "xprt" until the
>>
>> client re-assigns it to a different TCP connection
>> via an operation called BindConnectionToSession
>> and the Linux client is not doing this soon enough,
>> it appears.
>>
>> So, the soclose() is delayed, which is why I think the
>> TCP connection gets stuck in CLOSE_WAIT and that is
>> why I've added the soshutdown(..SHUT_WR) calls,
>> which can happen before the client gets around to
>> re-assigning the back channel.
>>
>> Thanks for your help with this Michael, rick
>>
>> Best regards
>> Michael
>>>
>>> rick
>>> ps: I can capture packets while doing this, if anyone has a use
>>> for them.
>>>
>>>
>>>
>>>
>>>
>>>
>>> ________________________________________
>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>> Sent: Saturday, March 27, 2021 6:57 PM
>>> To: Jason Breitman
>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>>
>>>
>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>
>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>> # ifconfig lagg0
>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>
>>> We can also say that the sysctl settings did not resolve this issue.
>>>
>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>
>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>
>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>
>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>> This specifies how many seconds to wait for a final FIN
>>> packet before the socket is forcibly closed. This is
>>> strictly a violation of the TCP specification, but
>>> required to prevent denial-of-service attacks. In Linux
>>> 2.2, the default value was 180.
>>>
>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>
>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>
>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>
>>> The issue occurred after 5 days following a reboot of the client machines.
>>> I ran the capture information again to make use of the situation.
>>>
>>> #!/bin/sh
>>>
>>> while true
>>> do
>>> /bin/date >> /tmp/nfs-hang.log
>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>> /bin/sleep 60
>>> done
>>>
>>>
>>> On the NFS Server
>>> Active Internet connections (including servers)
>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>
>>> On the NFS Client
>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>
>>>
>>>
>>> You had also asked for the output below.
>>>
>>> # nfsstat -E -s
>>> BackChannelCtBindConnToSes
>>> 0 0
>>>
>>> # sysctl vfs.nfsd.request_space_throttle_count
>>> vfs.nfsd.request_space_throttle_count: 0
>>>
>>> I see that you are testing a patch and I look forward to seeing the results.
>>>
>>>
>>> Jason Breitman
>>>
>>>
>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>
>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>> Hi Jason,
>>>>
>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>
>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>
>>>>> Issue
>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>
>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>
>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>> not the kernel based RPC and nfsd in FreeBSD.
>>>
>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>
>>>> - Data flows correctly between SERVER and the CLIENT
>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>> - SERVER responds with a TCP Zero Window to those probes.
>>> Having the window size drop to zero is not necessarily incorrect.
>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>> NFS server with requests.
>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>> again and this shouls cause the window to open back up.
>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>> tcp_output() when it decides what to do about the rcvwin.
>>>
>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>> This probably does not happen for Jason's case, since the 6minute timeout
>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>> the case for NFSv4.1).
>>>
>>>> - CLIENT ACK that FIN.
>>>> - SERVER goes in FIN_WAIT_2 state
>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>
>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>
>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>
>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>
>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>
>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>> (They're just waiting for RPC requests.)
>>> However, I do now think I know why the soclose() does not happen.
>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>> cnt on the structure. This refcnt won't be released until the connection is
>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>> happen until the client creates a new TCP connection.
>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>
>>> I've created the attached patch (completely different from the previous one)
>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>> soclose().
>>> --> I know you are not comfortable with patching your server, but I do think
>>> this change will get the socket shutdown to complete.
>>>
>>> There are a couple more things you can check on the server...
>>> # nfsstat -E -s
>>> --> Look for the count under "BindConnToSes".
>>> --> If non-zero, backchannels have been assigned
>>> # sysctl -a | fgrep request_space_throttle_count
>>> --> If non-zero, the server has been overloaded at some point.
>>>
>>> I think the attached patch might work around the problem.
>>> The code that should open up the receive window needs to be checked.
>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>> assigned.
>>>
>>> rick
>>>
>>> Youssef
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>> <xprtdied.patch>
>>>
>>> <nfs-hang.log.gz>
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
t***@freebsd.org
2021-04-05 14:46:18 UTC
Permalink
> On 5. Apr 2021, at 01:12, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>>> On 4. Apr 2021, at 22:28, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> Oops, yes the packet capture is on freefall (forgot to mention that;-).
>>> You should be able to:
>>> % fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap
>>>
>>> Some useful packet #s are:
>>> 1949 - partitioning starts
>>> 2005 - partition healed
>>> 2060 - last RST
>>> 2067 - SYN -> gets going again
>>>
>>> This was taken at the Linux end. I have FreeBSD end too, although I
>>> don't think it tells you anything more.
>> Hi Rick,
>>
>> I would like to look at the FreeBSD side, too.
> fetch https://people.freebsd.org/~rmacklem/freetolinuxnfs.pcap
>
>> Do you also know, what
>> the state of the TCP connection was when the SYN / ACK / RST game was
>> going on?
> Just ESTABLISHED when the battle goes on.
> And it happens when the Send-Q is 0.
> (If the Send-Q is not empty, it finds its way to CLOSED.)
OK. What is the FreeBSD version you are using?

It seems that the TCP connection on the FreeBSD is still alive,
Linux has decided to start a new TCP connection using the old
port numbers. So it sends a SYN. The response is a challenge ACK
and Linux responds with a RST. This looks good so far. However,
FreeBSD should accept the RST and kill the TCP connection. The
next SYN from the Linux side would establish a new TCP connection.

So I'm wondering why the RST is not accepted. I made the timestamp
checking stricter but introduced a bug where RST segments without
timestamps were ignored. This was fixed.

Introduced in main on 2020/11/09:
https://svnweb.freebsd.org/changeset/base/367530
Introduced in stable/12 on 2020/11/30:
https://svnweb.freebsd.org/changeset/base/36818
Fix in main on 2021/01/13:
https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
Fix in stable/12 on 2021/01/24:
https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b

Are you using a version which is affected by this bug?

Best regards
Michael
>
> If I wait long enough before healing the partition, it will
> go to FIN_WAIT_1, and then if I plug it back in, it does not
> do battle (at least not for long).
>
> Btw, I have one running now that seems stuck really good.
> It has been 20minutes since I plugged the net cable back in.
> (Unfortunately, I didn't have tcpdump running until after
> I saw it was not progressing after healing.
> --> There is one difference. There was a 6minute timeout
> enabled on the server krpc for "no activity", which is
> now disabled like it is for NFSv4.1 in freebsd-current.
> I had forgotten to re-disable it.
> So, when it does battle, it might have been the 6minute
> timeout, which would then do the soshutdown(..SHUT_WR)
> which kept it from getting "stuck" forever.
> -->This time I had to reboot the FreeBSD NFS server to
> get the Linux client unstuck, so this one looked a lot
> like what has been reported.
> The pcap for this one, started after the network was plugged
> back in and I noticed it was stuck for quite a while is here:
> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>
> In it, there is just a bunch of RST followed by SYN sent
> from client->FreeBSD and FreeBSD just keeps sending
> acks for the old segment back.
> --> It looks like FreeBSD did the "RST, ACK" after the
> krpc did a soshutdown(..SHUT_WR) on the socket,
> for the one you've been looking at.
> I'll test some more...
>
>> I would like to understand why the reestablishment of the connection
>> did not work...
> It is looking like it takes either a non-empty send-q or a
> soshutdown(..SHUT_WR) to get the FreeBSD socket
> out of established, where it just ignores the RSTs and
> SYN packets.
>
> Thanks for looking at it, rick
>
> Best regards
> Michael
>>
>> Have fun with it, rick
>>
>>
>> ________________________________________
>> From: ***@freebsd.org <***@freebsd.org>
>> Sent: Sunday, April 4, 2021 12:41 PM
>> To: Rick Macklem
>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>> Subject: Re: NFS Mount Hangs
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>
>>
>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> Well, I'm going to cheat and top post, since this is elated info. and
>>> not really part of the discussion...
>>>
>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>> several minutes after un-partitioning the connection.
>>>
>>> The battle basically consists of the Linux client sending an RST, followed
>>> by a SYN.
>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>> over several minutes.
>>>
>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>> pretty good at ignoring it.
>>>
>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>> in case anyone wants to look at it.
>> On freefall? I would like to take a look at it...
>>
>> Best regards
>> Michael
>>>
>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>> *** Network is now partitioned...
>>>
>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> *** Lots of lines snipped.
>>>
>>>
>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> *** Network is now unpartitioned...
>>>
>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>> *** This "battle" goes on for 223sec...
>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>> cycle continue non-stop for several minutes. This time, the Linux
>>> client paused for a while (see ARPs below).
>>>
>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>
>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>> of 13 (100+ for another test run).
>>>
>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>> *** Now back in business...
>>>
>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>
>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>> suspect a Linux client bug, but will be investigating further.
>>>
>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>> or if the RST should be ack'd sooner?
>>>
>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>
>>> rick
>>>
>>>
>>> ________________________________________
>>> From: Scheffenegger, Richard <***@netapp.com>
>>> Sent: Sunday, April 4, 2021 7:50 AM
>>> To: Rick Macklem; ***@freebsd.org
>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>
>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>
>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>
>>> I can try getting the relevant bug info next week...
>>>
>>> ________________________________
>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>> An: ***@freebsd.org <***@freebsd.org>
>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>> Betreff: Re: NFS Mount Hangs
>>>
>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>>
>>>
>>>
>>> ***@freebsd.org wrote:
>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>
>>>>> I hope you don't mind a top post...
>>>>> I've been testing network partitioning between the only Linux client
>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>> applied to it.
>>>>>
>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>> I see...
>>>>>
>>>>> While partitioned:
>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>> the network partition or stays ESTABLISHED.
>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>> sent a FIN, but you never called close() on the socket.
>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>> I guess, and therefore the server does not even detect that the peer
>>>> is not reachable.
>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>> little while, and then disappears.
>>>> So how does Linux detect the peer is not reachable?
>>> Well, here's what I see in a packet capture in the Linux client once
>>> I partition it (just unplug the net cable):
>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>> - then only ARP queries
>>>
>>> Once I plug the net cable back in:
>>> - ARP works
>>> - one more retransmit of the same segement
>>> - receives RST from FreeBSD
>>> ** So, is this now a "new" TCP connection, despite
>>> using the same port#.
>>> --> It matters for NFS, since "new connection"
>>> implies "must retry all outstanding RPCs".
>>> - sends SYN
>>> - receives SYN, ACK from FreeBSD
>>> --> connection starts working again
>>> Always uses same port#.
>>>
>>> On the FreeBSD server end:
>>> - receives the last retransmit of the segment (with ACK)
>>> - sends RST
>>> - receives SYN
>>> - sends SYN, ACK
>>>
>>> I thought that there was no RST in the capture I looked at
>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>> The socket disappears from the Linux "netstat -a" and I
>>> suspect that happens after about 54sec, but I am not sure
>>> about the timing.
>>>
>>>>>
>>>>> After unpartitioning:
>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>> the same port#
>>>>> Active Internet connections (including servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>
>>>>> The Linux client shows the same connection ESTABLISHED.
>>> But disappears from "netstat -a" for a while during the partitioning.
>>>
>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>> traces to see if it retries RPCs or why the errors occur.)
>>> I have now done so, as above.
>>>
>>>>> --> However I never get hangs.
>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>> mount starts working again.
>>>>>
>>>>> The most obvious thing is that the Linux client always keeps using
>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>> a little while.)
>>>>>
>>>>> What do those TCP conversant think?
>>>> I guess you are you are never calling close() on the socket, for with
>>>> the connection state is CLOSED.
>>> Ok, that makes sense. For this case the Linux client has not done a
>>> BindConnectionToSession to re-assign the back channel.
>>> I'll have to bug them about this. However, I'll bet they'll answer
>>> that I have to tell them the back channel needs re-assignment
>>> or something like that.
>>>
>>> I am pretty certain they are broken, in that the client needs to
>>> retry all outstanding RPCs.
>>>
>>> For others, here's the long winded version of this that I just
>>> put on the phabricator review:
>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>> structure called SVCXPRT (normally pointed to by "xprt").
>>> These structures a ref counted and the soclose() is done
>>> when the ref. cnt goes to zero. My understanding is that
>>> "struct socket *" is free'd by soclose() so this cannot be done
>>> before the xprt ref. cnt goes to zero.
>>>
>>> For NFSv4.1/4.2 there is something called a back channel
>>> which means that a "xprt" is used for server->client RPCs,
>>> although the TCP connection is established by the client
>>> to the server.
>>> --> This back channel holds a ref cnt on "xprt" until the
>>>
>>> client re-assigns it to a different TCP connection
>>> via an operation called BindConnectionToSession
>>> and the Linux client is not doing this soon enough,
>>> it appears.
>>>
>>> So, the soclose() is delayed, which is why I think the
>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>> why I've added the soshutdown(..SHUT_WR) calls,
>>> which can happen before the client gets around to
>>> re-assigning the back channel.
>>>
>>> Thanks for your help with this Michael, rick
>>>
>>> Best regards
>>> Michael
>>>>
>>>> rick
>>>> ps: I can capture packets while doing this, if anyone has a use
>>>> for them.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________________
>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>> To: Jason Breitman
>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>>
>>>>
>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>
>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>> # ifconfig lagg0
>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>
>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>
>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>
>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>
>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>
>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>> This specifies how many seconds to wait for a final FIN
>>>> packet before the socket is forcibly closed. This is
>>>> strictly a violation of the TCP specification, but
>>>> required to prevent denial-of-service attacks. In Linux
>>>> 2.2, the default value was 180.
>>>>
>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>
>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>
>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>
>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>> I ran the capture information again to make use of the situation.
>>>>
>>>> #!/bin/sh
>>>>
>>>> while true
>>>> do
>>>> /bin/date >> /tmp/nfs-hang.log
>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>> /bin/sleep 60
>>>> done
>>>>
>>>>
>>>> On the NFS Server
>>>> Active Internet connections (including servers)
>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>
>>>> On the NFS Client
>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>
>>>>
>>>>
>>>> You had also asked for the output below.
>>>>
>>>> # nfsstat -E -s
>>>> BackChannelCtBindConnToSes
>>>> 0 0
>>>>
>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>
>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>
>>>>
>>>> Jason Breitman
>>>>
>>>>
>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>
>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>> Hi Jason,
>>>>>
>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>
>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>
>>>>>> Issue
>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>
>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>
>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>
>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>
>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>> Having the window size drop to zero is not necessarily incorrect.
>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>> NFS server with requests.
>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>> again and this shouls cause the window to open back up.
>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>
>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>> the case for NFSv4.1).
>>>>
>>>>> - CLIENT ACK that FIN.
>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>
>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>
>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>
>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>
>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>
>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>> (They're just waiting for RPC requests.)
>>>> However, I do now think I know why the soclose() does not happen.
>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>> happen until the client creates a new TCP connection.
>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>
>>>> I've created the attached patch (completely different from the previous one)
>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>> soclose().
>>>> --> I know you are not comfortable with patching your server, but I do think
>>>> this change will get the socket shutdown to complete.
>>>>
>>>> There are a couple more things you can check on the server...
>>>> # nfsstat -E -s
>>>> --> Look for the count under "BindConnToSes".
>>>> --> If non-zero, backchannels have been assigned
>>>> # sysctl -a | fgrep request_space_throttle_count
>>>> --> If non-zero, the server has been overloaded at some point.
>>>>
>>>> I think the attached patch might work around the problem.
>>>> The code that should open up the receive window needs to be checked.
>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>> assigned.
>>>>
>>>> rick
>>>>
>>>> Youssef
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>> <xprtdied.patch>
>>>>
>>>> <nfs-hang.log.gz>
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>
Rick Macklem
2021-04-05 23:24:04 UTC
Permalink
***@freebsd.org wrote:
[stuff snipped]
>OK. What is the FreeBSD version you are using?
main Dec. 23, 2020.

>
>It seems that the TCP connection on the FreeBSD is still alive,
>Linux has decided to start a new TCP connection using the old
>port numbers. So it sends a SYN. The response is a challenge ACK
>and Linux responds with a RST. This looks good so far. However,
>FreeBSD should accept the RST and kill the TCP connection. The
>next SYN from the Linux side would establish a new TCP connection.
>
>So I'm wondering why the RST is not accepted. I made the timestamp
>checking stricter but introduced a bug where RST segments without
>timestamps were ignored. This was fixed.
>
>Introduced in main on 2020/11/09:
> https://svnweb.freebsd.org/changeset/base/367530
>Introduced in stable/12 on 2020/11/30:
> https://svnweb.freebsd.org/changeset/base/36818
>Fix in main on 2021/01/13:
> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>Fix in stable/12 on 2021/01/24:
> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>
>Are you using a version which is affected by this bug?
I was. Now I've applied the patch.
Bad News. It did not fix the problem.
It still gets into an endless "ignore RST" and stay established when
the Send-Q is empty.

If the Send-Q is non-empty when I partition, it recovers fine,
sometimes not even needing to see an RST.

rick
ps: If you think there might be other recent changes that matter,
just say the word and I'll upgrade to bits de jur.

rick

Best regards
Michael
>
> If I wait long enough before healing the partition, it will
> go to FIN_WAIT_1, and then if I plug it back in, it does not
> do battle (at least not for long).
>
> Btw, I have one running now that seems stuck really good.
> It has been 20minutes since I plugged the net cable back in.
> (Unfortunately, I didn't have tcpdump running until after
> I saw it was not progressing after healing.
> --> There is one difference. There was a 6minute timeout
> enabled on the server krpc for "no activity", which is
> now disabled like it is for NFSv4.1 in freebsd-current.
> I had forgotten to re-disable it.
> So, when it does battle, it might have been the 6minute
> timeout, which would then do the soshutdown(..SHUT_WR)
> which kept it from getting "stuck" forever.
> -->This time I had to reboot the FreeBSD NFS server to
> get the Linux client unstuck, so this one looked a lot
> like what has been reported.
> The pcap for this one, started after the network was plugged
> back in and I noticed it was stuck for quite a while is here:
> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>
> In it, there is just a bunch of RST followed by SYN sent
> from client->FreeBSD and FreeBSD just keeps sending
> acks for the old segment back.
> --> It looks like FreeBSD did the "RST, ACK" after the
> krpc did a soshutdown(..SHUT_WR) on the socket,
> for the one you've been looking at.
> I'll test some more...
>
>> I would like to understand why the reestablishment of the connection
>> did not work...
> It is looking like it takes either a non-empty send-q or a
> soshutdown(..SHUT_WR) to get the FreeBSD socket
> out of established, where it just ignores the RSTs and
> SYN packets.
>
> Thanks for looking at it, rick
>
> Best regards
> Michael
>>
>> Have fun with it, rick
>>
>>
>> ________________________________________
>> From: ***@freebsd.org <***@freebsd.org>
>> Sent: Sunday, April 4, 2021 12:41 PM
>> To: Rick Macklem
>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>> Subject: Re: NFS Mount Hangs
>>
>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>
>>
>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> Well, I'm going to cheat and top post, since this is elated info. and
>>> not really part of the discussion...
>>>
>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>> several minutes after un-partitioning the connection.
>>>
>>> The battle basically consists of the Linux client sending an RST, followed
>>> by a SYN.
>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>> over several minutes.
>>>
>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>> pretty good at ignoring it.
>>>
>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>> in case anyone wants to look at it.
>> On freefall? I would like to take a look at it...
>>
>> Best regards
>> Michael
>>>
>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>> *** Network is now partitioned...
>>>
>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> *** Lots of lines snipped.
>>>
>>>
>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> *** Network is now unpartitioned...
>>>
>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>> *** This "battle" goes on for 223sec...
>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>> cycle continue non-stop for several minutes. This time, the Linux
>>> client paused for a while (see ARPs below).
>>>
>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>
>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>> of 13 (100+ for another test run).
>>>
>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>> *** Now back in business...
>>>
>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>
>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>> suspect a Linux client bug, but will be investigating further.
>>>
>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>> or if the RST should be ack'd sooner?
>>>
>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>
>>> rick
>>>
>>>
>>> ________________________________________
>>> From: Scheffenegger, Richard <***@netapp.com>
>>> Sent: Sunday, April 4, 2021 7:50 AM
>>> To: Rick Macklem; ***@freebsd.org
>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>
>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>
>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>
>>> I can try getting the relevant bug info next week...
>>>
>>> ________________________________
>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>> An: ***@freebsd.org <***@freebsd.org>
>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>> Betreff: Re: NFS Mount Hangs
>>>
>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>>
>>>
>>>
>>> ***@freebsd.org wrote:
>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>
>>>>> I hope you don't mind a top post...
>>>>> I've been testing network partitioning between the only Linux client
>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>> applied to it.
>>>>>
>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>> I see...
>>>>>
>>>>> While partitioned:
>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>> the network partition or stays ESTABLISHED.
>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>> sent a FIN, but you never called close() on the socket.
>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>> I guess, and therefore the server does not even detect that the peer
>>>> is not reachable.
>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>> little while, and then disappears.
>>>> So how does Linux detect the peer is not reachable?
>>> Well, here's what I see in a packet capture in the Linux client once
>>> I partition it (just unplug the net cable):
>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>> - then only ARP queries
>>>
>>> Once I plug the net cable back in:
>>> - ARP works
>>> - one more retransmit of the same segement
>>> - receives RST from FreeBSD
>>> ** So, is this now a "new" TCP connection, despite
>>> using the same port#.
>>> --> It matters for NFS, since "new connection"
>>> implies "must retry all outstanding RPCs".
>>> - sends SYN
>>> - receives SYN, ACK from FreeBSD
>>> --> connection starts working again
>>> Always uses same port#.
>>>
>>> On the FreeBSD server end:
>>> - receives the last retransmit of the segment (with ACK)
>>> - sends RST
>>> - receives SYN
>>> - sends SYN, ACK
>>>
>>> I thought that there was no RST in the capture I looked at
>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>> The socket disappears from the Linux "netstat -a" and I
>>> suspect that happens after about 54sec, but I am not sure
>>> about the timing.
>>>
>>>>>
>>>>> After unpartitioning:
>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>> the same port#
>>>>> Active Internet connections (including servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>
>>>>> The Linux client shows the same connection ESTABLISHED.
>>> But disappears from "netstat -a" for a while during the partitioning.
>>>
>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>> traces to see if it retries RPCs or why the errors occur.)
>>> I have now done so, as above.
>>>
>>>>> --> However I never get hangs.
>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>> mount starts working again.
>>>>>
>>>>> The most obvious thing is that the Linux client always keeps using
>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>> a little while.)
>>>>>
>>>>> What do those TCP conversant think?
>>>> I guess you are you are never calling close() on the socket, for with
>>>> the connection state is CLOSED.
>>> Ok, that makes sense. For this case the Linux client has not done a
>>> BindConnectionToSession to re-assign the back channel.
>>> I'll have to bug them about this. However, I'll bet they'll answer
>>> that I have to tell them the back channel needs re-assignment
>>> or something like that.
>>>
>>> I am pretty certain they are broken, in that the client needs to
>>> retry all outstanding RPCs.
>>>
>>> For others, here's the long winded version of this that I just
>>> put on the phabricator review:
>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>> structure called SVCXPRT (normally pointed to by "xprt").
>>> These structures a ref counted and the soclose() is done
>>> when the ref. cnt goes to zero. My understanding is that
>>> "struct socket *" is free'd by soclose() so this cannot be done
>>> before the xprt ref. cnt goes to zero.
>>>
>>> For NFSv4.1/4.2 there is something called a back channel
>>> which means that a "xprt" is used for server->client RPCs,
>>> although the TCP connection is established by the client
>>> to the server.
>>> --> This back channel holds a ref cnt on "xprt" until the
>>>
>>> client re-assigns it to a different TCP connection
>>> via an operation called BindConnectionToSession
>>> and the Linux client is not doing this soon enough,
>>> it appears.
>>>
>>> So, the soclose() is delayed, which is why I think the
>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>> why I've added the soshutdown(..SHUT_WR) calls,
>>> which can happen before the client gets around to
>>> re-assigning the back channel.
>>>
>>> Thanks for your help with this Michael, rick
>>>
>>> Best regards
>>> Michael
>>>>
>>>> rick
>>>> ps: I can capture packets while doing this, if anyone has a use
>>>> for them.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________________
>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>> To: Jason Breitman
>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>>
>>>>
>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>
>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>> # ifconfig lagg0
>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>
>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>
>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>
>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>
>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>
>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>> This specifies how many seconds to wait for a final FIN
>>>> packet before the socket is forcibly closed. This is
>>>> strictly a violation of the TCP specification, but
>>>> required to prevent denial-of-service attacks. In Linux
>>>> 2.2, the default value was 180.
>>>>
>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>
>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>
>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>
>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>> I ran the capture information again to make use of the situation.
>>>>
>>>> #!/bin/sh
>>>>
>>>> while true
>>>> do
>>>> /bin/date >> /tmp/nfs-hang.log
>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>> /bin/sleep 60
>>>> done
>>>>
>>>>
>>>> On the NFS Server
>>>> Active Internet connections (including servers)
>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>
>>>> On the NFS Client
>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>
>>>>
>>>>
>>>> You had also asked for the output below.
>>>>
>>>> # nfsstat -E -s
>>>> BackChannelCtBindConnToSes
>>>> 0 0
>>>>
>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>
>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>
>>>>
>>>> Jason Breitman
>>>>
>>>>
>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>
>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>> Hi Jason,
>>>>>
>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>
>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>
>>>>>> Issue
>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>
>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>
>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>
>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>
>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>> Having the window size drop to zero is not necessarily incorrect.
>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>> NFS server with requests.
>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>> again and this shouls cause the window to open back up.
>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>
>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>> the case for NFSv4.1).
>>>>
>>>>> - CLIENT ACK that FIN.
>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>
>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>
>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>
>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>
>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>
>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>> (They're just waiting for RPC requests.)
>>>> However, I do now think I know why the soclose() does not happen.
>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>> happen until the client creates a new TCP connection.
>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>
>>>> I've created the attached patch (completely different from the previous one)
>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>> soclose().
>>>> --> I know you are not comfortable with patching your server, but I do think
>>>> this change will get the socket shutdown to complete.
>>>>
>>>> There are a couple more things you can check on the server...
>>>> # nfsstat -E -s
>>>> --> Look for the count under "BindConnToSes".
>>>> --> If non-zero, backchannels have been assigned
>>>> # sysctl -a | fgrep request_space_throttle_count
>>>> --> If non-zero, the server has been overloaded at some point.
>>>>
>>>> I think the attached patch might work around the problem.
>>>> The code that should open up the receive window needs to be checked.
>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>> assigned.
>>>>
>>>> rick
>>>>
>>>> Youssef
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>> <xprtdied.patch>
>>>>
>>>> <nfs-hang.log.gz>
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-06 07:12:40 UTC
Permalink
> On 6. Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
> [stuff snipped]
>> OK. What is the FreeBSD version you are using?
> main Dec. 23, 2020.
>
>>
>> It seems that the TCP connection on the FreeBSD is still alive,
>> Linux has decided to start a new TCP connection using the old
>> port numbers. So it sends a SYN. The response is a challenge ACK
>> and Linux responds with a RST. This looks good so far. However,
>> FreeBSD should accept the RST and kill the TCP connection. The
>> next SYN from the Linux side would establish a new TCP connection.
>>
>> So I'm wondering why the RST is not accepted. I made the timestamp
>> checking stricter but introduced a bug where RST segments without
>> timestamps were ignored. This was fixed.
>>
>> Introduced in main on 2020/11/09:
>> https://svnweb.freebsd.org/changeset/base/367530
>> Introduced in stable/12 on 2020/11/30:
>> https://svnweb.freebsd.org/changeset/base/36818
>> Fix in main on 2021/01/13:
>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>> Fix in stable/12 on 2021/01/24:
>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>
>> Are you using a version which is affected by this bug?
> I was. Now I've applied the patch.
> Bad News. It did not fix the problem.
> It still gets into an endless "ignore RST" and stay established when
> the Send-Q is empty.
OK. Let us focus on this case.

Could you:
1. sudo sysctl net.inet.tcp.log_debug=1
2. repeat the situation where RSTs are ignored.
3. check if there is some output on the console (/var/log/messages).
4. Either provide the output or let me know that there is none.

Best regards
Michael
>
> If the Send-Q is non-empty when I partition, it recovers fine,
> sometimes not even needing to see an RST.
>
> rick
> ps: If you think there might be other recent changes that matter,
> just say the word and I'll upgrade to bits de jur.
>
> rick
>
> Best regards
> Michael
>>
>> If I wait long enough before healing the partition, it will
>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>> do battle (at least not for long).
>>
>> Btw, I have one running now that seems stuck really good.
>> It has been 20minutes since I plugged the net cable back in.
>> (Unfortunately, I didn't have tcpdump running until after
>> I saw it was not progressing after healing.
>> --> There is one difference. There was a 6minute timeout
>> enabled on the server krpc for "no activity", which is
>> now disabled like it is for NFSv4.1 in freebsd-current.
>> I had forgotten to re-disable it.
>> So, when it does battle, it might have been the 6minute
>> timeout, which would then do the soshutdown(..SHUT_WR)
>> which kept it from getting "stuck" forever.
>> -->This time I had to reboot the FreeBSD NFS server to
>> get the Linux client unstuck, so this one looked a lot
>> like what has been reported.
>> The pcap for this one, started after the network was plugged
>> back in and I noticed it was stuck for quite a while is here:
>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>
>> In it, there is just a bunch of RST followed by SYN sent
>> from client->FreeBSD and FreeBSD just keeps sending
>> acks for the old segment back.
>> --> It looks like FreeBSD did the "RST, ACK" after the
>> krpc did a soshutdown(..SHUT_WR) on the socket,
>> for the one you've been looking at.
>> I'll test some more...
>>
>>> I would like to understand why the reestablishment of the connection
>>> did not work...
>> It is looking like it takes either a non-empty send-q or a
>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>> out of established, where it just ignores the RSTs and
>> SYN packets.
>>
>> Thanks for looking at it, rick
>>
>> Best regards
>> Michael
>>>
>>> Have fun with it, rick
>>>
>>>
>>> ________________________________________
>>> From: ***@freebsd.org <***@freebsd.org>
>>> Sent: Sunday, April 4, 2021 12:41 PM
>>> To: Rick Macklem
>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>> not really part of the discussion...
>>>>
>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>> several minutes after un-partitioning the connection.
>>>>
>>>> The battle basically consists of the Linux client sending an RST, followed
>>>> by a SYN.
>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>> over several minutes.
>>>>
>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>> pretty good at ignoring it.
>>>>
>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>> in case anyone wants to look at it.
>>> On freefall? I would like to take a look at it...
>>>
>>> Best regards
>>> Michael
>>>>
>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>> *** Network is now partitioned...
>>>>
>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> *** Lots of lines snipped.
>>>>
>>>>
>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> *** Network is now unpartitioned...
>>>>
>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>> *** This "battle" goes on for 223sec...
>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>> client paused for a while (see ARPs below).
>>>>
>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>
>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>> of 13 (100+ for another test run).
>>>>
>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>> *** Now back in business...
>>>>
>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>
>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>> suspect a Linux client bug, but will be investigating further.
>>>>
>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>> or if the RST should be ack'd sooner?
>>>>
>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>
>>>> rick
>>>>
>>>>
>>>> ________________________________________
>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>> To: Rick Macklem; ***@freebsd.org
>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>
>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>
>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>
>>>> I can try getting the relevant bug info next week...
>>>>
>>>> ________________________________
>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>> An: ***@freebsd.org <***@freebsd.org>
>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>> Betreff: Re: NFS Mount Hangs
>>>>
>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>
>>>>
>>>>
>>>>
>>>> ***@freebsd.org wrote:
>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>
>>>>>> I hope you don't mind a top post...
>>>>>> I've been testing network partitioning between the only Linux client
>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>> applied to it.
>>>>>>
>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>> I see...
>>>>>>
>>>>>> While partitioned:
>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>> the network partition or stays ESTABLISHED.
>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>> sent a FIN, but you never called close() on the socket.
>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>> I guess, and therefore the server does not even detect that the peer
>>>>> is not reachable.
>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>> little while, and then disappears.
>>>>> So how does Linux detect the peer is not reachable?
>>>> Well, here's what I see in a packet capture in the Linux client once
>>>> I partition it (just unplug the net cable):
>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>> - then only ARP queries
>>>>
>>>> Once I plug the net cable back in:
>>>> - ARP works
>>>> - one more retransmit of the same segement
>>>> - receives RST from FreeBSD
>>>> ** So, is this now a "new" TCP connection, despite
>>>> using the same port#.
>>>> --> It matters for NFS, since "new connection"
>>>> implies "must retry all outstanding RPCs".
>>>> - sends SYN
>>>> - receives SYN, ACK from FreeBSD
>>>> --> connection starts working again
>>>> Always uses same port#.
>>>>
>>>> On the FreeBSD server end:
>>>> - receives the last retransmit of the segment (with ACK)
>>>> - sends RST
>>>> - receives SYN
>>>> - sends SYN, ACK
>>>>
>>>> I thought that there was no RST in the capture I looked at
>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>> The socket disappears from the Linux "netstat -a" and I
>>>> suspect that happens after about 54sec, but I am not sure
>>>> about the timing.
>>>>
>>>>>>
>>>>>> After unpartitioning:
>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>> the same port#
>>>>>> Active Internet connections (including servers)
>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>
>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>
>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>> I have now done so, as above.
>>>>
>>>>>> --> However I never get hangs.
>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>> mount starts working again.
>>>>>>
>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>> a little while.)
>>>>>>
>>>>>> What do those TCP conversant think?
>>>>> I guess you are you are never calling close() on the socket, for with
>>>>> the connection state is CLOSED.
>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>> BindConnectionToSession to re-assign the back channel.
>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>> that I have to tell them the back channel needs re-assignment
>>>> or something like that.
>>>>
>>>> I am pretty certain they are broken, in that the client needs to
>>>> retry all outstanding RPCs.
>>>>
>>>> For others, here's the long winded version of this that I just
>>>> put on the phabricator review:
>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>> These structures a ref counted and the soclose() is done
>>>> when the ref. cnt goes to zero. My understanding is that
>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>> before the xprt ref. cnt goes to zero.
>>>>
>>>> For NFSv4.1/4.2 there is something called a back channel
>>>> which means that a "xprt" is used for server->client RPCs,
>>>> although the TCP connection is established by the client
>>>> to the server.
>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>
>>>> client re-assigns it to a different TCP connection
>>>> via an operation called BindConnectionToSession
>>>> and the Linux client is not doing this soon enough,
>>>> it appears.
>>>>
>>>> So, the soclose() is delayed, which is why I think the
>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>> which can happen before the client gets around to
>>>> re-assigning the back channel.
>>>>
>>>> Thanks for your help with this Michael, rick
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> rick
>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>> for them.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>> To: Jason Breitman
>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>
>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>> # ifconfig lagg0
>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>
>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>
>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>
>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>
>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>
>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>> This specifies how many seconds to wait for a final FIN
>>>>> packet before the socket is forcibly closed. This is
>>>>> strictly a violation of the TCP specification, but
>>>>> required to prevent denial-of-service attacks. In Linux
>>>>> 2.2, the default value was 180.
>>>>>
>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>
>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>
>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>
>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>> I ran the capture information again to make use of the situation.
>>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> while true
>>>>> do
>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>> /bin/sleep 60
>>>>> done
>>>>>
>>>>>
>>>>> On the NFS Server
>>>>> Active Internet connections (including servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>
>>>>> On the NFS Client
>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>
>>>>>
>>>>>
>>>>> You had also asked for the output below.
>>>>>
>>>>> # nfsstat -E -s
>>>>> BackChannelCtBindConnToSes
>>>>> 0 0
>>>>>
>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>
>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>
>>>>>
>>>>> Jason Breitman
>>>>>
>>>>>
>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>
>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>> Hi Jason,
>>>>>>
>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>
>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>
>>>>>>> Issue
>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>
>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>
>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>
>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>
>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>> NFS server with requests.
>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>> again and this shouls cause the window to open back up.
>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>
>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>> the case for NFSv4.1).
>>>>>
>>>>>> - CLIENT ACK that FIN.
>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>
>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>
>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>
>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>
>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>
>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>> (They're just waiting for RPC requests.)
>>>>> However, I do now think I know why the soclose() does not happen.
>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>> happen until the client creates a new TCP connection.
>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>
>>>>> I've created the attached patch (completely different from the previous one)
>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>> soclose().
>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>> this change will get the socket shutdown to complete.
>>>>>
>>>>> There are a couple more things you can check on the server...
>>>>> # nfsstat -E -s
>>>>> --> Look for the count under "BindConnToSes".
>>>>> --> If non-zero, backchannels have been assigned
>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>
>>>>> I think the attached patch might work around the problem.
>>>>> The code that should open up the receive window needs to be checked.
>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>> assigned.
>>>>>
>>>>> rick
>>>>>
>>>>> Youssef
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>> <xprtdied.patch>
>>>>>
>>>>> <nfs-hang.log.gz>
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>
>>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-10 00:44:06 UTC
Permalink
***@freebsd.org wrote:
>> On 6. Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> ***@freebsd.org wrote:
>> [stuff snipped]
>>> OK. What is the FreeBSD version you are using?
>> main Dec. 23, 2020.
>>
>>>
>>> It seems that the TCP connection on the FreeBSD is still alive,
>>> Linux has decided to start a new TCP connection using the old
>>> port numbers. So it sends a SYN. The response is a challenge ACK
>>> and Linux responds with a RST. This looks good so far. However,
>>> FreeBSD should accept the RST and kill the TCP connection. The
>>> next SYN from the Linux side would establish a new TCP connection.
>>>
>>> So I'm wondering why the RST is not accepted. I made the timestamp
>>> checking stricter but introduced a bug where RST segments without
>>> timestamps were ignored. This was fixed.
>>>
>>> Introduced in main on 2020/11/09:
>>> https://svnweb.freebsd.org/changeset/base/367530
>>> Introduced in stable/12 on 2020/11/30:
>>> https://svnweb.freebsd.org/changeset/base/36818
>>>> Fix in main on 2021/01/13:
>>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>>> Fix in stable/12 on 2021/01/24:
>>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>>
>>> Are you using a version which is affected by this bug?
>> I was. Now I've applied the patch.
>> Bad News. It did not fix the problem.
>> It still gets into an endless "ignore RST" and stay established when
>> the Send-Q is empty.
>OK. Let us focus on this case.
>
>Could you:
>1. sudo sysctl net.inet.tcp.log_debug=1
>2. repeat the situation where RSTs are ignored.
>3. check if there is some output on the console (/var/log/messages).
>4. Either provide the output or let me know that there is none.
Well, I have some good news and some bad news (the bad is mostly for Richard).
The only message logged is:
tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally

But...the RST battle no longer occurs. Just one RST that works and then
the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...

So, what is different?

r367492 is reverted from the FreeBSD server.
I did the revert because I think it might be what otis@ hang is being
caused by. (In his case, the Recv-Q grows on the socket for the
stuck Linux client, while others work.

Why does reverting fix this?
My only guess is that the krpc gets the upcall right away and sees
a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
I know from a printf that this happened, but whether it caused the
RST battle to not happen, I don't know.

I can put r367492 back in and do more testing if you'd like, but
I think it probably needs to be reverted?

This does not explain the original hung Linux client problem,
but does shed light on the RST war I could create by doing a
network partitioning.

rick

Best regards
Michael
>
> If the Send-Q is non-empty when I partition, it recovers fine,
> sometimes not even needing to see an RST.
>
> rick
> ps: If you think there might be other recent changes that matter,
> just say the word and I'll upgrade to bits de jur.
>
> rick
>
> Best regards
> Michael
>>
>> If I wait long enough before healing the partition, it will
>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>> do battle (at least not for long).
>>
>> Btw, I have one running now that seems stuck really good.
>> It has been 20minutes since I plugged the net cable back in.
>> (Unfortunately, I didn't have tcpdump running until after
>> I saw it was not progressing after healing.
>> --> There is one difference. There was a 6minute timeout
>> enabled on the server krpc for "no activity", which is
>> now disabled like it is for NFSv4.1 in freebsd-current.
>> I had forgotten to re-disable it.
>> So, when it does battle, it might have been the 6minute
>> timeout, which would then do the soshutdown(..SHUT_WR)
>> which kept it from getting "stuck" forever.
>> -->This time I had to reboot the FreeBSD NFS server to
>> get the Linux client unstuck, so this one looked a lot
>> like what has been reported.
>> The pcap for this one, started after the network was plugged
>> back in and I noticed it was stuck for quite a while is here:
>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>
>> In it, there is just a bunch of RST followed by SYN sent
>> from client->FreeBSD and FreeBSD just keeps sending
>> acks for the old segment back.
>> --> It looks like FreeBSD did the "RST, ACK" after the
>> krpc did a soshutdown(..SHUT_WR) on the socket,
>> for the one you've been looking at.
>> I'll test some more...
>>
>>> I would like to understand why the reestablishment of the connection
>>> did not work...
>> It is looking like it takes either a non-empty send-q or a
>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>> out of established, where it just ignores the RSTs and
>> SYN packets.
>>
>> Thanks for looking at it, rick
>>
>> Best regards
>> Michael
>>>
>>> Have fun with it, rick
>>>
>>>
>>> ________________________________________
>>> From: ***@freebsd.org <***@freebsd.org>
>>> Sent: Sunday, April 4, 2021 12:41 PM
>>> To: Rick Macklem
>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>> not really part of the discussion...
>>>>
>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>> several minutes after un-partitioning the connection.
>>>>
>>>> The battle basically consists of the Linux client sending an RST, followed
>>>> by a SYN.
>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>> over several minutes.
>>>>
>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>> pretty good at ignoring it.
>>>>
>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>> in case anyone wants to look at it.
>>> On freefall? I would like to take a look at it...
>>>
>>> Best regards
>>> Michael
>>>>
>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>> *** Network is now partitioned...
>>>>
>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> *** Lots of lines snipped.
>>>>
>>>>
>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> *** Network is now unpartitioned...
>>>>
>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>> *** This "battle" goes on for 223sec...
>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>> client paused for a while (see ARPs below).
>>>>
>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>
>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>> of 13 (100+ for another test run).
>>>>
>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>> *** Now back in business...
>>>>
>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>
>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>> suspect a Linux client bug, but will be investigating further.
>>>>
>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>> or if the RST should be ack'd sooner?
>>>>
>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>
>>>> rick
>>>>
>>>>
>>>> ________________________________________
>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>> To: Rick Macklem; ***@freebsd.org
>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>
>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>
>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>
>>>> I can try getting the relevant bug info next week...
>>>>
>>>> ________________________________
>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>> An: ***@freebsd.org <***@freebsd.org>
>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>> Betreff: Re: NFS Mount Hangs
>>>>
>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>
>>>>
>>>>
>>>>
>>>> ***@freebsd.org wrote:
>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>
>>>>>> I hope you don't mind a top post...
>>>>>> I've been testing network partitioning between the only Linux client
>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>> applied to it.
>>>>>>
>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>> I see...
>>>>>>
>>>>>> While partitioned:
>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>> the network partition or stays ESTABLISHED.
>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>> sent a FIN, but you never called close() on the socket.
>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>> I guess, and therefore the server does not even detect that the peer
>>>>> is not reachable.
>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>> little while, and then disappears.
>>>>> So how does Linux detect the peer is not reachable?
>>>> Well, here's what I see in a packet capture in the Linux client once
>>>> I partition it (just unplug the net cable):
>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>> - then only ARP queries
>>>>
>>>> Once I plug the net cable back in:
>>>> - ARP works
>>>> - one more retransmit of the same segement
>>>> - receives RST from FreeBSD
>>>> ** So, is this now a "new" TCP connection, despite
>>>> using the same port#.
>>>> --> It matters for NFS, since "new connection"
>>>> implies "must retry all outstanding RPCs".
>>>> - sends SYN
>>>> - receives SYN, ACK from FreeBSD
>>>> --> connection starts working again
>>>> Always uses same port#.
>>>>
>>>> On the FreeBSD server end:
>>>> - receives the last retransmit of the segment (with ACK)
>>>> - sends RST
>>>> - receives SYN
>>>> - sends SYN, ACK
>>>>
>>>> I thought that there was no RST in the capture I looked at
>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>> The socket disappears from the Linux "netstat -a" and I
>>>> suspect that happens after about 54sec, but I am not sure
>>>> about the timing.
>>>>
>>>>>>
>>>>>> After unpartitioning:
>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>> the same port#
>>>>>> Active Internet connections (including servers)
>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>
>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>
>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>> I have now done so, as above.
>>>>
>>>>>> --> However I never get hangs.
>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>> mount starts working again.
>>>>>>
>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>> a little while.)
>>>>>>
>>>>>> What do those TCP conversant think?
>>>>> I guess you are you are never calling close() on the socket, for with
>>>>> the connection state is CLOSED.
>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>> BindConnectionToSession to re-assign the back channel.
>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>> that I have to tell them the back channel needs re-assignment
>>>> or something like that.
>>>>
>>>> I am pretty certain they are broken, in that the client needs to
>>>> retry all outstanding RPCs.
>>>>
>>>> For others, here's the long winded version of this that I just
>>>> put on the phabricator review:
>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>> These structures a ref counted and the soclose() is done
>>>> when the ref. cnt goes to zero. My understanding is that
>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>> before the xprt ref. cnt goes to zero.
>>>>
>>>> For NFSv4.1/4.2 there is something called a back channel
>>>> which means that a "xprt" is used for server->client RPCs,
>>>> although the TCP connection is established by the client
>>>> to the server.
>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>
>>>> client re-assigns it to a different TCP connection
>>>> via an operation called BindConnectionToSession
>>>> and the Linux client is not doing this soon enough,
>>>> it appears.
>>>>
>>>> So, the soclose() is delayed, which is why I think the
>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>> which can happen before the client gets around to
>>>> re-assigning the back channel.
>>>>
>>>> Thanks for your help with this Michael, rick
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> rick
>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>> for them.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>> To: Jason Breitman
>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>
>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>> # ifconfig lagg0
>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>
>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>
>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>
>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>
>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>
>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>> This specifies how many seconds to wait for a final FIN
>>>>> packet before the socket is forcibly closed. This is
>>>>> strictly a violation of the TCP specification, but
>>>>> required to prevent denial-of-service attacks. In Linux
>>>>> 2.2, the default value was 180.
>>>>>
>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>
>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>
>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>
>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>> I ran the capture information again to make use of the situation.
>>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> while true
>>>>> do
>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>> /bin/sleep 60
>>>>> done
>>>>>
>>>>>
>>>>> On the NFS Server
>>>>> Active Internet connections (including servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>
>>>>> On the NFS Client
>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>
>>>>>
>>>>>
>>>>> You had also asked for the output below.
>>>>>
>>>>> # nfsstat -E -s
>>>>> BackChannelCtBindConnToSes
>>>>> 0 0
>>>>>
>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>
>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>
>>>>>
>>>>> Jason Breitman
>>>>>
>>>>>
>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>
>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>> Hi Jason,
>>>>>>
>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>
>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>
>>>>>>> Issue
>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>
>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>
>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>
>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>
>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>> NFS server with requests.
>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>> again and this shouls cause the window to open back up.
>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>
>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>> the case for NFSv4.1).
>>>>>
>>>>>> - CLIENT ACK that FIN.
>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>
>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>
>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>
>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>
>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>
>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>> (They're just waiting for RPC requests.)
>>>>> However, I do now think I know why the soclose() does not happen.
>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>> happen until the client creates a new TCP connection.
>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>
>>>>> I've created the attached patch (completely different from the previous one)
>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>> soclose().
>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>> this change will get the socket shutdown to complete.
>>>>>
>>>>> There are a couple more things you can check on the server...
>>>>> # nfsstat -E -s
>>>>> --> Look for the count under "BindConnToSes".
>>>>> --> If non-zero, backchannels have been assigned
>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>
>>>>> I think the attached patch might work around the problem.
>>>>> The code that should open up the receive window needs to be checked.
>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>> assigned.
>>>>>
>>>>> rick
>>>>>
>>>>> Youssef
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>> <xprtdied.patch>
>>>>>
>>>>> <nfs-hang.log.gz>
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>
>>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-04-10 09:19:30 UTC
Permalink
Hi Rick,

> Well, I have some good news and some bad news (the bad is mostly for Richard).
>
> The only message logged is:
> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>
> But...the RST battle no longer occurs. Just one RST that works and then the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>
> So, what is different?
>
> r367492 is reverted from the FreeBSD server.
> I did the revert because I think it might be what otis@ hang is being caused by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, while others work.
>
> Why does reverting fix this?
> My only guess is that the krpc gets the upcall right away and sees a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).

With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?

From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).

I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....

Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...

Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?

If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...


> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>
> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?

Please, I don't quite understand why the exact timing of the upcall would be that critical here...

A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.

> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>
> rick
t***@freebsd.org
2021-04-10 12:19:05 UTC
Permalink
> On 10. Apr 2021, at 11:19, Scheffenegger, Richard <***@netapp.com> wrote:
>
> Hi Rick,
>
>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>>
>> The only message logged is:
>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>>
>> But...the RST battle no longer occurs. Just one RST that works and then the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>
>> So, what is different?
>>
>> r367492 is reverted from the FreeBSD server.
>> I did the revert because I think it might be what otis@ hang is being caused by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, while others work.
>>
>> Why does reverting fix this?
>> My only guess is that the krpc gets the upcall right away and sees a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
>
> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
My understanding is that he needs this error indication when calling shutdown().
>
> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>
> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>
> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>
> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
>
> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
My understanding is that the problem is related to getting a local error indication after
receiving a RST segment too late or not at all.

Best regards
Michael
>
>
>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>
>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>
> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>
> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>
>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>
>> rick
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-04-10 14:40:24 UTC
Permalink
________________________________
Von: ***@freebsd.org <***@freebsd.org>
Gesendet: Samstag, April 10, 2021 2:19 PM
An: Scheffenegger, Richard
Cc: Rick Macklem; Youssef GHORBAL; freebsd-***@freebsd.org
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 11:19, Scheffenegger, Richard <***@netapp.com> wrote:
>
> Hi Rick,
>
>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>>
>> The only message logged is:
>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>>
>> But...the RST battle no longer occurs. Just one RST that works and then the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>
>> So, what is different?
>>
>> r367492 is reverted from the FreeBSD server.
>> I did the revert because I think it might be what otis@ hang is being caused by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, while others work.
>>
>> Why does reverting fix this?
>> My only guess is that the krpc gets the upcall right away and sees a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
>
> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?

My understanding is that he needs this error indication when calling shutdown().

>
> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>
> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>
> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>
> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?

Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?

Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;

>
> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...

My understanding is that the problem is related to getting a local error indication after
receiving a RST segment too late or not at all.

Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...

Best regards
Michael
>
>
>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>
>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>
> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>
> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>
>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>
>> rick
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-10 15:56:33 UTC
Permalink
Scheffenegger, Richard <***@netapp.com> wrote:
>>Rick wrote:
>> Hi Rick,
>>
>>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>>>
>>> The only message logged is:
>>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>>>
Btw, I did get one additional message during further testing (with r367492 reverted):
tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was rejected, connection attempt aborted
by remote endpoint

This only happened once of several test cycles.

>>> But...the RST battle no longer occurs. Just one RST that works and then the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>>
>>> So, what is different?
>>>
>>> r367492 is reverted from the FreeBSD server.
>>> I did the revert because I think it might be what otis@ hang is being caused by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, while others work.
>>>
>>> Why does reverting fix this?
>>> My only guess is that the krpc gets the upcall right away and sees a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
This was bogus and incorrect. The diagnostic printf() I saw was generated for the
back channel, and that would have occurred after the socket was shut down.

>>
>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
If Send-Q is 0 when the network is partitioned, after healing, the krpc sees no activity on
the socket (until it acquires/processes an RPC it will not do a sosend()).
Without the 6minute timeout, the RST battle goes on "forever" (I've never actually
waited more than 30minutes, which is close enough to "forever" for me).
--> With the 6minute timeout, the "battle" stops after 6minutes, when the timeout
causes a soshutdown(..SHUT_WR) on the socket.
(Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
for TCP connections with the back channel enabled, such as Linux 4.1/4.2 ones.)

If Send-Q is non-empty when the network is partitioned, the battle will not happen.

>
>My understanding is that he needs this error indication when calling shutdown().
There are several ways the krpc notices that a TCP connection is no longer functional.
- An error return like EPIPE from either sosend() or soreceive().
- A return of 0 from soreceive() with no data (normal EOF from other end).
- A 6minute timeout on the server end, when no activity has occurred on the
connection. This timer is currently disabled for NFSv4.1/4.2 mounts in "main",
but I enabled it for this testing, to stop the "RST battle goes on forever"
during testing. I am thinking of enabling it on "main", but this crude bandaid
shouldn't be thought of as a "fix for the RST battle".

>>
>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>
>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>
>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>
>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
Since the krpc only uses receive upcalls, I don't see how reverting the send side would have
any effect?

>Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
I think it has already shipped broken.
I don't know if an errata is possible, or if it will be broken until 13.1.

--> I am much more concerned with the otis@ stuck client problem than this RST battle that only
occurs after a network partitioning, especially if it is 13.0 specific.
I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
problem, which I failed to reproduce.

rick

Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;

>
> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...

My understanding is that the problem is related to getting a local error indication after
receiving a RST segment too late or not at all.

Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...

Best regards
Michael
>
>
>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>
>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>
> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>
> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>
>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>
>> rick
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-10 16:12:40 UTC
Permalink
> On 10. Apr 2021, at 17:56, Rick Macklem <***@uoguelph.ca> wrote:
>
> Scheffenegger, Richard <***@netapp.com> wrote:
>>> Rick wrote:
>>> Hi Rick,
>>>
>>>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>>>>
>>>> The only message logged is:
>>>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>>>>
> Btw, I did get one additional message during further testing (with r367492 reverted):
> tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was rejected, connection attempt aborted
> by remote endpoint
>
> This only happened once of several test cycles.
That is OK.
>
>>>> But...the RST battle no longer occurs. Just one RST that works and then the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>>>
>>>> So, what is different?
>>>>
>>>> r367492 is reverted from the FreeBSD server.
>>>> I did the revert because I think it might be what otis@ hang is being caused by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, while others work.
>>>>
>>>> Why does reverting fix this?
>>>> My only guess is that the krpc gets the upcall right away and sees a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
> This was bogus and incorrect. The diagnostic printf() I saw was generated for the
> back channel, and that would have occurred after the socket was shut down.
>
>>>
>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
> If Send-Q is 0 when the network is partitioned, after healing, the krpc sees no activity on
> the socket (until it acquires/processes an RPC it will not do a sosend()).
> Without the 6minute timeout, the RST battle goes on "forever" (I've never actually
> waited more than 30minutes, which is close enough to "forever" for me).
> --> With the 6minute timeout, the "battle" stops after 6minutes, when the timeout
> causes a soshutdown(..SHUT_WR) on the socket.
> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
> for TCP connections with the back channel enabled, such as Linux 4.1/4.2 ones.)
I'm confused. So you are saying that if the Send-Q is empty when you partition the
network, and the peer starts to send SYNs after the healing, FreeBSD responds
with a challenge ACK which triggers the sending of a RST by Linux. This RST is
ignored multiple times.
Is that true? Even with my patch for the the bug I introduced?
What version of the kernel are you using?

Best regards
Michael
>
> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>
>>
>> My understanding is that he needs this error indication when calling shutdown().
> There are several ways the krpc notices that a TCP connection is no longer functional.
> - An error return like EPIPE from either sosend() or soreceive().
> - A return of 0 from soreceive() with no data (normal EOF from other end).
> - A 6minute timeout on the server end, when no activity has occurred on the
> connection. This timer is currently disabled for NFSv4.1/4.2 mounts in "main",
> but I enabled it for this testing, to stop the "RST battle goes on forever"
> during testing. I am thinking of enabling it on "main", but this crude bandaid
> shouldn't be thought of as a "fix for the RST battle".
>
>>>
>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>
>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>
>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>
>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
> Since the krpc only uses receive upcalls, I don't see how reverting the send side would have
> any effect?
>
>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
> I think it has already shipped broken.
> I don't know if an errata is possible, or if it will be broken until 13.1.
>
> --> I am much more concerned with the otis@ stuck client problem than this RST battle that only
> occurs after a network partitioning, especially if it is 13.0 specific.
> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
> problem, which I failed to reproduce.
>
> rick
>
> Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;
>
>>
>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>
> My understanding is that the problem is related to getting a local error indication after
> receiving a RST segment too late or not at all.
>
> Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...
>
> Best regards
> Michael
>>
>>
>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>
>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>
>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>
>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>
>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>
>>> rick
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-04-10 18:15:02 UTC
Permalink
I went through all the instances, where there would be an immediate soupcall triggered (before r367492).

If the problem is related to a race condition, where the socket is unlocked before the upcall, I can change the patch in such a way, to retain the lock on the socket all through TCP processing.

Both sorwakeups are with a locked socket (which is the critical part, I understand), while for the write upcall there is one unlocked, and one locked....


Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
***@netapp.com

https://ts.la/richard49892


-----Ursprüngliche Nachricht-----
Von: ***@freebsd.org <***@freebsd.org>
Gesendet: Samstag, 10. April 2021 18:13
An: Rick Macklem <***@uoguelph.ca>
Cc: Scheffenegger, Richard <***@netapp.com>; Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 17:56, Rick Macklem <***@uoguelph.ca> wrote:
>
> Scheffenegger, Richard <***@netapp.com> wrote:
>>> Rick wrote:
>>> Hi Rick,
>>>
>>>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>>>>
>>>> The only message logged is:
>>>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment
>>>> processed normally
>>>>
> Btw, I did get one additional message during further testing (with r367492 reverted):
> tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was rejected, connection attempt aborted
> by remote endpoint
>
> This only happened once of several test cycles.
That is OK.
>
>>>> But...the RST battle no longer occurs. Just one RST that works and then the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>>>
>>>> So, what is different?
>>>>
>>>> r367492 is reverted from the FreeBSD server.
>>>> I did the revert because I think it might be what otis@ hang is being caused by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, while others work.
>>>>
>>>> Why does reverting fix this?
>>>> My only guess is that the krpc gets the upcall right away and sees a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
> This was bogus and incorrect. The diagnostic printf() I saw was
> generated for the back channel, and that would have occurred after the socket was shut down.
>
>>>
>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
> If Send-Q is 0 when the network is partitioned, after healing, the
> krpc sees no activity on the socket (until it acquires/processes an RPC it will not do a sosend()).
> Without the 6minute timeout, the RST battle goes on "forever" (I've
> never actually waited more than 30minutes, which is close enough to "forever" for me).
> --> With the 6minute timeout, the "battle" stops after 6minutes, when
> --> the timeout
> causes a soshutdown(..SHUT_WR) on the socket.
> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
> for TCP connections with the back channel enabled, such as Linux
> 4.1/4.2 ones.)
I'm confused. So you are saying that if the Send-Q is empty when you partition the network, and the peer starts to send SYNs after the healing, FreeBSD responds with a challenge ACK which triggers the sending of a RST by Linux. This RST is ignored multiple times.
Is that true? Even with my patch for the the bug I introduced?
What version of the kernel are you using?

Best regards
Michael
>
> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>
>>
>> My understanding is that he needs this error indication when calling shutdown().
> There are several ways the krpc notices that a TCP connection is no longer functional.
> - An error return like EPIPE from either sosend() or soreceive().
> - A return of 0 from soreceive() with no data (normal EOF from other end).
> - A 6minute timeout on the server end, when no activity has occurred
> on the connection. This timer is currently disabled for NFSv4.1/4.2
> mounts in "main", but I enabled it for this testing, to stop the "RST battle goes on forever"
> during testing. I am thinking of enabling it on "main", but this
> crude bandaid shouldn't be thought of as a "fix for the RST battle".
>
>>>
>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>
>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>
>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>
>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
> Since the krpc only uses receive upcalls, I don't see how reverting
> the send side would have any effect?
>
>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
> I think it has already shipped broken.
> I don't know if an errata is possible, or if it will be broken until 13.1.
>
> --> I am much more concerned with the otis@ stuck client problem than
> --> this RST battle that only
> occurs after a network partitioning, especially if it is 13.0 specific.
> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
> problem, which I failed to reproduce.
>
> rick
>
> Rs: agree, a good understanding where the interaction btwn stack,
> socket and in kernel tcp user breaks is needed;
>
>>
>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>
> My understanding is that the problem is related to getting a local
> error indication after receiving a RST segment too late or not at all.
>
> Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...
>
> Best regards
> Michael
>>
>>
>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>
>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>
>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>
>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>
>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>
>>> rick
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-10 21:59:51 UTC
Permalink
***@freebsd.org wrote:
>Rick wrote:
[stuff snipped]
>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
> If Send-Q is 0 when the network is partitioned, after healing, the krpc sees no activity on
> the socket (until it acquires/processes an RPC it will not do a sosend()).
> Without the 6minute timeout, the RST battle goes on "forever" (I've never actually
> waited more than 30minutes, which is close enough to "forever" for me).
> --> With the 6minute timeout, the "battle" stops after 6minutes, when the timeout
> causes a soshutdown(..SHUT_WR) on the socket.
> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
> for TCP connections with the back channel enabled, such as Linux 4.1/4.2 ones.)
>I'm confused. So you are saying that if the Send-Q is empty when you partition the
>network, and the peer starts to send SYNs after the healing, FreeBSD responds
>with a challenge ACK which triggers the sending of a RST by Linux. This RST is
>ignored multiple times.
>Is that true? Even with my patch for the the bug I introduced?
Yes and yes.
Go take another look at linuxtofreenfs.pcap
("fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap" if you don't
already have it.)
Look at packet #1949->2069. I use wireshark, but you'll have your favourite.
You'll see the "RST battle" that ends after
6minutes at packet#2069. If there is no 6minute timeout enabled in the
server side krpc, then the battle just continues (I once let it run for about
30minutes before giving up). The 6minute timeout is not currently enabled
in main, etc.

>What version of the kernel are you using?
"main" dated Dec. 23, 2020 + your bugfix + assorted NFS patches that
are not relevant + 2 small krpc related patches.
--> The two small krpc related patches enable the 6minute timeout and
add a soshutdown(..SHUT_WR) call when the 6minute timeout is
triggered. These have no effect until the 6minutes is up and, without
them the "RTS battle" goes on forever.

Add to the above a revert of r367492 and the RST battle goes away and things
behave as expected. The recovery happens quickly after the network is
unpartitioned, with either 0 or 1 RSTs.

rick
ps: Once the irrelevant NFS patches make it into "main", I will upgrade to
main bits-de-jur for testing.

Best regards
Michael
>
> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>
>>
>> My understanding is that he needs this error indication when calling shutdown().
> There are several ways the krpc notices that a TCP connection is no longer functional.
> - An error return like EPIPE from either sosend() or soreceive().
> - A return of 0 from soreceive() with no data (normal EOF from other end).
> - A 6minute timeout on the server end, when no activity has occurred on the
> connection. This timer is currently disabled for NFSv4.1/4.2 mounts in "main",
> but I enabled it for this testing, to stop the "RST battle goes on forever"
> during testing. I am thinking of enabling it on "main", but this crude bandaid
> shouldn't be thought of as a "fix for the RST battle".
>
>>>
>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>
>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>
>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>
>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
> Since the krpc only uses receive upcalls, I don't see how reverting the send side would have
> any effect?
>
>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
> I think it has already shipped broken.
> I don't know if an errata is possible, or if it will be broken until 13.1.
>
> --> I am much more concerned with the otis@ stuck client problem than this RST battle that only
> occurs after a network partitioning, especially if it is 13.0 specific.
> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
> problem, which I failed to reproduce.
>
> rick
>
> Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;
>
>>
>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>
> My understanding is that the problem is related to getting a local error indication after
> receiving a RST segment too late or not at all.
>
> Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...
>
> Best regards
> Michael
>>
>>
>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>
>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>
>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>
>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>
>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>
>>> rick
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-11 12:30:09 UTC
Permalink
> On 10. Apr 2021, at 23:59, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>> Rick wrote:
> [stuff snipped]
>>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
>> If Send-Q is 0 when the network is partitioned, after healing, the krpc sees no activity on
>> the socket (until it acquires/processes an RPC it will not do a sosend()).
>> Without the 6minute timeout, the RST battle goes on "forever" (I've never actually
>> waited more than 30minutes, which is close enough to "forever" for me).
>> --> With the 6minute timeout, the "battle" stops after 6minutes, when the timeout
>> causes a soshutdown(..SHUT_WR) on the socket.
>> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
>> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
>> for TCP connections with the back channel enabled, such as Linux 4.1/4.2 ones.)
>> I'm confused. So you are saying that if the Send-Q is empty when you partition the
>> network, and the peer starts to send SYNs after the healing, FreeBSD responds
>> with a challenge ACK which triggers the sending of a RST by Linux. This RST is
>> ignored multiple times.
>> Is that true? Even with my patch for the the bug I introduced?
> Yes and yes.
> Go take another look at linuxtofreenfs.pcap
> ("fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap" if you don't
> already have it.)
> Look at packet #1949->2069. I use wireshark, but you'll have your favourite.
> You'll see the "RST battle" that ends after
> 6minutes at packet#2069. If there is no 6minute timeout enabled in the
> server side krpc, then the battle just continues (I once let it run for about
> 30minutes before giving up). The 6minute timeout is not currently enabled
> in main, etc.
Hmm. I don't understand why r367492 can impact the processing of the RST, which
basically destroys the TCP connection.

Richard: Can you explain that?

Best regards
Michael
>
>> What version of the kernel are you using?
> "main" dated Dec. 23, 2020 + your bugfix + assorted NFS patches that
> are not relevant + 2 small krpc related patches.
> --> The two small krpc related patches enable the 6minute timeout and
> add a soshutdown(..SHUT_WR) call when the 6minute timeout is
> triggered. These have no effect until the 6minutes is up and, without
> them the "RTS battle" goes on forever.
>
> Add to the above a revert of r367492 and the RST battle goes away and things
> behave as expected. The recovery happens quickly after the network is
> unpartitioned, with either 0 or 1 RSTs.
>
> rick
> ps: Once the irrelevant NFS patches make it into "main", I will upgrade to
> main bits-de-jur for testing.
>
> Best regards
> Michael
>>
>> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>>
>>>
>>> My understanding is that he needs this error indication when calling shutdown().
>> There are several ways the krpc notices that a TCP connection is no longer functional.
>> - An error return like EPIPE from either sosend() or soreceive().
>> - A return of 0 from soreceive() with no data (normal EOF from other end).
>> - A 6minute timeout on the server end, when no activity has occurred on the
>> connection. This timer is currently disabled for NFSv4.1/4.2 mounts in "main",
>> but I enabled it for this testing, to stop the "RST battle goes on forever"
>> during testing. I am thinking of enabling it on "main", but this crude bandaid
>> shouldn't be thought of as a "fix for the RST battle".
>>
>>>>
>>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>>
>>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>>
>>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>>
>>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
>> Since the krpc only uses receive upcalls, I don't see how reverting the send side would have
>> any effect?
>>
>>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
>> I think it has already shipped broken.
>> I don't know if an errata is possible, or if it will be broken until 13.1.
>>
>> --> I am much more concerned with the otis@ stuck client problem than this RST battle that only
>> occurs after a network partitioning, especially if it is 13.0 specific.
>> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
>> problem, which I failed to reproduce.
>>
>> rick
>>
>> Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;
>>
>>>
>>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>>
>> My understanding is that the problem is related to getting a local error indication after
>> receiving a RST segment too late or not at all.
>>
>> Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...
>>
>> Best regards
>> Michael
>>>
>>>
>>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>>
>>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>>
>>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>>
>>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>>
>>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>>
>>>> rick
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-04-11 16:54:37 UTC
Permalink
From what i understand rick stating around the socket state changing before the upcall, i can only speculate that the rst fight is for the new sessios the client tries with the same 5tuple, while server side the old original session persists, as the nfs server never closes /shutdown the session .

But a debug logged version of the socket upcall used by the nfs server should reveal any differences in socket state at the time of upcall.

I would very much like to know if d29690 addresses that problem (if it was due to releasing the lock before the upcall), or if that still shows differences between prior to my central upcall change, post that change and with d29690 ...

________________________________
Von: ***@freebsd.org <***@freebsd.org>
Gesendet: Sunday, April 11, 2021 2:30:09 PM
An: Rick Macklem <***@uoguelph.ca>
Cc: Scheffenegger, Richard <***@netapp.com>; Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 23:59, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>> Rick wrote:
> [stuff snipped]
>>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
>> If Send-Q is 0 when the network is partitioned, after healing, the krpc sees no activity on
>> the socket (until it acquires/processes an RPC it will not do a sosend()).
>> Without the 6minute timeout, the RST battle goes on "forever" (I've never actually
>> waited more than 30minutes, which is close enough to "forever" for me).
>> --> With the 6minute timeout, the "battle" stops after 6minutes, when the timeout
>> causes a soshutdown(..SHUT_WR) on the socket.
>> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
>> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
>> for TCP connections with the back channel enabled, such as Linux 4.1/4.2 ones.)
>> I'm confused. So you are saying that if the Send-Q is empty when you partition the
>> network, and the peer starts to send SYNs after the healing, FreeBSD responds
>> with a challenge ACK which triggers the sending of a RST by Linux. This RST is
>> ignored multiple times.
>> Is that true? Even with my patch for the the bug I introduced?
> Yes and yes.
> Go take another look at linuxtofreenfs.pcap
> ("fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap" if you don't
> already have it.)
> Look at packet #1949->2069. I use wireshark, but you'll have your favourite.
> You'll see the "RST battle" that ends after
> 6minutes at packet#2069. If there is no 6minute timeout enabled in the
> server side krpc, then the battle just continues (I once let it run for about
> 30minutes before giving up). The 6minute timeout is not currently enabled
> in main, etc.
Hmm. I don't understand why r367492 can impact the processing of the RST, which
basically destroys the TCP connection.

Richard: Can you explain that?

Best regards
Michael
>
>> What version of the kernel are you using?
> "main" dated Dec. 23, 2020 + your bugfix + assorted NFS patches that
> are not relevant + 2 small krpc related patches.
> --> The two small krpc related patches enable the 6minute timeout and
> add a soshutdown(..SHUT_WR) call when the 6minute timeout is
> triggered. These have no effect until the 6minutes is up and, without
> them the "RTS battle" goes on forever.
>
> Add to the above a revert of r367492 and the RST battle goes away and things
> behave as expected. The recovery happens quickly after the network is
> unpartitioned, with either 0 or 1 RSTs.
>
> rick
> ps: Once the irrelevant NFS patches make it into "main", I will upgrade to
> main bits-de-jur for testing.
>
> Best regards
> Michael
>>
>> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>>
>>>
>>> My understanding is that he needs this error indication when calling shutdown().
>> There are several ways the krpc notices that a TCP connection is no longer functional.
>> - An error return like EPIPE from either sosend() or soreceive().
>> - A return of 0 from soreceive() with no data (normal EOF from other end).
>> - A 6minute timeout on the server end, when no activity has occurred on the
>> connection. This timer is currently disabled for NFSv4.1/4.2 mounts in "main",
>> but I enabled it for this testing, to stop the "RST battle goes on forever"
>> during testing. I am thinking of enabling it on "main", but this crude bandaid
>> shouldn't be thought of as a "fix for the RST battle".
>>
>>>>
>>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>>
>>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>>
>>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>>
>>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
>> Since the krpc only uses receive upcalls, I don't see how reverting the send side would have
>> any effect?
>>
>>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
>> I think it has already shipped broken.
>> I don't know if an errata is possible, or if it will be broken until 13.1.
>>
>> --> I am much more concerned with the otis@ stuck client problem than this RST battle that only
>> occurs after a network partitioning, especially if it is 13.0 specific.
>> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
>> problem, which I failed to reproduce.
>>
>> rick
>>
>> Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;
>>
>>>
>>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>>
>> My understanding is that the problem is related to getting a local error indication after
>> receiving a RST segment too late or not at all.
>>
>> Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...
>>
>> Best regards
>> Michael
>>>
>>>
>>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>>
>>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>>
>>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>>
>>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>>
>>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>>
>>>> rick
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-11 22:49:49 UTC
Permalink
I should be able to test D69290 in about a week.
Note that I will not be able to tell if it fixes otis@'s
hung Linux client problem.

rick

________________________________________
From: Scheffenegger, Richard <***@netapp.com>
Sent: Sunday, April 11, 2021 12:54 PM
To: ***@freebsd.org; Rick Macklem
Cc: Youssef GHORBAL; freebsd-***@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca


From what i understand rick stating around the socket state changing before the upcall, i can only speculate that the rst fight is for the new sessios the client tries with the same 5tuple, while server side the old original session persists, as the nfs server never closes /shutdown the session .

But a debug logged version of the socket upcall used by the nfs server should reveal any differences in socket state at the time of upcall.

I would very much like to know if d29690 addresses that problem (if it was due to releasing the lock before the upcall), or if that still shows differences between prior to my central upcall change, post that change and with d29690 ...

________________________________
Von: ***@freebsd.org <***@freebsd.org>
Gesendet: Sunday, April 11, 2021 2:30:09 PM
An: Rick Macklem <***@uoguelph.ca>
Cc: Scheffenegger, Richard <***@netapp.com>; Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 23:59, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>> Rick wrote:
> [stuff snipped]
>>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
>> If Send-Q is 0 when the network is partitioned, after healing, the krpc sees no activity on
>> the socket (until it acquires/processes an RPC it will not do a sosend()).
>> Without the 6minute timeout, the RST battle goes on "forever" (I've never actually
>> waited more than 30minutes, which is close enough to "forever" for me).
>> --> With the 6minute timeout, the "battle" stops after 6minutes, when the timeout
>> causes a soshutdown(..SHUT_WR) on the socket.
>> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
>> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
>> for TCP connections with the back channel enabled, such as Linux 4.1/4.2 ones.)
>> I'm confused. So you are saying that if the Send-Q is empty when you partition the
>> network, and the peer starts to send SYNs after the healing, FreeBSD responds
>> with a challenge ACK which triggers the sending of a RST by Linux. This RST is
>> ignored multiple times.
>> Is that true? Even with my patch for the the bug I introduced?
> Yes and yes.
> Go take another look at linuxtofreenfs.pcap
> ("fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap" if you don't
> already have it.)
> Look at packet #1949->2069. I use wireshark, but you'll have your favourite.
> You'll see the "RST battle" that ends after
> 6minutes at packet#2069. If there is no 6minute timeout enabled in the
> server side krpc, then the battle just continues (I once let it run for about
> 30minutes before giving up). The 6minute timeout is not currently enabled
> in main, etc.
Hmm. I don't understand why r367492 can impact the processing of the RST, which
basically destroys the TCP connection.

Richard: Can you explain that?

Best regards
Michael
>
>> What version of the kernel are you using?
> "main" dated Dec. 23, 2020 + your bugfix + assorted NFS patches that
> are not relevant + 2 small krpc related patches.
> --> The two small krpc related patches enable the 6minute timeout and
> add a soshutdown(..SHUT_WR) call when the 6minute timeout is
> triggered. These have no effect until the 6minutes is up and, without
> them the "RTS battle" goes on forever.
>
> Add to the above a revert of r367492 and the RST battle goes away and things
> behave as expected. The recovery happens quickly after the network is
> unpartitioned, with either 0 or 1 RSTs.
>
> rick
> ps: Once the irrelevant NFS patches make it into "main", I will upgrade to
> main bits-de-jur for testing.
>
> Best regards
> Michael
>>
>> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>>
>>>
>>> My understanding is that he needs this error indication when calling shutdown().
>> There are several ways the krpc notices that a TCP connection is no longer functional.
>> - An error return like EPIPE from either sosend() or soreceive().
>> - A return of 0 from soreceive() with no data (normal EOF from other end).
>> - A 6minute timeout on the server end, when no activity has occurred on the
>> connection. This timer is currently disabled for NFSv4.1/4.2 mounts in "main",
>> but I enabled it for this testing, to stop the "RST battle goes on forever"
>> during testing. I am thinking of enabling it on "main", but this crude bandaid
>> shouldn't be thought of as a "fix for the RST battle".
>>
>>>>
>>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>>
>>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>>
>>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>>
>>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
>> Since the krpc only uses receive upcalls, I don't see how reverting the send side would have
>> any effect?
>>
>>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
>> I think it has already shipped broken.
>> I don't know if an errata is possible, or if it will be broken until 13.1.
>>
>> --> I am much more concerned with the otis@ stuck client problem than this RST battle that only
>> occurs after a network partitioning, especially if it is 13.0 specific.
>> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
>> problem, which I failed to reproduce.
>>
>> rick
>>
>> Rs: agree, a good understanding where the interaction btwn stack, socket and in kernel tcp user breaks is needed;
>>
>>>
>>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>>
>> My understanding is that the problem is related to getting a local error indication after
>> receiving a RST segment too late or not at all.
>>
>> Rs: but the move of the upcall should not materially change that; i don’t have a pc here to see if any upcall actually happens on rst...
>>
>> Best regards
>> Michael
>>>
>>>
>>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>>
>>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>>
>>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>>
>>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>>
>>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>>
>>>> rick
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Scheffenegger, Richard
2021-04-12 07:49:44 UTC
Permalink
I was trying to do some simple tests yesterday - but don't know if these are representative:

Using an old Debian 3.16.3 linux box as nfs client, and simulating the disconnect with an ipfw rule, while introducing some packet drops using dummynet (I really should be adding a simple markov-chain state machine for burst losses), to utilize some of the socket upcalls in the tcp_input code flow. But it got too late before I arrived at any relevant results...


Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
***@netapp.com

https://ts.la/richard49892


-----Ursprüngliche Nachricht-----
Von: Rick Macklem <***@uoguelph.ca>
Gesendet: Montag, 12. April 2021 00:50
An: Scheffenegger, Richard <***@netapp.com>; ***@freebsd.org
Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




I should be able to test D69290 in about a week.
Note that I will not be able to tell if it fixes otis@'s hung Linux client problem.

rick

________________________________________
From: Scheffenegger, Richard <***@netapp.com>
Sent: Sunday, April 11, 2021 12:54 PM
To: ***@freebsd.org; Rick Macklem
Cc: Youssef GHORBAL; freebsd-***@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca


From what i understand rick stating around the socket state changing before the upcall, i can only speculate that the rst fight is for the new sessios the client tries with the same 5tuple, while server side the old original session persists, as the nfs server never closes /shutdown the session .

But a debug logged version of the socket upcall used by the nfs server should reveal any differences in socket state at the time of upcall.

I would very much like to know if d29690 addresses that problem (if it was due to releasing the lock before the upcall), or if that still shows differences between prior to my central upcall change, post that change and with d29690 ...

________________________________
Von: ***@freebsd.org <***@freebsd.org>
Gesendet: Sunday, April 11, 2021 2:30:09 PM
An: Rick Macklem <***@uoguelph.ca>
Cc: Scheffenegger, Richard <***@netapp.com>; Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 23:59, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>> Rick wrote:
> [stuff snipped]
>>>> With r367492 you don't get the upcall with the same error state? Or you don't get an error on a write() call, when there should be one?
>> If Send-Q is 0 when the network is partitioned, after healing, the
>> krpc sees no activity on the socket (until it acquires/processes an RPC it will not do a sosend()).
>> Without the 6minute timeout, the RST battle goes on "forever" (I've
>> never actually waited more than 30minutes, which is close enough to "forever" for me).
>> --> With the 6minute timeout, the "battle" stops after 6minutes, when
>> --> the timeout
>> causes a soshutdown(..SHUT_WR) on the socket.
>> (Since the soshutdown() patch is not yet in "main". I got comments, but no "reviewed"
>> on it, the 6minute timer won't help if enabled in main. The soclose() won't happen
>> for TCP connections with the back channel enabled, such as Linux
>> 4.1/4.2 ones.) I'm confused. So you are saying that if the Send-Q is
>> empty when you partition the network, and the peer starts to send
>> SYNs after the healing, FreeBSD responds with a challenge ACK which
>> triggers the sending of a RST by Linux. This RST is ignored multiple times.
>> Is that true? Even with my patch for the the bug I introduced?
> Yes and yes.
> Go take another look at linuxtofreenfs.pcap ("fetch
> https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap" if you don't
> already have it.) Look at packet #1949->2069. I use wireshark, but
> you'll have your favourite.
> You'll see the "RST battle" that ends after 6minutes at packet#2069.
> If there is no 6minute timeout enabled in the server side krpc, then
> the battle just continues (I once let it run for about 30minutes
> before giving up). The 6minute timeout is not currently enabled in
> main, etc.
Hmm. I don't understand why r367492 can impact the processing of the RST, which basically destroys the TCP connection.

Richard: Can you explain that?

Best regards
Michael
>
>> What version of the kernel are you using?
> "main" dated Dec. 23, 2020 + your bugfix + assorted NFS patches that
> are not relevant + 2 small krpc related patches.
> --> The two small krpc related patches enable the 6minute timeout and
> add a soshutdown(..SHUT_WR) call when the 6minute timeout is
> triggered. These have no effect until the 6minutes is up and, without
> them the "RTS battle" goes on forever.
>
> Add to the above a revert of r367492 and the RST battle goes away and
> things behave as expected. The recovery happens quickly after the
> network is unpartitioned, with either 0 or 1 RSTs.
>
> rick
> ps: Once the irrelevant NFS patches make it into "main", I will upgrade to
> main bits-de-jur for testing.
>
> Best regards
> Michael
>>
>> If Send-Q is non-empty when the network is partitioned, the battle will not happen.
>>
>>>
>>> My understanding is that he needs this error indication when calling shutdown().
>> There are several ways the krpc notices that a TCP connection is no longer functional.
>> - An error return like EPIPE from either sosend() or soreceive().
>> - A return of 0 from soreceive() with no data (normal EOF from other end).
>> - A 6minute timeout on the server end, when no activity has occurred
>> on the connection. This timer is currently disabled for NFSv4.1/4.2
>> mounts in "main", but I enabled it for this testing, to stop the "RST battle goes on forever"
>> during testing. I am thinking of enabling it on "main", but this
>> crude bandaid shouldn't be thought of as a "fix for the RST battle".
>>
>>>>
>>>> From what you describe, this is on writes, isn't it? (I'm asking, at the original problem that was fixed with r367492, occurs in the read path (draining of ths so_rcv buffer in the upcall right away, which subsequently influences the ACK sent by the stack).
>>>>
>>>> I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't have a symmetric equivalent on WAKESOW....
>>>>
>>>> Thus a partial backout (leaving the WAKESOR part inside, but reverting the WAKESOW part) would still fix my initial problem about erraneous DSACKs (which can also lead to extremely poor performance with Linux clients), but possible address this issue...
>>>>
>>>> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the revert only on the so_snd upcall?
>> Since the krpc only uses receive upcalls, I don't see how reverting
>> the send side would have any effect?
>>
>>> Since the release of 13.0 is almost done, can we try to fix the issue instead of reverting the commit?
>> I think it has already shipped broken.
>> I don't know if an errata is possible, or if it will be broken until 13.1.
>>
>> --> I am much more concerned with the otis@ stuck client problem than
>> --> this RST battle that only
>> occurs after a network partitioning, especially if it is 13.0 specific.
>> I did this testing to try to reproduce Jason's stuck client (with connection in CLOSE_WAIT)
>> problem, which I failed to reproduce.
>>
>> rick
>>
>> Rs: agree, a good understanding where the interaction btwn stack,
>> socket and in kernel tcp user breaks is needed;
>>
>>>
>>> If this doesn't help, some major surgery will be necessary to prevent NFS sessions with SACK enabled, to transmit DSACKs...
>>
>> My understanding is that the problem is related to getting a local
>> error indication after receiving a RST segment too late or not at all.
>>
>> Rs: but the move of the upcall should not materially change that; i don't have a pc here to see if any upcall actually happens on rst...
>>
>> Best regards
>> Michael
>>>
>>>
>>>> I know from a printf that this happened, but whether it caused the RST battle to not happen, I don't know.
>>>>
>>>> I can put r367492 back in and do more testing if you'd like, but I think it probably needs to be reverted?
>>>
>>> Please, I don't quite understand why the exact timing of the upcall would be that critical here...
>>>
>>> A comparison of the soxxx calls and errors between the "good" and the "bad" would be perfect. I don't know if this is easy to do though, as these calls appear to be scattered all around the RPC / NFS source paths.
>>>
>>>> This does not explain the original hung Linux client problem, but does shed light on the RST war I could create by doing a network partitioning.
>>>>
>>>> rick
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-10 12:13:18 UTC
Permalink
> On 10. Apr 2021, at 02:44, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>>> On 6. Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> ***@freebsd.org wrote:
>>> [stuff snipped]
>>>> OK. What is the FreeBSD version you are using?
>>> main Dec. 23, 2020.
>>>
>>>>
>>>> It seems that the TCP connection on the FreeBSD is still alive,
>>>> Linux has decided to start a new TCP connection using the old
>>>> port numbers. So it sends a SYN. The response is a challenge ACK
>>>> and Linux responds with a RST. This looks good so far. However,
>>>> FreeBSD should accept the RST and kill the TCP connection. The
>>>> next SYN from the Linux side would establish a new TCP connection.
>>>>
>>>> So I'm wondering why the RST is not accepted. I made the timestamp
>>>> checking stricter but introduced a bug where RST segments without
>>>> timestamps were ignored. This was fixed.
>>>>
>>>> Introduced in main on 2020/11/09:
>>>> https://svnweb.freebsd.org/changeset/base/367530
>>>> Introduced in stable/12 on 2020/11/30:
>>>> https://svnweb.freebsd.org/changeset/base/36818
>>>>> Fix in main on 2021/01/13:
>>>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>>>> Fix in stable/12 on 2021/01/24:
>>>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>>>
>>>> Are you using a version which is affected by this bug?
>>> I was. Now I've applied the patch.
>>> Bad News. It did not fix the problem.
>>> It still gets into an endless "ignore RST" and stay established when
>>> the Send-Q is empty.
>> OK. Let us focus on this case.
>>
>> Could you:
>> 1. sudo sysctl net.inet.tcp.log_debug=1
>> 2. repeat the situation where RSTs are ignored.
>> 3. check if there is some output on the console (/var/log/messages).
>> 4. Either provide the output or let me know that there is none.
> Well, I have some good news and some bad news (the bad is mostly for Richard).
> The only message logged is:
> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>
> But...the RST battle no longer occurs. Just one RST that works and then
> the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
The above is what I would expect if you integrated cc3c34859eab1b317d0f38731355b53f7d978c97
or reverted r367530. Did you do that?
>
>
> So, what is different?
>
> r367492 is reverted from the FreeBSD server.
Only that? So you still have the bug I introduced in tree, but the RST segment is accepted?

Best regards
Michael
> I did the revert because I think it might be what otis@ hang is being
> caused by. (In his case, the Recv-Q grows on the socket for the
> stuck Linux client, while others work.
>
> Why does reverting fix this?
> My only guess is that the krpc gets the upcall right away and sees
> a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
> I know from a printf that this happened, but whether it caused the
> RST battle to not happen, I don't know.
>
> I can put r367492 back in and do more testing if you'd like, but
> I think it probably needs to be reverted?
>
> This does not explain the original hung Linux client problem,
> but does shed light on the RST war I could create by doing a
> network partitioning.
>
> rick
>
> Best regards
> Michael
>>
>> If the Send-Q is non-empty when I partition, it recovers fine,
>> sometimes not even needing to see an RST.
>>
>> rick
>> ps: If you think there might be other recent changes that matter,
>> just say the word and I'll upgrade to bits de jur.
>>
>> rick
>>
>> Best regards
>> Michael
>>>
>>> If I wait long enough before healing the partition, it will
>>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>>> do battle (at least not for long).
>>>
>>> Btw, I have one running now that seems stuck really good.
>>> It has been 20minutes since I plugged the net cable back in.
>>> (Unfortunately, I didn't have tcpdump running until after
>>> I saw it was not progressing after healing.
>>> --> There is one difference. There was a 6minute timeout
>>> enabled on the server krpc for "no activity", which is
>>> now disabled like it is for NFSv4.1 in freebsd-current.
>>> I had forgotten to re-disable it.
>>> So, when it does battle, it might have been the 6minute
>>> timeout, which would then do the soshutdown(..SHUT_WR)
>>> which kept it from getting "stuck" forever.
>>> -->This time I had to reboot the FreeBSD NFS server to
>>> get the Linux client unstuck, so this one looked a lot
>>> like what has been reported.
>>> The pcap for this one, started after the network was plugged
>>> back in and I noticed it was stuck for quite a while is here:
>>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>>
>>> In it, there is just a bunch of RST followed by SYN sent
>>> from client->FreeBSD and FreeBSD just keeps sending
>>> acks for the old segment back.
>>> --> It looks like FreeBSD did the "RST, ACK" after the
>>> krpc did a soshutdown(..SHUT_WR) on the socket,
>>> for the one you've been looking at.
>>> I'll test some more...
>>>
>>>> I would like to understand why the reestablishment of the connection
>>>> did not work...
>>> It is looking like it takes either a non-empty send-q or a
>>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>>> out of established, where it just ignores the RSTs and
>>> SYN packets.
>>>
>>> Thanks for looking at it, rick
>>>
>>> Best regards
>>> Michael
>>>>
>>>> Have fun with it, rick
>>>>
>>>>
>>>> ________________________________________
>>>> From: ***@freebsd.org <***@freebsd.org>
>>>> Sent: Sunday, April 4, 2021 12:41 PM
>>>> To: Rick Macklem
>>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>
>>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>>> not really part of the discussion...
>>>>>
>>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>>> several minutes after un-partitioning the connection.
>>>>>
>>>>> The battle basically consists of the Linux client sending an RST, followed
>>>>> by a SYN.
>>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>>> over several minutes.
>>>>>
>>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>>> pretty good at ignoring it.
>>>>>
>>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>>> in case anyone wants to look at it.
>>>> On freefall? I would like to take a look at it...
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>>> *** Network is now partitioned...
>>>>>
>>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> *** Lots of lines snipped.
>>>>>
>>>>>
>>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> *** Network is now unpartitioned...
>>>>>
>>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>>> *** This "battle" goes on for 223sec...
>>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>>> client paused for a while (see ARPs below).
>>>>>
>>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>>
>>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>>> of 13 (100+ for another test run).
>>>>>
>>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>>> *** Now back in business...
>>>>>
>>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>>
>>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>>> suspect a Linux client bug, but will be investigating further.
>>>>>
>>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>>> or if the RST should be ack'd sooner?
>>>>>
>>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>>
>>>>> rick
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>>> To: Rick Macklem; ***@freebsd.org
>>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>>
>>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>>
>>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>>
>>>>> I can try getting the relevant bug info next week...
>>>>>
>>>>> ________________________________
>>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>>> An: ***@freebsd.org <***@freebsd.org>
>>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>>> Betreff: Re: NFS Mount Hangs
>>>>>
>>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ***@freebsd.org wrote:
>>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>>
>>>>>>> I hope you don't mind a top post...
>>>>>>> I've been testing network partitioning between the only Linux client
>>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>>> applied to it.
>>>>>>>
>>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>>> I see...
>>>>>>>
>>>>>>> While partitioned:
>>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>>> the network partition or stays ESTABLISHED.
>>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>>> sent a FIN, but you never called close() on the socket.
>>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>>> I guess, and therefore the server does not even detect that the peer
>>>>>> is not reachable.
>>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>>> little while, and then disappears.
>>>>>> So how does Linux detect the peer is not reachable?
>>>>> Well, here's what I see in a packet capture in the Linux client once
>>>>> I partition it (just unplug the net cable):
>>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>>> - then only ARP queries
>>>>>
>>>>> Once I plug the net cable back in:
>>>>> - ARP works
>>>>> - one more retransmit of the same segement
>>>>> - receives RST from FreeBSD
>>>>> ** So, is this now a "new" TCP connection, despite
>>>>> using the same port#.
>>>>> --> It matters for NFS, since "new connection"
>>>>> implies "must retry all outstanding RPCs".
>>>>> - sends SYN
>>>>> - receives SYN, ACK from FreeBSD
>>>>> --> connection starts working again
>>>>> Always uses same port#.
>>>>>
>>>>> On the FreeBSD server end:
>>>>> - receives the last retransmit of the segment (with ACK)
>>>>> - sends RST
>>>>> - receives SYN
>>>>> - sends SYN, ACK
>>>>>
>>>>> I thought that there was no RST in the capture I looked at
>>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>>> The socket disappears from the Linux "netstat -a" and I
>>>>> suspect that happens after about 54sec, but I am not sure
>>>>> about the timing.
>>>>>
>>>>>>>
>>>>>>> After unpartitioning:
>>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>>> the same port#
>>>>>>> Active Internet connections (including servers)
>>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>>
>>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>>
>>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>>> I have now done so, as above.
>>>>>
>>>>>>> --> However I never get hangs.
>>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>>> mount starts working again.
>>>>>>>
>>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>>> a little while.)
>>>>>>>
>>>>>>> What do those TCP conversant think?
>>>>>> I guess you are you are never calling close() on the socket, for with
>>>>>> the connection state is CLOSED.
>>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>>> BindConnectionToSession to re-assign the back channel.
>>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>>> that I have to tell them the back channel needs re-assignment
>>>>> or something like that.
>>>>>
>>>>> I am pretty certain they are broken, in that the client needs to
>>>>> retry all outstanding RPCs.
>>>>>
>>>>> For others, here's the long winded version of this that I just
>>>>> put on the phabricator review:
>>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>>> These structures a ref counted and the soclose() is done
>>>>> when the ref. cnt goes to zero. My understanding is that
>>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>>> before the xprt ref. cnt goes to zero.
>>>>>
>>>>> For NFSv4.1/4.2 there is something called a back channel
>>>>> which means that a "xprt" is used for server->client RPCs,
>>>>> although the TCP connection is established by the client
>>>>> to the server.
>>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>>
>>>>> client re-assigns it to a different TCP connection
>>>>> via an operation called BindConnectionToSession
>>>>> and the Linux client is not doing this soon enough,
>>>>> it appears.
>>>>>
>>>>> So, the soclose() is delayed, which is why I think the
>>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>>> which can happen before the client gets around to
>>>>> re-assigning the back channel.
>>>>>
>>>>> Thanks for your help with this Michael, rick
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>>
>>>>>> rick
>>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>>> for them.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________________
>>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>>> To: Jason Breitman
>>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>>> Subject: Re: NFS Mount Hangs
>>>>>>
>>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>
>>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>>> # ifconfig lagg0
>>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>>
>>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>>
>>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>>
>>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>>
>>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>>
>>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>>> This specifies how many seconds to wait for a final FIN
>>>>>> packet before the socket is forcibly closed. This is
>>>>>> strictly a violation of the TCP specification, but
>>>>>> required to prevent denial-of-service attacks. In Linux
>>>>>> 2.2, the default value was 180.
>>>>>>
>>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>>
>>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>>
>>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>>
>>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>>> I ran the capture information again to make use of the situation.
>>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> while true
>>>>>> do
>>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>>> /bin/sleep 60
>>>>>> done
>>>>>>
>>>>>>
>>>>>> On the NFS Server
>>>>>> Active Internet connections (including servers)
>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>>
>>>>>> On the NFS Client
>>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>>
>>>>>>
>>>>>>
>>>>>> You had also asked for the output below.
>>>>>>
>>>>>> # nfsstat -E -s
>>>>>> BackChannelCtBindConnToSes
>>>>>> 0 0
>>>>>>
>>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>>
>>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>>
>>>>>>
>>>>>> Jason Breitman
>>>>>>
>>>>>>
>>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>>
>>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>>> Hi Jason,
>>>>>>>
>>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>>
>>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>>
>>>>>>>> Issue
>>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>>
>>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>>
>>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>>
>>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>>
>>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>>> NFS server with requests.
>>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>>> again and this shouls cause the window to open back up.
>>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>>
>>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>>> the case for NFSv4.1).
>>>>>>
>>>>>>> - CLIENT ACK that FIN.
>>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>>
>>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>>
>>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>>
>>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>>
>>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>>
>>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>>> (They're just waiting for RPC requests.)
>>>>>> However, I do now think I know why the soclose() does not happen.
>>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>>> happen until the client creates a new TCP connection.
>>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>>
>>>>>> I've created the attached patch (completely different from the previous one)
>>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>>> soclose().
>>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>>> this change will get the socket shutdown to complete.
>>>>>>
>>>>>> There are a couple more things you can check on the server...
>>>>>> # nfsstat -E -s
>>>>>> --> Look for the count under "BindConnToSes".
>>>>>> --> If non-zero, backchannels have been assigned
>>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>>
>>>>>> I think the attached patch might work around the problem.
>>>>>> The code that should open up the receive window needs to be checked.
>>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>>> assigned.
>>>>>>
>>>>>> rick
>>>>>>
>>>>>> Youssef
>>>>>>
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>>> <xprtdied.patch>
>>>>>>
>>>>>> <nfs-hang.log.gz>
>>>>>>
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>
>>>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rick Macklem
2021-04-10 15:04:16 UTC
Permalink
***@freebsd.org wrote:
>> On 10. Apr 2021, at 02:44, Rick Macklem <***@uoguelph.ca> wrote:
>>
>> ***@freebsd.org wrote:
>>>> On 6. Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> ***@freebsd.org wrote:
>>>> [stuff snipped]
>>>>> OK. What is the FreeBSD version you are using?
>>>> main Dec. 23, 2020.
>>>>
>>>>>
>>>>> It seems that the TCP connection on the FreeBSD is still alive,
>>>>> Linux has decided to start a new TCP connection using the old
>>>>> port numbers. So it sends a SYN. The response is a challenge ACK
>>>>> and Linux responds with a RST. This looks good so far. However,
>>>>> FreeBSD should accept the RST and kill the TCP connection. The
>>>>> next SYN from the Linux side would establish a new TCP connection.
>>>>>
>>>>> So I'm wondering why the RST is not accepted. I made the timestamp
>>>>> checking stricter but introduced a bug where RST segments without
>>>>> timestamps were ignored. This was fixed.
>>>>>
>>>>> Introduced in main on 2020/11/09:
>>>>> https://svnweb.freebsd.org/changeset/base/367530
>>>>> Introduced in stable/12 on 2020/11/30:
>>>>> https://svnweb.freebsd.org/changeset/base/36818
>>>>>> Fix in main on 2021/01/13:
>>>>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>>>>> Fix in stable/12 on 2021/01/24:
>>>>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>>>>
>>>>> Are you using a version which is affected by this bug?
>>>> I was. Now I've applied the patch.
>>>> Bad News. It did not fix the problem.
>>>> It still gets into an endless "ignore RST" and stay established when
>>>> the Send-Q is empty.
>>> OK. Let us focus on this case.
>>>
>>> Could you:
>>> 1. sudo sysctl net.inet.tcp.log_debug=1
>>> 2. repeat the situation where RSTs are ignored.
>>> 3. check if there is some output on the console (/var/log/messages).
>>> 4. Either provide the output or let me know that there is none.
>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>> The only message logged is:
>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>>
>> But...the RST battle no longer occurs. Just one RST that works and then
>> the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>The above is what I would expect if you integrated cc3c34859eab1b317d0f38731355b53f7d978c97
>or reverted r367530. Did you do that?
r367530 is in the kernel that does not cause the "RST battle".

>
>
> So, what is different?
>
> r367492 is reverted from the FreeBSD server.
Only that? So you still have the bug I introduced in tree, but the RST segment is accepted?
No. The kernel being tested includes the fix (you committed mid-January) for the bug
that went in in Nov.
However, adding the mid-January patch did not fix the problem.
Then reverting r367492 (and only r367492) made the problem go away.

I did not expect reverting r367492 to affect this.
I reverted r367492 because otis@ gets Linux client mounts "stuck" against a FreBSD13
NFS server, where the Recv-Q size grows and the client gets no RPC replies. Other
clients are still working fine. I can only think of one explanations for this:
- An upcall gets missed or occurs at the wrong time.
--> Since what this patch does is move where the upcalls is done, it is the logical
culprit.
Hopefully otis@ will be able to determine if reverting r367492 fixes the problem.
This will take weeks, since the problem recently took two weeks to recur.
--> This would be the receive path, so reverting the send path would not be
relevant.
*** I'd like to hear from otis@ before testing a "send path only" revert.
--> Also, it has been a long time since I worked on the socket upcall code, but I
vaguely remember that the upcalls needed to be done before SOCKBUF_LOCK()
is dropped to ensure that the socket buffer is in the expected state.
r367492 drops SOCKBUF_LOCK() and then picks it up again for the upcalls.

I'll send you guys the otis@ problem email. (I don't think that one is cc'd to a list.

rick

Best regards
Michael
> I did the revert because I think it might be what otis@ hang is being
> caused by. (In his case, the Recv-Q grows on the socket for the
> stuck Linux client, while others work.
>
> Why does reverting fix this?
> My only guess is that the krpc gets the upcall right away and sees
> a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
> I know from a printf that this happened, but whether it caused the
> RST battle to not happen, I don't know.
>
> I can put r367492 back in and do more testing if you'd like, but
> I think it probably needs to be reverted?
>
> This does not explain the original hung Linux client problem,
> but does shed light on the RST war I could create by doing a
> network partitioning.
>
> rick
>
> Best regards
> Michael
>>
>> If the Send-Q is non-empty when I partition, it recovers fine,
>> sometimes not even needing to see an RST.
>>
>> rick
>> ps: If you think there might be other recent changes that matter,
>> just say the word and I'll upgrade to bits de jur.
>>
>> rick
>>
>> Best regards
>> Michael
>>>
>>> If I wait long enough before healing the partition, it will
>>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>>> do battle (at least not for long).
>>>
>>> Btw, I have one running now that seems stuck really good.
>>> It has been 20minutes since I plugged the net cable back in.
>>> (Unfortunately, I didn't have tcpdump running until after
>>> I saw it was not progressing after healing.
>>> --> There is one difference. There was a 6minute timeout
>>> enabled on the server krpc for "no activity", which is
>>> now disabled like it is for NFSv4.1 in freebsd-current.
>>> I had forgotten to re-disable it.
>>> So, when it does battle, it might have been the 6minute
>>> timeout, which would then do the soshutdown(..SHUT_WR)
>>> which kept it from getting "stuck" forever.
>>> -->This time I had to reboot the FreeBSD NFS server to
>>> get the Linux client unstuck, so this one looked a lot
>>> like what has been reported.
>>> The pcap for this one, started after the network was plugged
>>> back in and I noticed it was stuck for quite a while is here:
>>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>>
>>> In it, there is just a bunch of RST followed by SYN sent
>>> from client->FreeBSD and FreeBSD just keeps sending
>>> acks for the old segment back.
>>> --> It looks like FreeBSD did the "RST, ACK" after the
>>> krpc did a soshutdown(..SHUT_WR) on the socket,
>>> for the one you've been looking at.
>>> I'll test some more...
>>>
>>>> I would like to understand why the reestablishment of the connection
>>>> did not work...
>>> It is looking like it takes either a non-empty send-q or a
>>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>>> out of established, where it just ignores the RSTs and
>>> SYN packets.
>>>
>>> Thanks for looking at it, rick
>>>
>>> Best regards
>>> Michael
>>>>
>>>> Have fun with it, rick
>>>>
>>>>
>>>> ________________________________________
>>>> From: ***@freebsd.org <***@freebsd.org>
>>>> Sent: Sunday, April 4, 2021 12:41 PM
>>>> To: Rick Macklem
>>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>
>>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>>> not really part of the discussion...
>>>>>
>>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>>> several minutes after un-partitioning the connection.
>>>>>
>>>>> The battle basically consists of the Linux client sending an RST, followed
>>>>> by a SYN.
>>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>>> over several minutes.
>>>>>
>>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>>> pretty good at ignoring it.
>>>>>
>>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>>> in case anyone wants to look at it.
>>>> On freefall? I would like to take a look at it...
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>>> *** Network is now partitioned...
>>>>>
>>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> *** Lots of lines snipped.
>>>>>
>>>>>
>>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> *** Network is now unpartitioned...
>>>>>
>>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>>> *** This "battle" goes on for 223sec...
>>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>>> client paused for a while (see ARPs below).
>>>>>
>>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>>
>>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>>> of 13 (100+ for another test run).
>>>>>
>>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>>> *** Now back in business...
>>>>>
>>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>>
>>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>>> suspect a Linux client bug, but will be investigating further.
>>>>>
>>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>>> or if the RST should be ack'd sooner?
>>>>>
>>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>>
>>>>> rick
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>>> To: Rick Macklem; ***@freebsd.org
>>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>>
>>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>>
>>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>>
>>>>> I can try getting the relevant bug info next week...
>>>>>
>>>>> ________________________________
>>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>>> An: ***@freebsd.org <***@freebsd.org>
>>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>>> Betreff: Re: NFS Mount Hangs
>>>>>
>>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ***@freebsd.org wrote:
>>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>>
>>>>>>> I hope you don't mind a top post...
>>>>>>> I've been testing network partitioning between the only Linux client
>>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>>> applied to it.
>>>>>>>
>>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>>> I see...
>>>>>>>
>>>>>>> While partitioned:
>>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>>> the network partition or stays ESTABLISHED.
>>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>>> sent a FIN, but you never called close() on the socket.
>>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>>> I guess, and therefore the server does not even detect that the peer
>>>>>> is not reachable.
>>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>>> little while, and then disappears.
>>>>>> So how does Linux detect the peer is not reachable?
>>>>> Well, here's what I see in a packet capture in the Linux client once
>>>>> I partition it (just unplug the net cable):
>>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>>> - then only ARP queries
>>>>>
>>>>> Once I plug the net cable back in:
>>>>> - ARP works
>>>>> - one more retransmit of the same segement
>>>>> - receives RST from FreeBSD
>>>>> ** So, is this now a "new" TCP connection, despite
>>>>> using the same port#.
>>>>> --> It matters for NFS, since "new connection"
>>>>> implies "must retry all outstanding RPCs".
>>>>> - sends SYN
>>>>> - receives SYN, ACK from FreeBSD
>>>>> --> connection starts working again
>>>>> Always uses same port#.
>>>>>
>>>>> On the FreeBSD server end:
>>>>> - receives the last retransmit of the segment (with ACK)
>>>>> - sends RST
>>>>> - receives SYN
>>>>> - sends SYN, ACK
>>>>>
>>>>> I thought that there was no RST in the capture I looked at
>>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>>> The socket disappears from the Linux "netstat -a" and I
>>>>> suspect that happens after about 54sec, but I am not sure
>>>>> about the timing.
>>>>>
>>>>>>>
>>>>>>> After unpartitioning:
>>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>>> the same port#
>>>>>>> Active Internet connections (including servers)
>>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>>
>>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>>
>>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>>> I have now done so, as above.
>>>>>
>>>>>>> --> However I never get hangs.
>>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>>> mount starts working again.
>>>>>>>
>>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>>> a little while.)
>>>>>>>
>>>>>>> What do those TCP conversant think?
>>>>>> I guess you are you are never calling close() on the socket, for with
>>>>>> the connection state is CLOSED.
>>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>>> BindConnectionToSession to re-assign the back channel.
>>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>>> that I have to tell them the back channel needs re-assignment
>>>>> or something like that.
>>>>>
>>>>> I am pretty certain they are broken, in that the client needs to
>>>>> retry all outstanding RPCs.
>>>>>
>>>>> For others, here's the long winded version of this that I just
>>>>> put on the phabricator review:
>>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>>> These structures a ref counted and the soclose() is done
>>>>> when the ref. cnt goes to zero. My understanding is that
>>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>>> before the xprt ref. cnt goes to zero.
>>>>>
>>>>> For NFSv4.1/4.2 there is something called a back channel
>>>>> which means that a "xprt" is used for server->client RPCs,
>>>>> although the TCP connection is established by the client
>>>>> to the server.
>>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>>
>>>>> client re-assigns it to a different TCP connection
>>>>> via an operation called BindConnectionToSession
>>>>> and the Linux client is not doing this soon enough,
>>>>> it appears.
>>>>>
>>>>> So, the soclose() is delayed, which is why I think the
>>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>>> which can happen before the client gets around to
>>>>> re-assigning the back channel.
>>>>>
>>>>> Thanks for your help with this Michael, rick
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>>
>>>>>> rick
>>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>>> for them.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________________
>>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>>> To: Jason Breitman
>>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>>> Subject: Re: NFS Mount Hangs
>>>>>>
>>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>
>>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>>> # ifconfig lagg0
>>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>>
>>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>>
>>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>>
>>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>>
>>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>>
>>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>>> This specifies how many seconds to wait for a final FIN
>>>>>> packet before the socket is forcibly closed. This is
>>>>>> strictly a violation of the TCP specification, but
>>>>>> required to prevent denial-of-service attacks. In Linux
>>>>>> 2.2, the default value was 180.
>>>>>>
>>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>>
>>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>>
>>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>>
>>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>>> I ran the capture information again to make use of the situation.
>>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> while true
>>>>>> do
>>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>>> /bin/sleep 60
>>>>>> done
>>>>>>
>>>>>>
>>>>>> On the NFS Server
>>>>>> Active Internet connections (including servers)
>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>>
>>>>>> On the NFS Client
>>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>>
>>>>>>
>>>>>>
>>>>>> You had also asked for the output below.
>>>>>>
>>>>>> # nfsstat -E -s
>>>>>> BackChannelCtBindConnToSes
>>>>>> 0 0
>>>>>>
>>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>>
>>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>>
>>>>>>
>>>>>> Jason Breitman
>>>>>>
>>>>>>
>>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>>
>>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>>> Hi Jason,
>>>>>>>
>>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>>
>>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>>
>>>>>>>> Issue
>>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>>
>>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>>
>>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>>
>>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>>
>>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>>> NFS server with requests.
>>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>>> again and this shouls cause the window to open back up.
>>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>>
>>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>>> the case for NFSv4.1).
>>>>>>
>>>>>>> - CLIENT ACK that FIN.
>>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>>
>>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>>
>>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>>
>>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>>
>>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>>
>>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>>> (They're just waiting for RPC requests.)
>>>>>> However, I do now think I know why the soclose() does not happen.
>>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>>> happen until the client creates a new TCP connection.
>>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>>
>>>>>> I've created the attached patch (completely different from the previous one)
>>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>>> soclose().
>>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>>> this change will get the socket shutdown to complete.
>>>>>>
>>>>>> There are a couple more things you can check on the server...
>>>>>> # nfsstat -E -s
>>>>>> --> Look for the count under "BindConnToSes".
>>>>>> --> If non-zero, backchannels have been assigned
>>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>>
>>>>>> I think the attached patch might work around the problem.
>>>>>> The code that should open up the receive window needs to be checked.
>>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>>> assigned.
>>>>>>
>>>>>> rick
>>>>>>
>>>>>> Youssef
>>>>>>
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>>> <xprtdied.patch>
>>>>>>
>>>>>> <nfs-hang.log.gz>
>>>>>>
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>
>>>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-04-10 15:19:33 UTC
Permalink
> On 10. Apr 2021, at 17:04, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org wrote:
>>> On 10. Apr 2021, at 02:44, Rick Macklem <***@uoguelph.ca> wrote:
>>>
>>> ***@freebsd.org wrote:
>>>>> On 6. Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>
>>>>> ***@freebsd.org wrote:
>>>>> [stuff snipped]
>>>>>> OK. What is the FreeBSD version you are using?
>>>>> main Dec. 23, 2020.
>>>>>
>>>>>>
>>>>>> It seems that the TCP connection on the FreeBSD is still alive,
>>>>>> Linux has decided to start a new TCP connection using the old
>>>>>> port numbers. So it sends a SYN. The response is a challenge ACK
>>>>>> and Linux responds with a RST. This looks good so far. However,
>>>>>> FreeBSD should accept the RST and kill the TCP connection. The
>>>>>> next SYN from the Linux side would establish a new TCP connection.
>>>>>>
>>>>>> So I'm wondering why the RST is not accepted. I made the timestamp
>>>>>> checking stricter but introduced a bug where RST segments without
>>>>>> timestamps were ignored. This was fixed.
>>>>>>
>>>>>> Introduced in main on 2020/11/09:
>>>>>> https://svnweb.freebsd.org/changeset/base/367530
>>>>>> Introduced in stable/12 on 2020/11/30:
>>>>>> https://svnweb.freebsd.org/changeset/base/36818
>>>>>>> Fix in main on 2021/01/13:
>>>>>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>>>>>> Fix in stable/12 on 2021/01/24:
>>>>>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>>>>>
>>>>>> Are you using a version which is affected by this bug?
>>>>> I was. Now I've applied the patch.
>>>>> Bad News. It did not fix the problem.
>>>>> It still gets into an endless "ignore RST" and stay established when
>>>>> the Send-Q is empty.
>>>> OK. Let us focus on this case.
>>>>
>>>> Could you:
>>>> 1. sudo sysctl net.inet.tcp.log_debug=1
>>>> 2. repeat the situation where RSTs are ignored.
>>>> 3. check if there is some output on the console (/var/log/messages).
>>>> 4. Either provide the output or let me know that there is none.
>>> Well, I have some good news and some bad news (the bad is mostly for Richard).
>>> The only message logged is:
>>> tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, segment processed normally
>>>
>>> But...the RST battle no longer occurs. Just one RST that works and then
>>> the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>> The above is what I would expect if you integrated cc3c34859eab1b317d0f38731355b53f7d978c97
>> or reverted r367530. Did you do that?
> r367530 is in the kernel that does not cause the "RST battle".
>
>>
>>
>> So, what is different?
>>
>> r367492 is reverted from the FreeBSD server.
> Only that? So you still have the bug I introduced in tree, but the RST segment is accepted?
> No. The kernel being tested includes the fix (you committed mid-January) for the bug
> that went in in Nov.
> However, adding the mid-January patch did not fix the problem.
OK. I was focussing on the behaviour that FreeBSD does ignore the received RST.
That is fixed. Good.

I understand that this does not solve your issue.
> Then reverting r367492 (and only r367492) made the problem go away.
>
> I did not expect reverting r367492 to affect this.
> I reverted r367492 because otis@ gets Linux client mounts "stuck" against a FreBSD13
> NFS server, where the Recv-Q size grows and the client gets no RPC replies. Other
> clients are still working fine. I can only think of one explanations for this:
> - An upcall gets missed or occurs at the wrong time.
My understanding of the patch is that it "delays" the upcall to the end of the
packet processing. So the amount of time is at most a packet processing time,
which is short in my view.
Richard: Correct me if I'm wrong.

Best regards
Michael
> --> Since what this patch does is move where the upcalls is done, it is the logical
> culprit.
> Hopefully otis@ will be able to determine if reverting r367492 fixes the problem.
> This will take weeks, since the problem recently took two weeks to recur.
> --> This would be the receive path, so reverting the send path would not be
> relevant.
> *** I'd like to hear from otis@ before testing a "send path only" revert.
> --> Also, it has been a long time since I worked on the socket upcall code, but I
> vaguely remember that the upcalls needed to be done before SOCKBUF_LOCK()
> is dropped to ensure that the socket buffer is in the expected state.
> r367492 drops SOCKBUF_LOCK() and then picks it up again for the upcalls.
>
> I'll send you guys the otis@ problem email. (I don't think that one is cc'd to a list.
>
> rick
>
> Best regards
> Michael
>> I did the revert because I think it might be what otis@ hang is being
>> caused by. (In his case, the Recv-Q grows on the socket for the
>> stuck Linux client, while others work.
>>
>> Why does reverting fix this?
>> My only guess is that the krpc gets the upcall right away and sees
>> a EPIPE when it does soreceive()->results in soshutdown(SHUT_WR).
>> I know from a printf that this happened, but whether it caused the
>> RST battle to not happen, I don't know.
>>
>> I can put r367492 back in and do more testing if you'd like, but
>> I think it probably needs to be reverted?
>>
>> This does not explain the original hung Linux client problem,
>> but does shed light on the RST war I could create by doing a
>> network partitioning.
>>
>> rick
>>
>> Best regards
>> Michael
>>>
>>> If the Send-Q is non-empty when I partition, it recovers fine,
>>> sometimes not even needing to see an RST.
>>>
>>> rick
>>> ps: If you think there might be other recent changes that matter,
>>> just say the word and I'll upgrade to bits de jur.
>>>
>>> rick
>>>
>>> Best regards
>>> Michael
>>>>
>>>> If I wait long enough before healing the partition, it will
>>>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>>>> do battle (at least not for long).
>>>>
>>>> Btw, I have one running now that seems stuck really good.
>>>> It has been 20minutes since I plugged the net cable back in.
>>>> (Unfortunately, I didn't have tcpdump running until after
>>>> I saw it was not progressing after healing.
>>>> --> There is one difference. There was a 6minute timeout
>>>> enabled on the server krpc for "no activity", which is
>>>> now disabled like it is for NFSv4.1 in freebsd-current.
>>>> I had forgotten to re-disable it.
>>>> So, when it does battle, it might have been the 6minute
>>>> timeout, which would then do the soshutdown(..SHUT_WR)
>>>> which kept it from getting "stuck" forever.
>>>> -->This time I had to reboot the FreeBSD NFS server to
>>>> get the Linux client unstuck, so this one looked a lot
>>>> like what has been reported.
>>>> The pcap for this one, started after the network was plugged
>>>> back in and I noticed it was stuck for quite a while is here:
>>>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>>>
>>>> In it, there is just a bunch of RST followed by SYN sent
>>>> from client->FreeBSD and FreeBSD just keeps sending
>>>> acks for the old segment back.
>>>> --> It looks like FreeBSD did the "RST, ACK" after the
>>>> krpc did a soshutdown(..SHUT_WR) on the socket,
>>>> for the one you've been looking at.
>>>> I'll test some more...
>>>>
>>>>> I would like to understand why the reestablishment of the connection
>>>>> did not work...
>>>> It is looking like it takes either a non-empty send-q or a
>>>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>>>> out of established, where it just ignores the RSTs and
>>>> SYN packets.
>>>>
>>>> Thanks for looking at it, rick
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> Have fun with it, rick
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: ***@freebsd.org <***@freebsd.org>
>>>>> Sent: Sunday, April 4, 2021 12:41 PM
>>>>> To: Rick Macklem
>>>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>
>>>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>>>> not really part of the discussion...
>>>>>>
>>>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>>>> several minutes after un-partitioning the connection.
>>>>>>
>>>>>> The battle basically consists of the Linux client sending an RST, followed
>>>>>> by a SYN.
>>>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>>>> over several minutes.
>>>>>>
>>>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>>>> pretty good at ignoring it.
>>>>>>
>>>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>>>> in case anyone wants to look at it.
>>>>> On freefall? I would like to take a look at it...
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>>
>>>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>>>> *** Network is now partitioned...
>>>>>>
>>>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>>> *** Lots of lines snipped.
>>>>>>
>>>>>>
>>>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>>> *** Network is now unpartitioned...
>>>>>>
>>>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>>>> *** This "battle" goes on for 223sec...
>>>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>>>> client paused for a while (see ARPs below).
>>>>>>
>>>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>>>
>>>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>>>> of 13 (100+ for another test run).
>>>>>>
>>>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>>>> *** Now back in business...
>>>>>>
>>>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>>>
>>>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>>>> suspect a Linux client bug, but will be investigating further.
>>>>>>
>>>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>>>> or if the RST should be ack'd sooner?
>>>>>>
>>>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>>>
>>>>>> rick
>>>>>>
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>>>> To: Rick Macklem; ***@freebsd.org
>>>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>>>> Subject: Re: NFS Mount Hangs
>>>>>>
>>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>>
>>>>>>
>>>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>>>
>>>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>>>
>>>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>>>
>>>>>> I can try getting the relevant bug info next week...
>>>>>>
>>>>>> ________________________________
>>>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>>>> An: ***@freebsd.org <***@freebsd.org>
>>>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>>>> Betreff: Re: NFS Mount Hangs
>>>>>>
>>>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ***@freebsd.org wrote:
>>>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>>>
>>>>>>>> I hope you don't mind a top post...
>>>>>>>> I've been testing network partitioning between the only Linux client
>>>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>>>> applied to it.
>>>>>>>>
>>>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>>>> I see...
>>>>>>>>
>>>>>>>> While partitioned:
>>>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>>>> the network partition or stays ESTABLISHED.
>>>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>>>> sent a FIN, but you never called close() on the socket.
>>>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>>>> I guess, and therefore the server does not even detect that the peer
>>>>>>> is not reachable.
>>>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>>>> little while, and then disappears.
>>>>>>> So how does Linux detect the peer is not reachable?
>>>>>> Well, here's what I see in a packet capture in the Linux client once
>>>>>> I partition it (just unplug the net cable):
>>>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>>>> - then only ARP queries
>>>>>>
>>>>>> Once I plug the net cable back in:
>>>>>> - ARP works
>>>>>> - one more retransmit of the same segement
>>>>>> - receives RST from FreeBSD
>>>>>> ** So, is this now a "new" TCP connection, despite
>>>>>> using the same port#.
>>>>>> --> It matters for NFS, since "new connection"
>>>>>> implies "must retry all outstanding RPCs".
>>>>>> - sends SYN
>>>>>> - receives SYN, ACK from FreeBSD
>>>>>> --> connection starts working again
>>>>>> Always uses same port#.
>>>>>>
>>>>>> On the FreeBSD server end:
>>>>>> - receives the last retransmit of the segment (with ACK)
>>>>>> - sends RST
>>>>>> - receives SYN
>>>>>> - sends SYN, ACK
>>>>>>
>>>>>> I thought that there was no RST in the capture I looked at
>>>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>>>> The socket disappears from the Linux "netstat -a" and I
>>>>>> suspect that happens after about 54sec, but I am not sure
>>>>>> about the timing.
>>>>>>
>>>>>>>>
>>>>>>>> After unpartitioning:
>>>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>>>> the same port#
>>>>>>>> Active Internet connections (including servers)
>>>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>>>
>>>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>>>
>>>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>>>> I have now done so, as above.
>>>>>>
>>>>>>>> --> However I never get hangs.
>>>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>>>> mount starts working again.
>>>>>>>>
>>>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>>>> a little while.)
>>>>>>>>
>>>>>>>> What do those TCP conversant think?
>>>>>>> I guess you are you are never calling close() on the socket, for with
>>>>>>> the connection state is CLOSED.
>>>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>>>> BindConnectionToSession to re-assign the back channel.
>>>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>>>> that I have to tell them the back channel needs re-assignment
>>>>>> or something like that.
>>>>>>
>>>>>> I am pretty certain they are broken, in that the client needs to
>>>>>> retry all outstanding RPCs.
>>>>>>
>>>>>> For others, here's the long winded version of this that I just
>>>>>> put on the phabricator review:
>>>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>>>> These structures a ref counted and the soclose() is done
>>>>>> when the ref. cnt goes to zero. My understanding is that
>>>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>>>> before the xprt ref. cnt goes to zero.
>>>>>>
>>>>>> For NFSv4.1/4.2 there is something called a back channel
>>>>>> which means that a "xprt" is used for server->client RPCs,
>>>>>> although the TCP connection is established by the client
>>>>>> to the server.
>>>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>>>
>>>>>> client re-assigns it to a different TCP connection
>>>>>> via an operation called BindConnectionToSession
>>>>>> and the Linux client is not doing this soon enough,
>>>>>> it appears.
>>>>>>
>>>>>> So, the soclose() is delayed, which is why I think the
>>>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>>>> which can happen before the client gets around to
>>>>>> re-assigning the back channel.
>>>>>>
>>>>>> Thanks for your help with this Michael, rick
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>>
>>>>>>> rick
>>>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>>>> for them.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>>>> To: Jason Breitman
>>>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>>>> Subject: Re: NFS Mount Hangs
>>>>>>>
>>>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>
>>>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>>>> # ifconfig lagg0
>>>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>>>
>>>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>>>
>>>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>>>
>>>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>>>
>>>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>>>
>>>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>>>> This specifies how many seconds to wait for a final FIN
>>>>>>> packet before the socket is forcibly closed. This is
>>>>>>> strictly a violation of the TCP specification, but
>>>>>>> required to prevent denial-of-service attacks. In Linux
>>>>>>> 2.2, the default value was 180.
>>>>>>>
>>>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>>>
>>>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>>>
>>>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>>>
>>>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>>>> I ran the capture information again to make use of the situation.
>>>>>>>
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> while true
>>>>>>> do
>>>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>>>> /bin/sleep 60
>>>>>>> done
>>>>>>>
>>>>>>>
>>>>>>> On the NFS Server
>>>>>>> Active Internet connections (including servers)
>>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>>>
>>>>>>> On the NFS Client
>>>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You had also asked for the output below.
>>>>>>>
>>>>>>> # nfsstat -E -s
>>>>>>> BackChannelCtBindConnToSes
>>>>>>> 0 0
>>>>>>>
>>>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>>>
>>>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>>>
>>>>>>>
>>>>>>> Jason Breitman
>>>>>>>
>>>>>>>
>>>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>>>
>>>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>>>> Hi Jason,
>>>>>>>>
>>>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>>>
>>>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>>>
>>>>>>>>> Issue
>>>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>>>
>>>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>>>
>>>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>>>
>>>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>>>
>>>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>>>> NFS server with requests.
>>>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>>>> again and this shouls cause the window to open back up.
>>>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>>>
>>>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>>>> the case for NFSv4.1).
>>>>>>>
>>>>>>>> - CLIENT ACK that FIN.
>>>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>>>
>>>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>>>
>>>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>>>
>>>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>>>
>>>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>>>
>>>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>>>> (They're just waiting for RPC requests.)
>>>>>>> However, I do now think I know why the soclose() does not happen.
>>>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>>>> happen until the client creates a new TCP connection.
>>>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>>>
>>>>>>> I've created the attached patch (completely different from the previous one)
>>>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>>>> soclose().
>>>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>>>> this change will get the socket shutdown to complete.
>>>>>>>
>>>>>>> There are a couple more things you can check on the server...
>>>>>>> # nfsstat -E -s
>>>>>>> --> Look for the count under "BindConnToSes".
>>>>>>> --> If non-zero, backchannels have been assigned
>>>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>>>
>>>>>>> I think the attached patch might work around the problem.
>>>>>>> The code that should open up the receive window needs to be checked.
>>>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>>>> assigned.
>>>>>>>
>>>>>>> rick
>>>>>>>
>>>>>>> Youssef
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>>>> <xprtdied.patch>
>>>>>>>
>>>>>>> <nfs-hang.log.gz>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> freebsd-***@freebsd.org mailing list
>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>>> _______________________________________________
>>>>>>> freebsd-***@freebsd.org mailing list
>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>>
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>> _______________________________________________
>>>>>> freebsd-***@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> freebsd-***@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
Peter Eriksson
2021-04-08 22:47:32 UTC
Permalink
Hmmm..

We might have run into the same situation here between some Linux (CentOS 7.9.2009) clients and our FreeBSD 12.2 nfs servers this evening when a network switch in between the nfs server and the clients had to be rebooted causing a network partitioning for a number of minutes.

Not every NFS mount froze on the Linux clients but a number of them did. New connections/new NFS mounts worked fine, but the frozen ones stayed frozen. They unstuck themself after some time (more than 6 minutes - more like and hour or two).

Unfortunately not logs (no netstat output, not tcpdump) are available from the clients at this time. We are setting up some monitoring scripts so if this happens again then I hope we’ll be able to capture some things…

I tried to check the NFS server side logs one some of our NFS servers from around the time of the partition but I’ve yet to find something of interest unfortunately...

Not really helpful but that it self-healed after a (long) while is interesting I think…

- Peter

> On 6 Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org <mailto:***@freebsd.org> wrote:
> [stuff snipped]
>> OK. What is the FreeBSD version you are using?
> main Dec. 23, 2020.
>
>>
>> It seems that the TCP connection on the FreeBSD is still alive,
>> Linux has decided to start a new TCP connection using the old
>> port numbers. So it sends a SYN. The response is a challenge ACK
>> and Linux responds with a RST. This looks good so far. However,
>> FreeBSD should accept the RST and kill the TCP connection. The
>> next SYN from the Linux side would establish a new TCP connection.
>>
>> So I'm wondering why the RST is not accepted. I made the timestamp
>> checking stricter but introduced a bug where RST segments without
>> timestamps were ignored. This was fixed.
>>
>> Introduced in main on 2020/11/09:
>> https://svnweb.freebsd.org/changeset/base/367530
>> Introduced in stable/12 on 2020/11/30:
>> https://svnweb.freebsd.org/changeset/base/36818
>> Fix in main on 2021/01/13:
>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>> Fix in stable/12 on 2021/01/24:
>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>
>> Are you using a version which is affected by this bug?
> I was. Now I've applied the patch.
> Bad News. It did not fix the problem.
> It still gets into an endless "ignore RST" and stay established when
> the Send-Q is empty.
>
> If the Send-Q is non-empty when I partition, it recovers fine,
> sometimes not even needing to see an RST.
>
> rick
> ps: If you think there might be other recent changes that matter,
> just say the word and I'll upgrade to bits de jur.
>
> rick
>
> Best regards
> Michael
>>
>> If I wait long enough before healing the partition, it will
>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>> do battle (at least not for long).
>>
>> Btw, I have one running now that seems stuck really good.
>> It has been 20minutes since I plugged the net cable back in.
>> (Unfortunately, I didn't have tcpdump running until after
>> I saw it was not progressing after healing.
>> --> There is one difference. There was a 6minute timeout
>> enabled on the server krpc for "no activity", which is
>> now disabled like it is for NFSv4.1 in freebsd-current.
>> I had forgotten to re-disable it.
>> So, when it does battle, it might have been the 6minute
>> timeout, which would then do the soshutdown(..SHUT_WR)
>> which kept it from getting "stuck" forever.
>> -->This time I had to reboot the FreeBSD NFS server to
>> get the Linux client unstuck, so this one looked a lot
>> like what has been reported.
>> The pcap for this one, started after the network was plugged
>> back in and I noticed it was stuck for quite a while is here:
>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>
>> In it, there is just a bunch of RST followed by SYN sent
>> from client->FreeBSD and FreeBSD just keeps sending
>> acks for the old segment back.
>> --> It looks like FreeBSD did the "RST, ACK" after the
>> krpc did a soshutdown(..SHUT_WR) on the socket,
>> for the one you've been looking at.
>> I'll test some more...
>>
>>> I would like to understand why the reestablishment of the connection
>>> did not work...
>> It is looking like it takes either a non-empty send-q or a
>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>> out of established, where it just ignores the RSTs and
>> SYN packets.
>>
>> Thanks for looking at it, rick
>>
>> Best regards
>> Michael
>>>
>>> Have fun with it, rick
>>>
>>>
>>> ________________________________________
>>> From: ***@freebsd.org <***@freebsd.org>
>>> Sent: Sunday, April 4, 2021 12:41 PM
>>> To: Rick Macklem
>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>> not really part of the discussion...
>>>>
>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>> several minutes after un-partitioning the connection.
>>>>
>>>> The battle basically consists of the Linux client sending an RST, followed
>>>> by a SYN.
>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>> over several minutes.
>>>>
>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>> pretty good at ignoring it.
>>>>
>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>> in case anyone wants to look at it.
>>> On freefall? I would like to take a look at it...
>>>
>>> Best regards
>>> Michael
>>>>
>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>> *** Network is now partitioned...
>>>>
>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> *** Lots of lines snipped.
>>>>
>>>>
>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> *** Network is now unpartitioned...
>>>>
>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>> *** This "battle" goes on for 223sec...
>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>> client paused for a while (see ARPs below).
>>>>
>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>
>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>> of 13 (100+ for another test run).
>>>>
>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>> *** Now back in business...
>>>>
>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>
>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>> suspect a Linux client bug, but will be investigating further.
>>>>
>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>> or if the RST should be ack'd sooner?
>>>>
>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>
>>>> rick
>>>>
>>>>
>>>> ________________________________________
>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>> To: Rick Macklem; ***@freebsd.org
>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>
>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>
>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>
>>>> I can try getting the relevant bug info next week...
>>>>
>>>> ________________________________
>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>> An: ***@freebsd.org <***@freebsd.org>
>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>> Betreff: Re: NFS Mount Hangs
>>>>
>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>
>>>>
>>>>
>>>>
>>>> ***@freebsd.org wrote:
>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>
>>>>>> I hope you don't mind a top post...
>>>>>> I've been testing network partitioning between the only Linux client
>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>> applied to it.
>>>>>>
>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>> I see...
>>>>>>
>>>>>> While partitioned:
>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>> the network partition or stays ESTABLISHED.
>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>> sent a FIN, but you never called close() on the socket.
>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>> I guess, and therefore the server does not even detect that the peer
>>>>> is not reachable.
>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>> little while, and then disappears.
>>>>> So how does Linux detect the peer is not reachable?
>>>> Well, here's what I see in a packet capture in the Linux client once
>>>> I partition it (just unplug the net cable):
>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>> - then only ARP queries
>>>>
>>>> Once I plug the net cable back in:
>>>> - ARP works
>>>> - one more retransmit of the same segement
>>>> - receives RST from FreeBSD
>>>> ** So, is this now a "new" TCP connection, despite
>>>> using the same port#.
>>>> --> It matters for NFS, since "new connection"
>>>> implies "must retry all outstanding RPCs".
>>>> - sends SYN
>>>> - receives SYN, ACK from FreeBSD
>>>> --> connection starts working again
>>>> Always uses same port#.
>>>>
>>>> On the FreeBSD server end:
>>>> - receives the last retransmit of the segment (with ACK)
>>>> - sends RST
>>>> - receives SYN
>>>> - sends SYN, ACK
>>>>
>>>> I thought that there was no RST in the capture I looked at
>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>> The socket disappears from the Linux "netstat -a" and I
>>>> suspect that happens after about 54sec, but I am not sure
>>>> about the timing.
>>>>
>>>>>>
>>>>>> After unpartitioning:
>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>> the same port#
>>>>>> Active Internet connections (including servers)
>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>
>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>
>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>> I have now done so, as above.
>>>>
>>>>>> --> However I never get hangs.
>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>> mount starts working again.
>>>>>>
>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>> a little while.)
>>>>>>
>>>>>> What do those TCP conversant think?
>>>>> I guess you are you are never calling close() on the socket, for with
>>>>> the connection state is CLOSED.
>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>> BindConnectionToSession to re-assign the back channel.
>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>> that I have to tell them the back channel needs re-assignment
>>>> or something like that.
>>>>
>>>> I am pretty certain they are broken, in that the client needs to
>>>> retry all outstanding RPCs.
>>>>
>>>> For others, here's the long winded version of this that I just
>>>> put on the phabricator review:
>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>> These structures a ref counted and the soclose() is done
>>>> when the ref. cnt goes to zero. My understanding is that
>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>> before the xprt ref. cnt goes to zero.
>>>>
>>>> For NFSv4.1/4.2 there is something called a back channel
>>>> which means that a "xprt" is used for server->client RPCs,
>>>> although the TCP connection is established by the client
>>>> to the server.
>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>
>>>> client re-assigns it to a different TCP connection
>>>> via an operation called BindConnectionToSession
>>>> and the Linux client is not doing this soon enough,
>>>> it appears.
>>>>
>>>> So, the soclose() is delayed, which is why I think the
>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>> which can happen before the client gets around to
>>>> re-assigning the back channel.
>>>>
>>>> Thanks for your help with this Michael, rick
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> rick
>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>> for them.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>> To: Jason Breitman
>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>
>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>> # ifconfig lagg0
>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>
>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>
>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>
>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>
>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>
>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>> This specifies how many seconds to wait for a final FIN
>>>>> packet before the socket is forcibly closed. This is
>>>>> strictly a violation of the TCP specification, but
>>>>> required to prevent denial-of-service attacks. In Linux
>>>>> 2.2, the default value was 180.
>>>>>
>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>
>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>
>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>
>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>> I ran the capture information again to make use of the situation.
>>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> while true
>>>>> do
>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>> /bin/sleep 60
>>>>> done
>>>>>
>>>>>
>>>>> On the NFS Server
>>>>> Active Internet connections (including servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>
>>>>> On the NFS Client
>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>
>>>>>
>>>>>
>>>>> You had also asked for the output below.
>>>>>
>>>>> # nfsstat -E -s
>>>>> BackChannelCtBindConnToSes
>>>>> 0 0
>>>>>
>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>
>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>
>>>>>
>>>>> Jason Breitman
>>>>>
>>>>>
>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>
>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>> Hi Jason,
>>>>>>
>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>
>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>
>>>>>>> Issue
>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>
>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>
>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>
>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>
>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>> NFS server with requests.
>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>> again and this shouls cause the window to open back up.
>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>
>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>> the case for NFSv4.1).
>>>>>
>>>>>> - CLIENT ACK that FIN.
>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>
>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>
>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>
>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>
>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>
>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>> (They're just waiting for RPC requests.)
>>>>> However, I do now think I know why the soclose() does not happen.
>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>> happen until the client creates a new TCP connection.
>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>
>>>>> I've created the attached patch (completely different from the previous one)
>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>> soclose().
>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>> this change will get the socket shutdown to complete.
>>>>>
>>>>> There are a couple more things you can check on the server...
>>>>> # nfsstat -E -s
>>>>> --> Look for the count under "BindConnToSes".
>>>>> --> If non-zero, backchannels have been assigned
>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>
>>>>> I think the attached patch might work around the problem.
>>>>> The code that should open up the receive window needs to be checked.
>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>> assigned.
>>>>>
>>>>> rick
>>>>>
>>>>> Youssef
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>> <xprtdied.patch>
>>>>>
>>>>> <nfs-hang.log.gz>
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>
>>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org <mailto:freebsd-***@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net <https://lists.freebsd.org/mailman/listinfo/freebsd-net>
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org <mailto:freebsd-net-***@freebsd.org>"
Rick Macklem
2021-04-08 23:37:09 UTC
Permalink
Peter Eriksson wrote:
>Hmmm..
>
>We might have run into the same situation here between some Linux (CentOS >7.9.2009) clients and our FreeBSD 12.2 nfs servers this evening when a network switch >in between the nfs server and the clients had to be rebooted causing a network >partitioning for a number of minutes.
>
>Not every NFS mount froze on the Linux clients but a number of them did. New >connections/new NFS mounts worked fine, but the frozen ones stayed frozen. They >unstuck themself after some time (more than 6 minutes - more like and hour or two).
The 6minute timeout is disabled for 4.1/4.2 mounts.
I think I did that over concerns w.r.t. maintaining the back channel.
I am planning on enabling it soon.
The 2nd/3rd attachments in PR#254590 do this (same patch, but for FreeBSD13
vs FreeBSD12.
--> I also have patches in PR#254816 to fix recovery issues that can occur
after the partition heals.

I plan on making a post to freebsd-stable@ soon that summarizes the
patches. Doing the network partitioning testing has resulted in a bunch
of them, although most only matter if you have delegations enabled.

>Unfortunately not logs (no netstat output, not tcpdump) are available from the clients >at this time. We are setting up some monitoring scripts so if this happens again then I >hope we’ll be able to capture some things…
>
>I tried to check the NFS server side logs one some of our NFS servers from around the >time of the partition but I’ve yet to find something of interest unfortunately...
There is nothing sent to the console/syslog when a client creates a new TCP
connection. Every normal mount does them and the reconnects do not look
any different to the kernel RPC.

If it happens again, it would be nice to at least monitor "netstat -a" on the
servers, to see what state the connections are in.

>Not really helpful but that it self-healed after a (long) while is interesting I think…
What's that saying "patience is a virtue". I have no idea what could take hours
to get resolved?

rick

- Peter

> On 6 Apr 2021, at 01:24, Rick Macklem <***@uoguelph.ca> wrote:
>
> ***@freebsd.org <mailto:***@freebsd.org> wrote:
> [stuff snipped]
>> OK. What is the FreeBSD version you are using?
> main Dec. 23, 2020.
>
>>
>> It seems that the TCP connection on the FreeBSD is still alive,
>> Linux has decided to start a new TCP connection using the old
>> port numbers. So it sends a SYN. The response is a challenge ACK
>> and Linux responds with a RST. This looks good so far. However,
>> FreeBSD should accept the RST and kill the TCP connection. The
>> next SYN from the Linux side would establish a new TCP connection.
>>
>> So I'm wondering why the RST is not accepted. I made the timestamp
>> checking stricter but introduced a bug where RST segments without
>> timestamps were ignored. This was fixed.
>>
>> Introduced in main on 2020/11/09:
>> https://svnweb.freebsd.org/changeset/base/367530
>> Introduced in stable/12 on 2020/11/30:
>> https://svnweb.freebsd.org/changeset/base/36818
>> Fix in main on 2021/01/13:
>> https://cgit.FreeBSD.org/src/commit/?id=cc3c34859eab1b317d0f38731355b53f7d978c97
>> Fix in stable/12 on 2021/01/24:
>> https://cgit.FreeBSD.org/src/commit/?id=d05d908d6d3c85479c84c707f931148439ae826b
>>
>> Are you using a version which is affected by this bug?
> I was. Now I've applied the patch.
> Bad News. It did not fix the problem.
> It still gets into an endless "ignore RST" and stay established when
> the Send-Q is empty.
>
> If the Send-Q is non-empty when I partition, it recovers fine,
> sometimes not even needing to see an RST.
>
> rick
> ps: If you think there might be other recent changes that matter,
> just say the word and I'll upgrade to bits de jur.
>
> rick
>
> Best regards
> Michael
>>
>> If I wait long enough before healing the partition, it will
>> go to FIN_WAIT_1, and then if I plug it back in, it does not
>> do battle (at least not for long).
>>
>> Btw, I have one running now that seems stuck really good.
>> It has been 20minutes since I plugged the net cable back in.
>> (Unfortunately, I didn't have tcpdump running until after
>> I saw it was not progressing after healing.
>> --> There is one difference. There was a 6minute timeout
>> enabled on the server krpc for "no activity", which is
>> now disabled like it is for NFSv4.1 in freebsd-current.
>> I had forgotten to re-disable it.
>> So, when it does battle, it might have been the 6minute
>> timeout, which would then do the soshutdown(..SHUT_WR)
>> which kept it from getting "stuck" forever.
>> -->This time I had to reboot the FreeBSD NFS server to
>> get the Linux client unstuck, so this one looked a lot
>> like what has been reported.
>> The pcap for this one, started after the network was plugged
>> back in and I noticed it was stuck for quite a while is here:
>> fetch https://people.freebsd.org/~rmacklem/stuck.pcap
>>
>> In it, there is just a bunch of RST followed by SYN sent
>> from client->FreeBSD and FreeBSD just keeps sending
>> acks for the old segment back.
>> --> It looks like FreeBSD did the "RST, ACK" after the
>> krpc did a soshutdown(..SHUT_WR) on the socket,
>> for the one you've been looking at.
>> I'll test some more...
>>
>>> I would like to understand why the reestablishment of the connection
>>> did not work...
>> It is looking like it takes either a non-empty send-q or a
>> soshutdown(..SHUT_WR) to get the FreeBSD socket
>> out of established, where it just ignores the RSTs and
>> SYN packets.
>>
>> Thanks for looking at it, rick
>>
>> Best regards
>> Michael
>>>
>>> Have fun with it, rick
>>>
>>>
>>> ________________________________________
>>> From: ***@freebsd.org <***@freebsd.org>
>>> Sent: Sunday, April 4, 2021 12:41 PM
>>> To: Rick Macklem
>>> Cc: Scheffenegger, Richard; Youssef GHORBAL; freebsd-***@freebsd.org
>>> Subject: Re: NFS Mount Hangs
>>>
>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>
>>>
>>>> On 4. Apr 2021, at 17:27, Rick Macklem <***@uoguelph.ca> wrote:
>>>>
>>>> Well, I'm going to cheat and top post, since this is elated info. and
>>>> not really part of the discussion...
>>>>
>>>> I've been testing network partitioning between a Linux client (5.2 kernel)
>>>> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
>>>> I have had the Linux client doing "battle" with the FreeBSD server for
>>>> several minutes after un-partitioning the connection.
>>>>
>>>> The battle basically consists of the Linux client sending an RST, followed
>>>> by a SYN.
>>>> The FreeBSD server ignores the RST and just replies with the same old ack.
>>>> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
>>>> over several minutes.
>>>>
>>>> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
>>>> pretty good at ignoring it.
>>>>
>>>> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
>>>> in case anyone wants to look at it.
>>> On freefall? I would like to take a look at it...
>>>
>>> Best regards
>>> Michael
>>>>
>>>> Here's a tcpdump snippet of the interesting part (see the *** comments):
>>>> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
>>>> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
>>>> *** Network is now partitioned...
>>>>
>>>> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> *** Lots of lines snipped.
>>>>
>>>>
>>>> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> *** Network is now unpartitioned...
>>>>
>>>> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
>>>> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
>>>> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
>>>> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
>>>> *** This "battle" goes on for 223sec...
>>>> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
>>>> "FreeBSD replies with same old ACK". In another test run I saw this
>>>> cycle continue non-stop for several minutes. This time, the Linux
>>>> client paused for a while (see ARPs below).
>>>>
>>>> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
>>>> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
>>>> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
>>>> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>>>>
>>>> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
>>>> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
>>>> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
>>>> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
>>>> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
>>>> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
>>>> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
>>>> of 13 (100+ for another test run).
>>>>
>>>> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
>>>> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
>>>> *** Now back in business...
>>>>
>>>> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
>>>> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
>>>> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
>>>> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
>>>> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
>>>> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>>>>
>>>> This error 10063 after the partition heals is also "bad news". It indicates the Session
>>>> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
>>>> suspect a Linux client bug, but will be investigating further.
>>>>
>>>> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
>>>> or if the RST should be ack'd sooner?
>>>>
>>>> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>>>>
>>>> rick
>>>>
>>>>
>>>> ________________________________________
>>>> From: Scheffenegger, Richard <***@netapp.com>
>>>> Sent: Sunday, April 4, 2021 7:50 AM
>>>> To: Rick Macklem; ***@freebsd.org
>>>> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
>>>> Subject: Re: NFS Mount Hangs
>>>>
>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>
>>>>
>>>> For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>>>>
>>>> One is a missed updaten when the cöient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>>>>
>>>> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>>>>
>>>> I can try getting the relevant bug info next week...
>>>>
>>>> ________________________________
>>>> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
>>>> Gesendet: Friday, April 2, 2021 11:31:01 PM
>>>> An: ***@freebsd.org <***@freebsd.org>
>>>> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
>>>> Betreff: Re: NFS Mount Hangs
>>>>
>>>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>
>>>>
>>>>
>>>>
>>>> ***@freebsd.org wrote:
>>>>>> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
>>>>>>
>>>>>> I hope you don't mind a top post...
>>>>>> I've been testing network partitioning between the only Linux client
>>>>>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>>>>>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>>>>>> applied to it.
>>>>>>
>>>>>> I'm not enough of a TCP guy to know if this is useful, but here's what
>>>>>> I see...
>>>>>>
>>>>>> While partitioned:
>>>>>> On the FreeBSD server end, the socket either goes to CLOSED during
>>>>>> the network partition or stays ESTABLISHED.
>>>>> If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>>>>> sent a FIN, but you never called close() on the socket.
>>>>> If the socket stays in ESTABLISHED, there is no communication ongoing,
>>>>> I guess, and therefore the server does not even detect that the peer
>>>>> is not reachable.
>>>>>> On the Linux end, the socket seems to remain ESTABLISHED for a
>>>>>> little while, and then disappears.
>>>>> So how does Linux detect the peer is not reachable?
>>>> Well, here's what I see in a packet capture in the Linux client once
>>>> I partition it (just unplug the net cable):
>>>> - lots of retransmits of the same segment (with ACK) for 54sec
>>>> - then only ARP queries
>>>>
>>>> Once I plug the net cable back in:
>>>> - ARP works
>>>> - one more retransmit of the same segement
>>>> - receives RST from FreeBSD
>>>> ** So, is this now a "new" TCP connection, despite
>>>> using the same port#.
>>>> --> It matters for NFS, since "new connection"
>>>> implies "must retry all outstanding RPCs".
>>>> - sends SYN
>>>> - receives SYN, ACK from FreeBSD
>>>> --> connection starts working again
>>>> Always uses same port#.
>>>>
>>>> On the FreeBSD server end:
>>>> - receives the last retransmit of the segment (with ACK)
>>>> - sends RST
>>>> - receives SYN
>>>> - sends SYN, ACK
>>>>
>>>> I thought that there was no RST in the capture I looked at
>>>> yesterday, so I'm not sure if FreeBSD always sends an RST,
>>>> but the Linux client behaviour was the same. (Sent a SYN, etc).
>>>> The socket disappears from the Linux "netstat -a" and I
>>>> suspect that happens after about 54sec, but I am not sure
>>>> about the timing.
>>>>
>>>>>>
>>>>>> After unpartitioning:
>>>>>> On the FreeBSD server end, you get another socket showing up at
>>>>>> the same port#
>>>>>> Active Internet connections (including servers)
>>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
>>>>>> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
>>>>>>
>>>>>> The Linux client shows the same connection ESTABLISHED.
>>>> But disappears from "netstat -a" for a while during the partitioning.
>>>>
>>>>>> (The mount sometimes reports an error. I haven't looked at packet
>>>>>> traces to see if it retries RPCs or why the errors occur.)
>>>> I have now done so, as above.
>>>>
>>>>>> --> However I never get hangs.
>>>>>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>>>>>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>>>>>> mount starts working again.
>>>>>>
>>>>>> The most obvious thing is that the Linux client always keeps using
>>>>>> the same port#. (The FreeBSD client will use a different port# when
>>>>>> it does a TCP reconnect after no response from the NFS server for
>>>>>> a little while.)
>>>>>>
>>>>>> What do those TCP conversant think?
>>>>> I guess you are you are never calling close() on the socket, for with
>>>>> the connection state is CLOSED.
>>>> Ok, that makes sense. For this case the Linux client has not done a
>>>> BindConnectionToSession to re-assign the back channel.
>>>> I'll have to bug them about this. However, I'll bet they'll answer
>>>> that I have to tell them the back channel needs re-assignment
>>>> or something like that.
>>>>
>>>> I am pretty certain they are broken, in that the client needs to
>>>> retry all outstanding RPCs.
>>>>
>>>> For others, here's the long winded version of this that I just
>>>> put on the phabricator review:
>>>> In the server side kernel RPC, the socket (struct socket *) is in a
>>>> structure called SVCXPRT (normally pointed to by "xprt").
>>>> These structures a ref counted and the soclose() is done
>>>> when the ref. cnt goes to zero. My understanding is that
>>>> "struct socket *" is free'd by soclose() so this cannot be done
>>>> before the xprt ref. cnt goes to zero.
>>>>
>>>> For NFSv4.1/4.2 there is something called a back channel
>>>> which means that a "xprt" is used for server->client RPCs,
>>>> although the TCP connection is established by the client
>>>> to the server.
>>>> --> This back channel holds a ref cnt on "xprt" until the
>>>>
>>>> client re-assigns it to a different TCP connection
>>>> via an operation called BindConnectionToSession
>>>> and the Linux client is not doing this soon enough,
>>>> it appears.
>>>>
>>>> So, the soclose() is delayed, which is why I think the
>>>> TCP connection gets stuck in CLOSE_WAIT and that is
>>>> why I've added the soshutdown(..SHUT_WR) calls,
>>>> which can happen before the client gets around to
>>>> re-assigning the back channel.
>>>>
>>>> Thanks for your help with this Michael, rick
>>>>
>>>> Best regards
>>>> Michael
>>>>>
>>>>> rick
>>>>> ps: I can capture packets while doing this, if anyone has a use
>>>>> for them.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________________
>>>>> From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
>>>>> Sent: Saturday, March 27, 2021 6:57 PM
>>>>> To: Jason Breitman
>>>>> Cc: Rick Macklem; freebsd-***@freebsd.org
>>>>> Subject: Re: NFS Mount Hangs
>>>>>
>>>>> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>
>>>>> The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
>>>>> # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
>>>>> # ifconfig lagg0
>>>>> lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>> options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
>>>>>
>>>>> We can also say that the sysctl settings did not resolve this issue.
>>>>>
>>>>> # sysctl net.inet.tcp.fast_finwait2_recycle=1
>>>>> net.inet.tcp.fast_finwait2_recycle: 0 -> 1
>>>>>
>>>>> # sysctl net.inet.tcp.finwait2_timeout=1000
>>>>> net.inet.tcp.finwait2_timeout: 60000 -> 1000
>>>>>
>>>>> I don’t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
>>>>> By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
>>>>>
>>>>> tcp_fin_timeout (integer; default: 60; since Linux 2.2)
>>>>> This specifies how many seconds to wait for a final FIN
>>>>> packet before the socket is forcibly closed. This is
>>>>> strictly a violation of the TCP specification, but
>>>>> required to prevent denial-of-service attacks. In Linux
>>>>> 2.2, the default value was 180.
>>>>>
>>>>> So I don’t get why it stucks in the FIN_WAIT2 state anyway.
>>>>>
>>>>> You really need to have a packet capture during the outage (client and server side) so you’ll get over the wire chat and start speculating from there.
>>>>> No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
>>>>>
>>>>> * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
>>>>>
>>>>> The issue occurred after 5 days following a reboot of the client machines.
>>>>> I ran the capture information again to make use of the situation.
>>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> while true
>>>>> do
>>>>> /bin/date >> /tmp/nfs-hang.log
>>>>> /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
>>>>> /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
>>>>> /bin/sleep 60
>>>>> done
>>>>>
>>>>>
>>>>> On the NFS Server
>>>>> Active Internet connections (including servers)
>>>>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>>>>> tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
>>>>>
>>>>> On the NFS Client
>>>>> tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
>>>>>
>>>>>
>>>>>
>>>>> You had also asked for the output below.
>>>>>
>>>>> # nfsstat -E -s
>>>>> BackChannelCtBindConnToSes
>>>>> 0 0
>>>>>
>>>>> # sysctl vfs.nfsd.request_space_throttle_count
>>>>> vfs.nfsd.request_space_throttle_count: 0
>>>>>
>>>>> I see that you are testing a patch and I look forward to seeing the results.
>>>>>
>>>>>
>>>>> Jason Breitman
>>>>>
>>>>>
>>>>> On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
>>>>>
>>>>> Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
>>>>>> Hi Jason,
>>>>>>
>>>>>>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
>>>>>>>
>>>>>>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
>>>>>>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
>>>>>>>
>>>>>>> Issue
>>>>>>> NFSv4 mounts periodically hang on the NFS Client.
>>>>>>>
>>>>>>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
>>>>>>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
>>>>>>> Rebooting the NFS Client appears to be the only solution.
>>>>>>
>>>>>> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
>>>>> Yes, my understanding is that Isilon uses a proprietary user space nfsd and
>>>>> not the kernel based RPC and nfsd in FreeBSD.
>>>>>
>>>>>> We’ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
>>>>>>
>>>>>> - Data flows correctly between SERVER and the CLIENT
>>>>>> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
>>>>>> - The client (eager to send data) can only ack data sent by SERVER.
>>>>>> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
>>>>>> - SERVER responds with a TCP Zero Window to those probes.
>>>>> Having the window size drop to zero is not necessarily incorrect.
>>>>> If the server is overloaded (has a backlog of NFS requests), it can stop doing
>>>>> soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
>>>>> closes). This results in "backpressure" to stop the NFS client from flooding the
>>>>> NFS server with requests.
>>>>> --> However, once the backlog is handled, the nfsd should start to soreceive()
>>>>> again and this shouls cause the window to open back up.
>>>>> --> Maybe this is broken in the socket/TCP code. I quickly got lost in
>>>>> tcp_output() when it decides what to do about the rcvwin.
>>>>>
>>>>>> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
>>>>> This probably does not happen for Jason's case, since the 6minute timeout
>>>>> is disabled when the TCP connection is assigned as a backchannel (most likely
>>>>> the case for NFSv4.1).
>>>>>
>>>>>> - CLIENT ACK that FIN.
>>>>>> - SERVER goes in FIN_WAIT_2 state
>>>>>> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
>>>>>> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
>>>>>> - SERVER keeps responding with a TCP Zero Window to those probes.
>>>>>> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
>>>>>>
>>>>>> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we’ll end up in the same >state as you I think.
>>>>>>
>>>>>> We’ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
>>>>>>
>>>>>> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we’ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
>>>>>>
>>>>>> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
>>>>> Interesting story and good work w.r.t. sluething, Youssef, thanks.
>>>>>
>>>>> I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
>>>>> (They're just waiting for RPC requests.)
>>>>> However, I do now think I know why the soclose() does not happen.
>>>>> When the TCP connection is assigned as a backchannel, that takes a reference
>>>>> cnt on the structure. This refcnt won't be released until the connection is
>>>>> replaced by a BindConnectiotoSession operation from the client. But that won't
>>>>> happen until the client creates a new TCP connection.
>>>>> --> No refcnt release-->no refcnt of 0-->no soclose().
>>>>>
>>>>> I've created the attached patch (completely different from the previous one)
>>>>> that adds soshutdown(SHUT_WR) calls in the three places where the TCP
>>>>> connection is going away. This seems to get it past CLOSE_WAIT without a
>>>>> soclose().
>>>>> --> I know you are not comfortable with patching your server, but I do think
>>>>> this change will get the socket shutdown to complete.
>>>>>
>>>>> There are a couple more things you can check on the server...
>>>>> # nfsstat -E -s
>>>>> --> Look for the count under "BindConnToSes".
>>>>> --> If non-zero, backchannels have been assigned
>>>>> # sysctl -a | fgrep request_space_throttle_count
>>>>> --> If non-zero, the server has been overloaded at some point.
>>>>>
>>>>> I think the attached patch might work around the problem.
>>>>> The code that should open up the receive window needs to be checked.
>>>>> I am also looking at enabling the 6minute timeout when a backchannel is
>>>>> assigned.
>>>>>
>>>>> rick
>>>>>
>>>>> Youssef
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>>>>> https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>>>> <xprtdied.patch>
>>>>>
>>>>> <nfs-hang.log.gz>
>>>>>
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-***@freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>>
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>> _______________________________________________
>>>> freebsd-***@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>>
>>
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org <mailto:freebsd-***@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net <https://lists.freebsd.org/mailman/listinfo/freebsd-net>
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org <mailto:freebsd-net-***@freebsd.org>"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
t***@freebsd.org
2021-03-18 12:58:30 UTC
Permalink
> On 18. Mar 2021, at 13:53, Rodney W. Grimes <freebsd-***@gndrsh.dnsmgr.net> wrote:
>
> Note I am NOT a TCP expert, but know enough about it to add a comment...
>
>> Alan Somers wrote:
>> [stuff snipped]
>>> Is the 128K limit related to MAXPHYS? If so, it should be greater in 13.0.
>> For the client, yes. For the server, no.
>> For the server, it is just a compile time constant NFS_SRVMAXIO.
>>
>> It's mainly related to the fact that I haven't gotten around to testing larger
>> sizes yet.
>> - kern.ipc.maxsockbuf needs to be several times the limit, which means it would
>> have to increase for 1Mbyte.
>> - The session code must negotiate a maximum RPC size > 1 Mbyte.
>> (I think the server code does do this, but it needs to be tested.)
>> And, yes, the client is limited to MAXPHYS.
>>
>> Doing this is on my todo list, rick
>>
>> The client should acquire the attributes that indicate that and set rsize/wsize
>> to that. "# nfsstat -m" on the client should show you what the client
>> is actually using. If it is larger than 128K, set both rsize and wsize to 128K.
>>
>>> Output from the NFS Client when the issue occurs
>>> # netstat -an | grep NFS.Server.IP.X
>>> tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 FIN_WAIT2
>> I'm no TCP guy. Hopefully others might know why the client would be
>> stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack,
>> but could be wrong?)
>
> The most common way to get stuck in FIN_WAIT2 is to call
> shutdown(2) on a socket, but never following up with a
> close(2) after some timeout period. The "client" is still
> connected to the socket and can stay in this shutdown state
> for ever, the kernel well not reap the socket as it is
> associated with a processes, aka not orphaned. I suspect
> that the Linux client has a corner condition that is leading
> to this socket leak.
>
> If on the Linux client you can look at the sockets to see
> if these are still associated with a process, ala fstat or
> what ever Linux tool does this that would be helpfull.
> If they are infact connected to a process it is that
> process that must call close(2) to clean these up.
>
> IIRC the server side socket would be gone at this point
> and there is nothing the server can do that would allow
> a FIN_WAIT2 to close down.
Jason reported that the server is in CLOSE-WAIT. This would
mean the the server received the FIN, ACKed it, but has not
initiated the teardown of the Server->Client direction.
So the server side socket is still there and close has not
be called yet.
>
> The real TCP experts can now correct my 30 year old TCP
> stack understanding...
I wouldn't count myself as a real TCP expert, but the behaviour
hasn't changed in the last 30 years, I think...

Best regards
Michael
>
>>
>>> # cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info
>>> netid: tcp
>>> addr: NFS.Server.IP.X
>>> port: 2049
>>> state: 0x51
>>>
>>> syslog
>>> Mar 4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status -client- --rqstp- ->timeout ---ops--
>>> Mar 4 10:29:27 hostname kernel: [437414.133158] 57419 40a1 0 9b723c73 >143cfadf 30000 4ca953b5 nfsv4 OPEN_NOATTR a:call_connect_status [sunrpc] >q:xprt_pending
>> I don't know what OPEN_NOATTR means, but I assume it is some variant
>> of NFSv4 Open operation.
>> [stuff snipped]
>>> Mar 4 10:29:30 hostname kernel: [437417.110517] RPC: 57419 xprt_connect_status: >connect attempt timed out
>>> Mar 4 10:29:30 hostname kernel: [437417.112172] RPC: 57419 call_connect_status
>>> (status -110)
>> I have no idea what status -110 means?
>>> Mar 4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout (major)
>>> Mar 4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind (status 0)
>>> Mar 4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect xprt >00000000e061831b is not connected
>>> Mar 4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect xprt >00000000e061831b is not connected
>>> Mar 4 10:30:31 hostname kernel: [437478.551090] RPC: 57419 xprt_connect_status: >connect attempt timed out
>>> Mar 4 10:30:31 hostname kernel: [437478.552396] RPC: 57419 call_connect_status >(status -110)
>>> Mar 4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout (minor)
>>> Mar 4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind (status 0)
>>> Mar 4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect xprt >00000000e061831b is not connected
>>> Mar 4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect xprt >00000000e061831b is not connected
>> Is it possible that the client is trying to (re)connect using the same client port#?
>> I would normally expect the client to create a new TCP connection using a
>> different client port# and then retry the outstanding RPCs.
>> --> Capturing packets when this happens would show us what is going on.
>>
>> If there is a problem on the FreeBSD end, it is most likely a broken
>> network device driver.
>> --> Try disabling TSO , LRO.
>> --> Try a different driver for the net hardware on the server.
>> --> Try a different net chip on the server.
>> If you can capture packets when (not after) the hang
>> occurs, then you can look at them in wireshark and see
>> what is actually happening. (Ideally on both client and
>> server, to check that your network hasn't dropped anything.)
>> --> I know, if the hangs aren't easily reproducible, this isn't
>> easily done.
>> --> Try a newer Linux kernel and see if the problem persists.
>> The Linux folk will get more interested if you can reproduce
>> the problem on 5.12. (Recent bakeathon testing of the 5.12
>> kernel against the FreeBSD server did not find any issues.)
>>
>> Hopefully the network folk have some insight w.r.t. why
>> the TCP connection is sitting in FIN_WAIT2.
>>
>> rick
>>
>>
>>
>> Jason Breitman
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>>
>> _______________________________________________
>> freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>>
>
> --
> Rod Grimes ***@freebsd.org
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Rodney W. Grimes
2021-04-04 17:27:28 UTC
Permalink
And I'll follow the lead to top post, as I have been quietly following
this thread, trying to only add when I think I have relevant input, and
I think I do on a small point...

Rick,
Your "unplugging" a cable to simulate network partitioning,
in my experience this is a bad way to do that, as the host gets to
see the link layer go down and knows it can not send. I am actually
puzzled that you see arp packets, but I guess those are getting
picked off before the interface layer silently tosses them on
the ground. IIRC due to this loss of link layer you may be
masking some things that would occur in other situations as often
an error is returned to the application layer. IE the ONLY packet
your likely to see into an unplugged cable is "arp".

I can suggest other means to partition, such as configuring a switch
port in and out of the correct LAN/VLAN, a physical switch in the TX
pair to open it, but leave RX pair intact so carrier is not lost.
Both of these simulate partitioning that is more realistic, AND does
not have the side effect of allowing upper layers to eat the packets
before bpf can grab them, or be told that partitioning has occured.

Another side effect of unplugging a cable is that a host should
immediately invalidate all ARP entries on that interface... hence
your getting into an arp who has situation that should not even
start for 5 minutes in the other failure modes.

Regards,
Rod

> Well, I'm going to cheat and top post, since this is elated info. and
> not really part of the discussion...
>
> I've been testing network partitioning between a Linux client (5.2 kernel)
> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
> I have had the Linux client doing "battle" with the FreeBSD server for
> several minutes after un-partitioning the connection.
>
> The battle basically consists of the Linux client sending an RST, followed
> by a SYN.
> The FreeBSD server ignores the RST and just replies with the same old ack.
> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
> over several minutes.
>
> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
> pretty good at ignoring it.
>
> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
> in case anyone wants to look at it.
>
> Here's a tcpdump snippet of the interesting part (see the *** comments):
> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
> *** Network is now partitioned...
>
> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> *** Lots of lines snipped.
>
>
> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> *** Network is now unpartitioned...
>
> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
> *** This "battle" goes on for 223sec...
> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
> "FreeBSD replies with same old ACK". In another test run I saw this
> cycle continue non-stop for several minutes. This time, the Linux
> client paused for a while (see ARPs below).
>
> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>
> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
> of 13 (100+ for another test run).
>
> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
> *** Now back in business...
>
> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>
> This error 10063 after the partition heals is also "bad news". It indicates the Session
> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
> suspect a Linux client bug, but will be investigating further.
>
> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
> or if the RST should be ack'd sooner?
>
> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>
> rick
>
>
> ________________________________________
> From: Scheffenegger, Richard <***@netapp.com>
> Sent: Sunday, April 4, 2021 7:50 AM
> To: Rick Macklem; ***@freebsd.org
> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
> For what it?s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>
> One is a missed updaten when the c?ient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>
> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>
> I can try getting the relevant bug info next week...
>
> ________________________________
> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
> Gesendet: Friday, April 2, 2021 11:31:01 PM
> An: ***@freebsd.org <***@freebsd.org>
> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
> Betreff: Re: NFS Mount Hangs
>
> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
>
>
> ***@freebsd.org wrote:
> >> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
> >>
> >> I hope you don't mind a top post...
> >> I've been testing network partitioning between the only Linux client
> >> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
> >> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
> >> applied to it.
> >>
> >> I'm not enough of a TCP guy to know if this is useful, but here's what
> >> I see...
> >>
> >> While partitioned:
> >> On the FreeBSD server end, the socket either goes to CLOSED during
> >> the network partition or stays ESTABLISHED.
> >If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
> >sent a FIN, but you never called close() on the socket.
> >If the socket stays in ESTABLISHED, there is no communication ongoing,
> >I guess, and therefore the server does not even detect that the peer
> >is not reachable.
> >> On the Linux end, the socket seems to remain ESTABLISHED for a
> >> little while, and then disappears.
> >So how does Linux detect the peer is not reachable?
> Well, here's what I see in a packet capture in the Linux client once
> I partition it (just unplug the net cable):
> - lots of retransmits of the same segment (with ACK) for 54sec
> - then only ARP queries
>
> Once I plug the net cable back in:
> - ARP works
> - one more retransmit of the same segement
> - receives RST from FreeBSD
> ** So, is this now a "new" TCP connection, despite
> using the same port#.
> --> It matters for NFS, since "new connection"
> implies "must retry all outstanding RPCs".
> - sends SYN
> - receives SYN, ACK from FreeBSD
> --> connection starts working again
> Always uses same port#.
>
> On the FreeBSD server end:
> - receives the last retransmit of the segment (with ACK)
> - sends RST
> - receives SYN
> - sends SYN, ACK
>
> I thought that there was no RST in the capture I looked at
> yesterday, so I'm not sure if FreeBSD always sends an RST,
> but the Linux client behaviour was the same. (Sent a SYN, etc).
> The socket disappears from the Linux "netstat -a" and I
> suspect that happens after about 54sec, but I am not sure
> about the timing.
>
> >>
> >> After unpartitioning:
> >> On the FreeBSD server end, you get another socket showing up at
> >> the same port#
> >> Active Internet connections (including servers)
> >> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> >> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
> >> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
> >>
> >> The Linux client shows the same connection ESTABLISHED.
> But disappears from "netstat -a" for a while during the partitioning.
>
> >> (The mount sometimes reports an error. I haven't looked at packet
> >> traces to see if it retries RPCs or why the errors occur.)
> I have now done so, as above.
>
> >> --> However I never get hangs.
> >> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
> >> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
> >> mount starts working again.
> >>
> >> The most obvious thing is that the Linux client always keeps using
> >> the same port#. (The FreeBSD client will use a different port# when
> >> it does a TCP reconnect after no response from the NFS server for
> >> a little while.)
> >>
> >> What do those TCP conversant think?
> >I guess you are you are never calling close() on the socket, for with
> >the connection state is CLOSED.
> Ok, that makes sense. For this case the Linux client has not done a
> BindConnectionToSession to re-assign the back channel.
> I'll have to bug them about this. However, I'll bet they'll answer
> that I have to tell them the back channel needs re-assignment
> or something like that.
>
> I am pretty certain they are broken, in that the client needs to
> retry all outstanding RPCs.
>
> For others, here's the long winded version of this that I just
> put on the phabricator review:
> In the server side kernel RPC, the socket (struct socket *) is in a
> structure called SVCXPRT (normally pointed to by "xprt").
> These structures a ref counted and the soclose() is done
> when the ref. cnt goes to zero. My understanding is that
> "struct socket *" is free'd by soclose() so this cannot be done
> before the xprt ref. cnt goes to zero.
>
> For NFSv4.1/4.2 there is something called a back channel
> which means that a "xprt" is used for server->client RPCs,
> although the TCP connection is established by the client
> to the server.
> --> This back channel holds a ref cnt on "xprt" until the
>
> client re-assigns it to a different TCP connection
> via an operation called BindConnectionToSession
> and the Linux client is not doing this soon enough,
> it appears.
>
> So, the soclose() is delayed, which is why I think the
> TCP connection gets stuck in CLOSE_WAIT and that is
> why I've added the soshutdown(..SHUT_WR) calls,
> which can happen before the client gets around to
> re-assigning the back channel.
>
> Thanks for your help with this Michael, rick
>
> Best regards
> Michael
> >
> > rick
> > ps: I can capture packets while doing this, if anyone has a use
> > for them.
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> > Sent: Saturday, March 27, 2021 6:57 PM
> > To: Jason Breitman
> > Cc: Rick Macklem; freebsd-***@freebsd.org
> > Subject: Re: NFS Mount Hangs
> >
> > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
> >
> >
> >
> >
> > On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
> >
> > The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> > # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> > # ifconfig lagg0
> > lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> > options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
> >
> > We can also say that the sysctl settings did not resolve this issue.
> >
> > # sysctl net.inet.tcp.fast_finwait2_recycle=1
> > net.inet.tcp.fast_finwait2_recycle: 0 -> 1
> >
> > # sysctl net.inet.tcp.finwait2_timeout=1000
> > net.inet.tcp.finwait2_timeout: 60000 -> 1000
> >
> > I don?t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> > By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
> >
> > tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> > This specifies how many seconds to wait for a final FIN
> > packet before the socket is forcibly closed. This is
> > strictly a violation of the TCP specification, but
> > required to prevent denial-of-service attacks. In Linux
> > 2.2, the default value was 180.
> >
> > So I don?t get why it stucks in the FIN_WAIT2 state anyway.
> >
> > You really need to have a packet capture during the outage (client and server side) so you?ll get over the wire chat and start speculating from there.
> > No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
> >
> > * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
> >
> > The issue occurred after 5 days following a reboot of the client machines.
> > I ran the capture information again to make use of the situation.
> >
> > #!/bin/sh
> >
> > while true
> > do
> > /bin/date >> /tmp/nfs-hang.log
> > /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> > /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> > /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> > /bin/sleep 60
> > done
> >
> >
> > On the NFS Server
> > Active Internet connections (including servers)
> > Proto Recv-Q Send-Q Local Address Foreign Address (state)
> > tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
> >
> > On the NFS Client
> > tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
> >
> >
> >
> > You had also asked for the output below.
> >
> > # nfsstat -E -s
> > BackChannelCtBindConnToSes
> > 0 0
> >
> > # sysctl vfs.nfsd.request_space_throttle_count
> > vfs.nfsd.request_space_throttle_count: 0
> >
> > I see that you are testing a patch and I look forward to seeing the results.
> >
> >
> > Jason Breitman
> >
> >
> > On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
> >
> > Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
> >> Hi Jason,
> >>
> >>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
> >>>
> >>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
> >>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
> >>>
> >>> Issue
> >>> NFSv4 mounts periodically hang on the NFS Client.
> >>>
> >>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
> >>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
> >>> Rebooting the NFS Client appears to be the only solution.
> >>
> >> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> > Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> > not the kernel based RPC and nfsd in FreeBSD.
> >
> >> We?ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
> >>
> >> - Data flows correctly between SERVER and the CLIENT
> >> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
> >> - The client (eager to send data) can only ack data sent by SERVER.
> >> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
> >> - SERVER responds with a TCP Zero Window to those probes.
> > Having the window size drop to zero is not necessarily incorrect.
> > If the server is overloaded (has a backlog of NFS requests), it can stop doing
> > soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> > closes). This results in "backpressure" to stop the NFS client from flooding the
> > NFS server with requests.
> > --> However, once the backlog is handled, the nfsd should start to soreceive()
> > again and this shouls cause the window to open back up.
> > --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> > tcp_output() when it decides what to do about the rcvwin.
> >
> >> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> > This probably does not happen for Jason's case, since the 6minute timeout
> > is disabled when the TCP connection is assigned as a backchannel (most likely
> > the case for NFSv4.1).
> >
> >> - CLIENT ACK that FIN.
> >> - SERVER goes in FIN_WAIT_2 state
> >> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
> >> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
> >> - SERVER keeps responding with a TCP Zero Window to those probes.
> >> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
> >>
> >> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we?ll end up in the same >state as you I think.
> >>
> >> We?ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
> >>
> >> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we?ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
> >>
> >> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> > Interesting story and good work w.r.t. sluething, Youssef, thanks.
> >
> > I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> > (They're just waiting for RPC requests.)
> > However, I do now think I know why the soclose() does not happen.
> > When the TCP connection is assigned as a backchannel, that takes a reference
> > cnt on the structure. This refcnt won't be released until the connection is
> > replaced by a BindConnectiotoSession operation from the client. But that won't
> > happen until the client creates a new TCP connection.
> > --> No refcnt release-->no refcnt of 0-->no soclose().
> >
> > I've created the attached patch (completely different from the previous one)
> > that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> > connection is going away. This seems to get it past CLOSE_WAIT without a
> > soclose().
> > --> I know you are not comfortable with patching your server, but I do think
> > this change will get the socket shutdown to complete.
> >
> > There are a couple more things you can check on the server...
> > # nfsstat -E -s
> > --> Look for the count under "BindConnToSes".
> > --> If non-zero, backchannels have been assigned
> > # sysctl -a | fgrep request_space_throttle_count
> > --> If non-zero, the server has been overloaded at some point.
> >
> > I think the attached patch might work around the problem.
> > The code that should open up the receive window needs to be checked.
> > I am also looking at enabling the 6minute timeout when a backchannel is
> > assigned.
> >
> > rick
> >
> > Youssef
> >
> > _______________________________________________
> > freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> > https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> > To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> > <xprtdied.patch>
> >
> > <nfs-hang.log.gz>
> >
> > _______________________________________________
> > freebsd-***@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> > _______________________________________________
> > freebsd-***@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
>

--
Rod Grimes ***@freebsd.org
Rick Macklem
2021-04-04 20:47:16 UTC
Permalink
Rodney W. Grimes wrote:
>And I'll follow the lead to top post, as I have been quietly following
>this thread, trying to only add when I think I have relevant input, and
>I think I do on a small point...
>
>Rick,
> Your "unplugging" a cable to simulate network partitioning,
>in my experience this is a bad way to do that, as the host gets to
>see the link layer go down and knows it can not send.
I unplug the server end and normally capture packets at the
Linux client end.

> I am actually
>puzzled that you see arp packets, but I guess those are getting
>picked off before the interface layer silently tosses them on
>the ground. IIRC due to this loss of link layer you may be
>masking some things that would occur in other situations as often
>an error is returned to the application layer. IE the ONLY packet
>your likely to see into an unplugged cable is "arp".
The FreeBSD server end, where I unplug, does not seem to notice
at the link level (thanks to the intel net driver).
I do not even see a "loss of carrier" type message when I do it.

>I can suggest other means to partition, such as configuring a switch
>port in and out of the correct LAN/VLAN, a physical switch in the TX
>pair to open it, but leave RX pair intact so carrier is not lost.
My switch is just the nat gateway the phone company provides.
I can log into it with a web browser, but have only done so once
in 2.5years since I got it.
It is also doing very important stuff during the testing, like streaming
the Mandalorian.

>Both of these simulate partitioning that is more realistic, AND does
>not have the side effect of allowing upper layers to eat the packets
>before bpf can grab them, or be told that partitioning has occured.
Well, if others feel that something like the above will be useful, I might try.

>Another side effect of unplugging a cable is that a host should
>immediately invalidate all ARP entries on that interface... hence
>your getting into an arp who has situation that should not even
>start for 5 minutes in the other failure modes.
The Linux client keeps spitting out arp queries, so that gets fixed
almost instantly when the cable gets plugged back in.

In general, the kernel RPC sees very little of what is going on.
Sometimes a EPIPE when it tries to use a socket after it has
closed down.

rick

Regards,
Rod

> Well, I'm going to cheat and top post, since this is elated info. and
> not really part of the discussion...
>
> I've been testing network partitioning between a Linux client (5.2 kernel)
> and a FreeBSD-current NFS server. I have not gotten a solid hang, but
> I have had the Linux client doing "battle" with the FreeBSD server for
> several minutes after un-partitioning the connection.
>
> The battle basically consists of the Linux client sending an RST, followed
> by a SYN.
> The FreeBSD server ignores the RST and just replies with the same old ack.
> --> This varies from "just a SYN" that succeeds to 100+ cycles of the above
> over several minutes.
>
> I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems
> pretty good at ignoring it.
>
> A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap
> in case anyone wants to look at it.
>
> Here's a tcpdump snippet of the interesting part (see the *** comments):
> 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0
> 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0
> *** Network is now partitioned...
>
> 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> *** Lots of lines snipped.
>
>
> 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> *** Network is now unpartitioned...
>
> 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0
> 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0
> 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0
> 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0
> *** This "battle" goes on for 223sec...
> I snipped out 13 cycles of this "Linux sends an RST, followed by SYN"
> "FreeBSD replies with same old ACK". In another test run I saw this
> cycle continue non-stop for several minutes. This time, the Linux
> client paused for a while (see ARPs below).
>
> 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0
> 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0
> 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0
> *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST.
>
> 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28
> 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46
> 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46
> 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0
> 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0
> 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0
> *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last
> of 13 (100+ for another test run).
>
> 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0
> 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0
> *** Now back in business...
>
> 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0
> 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53
> 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53
> 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0
> 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53
> 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063
>
> This error 10063 after the partition heals is also "bad news". It indicates the Session
> (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I
> suspect a Linux client bug, but will be investigating further.
>
> So, hopefully TCP conversant folk can confirm if the above is correct behaviour
> or if the RST should be ack'd sooner?
>
> I could also see this becoming a "forever" TCP battle for other versions of Linux client.
>
> rick
>
>
> ________________________________________
> From: Scheffenegger, Richard <***@netapp.com>
> Sent: Sunday, April 4, 2021 7:50 AM
> To: Rick Macklem; ***@freebsd.org
> Cc: Youssef GHORBAL; freebsd-***@freebsd.org
> Subject: Re: NFS Mount Hangs
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
>
>
> For what it?s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever.
>
> One is a missed updaten when the c?ient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them);
>
> The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever...
>
> I can try getting the relevant bug info next week...
>
> ________________________________
> Von: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> im Auftrag von Rick Macklem <***@uoguelph.ca>
> Gesendet: Friday, April 2, 2021 11:31:01 PM
> An: ***@freebsd.org <***@freebsd.org>
> Cc: Youssef GHORBAL <***@pasteur.fr>; freebsd-***@freebsd.org <freebsd-***@freebsd.org>
> Betreff: Re: NFS Mount Hangs
>
> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
>
>
> ***@freebsd.org wrote:
> >> On 2. Apr 2021, at 02:07, Rick Macklem <***@uoguelph.ca> wrote:
> >>
> >> I hope you don't mind a top post...
> >> I've been testing network partitioning between the only Linux client
> >> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
> >> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
> >> applied to it.
> >>
> >> I'm not enough of a TCP guy to know if this is useful, but here's what
> >> I see...
> >>
> >> While partitioned:
> >> On the FreeBSD server end, the socket either goes to CLOSED during
> >> the network partition or stays ESTABLISHED.
> >If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
> >sent a FIN, but you never called close() on the socket.
> >If the socket stays in ESTABLISHED, there is no communication ongoing,
> >I guess, and therefore the server does not even detect that the peer
> >is not reachable.
> >> On the Linux end, the socket seems to remain ESTABLISHED for a
> >> little while, and then disappears.
> >So how does Linux detect the peer is not reachable?
> Well, here's what I see in a packet capture in the Linux client once
> I partition it (just unplug the net cable):
> - lots of retransmits of the same segment (with ACK) for 54sec
> - then only ARP queries
>
> Once I plug the net cable back in:
> - ARP works
> - one more retransmit of the same segement
> - receives RST from FreeBSD
> ** So, is this now a "new" TCP connection, despite
> using the same port#.
> --> It matters for NFS, since "new connection"
> implies "must retry all outstanding RPCs".
> - sends SYN
> - receives SYN, ACK from FreeBSD
> --> connection starts working again
> Always uses same port#.
>
> On the FreeBSD server end:
> - receives the last retransmit of the segment (with ACK)
> - sends RST
> - receives SYN
> - sends SYN, ACK
>
> I thought that there was no RST in the capture I looked at
> yesterday, so I'm not sure if FreeBSD always sends an RST,
> but the Linux client behaviour was the same. (Sent a SYN, etc).
> The socket disappears from the Linux "netstat -a" and I
> suspect that happens after about 54sec, but I am not sure
> about the timing.
>
> >>
> >> After unpartitioning:
> >> On the FreeBSD server end, you get another socket showing up at
> >> the same port#
> >> Active Internet connections (including servers)
> >> Proto Recv-Q Send-Q Local Address Foreign Address (state)
> >> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED
> >> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED
> >>
> >> The Linux client shows the same connection ESTABLISHED.
> But disappears from "netstat -a" for a while during the partitioning.
>
> >> (The mount sometimes reports an error. I haven't looked at packet
> >> traces to see if it retries RPCs or why the errors occur.)
> I have now done so, as above.
>
> >> --> However I never get hangs.
> >> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
> >> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
> >> mount starts working again.
> >>
> >> The most obvious thing is that the Linux client always keeps using
> >> the same port#. (The FreeBSD client will use a different port# when
> >> it does a TCP reconnect after no response from the NFS server for
> >> a little while.)
> >>
> >> What do those TCP conversant think?
> >I guess you are you are never calling close() on the socket, for with
> >the connection state is CLOSED.
> Ok, that makes sense. For this case the Linux client has not done a
> BindConnectionToSession to re-assign the back channel.
> I'll have to bug them about this. However, I'll bet they'll answer
> that I have to tell them the back channel needs re-assignment
> or something like that.
>
> I am pretty certain they are broken, in that the client needs to
> retry all outstanding RPCs.
>
> For others, here's the long winded version of this that I just
> put on the phabricator review:
> In the server side kernel RPC, the socket (struct socket *) is in a
> structure called SVCXPRT (normally pointed to by "xprt").
> These structures a ref counted and the soclose() is done
> when the ref. cnt goes to zero. My understanding is that
> "struct socket *" is free'd by soclose() so this cannot be done
> before the xprt ref. cnt goes to zero.
>
> For NFSv4.1/4.2 there is something called a back channel
> which means that a "xprt" is used for server->client RPCs,
> although the TCP connection is established by the client
> to the server.
> --> This back channel holds a ref cnt on "xprt" until the
>
> client re-assigns it to a different TCP connection
> via an operation called BindConnectionToSession
> and the Linux client is not doing this soon enough,
> it appears.
>
> So, the soclose() is delayed, which is why I think the
> TCP connection gets stuck in CLOSE_WAIT and that is
> why I've added the soshutdown(..SHUT_WR) calls,
> which can happen before the client gets around to
> re-assigning the back channel.
>
> Thanks for your help with this Michael, rick
>
> Best regards
> Michael
> >
> > rick
> > ps: I can capture packets while doing this, if anyone has a use
> > for them.
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Youssef GHORBAL <***@pasteur.fr>
> > Sent: Saturday, March 27, 2021 6:57 PM
> > To: Jason Breitman
> > Cc: Rick Macklem; freebsd-***@freebsd.org
> > Subject: Re: NFS Mount Hangs
> >
> > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca
> >
> >
> >
> >
> > On 27 Mar 2021, at 13:20, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
> >
> > The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue.
> > # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso
> > # ifconfig lagg0
> > lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
> > options=8100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER>
> >
> > We can also say that the sysctl settings did not resolve this issue.
> >
> > # sysctl net.inet.tcp.fast_finwait2_recycle=1
> > net.inet.tcp.fast_finwait2_recycle: 0 -> 1
> >
> > # sysctl net.inet.tcp.finwait2_timeout=1000
> > net.inet.tcp.finwait2_timeout: 60000 -> 1000
> >
> > I don?t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD.
> > By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout)
> >
> > tcp_fin_timeout (integer; default: 60; since Linux 2.2)
> > This specifies how many seconds to wait for a final FIN
> > packet before the socket is forcibly closed. This is
> > strictly a violation of the TCP specification, but
> > required to prevent denial-of-service attacks. In Linux
> > 2.2, the default value was 180.
> >
> > So I don?t get why it stucks in the FIN_WAIT2 state anyway.
> >
> > You really need to have a packet capture during the outage (client and server side) so you?ll get over the wire chat and start speculating from there.
> > No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck.
> >
> > * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings.
> >
> > The issue occurred after 5 days following a reboot of the client machines.
> > I ran the capture information again to make use of the situation.
> >
> > #!/bin/sh
> >
> > while true
> > do
> > /bin/date >> /tmp/nfs-hang.log
> > /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log
> > /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log
> > /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log
> > /bin/sleep 60
> > done
> >
> >
> > On the NFS Server
> > Active Internet connections (including servers)
> > Proto Recv-Q Send-Q Local Address Foreign Address (state)
> > tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT
> >
> > On the NFS Client
> > tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2
> >
> >
> >
> > You had also asked for the output below.
> >
> > # nfsstat -E -s
> > BackChannelCtBindConnToSes
> > 0 0
> >
> > # sysctl vfs.nfsd.request_space_throttle_count
> > vfs.nfsd.request_space_throttle_count: 0
> >
> > I see that you are testing a patch and I look forward to seeing the results.
> >
> >
> > Jason Breitman
> >
> >
> > On Mar 21, 2021, at 6:21 PM, Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
> >
> > Youssef GHORBAL <***@pasteur.fr<mailto:***@pasteur.fr>> wrote:
> >> Hi Jason,
> >>
> >>> On 17 Mar 2021, at 18:17, Jason Breitman <***@tildenparkcapital.com<mailto:***@tildenparkcapital.com>> wrote:
> >>>
> >>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue.
> >>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side.
> >>>
> >>> Issue
> >>> NFSv4 mounts periodically hang on the NFS Client.
> >>>
> >>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues.
> >>> Also, other NFS Clients are successfully mounting from the NFS Server in question.
> >>> Rebooting the NFS Client appears to be the only solution.
> >>
> >> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd)
> > Yes, my understanding is that Isilon uses a proprietary user space nfsd and
> > not the kernel based RPC and nfsd in FreeBSD.
> >
> >> We?ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows:
> >>
> >> - Data flows correctly between SERVER and the CLIENT
> >> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0
> >> - The client (eager to send data) can only ack data sent by SERVER.
> >> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers.
> >> - SERVER responds with a TCP Zero Window to those probes.
> > Having the window size drop to zero is not necessarily incorrect.
> > If the server is overloaded (has a backlog of NFS requests), it can stop doing
> > soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window
> > closes). This results in "backpressure" to stop the NFS client from flooding the
> > NFS server with requests.
> > --> However, once the backlog is handled, the nfsd should start to soreceive()
> > again and this shouls cause the window to open back up.
> > --> Maybe this is broken in the socket/TCP code. I quickly got lost in
> > tcp_output() when it decides what to do about the rcvwin.
> >
> >> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0)
> > This probably does not happen for Jason's case, since the 6minute timeout
> > is disabled when the TCP connection is assigned as a backchannel (most likely
> > the case for NFSv4.1).
> >
> >> - CLIENT ACK that FIN.
> >> - SERVER goes in FIN_WAIT_2 state
> >> - CLIENT closes its half part part of the socket and goes in LAST_ACK state.
> >> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket.
> >> - SERVER keeps responding with a TCP Zero Window to those probes.
> >> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session.
> >>
> >> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we?ll end up in the same >state as you I think.
> >>
> >> We?ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding)
> >>
> >> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we?ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears)
> >>
> >> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck.
> > Interesting story and good work w.r.t. sluething, Youssef, thanks.
> >
> > I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads.
> > (They're just waiting for RPC requests.)
> > However, I do now think I know why the soclose() does not happen.
> > When the TCP connection is assigned as a backchannel, that takes a reference
> > cnt on the structure. This refcnt won't be released until the connection is
> > replaced by a BindConnectiotoSession operation from the client. But that won't
> > happen until the client creates a new TCP connection.
> > --> No refcnt release-->no refcnt of 0-->no soclose().
> >
> > I've created the attached patch (completely different from the previous one)
> > that adds soshutdown(SHUT_WR) calls in the three places where the TCP
> > connection is going away. This seems to get it past CLOSE_WAIT without a
> > soclose().
> > --> I know you are not comfortable with patching your server, but I do think
> > this change will get the socket shutdown to complete.
> >
> > There are a couple more things you can check on the server...
> > # nfsstat -E -s
> > --> Look for the count under "BindConnToSes".
> > --> If non-zero, backchannels have been assigned
> > # sysctl -a | fgrep request_space_throttle_count
> > --> If non-zero, the server has been overloaded at some point.
> >
> > I think the attached patch might work around the problem.
> > The code that should open up the receive window needs to be checked.
> > I am also looking at enabling the 6minute timeout when a backchannel is
> > assigned.
> >
> > rick
> >
> > Youssef
> >
> > _______________________________________________
> > freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
> > https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$
> > To unsubscribe, send any mail to "freebsd-net-***@freebsd.org<mailto:freebsd-net-***@freebsd.org>"
> > <xprtdied.patch>
> >
> > <nfs-hang.log.gz>
> >
> > _______________________________________________
> > freebsd-***@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> > _______________________________________________
> > freebsd-***@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
>
>

--
Rod Grimes ***@freebsd.org
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-***@freebsd.org"
Loading...