panic with tcp timers

Discussion:

panic with tcp timers

Gleb Smirnoff

2016-06-17 04:53:19 UTC

Hi!

At Netflix we are observing a race in TCP timers with head.
The problem is a regression, that doesn't happen on stable/10.
The panic usually happens after several hours at 55 Gbit/s of
traffic.

What happens is that tcp_timer_keep finds t_tcpcb being
NULL. Some coredumps have tcpcb already initialized,
with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which
means that other CPU was working on the tcpcb while
the faulted one was working on the panic. So, this all looks
like a use after free, which conflicts with new allocation.

Comparing stable/10 and head, I see two changes that could
affect that:

- callout_async_drain
- switch to READ lock for inp info in tcp timers

That's why you are in To, Julien and Hans :)

We continue investigating, and I will keep you updated.
However, any help is welcome. I can share cores.

--
Totus tuus, Glebius.

Hans Petter Selasky

2016-06-17 11:10:57 UTC

Post by Gleb Smirnoff
Hi!
At Netflix we are observing a race in TCP timers with head.
The problem is a regression, that doesn't happen on stable/10.
The panic usually happens after several hours at 55 Gbit/s of
traffic.
What happens is that tcp_timer_keep finds t_tcpcb being
NULL. Some coredumps have tcpcb already initialized,
with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which
means that other CPU was working on the tcpcb while
the faulted one was working on the panic. So, this all looks
like a use after free, which conflicts with new allocation.
Comparing stable/10 and head, I see two changes that could
- callout_async_drain
- switch to READ lock for inp info in tcp timers
That's why you are in To, Julien and Hans :)
We continue investigating, and I will keep you updated.
However, any help is welcome. I can share cores.

Hi,

I do have projects/hps_head around, which is not that much behind
11-current, which has a completely different callout implementation. If
you can reproduce the issue separately might we worth a try to rule out
the callout stack.

--HPS

Bjoern A. Zeeb

2016-06-17 14:41:02 UTC

Post by Gleb Smirnoff
Hi!
At Netflix we are observing a race in TCP timers with head.
The problem is a regression, that doesn't happen on stable/10.
The panic usually happens after several hours at 55 Gbit/s of
traffic.
What happens is that tcp_timer_keep finds t_tcpcb being
NULL. Some coredumps have tcpcb already initialized,
with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which
means that other CPU was working on the tcpcb while
the faulted one was working on the panic. So, this all looks
like a use after free, which conflicts with new allocation.
Comparing stable/10 and head, I see two changes that could
- callout_async_drain
- switch to READ lock for inp info in tcp timers
That's why you are in To, Julien and Hans :)
We continue investigating, and I will keep you updated.
However, any help is welcome. I can share cores.

There’s also the change to no longer mark the zones NO_FREE.
In theory I was convinced at the time that it should not be an issue
anymore.

If I had overlooked something or follow-up timer changes invalidated
assumptions then that could also be trouble.

That said, I was not able to get any related panics or log entries
anymore lately (but I am currently slightly behind head with my branch).

We should get the problem fixed however and not try to “paint over”
again.

/bz

Julien Charbon

2016-06-17 09:27:39 UTC

Hi Gleb,

Post by Gleb Smirnoff
At Netflix we are observing a race in TCP timers with head.
The problem is a regression, that doesn't happen on stable/10.
The panic usually happens after several hours at 55 Gbit/s of
traffic.
What happens is that tcp_timer_keep finds t_tcpcb being
NULL. Some coredumps have tcpcb already initialized,
with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which
means that other CPU was working on the tcpcb while
the faulted one was working on the panic. So, this all looks
like a use after free, which conflicts with new allocation.
Comparing stable/10 and head, I see two changes that could
- callout_async_drain
- switch to READ lock for inp info in tcp timers
That's why you are in To, Julien and Hans :)
We continue investigating, and I will keep you updated.
However, any help is welcome. I can share cores.

Thanks for sharing. Let me run our TCP tests on a recent version of
HEAD to see if by chance I can reproduce it. If I am not able to
reproduce it I will ask for debug kernel and cores and see if I can help.

Few notes here:

- Around 2 months ago I did test HEAD with callout_async_drain() in
TCP timers with our TCP QA testsuite but no kernel panic. That said I
did not let our test run during several hours.

- At Verisign we run 10 with READ lock for inp info in tcp timers
change. Again, it does not mean this change has no impact here.

My 2 cents.

--
Julien

Gleb Smirnoff

2016-06-20 07:39:17 UTC

Hi!

On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
J> > Comparing stable/10 and head, I see two changes that could
J> > affect that:
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.

Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.

The panic requires heavy contention on the TCP info lock.

[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic

When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Exact quote from ktrdump:

0xfffff82089867418 <-- faulty tcpcb
0xfffff82089867728 <-- its tt_keep

***@piston:~/cores:|>grep 0xfffff82089867418 ktr
19012192 11 scheduled 0xfffff820898677a8 func 0xffffffff80628740 arg 0xfffff82089867418 in 20553.1e3ff819
19012190 11 rescheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20613.04a6193d
19009551 11 rescheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20613.042306cf
19009549 11 scheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20563.0423409f
19009545 11 tcp_newtcpcb: 0xfffff82089867418
18962842 11 tcp_discardcb: free 0xfffff82089867418
18962830 11 tcp_discardcb: 0xfffff82089867418 draincnt 0
18962826 11 failed to stop 0xfffff820898677a8 func 0xffffffff80628740 arg 0xfffff82089867418
18962822 11 cancelled3 0xfffff82089867768 func 0xffffffff806288e0 arg 0xfffff82089867418
18962808 11 cancelled3 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418
18962804 11 failed to stop 0xfffff820898676e8 func 0xffffffff80628fa0 arg 0xfffff82089867418
18962792 11 failed to stop 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418
18962755 11 tcp_discardcb: 0xfffff82089867418
18962742 11 scheduled 0xfffff82089867768 func 0xffffffff806288e0 arg 0xfffff82089867418 in 20632.ffc8d308
18962703 11 cancelled3 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418
18962695 11 scheduled 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418 in 20612.ffc8ea28
18923275 6 tcp_timer_keep: 0xfffff82089867418
18923274 6 callout 0xfffff82089867728 func 0xffffffff80628bb0 arg 0xfffff82089867418
17850361 9 scheduled 0xfffff820898676a8 func 0xffffffff806291e0 arg 0xfffff82089867418 in 20553.5aec0004

--
Totus tuus, Glebius.

Julien Charbon

2016-06-20 09:55:55 UTC

Hi,

Post by Gleb Smirnoff
J> > Comparing stable/10 and head, I see two changes that could
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).

Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1
(success) here) when the callout cancellation is a success _but_ the
callout is current running, that should fix it.

For the history: It comes back to my old callout question:

Does _callout_stop_safe() is allowed to return 1 (success) even if the
callout is still currently running; a.k.a. it is not because you
successfully cancelled a callout that the callout is not currently running.

We did propose a patch to make _callout_stop_safe() returns 0 (fail)
when the callout is currently running:

callout_stop() should return 0 when the callout is currently being
serviced and indeed unstoppable
https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most

But this change impacted too many old code paths and was interesting
only for TCP timers and thus was abandoned.

My 2 cents.

--
Julien

Gleb Smirnoff

2016-06-20 09:58:22 UTC

On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote:
J> > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
J> > J> > Comparing stable/10 and head, I see two changes that could
J> > J> > affect that:
J> > J> >
J> > J> > - callout_async_drain
J> > J> > - switch to READ lock for inp info in tcp timers
J> > J> >
J> > J> > That's why you are in To, Julien and Hans :)
J> > J> >
J> > J> > We continue investigating, and I will keep you updated.
J> > J> > However, any help is welcome. I can share cores.
J> >
J> > Now, spending some time with cores and adding a bunch of
J> > extra CTRs, I have a sequence of events that lead to the
J> > panic. In short, the bug is in the callout system. It seems
J> > to be not relevant to the callout_async_drain, at least for
J> > now. The transition to READ lock unmasked the problem, that's
J> > why NetflixBSD 10 doesn't panic.
J> >
J> > The panic requires heavy contention on the TCP info lock.
J> >
J> > [CPU 1] the callout fires, tcp_timer_keep entered
J> > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
J> > [CPU 2] schedules the callout
J> > [CPU 2] tcp_discardcb called
J> > [CPU 2] callout successfully canceled
J> > [CPU 2] tcpcb freed
J> > [CPU 1] unblocks... panic
J> >
J> > When the lock was WLOCK, all contenders were resumed in a
J> > sequence they came to the lock. Now, that they are readers,
J> > once the lock is released, readers are resumed in a "random"
J> > order, and this allows tcp_discardcb to go before the old
J> > running callout, and this unmasks the panic.
J>
J> Highly interesting. I should be able to reproduce that (will be useful
J> for testing the corresponding fix).
J>
J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1
J> (success) here) when the callout cancellation is a success _but_ the
J> callout is current running, that should fix it.
J>
J> For the history: It comes back to my old callout question:
J>
J> Does _callout_stop_safe() is allowed to return 1 (success) even if the
J> callout is still currently running; a.k.a. it is not because you
J> successfully cancelled a callout that the callout is not currently running.
J>
J> We did propose a patch to make _callout_stop_safe() returns 0 (fail)
J> when the callout is currently running:
J>
J> callout_stop() should return 0 when the callout is currently being
J> serviced and indeed unstoppable
J> https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most
J>
J> But this change impacted too many old code paths and was interesting
J> only for TCP timers and thus was abandoned.

The fix I am working on now is doing exactly that. callout_reset must
return 0 if the callout is currently running.

What are the old paths impacted?

--
Totus tuus, Glebius.

Hans Petter Selasky

2016-06-20 10:14:18 UTC

Post by Gleb Smirnoff
The fix I am working on now is doing exactly that. callout_reset must
return 0 if the callout is currently running.
What are the old paths impacted?

Hi,

I'll dig into the matter aswell and give some comments. Thanks for the
analysis, Gleb.

FYI: This class of problems wouldn't exist if the callout could be
associated with a mutex!

--HPS

Gleb Smirnoff

2016-06-20 10:30:15 UTC

On Mon, Jun 20, 2016 at 12:14:18PM +0200, Hans Petter Selasky wrote:
H> On 06/20/16 11:58, Gleb Smirnoff wrote:
H> > The fix I am working on now is doing exactly that. callout_reset must
H> > return 0 if the callout is currently running.
H> >
H> > What are the old paths impacted?
H>
H> Hi,
H>
H> I'll dig into the matter aswell and give some comments. Thanks for the
H> analysis, Gleb.
H>
H> FYI: This class of problems wouldn't exist if the callout could be
H> associated with a mutex!

Exactly! I am convinced that all callouts should be locked, and non-locked
one should simply go away, as well as async drain.

What does prevent us from converting TCP timeouts to locked? To my
understanding it is the lock order of taking pcbinfo after pcb lock.

I'm now trying to understand Julien's conversion from pcbinfo lock
to pcbinfo + pcblist locks. It seems to me that we actually can convert
TCP timers to locked callouts.

What for do we need pcbinfo read lock in a tcp timer? The timer works
only on the pcb and doesn't modify global lists, does it?

--
Totus tuus, Glebius.

Hans Petter Selasky

2016-06-20 10:57:35 UTC

Post by Gleb Smirnoff
Exactly! I am convinced that all callouts should be locked, and non-locked
one should simply go away, as well as async drain.

I agree about that that, except you still need the async drain, because
it will prevent freeing the lock protecting the callout, which may still
be in use after callout_stop().

--HPS

Hans Petter Selasky

2016-06-20 11:00:16 UTC

Post by Gleb Smirnoff
What does prevent us from converting TCP timeouts to locked? To my
understanding it is the lock order of taking pcbinfo after pcb lock.

I started this work:

https://reviews.freebsd.org/D1563

--HPS

Adrian Chadd

2016-06-20 14:18:01 UTC

There's implications for RSS with how the callout system currently works.

If you don't know the RSS bucket ID of a connection in advance, you'll
create callouts on the wrong CPUs and then they're not migrated.

The initial work here did convert things over, but didn't place the
callouts in the right CPU/RSS bucket and there wasn't a mechanism to
fix this. :(

(But I'm also a firm believer of that too. I'd also finally just like
the callout system to not be tied to per-CPU queues, but also
per-RSS-bucket callout wheels so we could assign a CPU mask to a given
callout wheel and then migrating things around is just "change cpu
mask", not "change callout cpu id.")

-adrian

Post by Hans Petter Selasky

Post by Gleb Smirnoff
What does prevent us from converting TCP timeouts to locked? To my
understanding it is the lock order of taking pcbinfo after pcb lock.

https://reviews.freebsd.org/D1563
--HPS
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-net

Julien Charbon

2016-06-20 16:48:27 UTC

Hi,

Post by Gleb Smirnoff
H> > The fix I am working on now is doing exactly that. callout_reset must
H> > return 0 if the callout is currently running.
H> >
H> > What are the old paths impacted?
H>
H> I'll dig into the matter aswell and give some comments. Thanks for the
H> analysis, Gleb.
H>
H> FYI: This class of problems wouldn't exist if the callout could be
H> associated with a mutex!
Exactly! I am convinced that all callouts should be locked, and non-locked
one should simply go away, as well as async drain.
What does prevent us from converting TCP timeouts to locked? To my
understanding it is the lock order of taking pcbinfo after pcb lock.
I'm now trying to understand Julien's conversion from pcbinfo lock
to pcbinfo + pcblist locks. It seems to me that we actually can convert
TCP timers to locked callouts.
What for do we need pcbinfo read lock in a tcp timer? The timer works
only on the pcb and doesn't modify global lists, does it?

tcp_timer_keep() for example can modify global pcb list, see the call
stack below:

tcp_timer_keep()
tcp_drop()
tcp_close()
sofree()
tcp_usr_detach() (via pr->pr_usrreqs->pru_detach() in sofree())
tcp_detach()
in_pcbfree()
in_pcbremlists()

Anyway, a bit of history here:

o In stable/10 the TCP locking order is:

ipi_lock (before) inpcb lock

and in stable/10 ipi_lock is protecting the global pcb list. Then,
only in the cases where you were absolutely sure you are _not_ going to
modify the global pcb list you are allowed to take the inpcb lock only.
For all the other cases, you have to take the write lock on ipi_lock
first due to lock order.

And in context of short-lived TCP connection of the 5 received packets
for a TCP connection, 4 require the write ipi_lock lock, and that's does
not scale well.

Lesson learned: For scaling perspective, in lock ordering always put
the most global lock last.

It was improved in a lean way in 11:

o In 11 the TCP lock order became:

ipi_lock (before) inpcb lock (before) ipi_list

And in 11 ipi_list protects global pcb list, and only the code actually
modifying the global list is protected by a global write lock, e.g.:

https://github.com/freebsd/freebsd/blob/master/sys/netinet/in_pcb.c#L1285

Then why keeping the ipi_lock lock at all now?

It is solely for one case: When you need the complete stability of the
full pcb list while traversing it. These full pcb list traversals are
done in functions like: in_pcbnotifyall(), in_pcbpurgeif0(),
inp_apply_all(), tcp_ccalgounload(), tcp_drain(), etc.

Thus in 11 ipi_lock write lock is required only when you do this full
traversal with list stability requirement, and the ipi_lock read lock in
all other cases like tcp_input() that then scales better.

Sadly in 11, you cannot use the inpcb lock as is for the TCP timers
callout, because like tcp_timer_keep() most TCP timers callout can
modify the global pcb list and then you need the read lock ipi_lock in
top of inpcb lock...

o Future/12:

The next natural step is to remove the ipi_lock from the picture to get:

inpcb lock (before) ipi_list

It /just/ requires a global pcb list that allows addition/deletion
while doing a full traversal concurrently. A classical way to achieve
that, is to use a RCU-based list. As soon as RCU (or anything
equivalent like lwref) are supported in kernel, we will implement this
change.

Just history here, as I already did a presentation on this exact topic
at BSDCan 2015:
https://wiki.freebsd.org/201506DevSummit#line-83

It was recorded but I never saw the footage/presentation actually
published. :)

--
Julien

Hans Petter Selasky

2016-06-20 10:49:17 UTC

Post by Gleb Smirnoff
J> callout_stop() should return 0 when the callout is currently being
J> serviced and indeed unstoppable
J> https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most
What are the old paths impacted?

Hi Gleb,

Digging through my e-mail archive rephrasing myself and comments about:

https://reviews.freebsd.org/D3078

There are two kinds of callouts. So-called MPSAFE callouts which doesn't
have a mutex associated with it, and non-MPSAFE which do.

When you are associating a mutex with a callout, callout_stop() will
_always_ cancel the callback even if it is pending, and this should be
reflected by the return value. Your proposed changes will break that ???

The changes in D3078 didn't take into account "use_lock" at least, and
so the return values for the non-MPSAFE case becomes incorrect.

Post by Gleb Smirnoff
Hi,
not_running = !(cc_exec_curr(cc, direct) == c && use_lock == 0);
Because if the callback process waits for lock "c_lock" in the callback process then we have above "cc_exec_curr(cc, direct) == c" is satisfied too, and non-MPSAFE callouts are always cancelable, via cc_exec_cancel(cc, direct) = true;

--HPS

Julien Charbon

2016-06-20 17:45:53 UTC

Hi,

Post by Gleb Smirnoff
J> > J> > Comparing stable/10 and head, I see two changes that could
J> > J> >
J> > J> > - callout_async_drain
J> > J> > - switch to READ lock for inp info in tcp timers
J> > J> >
J> > J> > That's why you are in To, Julien and Hans :)
J> > J> >
J> > J> > We continue investigating, and I will keep you updated.
J> > J> > However, any help is welcome. I can share cores.
J> >
J> > Now, spending some time with cores and adding a bunch of
J> > extra CTRs, I have a sequence of events that lead to the
J> > panic. In short, the bug is in the callout system. It seems
J> > to be not relevant to the callout_async_drain, at least for
J> > now. The transition to READ lock unmasked the problem, that's
J> > why NetflixBSD 10 doesn't panic.
J> >
J> > The panic requires heavy contention on the TCP info lock.
J> >
J> > [CPU 1] the callout fires, tcp_timer_keep entered
J> > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
J> > [CPU 2] schedules the callout
J> > [CPU 2] tcp_discardcb called
J> > [CPU 2] callout successfully canceled
J> > [CPU 2] tcpcb freed
J> > [CPU 1] unblocks... panic
J> >
J> > When the lock was WLOCK, all contenders were resumed in a
J> > sequence they came to the lock. Now, that they are readers,
J> > once the lock is released, readers are resumed in a "random"
J> > order, and this allows tcp_discardcb to go before the old
J> > running callout, and this unmasks the panic.
J>
J> Highly interesting. I should be able to reproduce that (will be useful
J> for testing the corresponding fix).
J>
J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1
J> (success) here) when the callout cancellation is a success _but_ the
J> callout is current running, that should fix it.
J>
J>
J> Does _callout_stop_safe() is allowed to return 1 (success) even if the
J> callout is still currently running; a.k.a. it is not because you
J> successfully cancelled a callout that the callout is not currently running.
J>
J> We did propose a patch to make _callout_stop_safe() returns 0 (fail)
J>
J> callout_stop() should return 0 when the callout is currently being
J> serviced and indeed unstoppable
J> https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most
J>
J> But this change impacted too many old code paths and was interesting
J> only for TCP timers and thus was abandoned.
The fix I am working on now is doing exactly that. callout_reset must
return 0 if the callout is currently running.
What are the old paths impacted?

Actually all the paths that check the callout_stop() return value AND
call both callout_reset() and callout_stop() AND use mpsafe callout().
And for each path, we would have to check our patch was ok (or not).

Thus, if you only do the change in callout_async_drain() context, you
don't impact the "old" callout_stop() behavior and get the desired
behavior for the TCP timers. Might be a good trade-off...

My 2 cents.

--
Julien

Randall Stewart via freebsd-net

2016-06-25 08:48:51 UTC

So

All of our timers in TCP do something like
---------------------
INFO-LOCK
INP_WLOCK

if (inp needs to be dropped) {
drop-it
}
do other work

UNLOCK-INP
UNLOCK-INFO
------------------

And generally the path “inp needs to be dropped” is rarely taken.

So why don’t we change the procedure to something like:

INP_WLOCK
if (inp needs to be dropped) {
inp_ref()
INP_WUNLOCK()
INFO_LOCK()
INP_WLOCK()
if (inp_release_ref) {
/* victory its dead */
INFO_UNLOCK
return
}
do-the-drop
}

This way we would only go grab the INFO lock in those rarer cases
when we *do* actually want to kill the tcb and at the same time
it would make the current callout system work correctly.. which as
many have pointed out would be much better if we could just let the
lock be gotten by the callout system. Hmm maybe with this scheme we
could even do that...

R

Post by Julien Charbon
Hi,

Post by Gleb Smirnoff
J> > J> > Comparing stable/10 and head, I see two changes that could
J> > J> >
J> > J> > - callout_async_drain
J> > J> > - switch to READ lock for inp info in tcp timers
J> > J> >
J> > J> > That's why you are in To, Julien and Hans :)
J> > J> >
J> > J> > We continue investigating, and I will keep you updated.
J> > J> > However, any help is welcome. I can share cores.
J> >
J> > Now, spending some time with cores and adding a bunch of
J> > extra CTRs, I have a sequence of events that lead to the
J> > panic. In short, the bug is in the callout system. It seems
J> > to be not relevant to the callout_async_drain, at least for
J> > now. The transition to READ lock unmasked the problem, that's
J> > why NetflixBSD 10 doesn't panic.
J> >
J> > The panic requires heavy contention on the TCP info lock.
J> >
J> > [CPU 1] the callout fires, tcp_timer_keep entered
J> > [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
J> > [CPU 2] schedules the callout
J> > [CPU 2] tcp_discardcb called
J> > [CPU 2] callout successfully canceled
J> > [CPU 2] tcpcb freed
J> > [CPU 1] unblocks... panic
J> >
J> > When the lock was WLOCK, all contenders were resumed in a
J> > sequence they came to the lock. Now, that they are readers,
J> > once the lock is released, readers are resumed in a "random"
J> > order, and this allows tcp_discardcb to go before the old
J> > running callout, and this unmasks the panic.
J>
J> Highly interesting. I should be able to reproduce that (will be useful
J> for testing the corresponding fix).
J>
J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1
J> (success) here) when the callout cancellation is a success _but_ the
J> callout is current running, that should fix it.
J>
J>
J> Does _callout_stop_safe() is allowed to return 1 (success) even if the
J> callout is still currently running; a.k.a. it is not because you
J> successfully cancelled a callout that the callout is not currently running.
J>
J> We did propose a patch to make _callout_stop_safe() returns 0 (fail)
J>
J> callout_stop() should return 0 when the callout is currently being
J> serviced and indeed unstoppable
J> https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most
J>
J> But this change impacted too many old code paths and was interesting
J> only for TCP timers and thus was abandoned.
The fix I am working on now is doing exactly that. callout_reset must
return 0 if the callout is currently running.
What are the old paths impacted?

Actually all the paths that check the callout_stop() return value AND
call both callout_reset() and callout_stop() AND use mpsafe callout().
And for each path, we would have to check our patch was ok (or not).
Thus, if you only do the change in callout_async_drain() context, you
don't impact the "old" callout_stop() behavior and get the desired
behavior for the TCP timers. Might be a good trade-off...
My 2 cents.
--
Julien

--------
Randall Stewart
***@netflix.com
803-317-4952

Randall Stewart via freebsd-net

2016-06-25 14:41:24 UTC

Ok

Lets try this again with my source changed to my @freebsd.net :-)

Now I am also attaching a patch for you Gleb, this will take some poking to
get in to your NF-head since it incorporates some changes we made earlier.

I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was
never meant to have done)..

Note we probably can move the code to use the callout lock init now.. but lets see if this works
on your setup on c096 and if so we can think about doing that.

R

Julien Charbon

2016-06-28 09:58:56 UTC

Hi Randall,

Post by Randall Stewart via freebsd-net
Ok
Now I am also attaching a patch for you Gleb, this will take some poking to
get in to your NF-head since it incorporates some changes we made earlier.
I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was
never meant to have done)..
Note we probably can move the code to use the callout lock init now.. but lets see if this works
on your setup on c096 and if so we can think about doing that.

Thanks for proposing a patch. I believe your patch will work with
callout lock init, but not without: You still have a use-after-free
issue on the tcpcb without callout lock init.

The case being subtle as usual, let me try to describe that could happen:

With your patch we have:

void
tcp_timer_keep(void *xtp)
{
struct tcpcb *tp = xtp;
struct tcptemp *t_template;
struct inpcb *inp;
CURVNET_SET(tp->t_vnet);
#ifdef TCPDEBUG
int ostate;

ostate = tp->t_state;
#endif
inp = tp->t_inpcb;
KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__,
tp));
INP_WLOCK(inp);
if (callout_pending(&tp->t_timers->tt_keep) ### Use after free
of tp here
!callout_active(&tp->t_timers->tt_keep)) {
INP_WUNLOCK(inp);
CURVNET_RESTORE();
return;
}
...

The use-after-free scenario:

[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_WLOCK(inp);
[CPU 2] schedules tcp_timer_keep with callout_reset()
[CPU 2] tcp_discardcb called
[CPU 2] tcp_timer_keep callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks, the tcpcb is used

Then the tcpcb will used just after being freed... Might also crash or
not depending in the case.

Extra notes:

o The invariant I see here is: The "callout successfully canceled"
step should never happen when "the callout is currently being executed".

o Solutions I see to enforce this invariant:

- First solution: Use callout lock init with inp lock, your patch
seems to permit that now.

- Second solution: Change callout_async_drain() behavior: It can
return 0 (fail) when the callout is currently being executed (no matter
what).

- Third solution: Don't trust callout_async_drain(callout) return
value of 1 (success) if the previous call of callout_reset(callout)
returned 0 (fail). That was the exact purpose of r284261 change, but
this solution is also step backward in modernization of TCP
timers/callout...

https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261&r2=284260&pathrev=284261

Hopefully my description is clear enough...

--
Julien

Matthew Macy

2016-06-28 17:51:39 UTC

You guys should really look at Samy Bahra's epoch based reclamation. I solved a similar problem in drm/linuxkpi using it.

-M

Post by Julien Charbon
Hi Randall,

Post by Randall Stewart via freebsd-net
Ok
Now I am also attaching a patch for you Gleb, this will take some poking to
get in to your NF-head since it incorporates some changes we made earlier.
I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was
never meant to have done)..
Note we probably can move the code to use the callout lock init now.. but lets see if this works
on your setup on c096 and if so we can think about doing that.

Thanks for proposing a patch. I believe your patch will work with
callout lock init, but not without: You still have a use-after-free
issue on the tcpcb without callout lock init.
void
tcp_timer_keep(void *xtp)
{
struct tcpcb *tp = xtp;
struct tcptemp *t_template;
struct inpcb *inp;
CURVNET_SET(tp->t_vnet);
#ifdef TCPDEBUG
int ostate;
ostate = tp->t_state;
#endif
inp = tp->t_inpcb;
KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__,
tp));
INP_WLOCK(inp);
if (callout_pending(&tp->t_timers->tt_keep) ### Use after free
of tp here
!callout_active(&tp->t_timers->tt_keep)) {
INP_WUNLOCK(inp);
CURVNET_RESTORE();
return;
}
...
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_WLOCK(inp);
[CPU 2] schedules tcp_timer_keep with callout_reset()
[CPU 2] tcp_discardcb called
[CPU 2] tcp_timer_keep callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks, the tcpcb is used
Then the tcpcb will used just after being freed... Might also crash or
not depending in the case.
o The invariant I see here is: The "callout successfully canceled"
step should never happen when "the callout is currently being executed".
- First solution: Use callout lock init with inp lock, your patch
seems to permit that now.
- Second solution: Change callout_async_drain() behavior: It can
return 0 (fail) when the callout is currently being executed (no matter
what).
- Third solution: Don't trust callout_async_drain(callout) return
value of 1 (success) if the previous call of callout_reset(callout)
returned 0 (fail). That was the exact purpose of r284261 change, but
this solution is also step backward in modernization of TCP
timers/callout...
https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261&r2=284260&pathrev=284261
Hopefully my description is clear enough...
--
Julien

K. Macy

2016-06-28 22:51:57 UTC

Post by Matthew Macy
You guys should really look at Samy Bahra's epoch based reclamation. I solved a similar problem in drm/linuxkpi using it.

The point being that this is a bug in the TCP life cycle handling
_not_ in callouts. Churning the callout interface is not the best /
only solution.
-M

Post by Matthew Macy

Post by Julien Charbon
Hi Randall,

Post by Randall Stewart via freebsd-net
Ok
Now I am also attaching a patch for you Gleb, this will take some poking to
get in to your NF-head since it incorporates some changes we made earlier.
I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was
never meant to have done)..
Note we probably can move the code to use the callout lock init now.. but lets see if this works
on your setup on c096 and if so we can think about doing that.

Thanks for proposing a patch. I believe your patch will work with
callout lock init, but not without: You still have a use-after-free
issue on the tcpcb without callout lock init.
void
tcp_timer_keep(void *xtp)
{
struct tcpcb *tp = xtp;
struct tcptemp *t_template;
struct inpcb *inp;
CURVNET_SET(tp->t_vnet);
#ifdef TCPDEBUG
int ostate;
ostate = tp->t_state;
#endif
inp = tp->t_inpcb;
KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__,
tp));
INP_WLOCK(inp);
if (callout_pending(&tp->t_timers->tt_keep) ### Use after free
of tp here
!callout_active(&tp->t_timers->tt_keep)) {
INP_WUNLOCK(inp);
CURVNET_RESTORE();
return;
}
...
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_WLOCK(inp);
[CPU 2] schedules tcp_timer_keep with callout_reset()
[CPU 2] tcp_discardcb called
[CPU 2] tcp_timer_keep callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks, the tcpcb is used
Then the tcpcb will used just after being freed... Might also crash or
not depending in the case.
o The invariant I see here is: The "callout successfully canceled"
step should never happen when "the callout is currently being executed".
- First solution: Use callout lock init with inp lock, your patch
seems to permit that now.
- Second solution: Change callout_async_drain() behavior: It can
return 0 (fail) when the callout is currently being executed (no matter
what).
- Third solution: Don't trust callout_async_drain(callout) return
value of 1 (success) if the previous call of callout_reset(callout)
returned 0 (fail). That was the exact purpose of r284261 change, but
this solution is also step backward in modernization of TCP
timers/callout...
https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261&r2=284260&pathrev=284261
Hopefully my description is clear enough...
--
Julien

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

Matthew Macy

2016-06-29 06:19:45 UTC

Post by K. Macy

Post by Matthew Macy
You guys should really look at Samy Bahra's epoch based reclamation. I solved a similar problem in drm/linuxkpi using it.

The point being that this is a bug in the TCP life cycle handling
_not_ in callouts. Churning the callout interface is not the best /
only solution.
-M

Please see see D7017/D7018 as an example for an ultimately more robust solution than continued fiddling with the callout interface.

https://reviews.freebsd.org/D7018

Cheers.

-M

Post by K. Macy

Post by Matthew Macy

Post by Julien Charbon
Hi Randall,

Post by Randall Stewart via freebsd-net
Ok
Now I am also attaching a patch for you Gleb, this will take some poking to
get in to your NF-head since it incorporates some changes we made earlier.
I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was
never meant to have done)..
Note we probably can move the code to use the callout lock init now.. but lets see if this works
on your setup on c096 and if so we can think about doing that.

Thanks for proposing a patch. I believe your patch will work with
callout lock init, but not without: You still have a use-after-free
issue on the tcpcb without callout lock init.
void
tcp_timer_keep(void *xtp)
{
struct tcpcb *tp = xtp;
struct tcptemp *t_template;
struct inpcb *inp;
CURVNET_SET(tp->t_vnet);
#ifdef TCPDEBUG
int ostate;
ostate = tp->t_state;
#endif
inp = tp->t_inpcb;
KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__,
tp));
INP_WLOCK(inp);
if (callout_pending(&tp->t_timers->tt_keep) ### Use after free
of tp here
!callout_active(&tp->t_timers->tt_keep)) {
INP_WUNLOCK(inp);
CURVNET_RESTORE();
return;
}
...
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_WLOCK(inp);
[CPU 2] schedules tcp_timer_keep with callout_reset()
[CPU 2] tcp_discardcb called
[CPU 2] tcp_timer_keep callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks, the tcpcb is used
Then the tcpcb will used just after being freed... Might also crash or
not depending in the case.
o The invariant I see here is: The "callout successfully canceled"
step should never happen when "the callout is currently being executed".
- First solution: Use callout lock init with inp lock, your patch
seems to permit that now.
- Second solution: Change callout_async_drain() behavior: It can
return 0 (fail) when the callout is currently being executed (no matter
what).
- Third solution: Don't trust callout_async_drain(callout) return
value of 1 (success) if the previous call of callout_reset(callout)
returned 0 (fail). That was the exact purpose of r284261 change, but
this solution is also step backward in modernization of TCP
timers/callout...
https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261&r2=284260&pathrev=284261
Hopefully my description is clear enough...
--
Julien

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

Matthew Macy

2016-06-30 08:41:34 UTC

Post by Matthew Macy

Post by K. Macy

Post by Matthew Macy
You guys should really look at Samy Bahra's epoch based reclamation. I solved a similar problem in drm/linuxkpi using it.

The point being that this is a bug in the TCP life cycle handling
_not_ in callouts. Churning the callout interface is not the best /
only solution.
-M

Please see see D7017/D7018 as an example for an ultimately more robust solution than continued fiddling with the callout interface.
https://reviews.freebsd.org/D7018

I realized that this shortens the race but still leaves one open from the time the callout lock is dropped to the time the epoch begins. I have a proposed fix to make read locks never block and to essentially close the race window.

The next issue that comes up is that synchronize is called too often. I'll talk to Samy about it in a few hours and come up with a better design.

-M

Post by Matthew Macy

Post by K. Macy

Post by Matthew Macy

Post by Julien Charbon
Hi Randall,

Post by Randall Stewart via freebsd-net
Ok
Now I am also attaching a patch for you Gleb, this will take some poking to
get in to your NF-head since it incorporates some changes we made earlier.
I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was
never meant to have done)..
Note we probably can move the code to use the callout lock init now.. but lets see if this works
on your setup on c096 and if so we can think about doing that.

Thanks for proposing a patch. I believe your patch will work with
callout lock init, but not without: You still have a use-after-free
issue on the tcpcb without callout lock init.
void
tcp_timer_keep(void *xtp)
{
struct tcpcb *tp = xtp;
struct tcptemp *t_template;
struct inpcb *inp;
CURVNET_SET(tp->t_vnet);
#ifdef TCPDEBUG
int ostate;
ostate = tp->t_state;
#endif
inp = tp->t_inpcb;
KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__,
tp));
INP_WLOCK(inp);
if (callout_pending(&tp->t_timers->tt_keep) ### Use after free
of tp here
!callout_active(&tp->t_timers->tt_keep)) {
INP_WUNLOCK(inp);
CURVNET_RESTORE();
return;
}
...
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_WLOCK(inp);
[CPU 2] schedules tcp_timer_keep with callout_reset()
[CPU 2] tcp_discardcb called
[CPU 2] tcp_timer_keep callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks, the tcpcb is used
Then the tcpcb will used just after being freed... Might also crash or
not depending in the case.
o The invariant I see here is: The "callout successfully canceled"
step should never happen when "the callout is currently being executed".
- First solution: Use callout lock init with inp lock, your patch
seems to permit that now.
- Second solution: Change callout_async_drain() behavior: It can
return 0 (fail) when the callout is currently being executed (no matter
what).
- Third solution: Don't trust callout_async_drain(callout) return
value of 1 (success) if the previous call of callout_reset(callout)
returned 0 (fail). That was the exact purpose of r284261 change, but
this solution is also step backward in modernization of TCP
timers/callout...
https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261&r2=284260&pathrev=284261
Hopefully my description is clear enough...
--
Julien

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

Konstantin Belousov

2016-06-20 10:48:19 UTC

Post by Julien Charbon
Hi,

Post by Gleb Smirnoff
J> > Comparing stable/10 and head, I see two changes that could
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).
Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1
(success) here) when the callout cancellation is a success _but_ the
callout is current running, that should fix it.
Does _callout_stop_safe() is allowed to return 1 (success) even if the
callout is still currently running; a.k.a. it is not because you
successfully cancelled a callout that the callout is not currently running.
We did propose a patch to make _callout_stop_safe() returns 0 (fail)
callout_stop() should return 0 when the callout is currently being
serviced and indeed unstoppable
https://reviews.freebsd.org/differential/changeset/?ref=62513&whitespace=ignore-most
But this change impacted too many old code paths and was interesting
only for TCP timers and thus was abandoned.

Look at callout_stop CS_MIGRBLOCK flag and the fix in sleepq_check_timeout().
Or, at least, do not allow this use of callout_stop() to rot again, after
previous dozen regressions and fixes there.

Julien Charbon

2016-07-14 17:01:11 UTC

Hi,

Post by Julien Charbon

Post by Gleb Smirnoff
J> > Comparing stable/10 and head, I see two changes that could
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).

Finally, I was able to reproduce it (without glebius fix). The trick
was to really lower TCP keep timer expiration:

$ sysctl -a | grep tcp.keep
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
$ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
Password:
net.inet.tcp.keepidle: 7200000 -> 10
net.inet.tcp.keepintvl: 75000 -> 50
net.inet.tcp.keepinit: 75000 -> 10

Note: It will certainly close all your ssh connections to the tested
server.

Now I will test in order:

#1. glebius fix
https://svnweb.freebsd.org/base?view=revision&revision=302350

#2. rss extra fix
https://reviews.freebsd.org/D7135

#3. rrs TCP Timer cleanup
https://reviews.freebsd.org/D7136

My panic for reference:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 10; apic id = 28
[***@atlas-dl360-4 ~]# instruction pointer = 0x20:0xffffffff80c346f1
stack pointer = 0x28:0xfffffe1f29b848b0
frame pointer = 0x28:0xfffffe1f29b848e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi4: clock (4))
trap number = 9
panic: general protection fault
cpuid = 10
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe1f29b844a0
vpanic() at vpanic+0x182/frame 0xfffffe1f29b84520
panic() at panic+0x43/frame 0xfffffe1f29b84580
trap_fatal() at trap_fatal+0x351/frame 0xfffffe1f29b845e0
trap() at trap+0x820/frame 0xfffffe1f29b847f0
calltrap() at calltrap+0x8/frame 0xfffffe1f29b847f0
--- trap 0x9, rip = 0xffffffff80c346f1, rsp = 0xfffffe1f29b848c0, rbp =
0xfffffe1f29b848e0 ---
tcp_timer_keep() at tcp_timer_keep+0x51/frame 0xfffffe1f29b848e0
softclock_call_cc() at softclock_call_cc+0x19c/frame 0xfffffe1f29b849c0
softclock() at softclock+0x47/frame 0xfffffe1f29b849e0
intr_event_execute_handlers() at intr_event_execute_handlers+0x96/frame
0xfffffe1f29b84a20
ithread_loop() at ithread_loop+0xa6/frame 0xfffffe1f29b84a70
fork_exit() at fork_exit+0x84/frame 0xfffffe1f29b84ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe1f29b84ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

--
Julien

Larry Rosenman

2016-07-14 21:02:25 UTC

Post by Julien Charbon
Hi,

Post by Julien Charbon

Post by Gleb Smirnoff
J> > Comparing stable/10 and head, I see two changes that could
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).

Finally, I was able to reproduce it (without glebius fix). The trick
$ sysctl -a | grep tcp.keep
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
$ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
net.inet.tcp.keepidle: 7200000 -> 10
net.inet.tcp.keepintvl: 75000 -> 50
net.inet.tcp.keepinit: 75000 -> 10
Note: It will certainly close all your ssh connections to the tested
server.
#1. glebius fix
https://svnweb.freebsd.org/base?view=revision&revision=302350
#2. rss extra fix
https://reviews.freebsd.org/D7135
#3. rrs TCP Timer cleanup
https://reviews.freebsd.org/D7136
Fatal trap 9: general protection fault while in kernel mode
cpuid = 10; apic id = 28
0x20:0xffffffff80c346f1
stack pointer = 0x28:0xfffffe1f29b848b0
frame pointer = 0x28:0xfffffe1f29b848e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi4: clock (4))
trap number = 9
panic: general protection fault
cpuid = 10
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe1f29b844a0
vpanic() at vpanic+0x182/frame 0xfffffe1f29b84520
panic() at panic+0x43/frame 0xfffffe1f29b84580
trap_fatal() at trap_fatal+0x351/frame 0xfffffe1f29b845e0
trap() at trap+0x820/frame 0xfffffe1f29b847f0
calltrap() at calltrap+0x8/frame 0xfffffe1f29b847f0
--- trap 0x9, rip = 0xffffffff80c346f1, rsp = 0xfffffe1f29b848c0, rbp =
0xfffffe1f29b848e0 ---
tcp_timer_keep() at tcp_timer_keep+0x51/frame 0xfffffe1f29b848e0
softclock_call_cc() at softclock_call_cc+0x19c/frame 0xfffffe1f29b849c0
softclock() at softclock+0x47/frame 0xfffffe1f29b849e0
intr_event_execute_handlers() at intr_event_execute_handlers+0x96/frame
0xfffffe1f29b84a20
ithread_loop() at ithread_loop+0xa6/frame 0xfffffe1f29b84a70
fork_exit() at fork_exit+0x84/frame 0xfffffe1f29b84ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe1f29b84ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
--
Julien

please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 17716 Limpia Crk, Round Rock, TX 78664-7281

Julien Charbon

2016-07-21 07:54:20 UTC

Hi,

Post by Larry Rosenman

Post by Julien Charbon

Post by Julien Charbon

Post by Gleb Smirnoff
J> > Comparing stable/10 and head, I see two changes that could
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).

Finally, I was able to reproduce it (without glebius fix). The trick
$ sysctl -a | grep tcp.keep
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
$ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
net.inet.tcp.keepidle: 7200000 -> 10
net.inet.tcp.keepintvl: 75000 -> 50
net.inet.tcp.keepinit: 75000 -> 10
Note: It will certainly close all your ssh connections to the tested
server.
#1. glebius fix
https://svnweb.freebsd.org/base?view=revision&revision=302350
#2. rss extra fix
https://reviews.freebsd.org/D7135
#3. rrs TCP Timer cleanup
https://reviews.freebsd.org/D7136

please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884

My tests result so far:

#1. r302350: First glebius TCP timer fix: No more TCP timer kernel
panic during 48h under 200k TCP query per second load.

Sadly I was unable to reproduce the issue described here:

panic: bogus refcnt 0 on lle 0xfffff80004608c00
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884

#2. r303098: Got all kernel callout changes since r302350, (updates on
callout code are indeed always full of surprises):
https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098

No kernel panic either.

Still to test:

#3. rss extra fix (if still relevant now)
https://reviews.freebsd.org/D7135

#4. rrs TCP Timer cleanup:
https://reviews.freebsd.org/D7136

My 2 cents.

--
Julien

Hans Petter Selasky

2016-07-21 08:05:48 UTC

Post by Julien Charbon
Hi,

Post by Larry Rosenman

Post by Julien Charbon

Post by Julien Charbon

Post by Gleb Smirnoff
J> > Comparing stable/10 and head, I see two changes that could
J> >
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> >
J> > That's why you are in To, Julien and Hans :)
J> >
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.

Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).

Finally, I was able to reproduce it (without glebius fix). The trick
$ sysctl -a | grep tcp.keep
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
$ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
net.inet.tcp.keepidle: 7200000 -> 10
net.inet.tcp.keepintvl: 75000 -> 50
net.inet.tcp.keepinit: 75000 -> 10
Note: It will certainly close all your ssh connections to the tested
server.
#1. glebius fix
https://svnweb.freebsd.org/base?view=revision&revision=302350
#2. rss extra fix
https://reviews.freebsd.org/D7135
#3. rrs TCP Timer cleanup
https://reviews.freebsd.org/D7136

please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884

#1. r302350: First glebius TCP timer fix: No more TCP timer kernel
panic during 48h under 200k TCP query per second load.
panic: bogus refcnt 0 on lle 0xfffff80004608c00
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884
#2. r303098: Got all kernel callout changes since r302350, (updates on
https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log&pathrev=303098
No kernel panic either.
#3. rss extra fix (if still relevant now)
https://reviews.freebsd.org/D7135
https://reviews.freebsd.org/D7136
My 2 cents.

Hi,

You should also check for memory leaks using "vmstat -m" .

--HPS

26 Replies
10 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Gleb Smirnoff 2016-06-17 04:53:19 UTC

Hans Petter Selasky 2016-06-17 11:10:57 UTC

Bjoern A. Zeeb 2016-06-17 14:41:02 UTC

Julien Charbon 2016-06-17 09:27:39 UTC

Gleb Smirnoff 2016-06-20 07:39:17 UTC

Julien Charbon 2016-06-20 09:55:55 UTC

Gleb Smirnoff 2016-06-20 09:58:22 UTC

Hans Petter Selasky 2016-06-20 10:14:18 UTC

Gleb Smirnoff 2016-06-20 10:30:15 UTC

Hans Petter Selasky 2016-06-20 10:57:35 UTC

Hans Petter Selasky 2016-06-20 11:00:16 UTC

Adrian Chadd 2016-06-20 14:18:01 UTC

Julien Charbon 2016-06-20 16:48:27 UTC

Hans Petter Selasky 2016-06-20 10:49:17 UTC

Julien Charbon 2016-06-20 17:45:53 UTC

Randall Stewart via freebsd-net 2016-06-25 08:48:51 UTC

Randall Stewart via freebsd-net 2016-06-25 14:41:24 UTC

Julien Charbon 2016-06-28 09:58:56 UTC

Matthew Macy 2016-06-28 17:51:39 UTC

K. Macy 2016-06-28 22:51:57 UTC

Matthew Macy 2016-06-29 06:19:45 UTC

Matthew Macy 2016-06-30 08:41:34 UTC

Konstantin Belousov 2016-06-20 10:48:19 UTC

Julien Charbon 2016-07-14 17:01:11 UTC

Larry Rosenman 2016-07-14 21:02:25 UTC

Julien Charbon 2016-07-21 07:54:20 UTC

Hans Petter Selasky 2016-07-21 08:05:48 UTC

about - legalese

Loading...