Discussion:
[Bug 254303] Fatal trap 12: page fault while in kernel mode ((frr 7.5_1 + Freebsd 13 Beta3) zebra crashes server when routes are populated)
b***@freebsd.org
2021-03-17 12:01:05 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #5 from Aleks <***@veesp.com> ---
(In reply to Marek Zarychta from comment #4)
So, I took kernel src from
https://download.freebsd.org/ftp/releases/amd64/13.0-RC2/src.txz
Build with "options FIB_ALGO"
FreeBSD 13.0-RC2 FreeBSD 13.0-RC2 #0: Wed Mar 17 13:23:47 EET 2021
:/usr/obj/usr/src/amd64.amd64/sys/CUSTOM amd64
Disabled FRR autostart and rebooted the server.

After reboot I've set multipath=1 and loaded dpdk_lpm4/6, and after that
started FRR.
[fib_algo] inet.0 (bsearch4#13) rebuild_fd: switching algo to radix4_lockless
[fib_algo] fib_module_register: attaching dpdk_lpm4 to inet
[fib_algo] fib_module_register: attaching dpdk_lpm6 to inet6
[fib_algo] inet.0 (radix4_lockless#114) rebuild_fd: switching algo to dpdk_lpm4

After bringing up second BGP FullView session servers still crashed.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freebsd.org
2021-03-18 03:35:14 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #6 from Zhenlei Huang <***@gmail.com> ---
CC Alexander V. Chernikov
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freebsd.org
2021-03-17 08:20:58 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #4 from Marek Zarychta <***@plan-b.pwste.edu.pl> ---
(In reply to Aleks from comment #3)
If you can build a custom kernel with "options FIB_ALGO", install it and after
reboot load the module dpdk_lpm4 (and dpdk_lpm6 if appropriate), then please
give it a try.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freebsd.org
2021-03-20 10:28:07 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

Alexander V. Chernikov <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Assignee|***@FreeBSD.org |***@FreeBSD.org
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freebsd.org
2021-03-20 14:10:27 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

Rodney W. Grimes <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@FreeBSD.org,
| |***@FreeBSD.org
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-21 18:46:54 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #7 from Alexander V. Chernikov <***@FreeBSD.org> ---
(In reply to Aleks from comment #5)
Is there any chance you could share kernel&core?
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-21 21:30:53 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #8 from Aleks <***@veesp.com> ---
(In reply to Alexander V. Chernikov from comment #7)
Sure, just tell me what you mean by "share kernel&core"
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-26 23:46:29 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #16 from Alexander V. Chernikov <***@FreeBSD.org> ---
(In reply to Aleks from comment #15)
Thank you!

Short summary:

From the private core.5 you sent me:
* rtentry looks perfectly fine, but the nexthop pointer is (mostly) zeroed

* from the core2: failure to resolve nh_priv pointer
* from the original kgdb_backtrace: nhg has zero pointer to nh_ctl

So far it looks like we're removing the additional reference from the nexthop
group in some corner case scenario, which results in the group being freed,
with the rtentry still pointing to this group.

Re reproduction: I don't have 2 full-view peers, so I ended up duplicating the
feed from a single peer & introducing some delay, to mimic propagation delays.
So far I wasn't able to reproduce any panic.
Are there any additional specifics (e.g. links flapping) in the setup?


IS there any chance you could run

stdbuf -o0 route -n monitor > zebra_log.txt at startup (or, actually, at the
point in time when all peers are down) and then try to turn up first and then
the second peer?
If you could also run something like
`while true; do date >> nhg.log ; netstat -4OnW >> nhg.log ; sleep 5; done`

and share both files along with the core backtrace, that would be awesome.

If there is a possibility of getting access to the server - that would really
speed the things up.
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-29 23:23:04 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #19 from Alexander V. Chernikov <***@FreeBSD.org> ---
So, it looks like it is a combination of 3 bugs:

The actual thing corrupting memory is
https://cgit.freebsd.org/src/commit/?id=42f997d9b721ce5b64c37958f21fa81630f5a224
(in 13.0-RC4).

We get to this codepath by having 127 hexthop groups (number when we trigger
array resize). This is addressed in
https://cgit.freebsd.org/src/commit/?id=9095dc7da4cf0c484fb1160b2180b7329b09b107
(only in HEAD atm).

We get that amount of nexthop groups (should be only one) because of
non-zeroing all of the memory in the comparison part of nexthop group. This is
address in
https://cgit.freebsd.org/src/commit/?id=823a80f4f9037b6b9611aaceb21f53115d1e64f1
(in 13-S, not sure if it lands in 13.0-R).
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-30 08:01:49 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #20 from commit-***@FreeBSD.org ---
A commit in branch stable/13 references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=923e7f7e12670e97b097a195e69c848a6e8773a2

commit 923e7f7e12670e97b097a195e69c848a6e8773a2
Author: Alexander V. Chernikov <***@FreeBSD.org>
AuthorDate: 2021-03-29 23:00:17 +0000
Commit: Alexander V. Chernikov <***@FreeBSD.org>
CommitDate: 2021-03-30 07:34:31 +0000

Fix nexhtop group index array scaling.

The current code has the limit of 127 nexthop groups due to the
wrongly-checked bitmask_copy() return value.

PR: 254303
Reported by: Aleks <a.ivanov at veesp.com>

(cherry picked from commit 9095dc7da4cf0c484fb1160b2180b7329b09b107)

sys/net/route/nhgrp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-29 23:07:41 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #18 from commit-***@FreeBSD.org ---
A commit in branch main references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=9095dc7da4cf0c484fb1160b2180b7329b09b107

commit 9095dc7da4cf0c484fb1160b2180b7329b09b107
Author: Alexander V. Chernikov <***@FreeBSD.org>
AuthorDate: 2021-03-29 23:00:17 +0000
Commit: Alexander V. Chernikov <***@FreeBSD.org>
CommitDate: 2021-03-29 23:00:17 +0000

Fix nexhtop group index array scaling.

The current code has the limit of 127 nexthop groups due to the
wrongly-checked bitmask_copy() return value.

PR: 254303
Reported by: Aleks <a.ivanov at veesp.com>
MFC after: 1 day

sys/net/route/nhgrp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-28 16:23:54 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #17 from Aleks <***@veesp.com> ---
(In reply to Alexander V. Chernikov from comment #16)
I'll give you both files and server access via email.
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-03-31 20:09:41 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #21 from commit-***@FreeBSD.org ---
A commit in branch releng/13.0 references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=b7fbdb5042c619221ee0b97573affcb8bcb59458

commit b7fbdb5042c619221ee0b97573affcb8bcb59458
Author: Alexander V. Chernikov <***@FreeBSD.org>
AuthorDate: 2021-03-29 23:00:17 +0000
Commit: Alexander V. Chernikov <***@FreeBSD.org>
CommitDate: 2021-03-31 20:00:10 +0000

Fix nexhtop group index array scaling.

The current code has the limit of 127 nexthop groups due to the
wrongly-checked bitmask_copy() return value.

PR: 254303
Reported by: Aleks <a.ivanov at veesp.com>
Approved by: re (gjb)

(cherry picked from commit 923e7f7e12670e97b097a195e69c848a6e8773a2)

sys/net/route/nhgrp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-04-04 08:55:43 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #22 from Alexander V. Chernikov <***@FreeBSD.org> ---
All relevant patches are in 13-R.
Does it fix an issue for you?
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-04-04 12:57:35 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

--- Comment #23 from Aleks <***@veesp.com> ---
(In reply to Alexander V. Chernikov from comment #22)
For me - yes. Thank you very much!
--
You are receiving this mail because:
You are on the CC list for the bug.
b***@freebsd.org
2021-04-04 13:00:11 UTC
Permalink
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254303

Alexander V. Chernikov <***@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|In Progress |Closed
Resolution|--- |FIXED
--
You are receiving this mail because:
You are on the CC list for the bug.
Loading...