Discussion:
page fault while in kernel mode - after upgrade from 12.2 to 13.0
Michael Schmiedgen
2021-05-03 18:04:30 UTC
Permalink
Hi List,

if I start a Samba jail, after a few seconds the system crashes. Very reproducible.

System has ~10 jails and 3 bhyve VMs. Dell server, Xeon E3-1240, 64GB RAM, 3 way mirror ZFS.

It also occurs a few seconds after I start a phone call using the SIP VM of that machine,
very strange.

I got some log messages suggesting raising somaxconn, so I did

kern.ipc.somaxconn=4096

in sysctl.conf


Below some debug information, please let me know if I should provide further information.

Should I open a bug or something?

Thank you very much!
Michael



Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ca52c0
stack pointer = 0x28:0xfffffe019d039650
frame pointer = 0x28:0xfffffe019d039690
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 649 (devd)
trap number = 12
panic: page fault
cpuid = 0
time = 1620061253
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80ca51c3 at sbappendaddr_locked+0x93
#8 0xffffffff80cb437a at uipc_send+0x73a
#9 0xffffffff80ca9053 at sosend_generic+0x633
#10 0xffffffff80ca94e0 at sosend+0x50
#11 0xffffffff80caff2e at kern_sendit+0x20e
#12 0xffffffff80cb032b at sendit+0x1db
#13 0xffffffff80cb013d at sys_sendto+0x4d
#14 0xffffffff8108ba8c at amd64_syscall+0x10c
#15 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 2m2s
Dumping 2373 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) list *0xffffffff80ca52c0
0xffffffff80ca52c0 is in sbappendaddr_locked_internal (/usr/src/sys/kern/uipc_sockbuf.c:1169).
1164 if (ctrl_last)
1165 ctrl_last->m_next = m0; /* concatenate data to control */
1166 else
1167 control = m0;
1168 m->m_next = control;
1169 for (n = m; n->m_next != NULL; n = n->m_next)
1170 sballoc(sb, n);
1171 sballoc(sb, n);
1172 nlast = n;
1173 SBLINKRECORD(sb, m);
(kgdb) backtrace
#0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c09916 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80c09b93 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff8108b187 in trap_fatal (frame=0xfffffe019d039590, eva=0) at /usr/src/sys/amd64/amd64/trap.c:915
#6 0xffffffff8108b1df in trap_pfault (frame=***@entry=0xfffffe019d039590, usermode=false, signo=<optimized out>, ***@entry=0x0, ucode=<optimized
out>, ***@entry=0x0)
at /usr/src/sys/amd64/amd64/trap.c:732
#7 0xffffffff8108a83d in trap (frame=0xfffffe019d039590) at /usr/src/sys/amd64/amd64/trap.c:398
#8 <signal handler called>
#9 sbappendaddr_locked_internal (sb=***@entry=0xfffff800447ef4f8, asa=***@entry=0xffffffff815cde60 <sun_noname>, m0=<optimized out>,
***@entry=0xfffff8008b186500, control=0xfffff8008b186500,
***@entry=0x0, ctrl_last=<optimized out>) at /usr/src/sys/kern/uipc_sockbuf.c:1169
#10 0xffffffff80ca51c3 in sbappendaddr_locked (sb=***@entry=0xfffff800447ef4f8, asa=***@entry=0xffffffff815cde60 <sun_noname>,
m0=***@entry=0xfffff8008b186500, control=0x0)
at /usr/src/sys/kern/uipc_sockbuf.c:1205
#11 0xffffffff80cb437a in uipc_send (so=<optimized out>, flags=0, m=0xfffff8008b186500, nam=<optimized out>, control=0x10, td=<optimized out>) at
/usr/src/sys/kern/uipc_usrreq.c:1056
#12 0xffffffff80ca9053 in sosend_generic (so=0xfffff800444abb10, addr=0x0, uio=<optimized out>, top=0xfffff8008b186500, control=0x0, flags=0,
td=0xfffffe0165ddc500)
at /usr/src/sys/kern/uipc_socket.c:1755
#13 0xffffffff80ca94e0 in sosend (so=0x100, ***@entry=0xfffff800444abb10, addr=0xb5ea5000, uio=0xfffff8008b186500, ***@entry=0xfffffe019d039898,
top=0x10, ***@entry=0x0,
control=***@entry=0x0, flags=272, ***@entry=0, td=0xfffffe0165ddc500) at /usr/src/sys/kern/uipc_socket.c:1810
#14 0xffffffff80caff2e in kern_sendit (td=<optimized out>, ***@entry=0xfffffe0165ddc500, s=8, mp=<optimized out>, ***@entry=0xfffffe019d039980, flags=0,
control=0x0,
segflg=***@entry=UIO_USERSPACE) at /usr/src/sys/kern/uipc_syscalls.c:798
#15 0xffffffff80cb032b in sendit (td=0xfffffe0165ddc500, s=-1242935296, mp=***@entry=0xfffffe019d039980, flags=16) at /usr/src/sys/kern/uipc_syscalls.c:723
#16 0xffffffff80cb013d in sys_sendto (td=0x100, uap=<optimized out>) at /usr/src/sys/kern/uipc_syscalls.c:841
#17 0xffffffff8108ba8c in syscallenter (td=0xfffffe0165ddc500) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#18 amd64_syscall (td=0xfffffe0165ddc500, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1156
#19 <signal handler called>
#20 0x00000000002858ea in ?? ()
Mark Johnston
2021-05-03 19:45:53 UTC
Permalink
Post by Michael Schmiedgen
Hi List,
if I start a Samba jail, after a few seconds the system crashes. Very reproducible.
System has ~10 jails and 3 bhyve VMs. Dell server, Xeon E3-1240, 64GB RAM, 3 way mirror ZFS.
It also occurs a few seconds after I start a phone call using the SIP VM of that machine,
very strange.
I got some log messages suggesting raising somaxconn, so I did
kern.ipc.somaxconn=4096
in sysctl.conf
Below some debug information, please let me know if I should provide further information.
Should I open a bug or something?
Thank you very much!
Michael
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ca52c0
stack pointer = 0x28:0xfffffe019d039650
frame pointer = 0x28:0xfffffe019d039690
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 649 (devd)
trap number = 12
panic: page fault
cpuid = 0
time = 1620061253
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80ca51c3 at sbappendaddr_locked+0x93
#8 0xffffffff80cb437a at uipc_send+0x73a
#9 0xffffffff80ca9053 at sosend_generic+0x633
#10 0xffffffff80ca94e0 at sosend+0x50
#11 0xffffffff80caff2e at kern_sendit+0x20e
#12 0xffffffff80cb032b at sendit+0x1db
#13 0xffffffff80cb013d at sys_sendto+0x4d
#14 0xffffffff8108ba8c at amd64_syscall+0x10c
#15 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 2m2s
Dumping 2373 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) list *0xffffffff80ca52c0
0xffffffff80ca52c0 is in sbappendaddr_locked_internal (/usr/src/sys/kern/uipc_sockbuf.c:1169).
1164 if (ctrl_last)
1165 ctrl_last->m_next = m0; /* concatenate data to control */
1166 else
1167 control = m0;
1168 m->m_next = control;
1169 for (n = m; n->m_next != NULL; n = n->m_next)
1170 sballoc(sb, n);
1171 sballoc(sb, n);
1172 nlast = n;
1173 SBLINKRECORD(sb, m);
So we are crashing because "n" is somehow NULL? That seems difficult to
explain. Can you show the local variables in this frame?

Does the panic always have the same stack trace?
Michael Schmiedgen
2021-05-04 18:38:39 UTC
Permalink
Hi Mark,

sorry for the delay, I only can test after work. I triggered another 2 panics, this time
with a different result (see below). Can I provide some more information?

Thank you!
Michael



--- #1


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address = 0x388
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80d3fa67
stack pointer = 0x28:0xfffffe0115bea9c0
frame pointer = 0x28:0xfffffe0115beaa20
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi1: netisr 0)
trap number = 12
panic: page fault
cpuid = 1
time = 1620144777
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80bcae5d at ithread_loop+0x24d
#8 0xffffffff80bc7c5e at fork_exit+0x7e
#9 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 3m51s
Dumping 2617 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) list *0xffffffff80d3fa67
0xffffffff80d3fa67 is in swi_net (/usr/src/sys/net/netisr.c:918).
913 if (local_npw.nw_head == NULL)
914 local_npw.nw_tail = NULL;
915 local_npw.nw_len--;
916 VNET_ASSERT(m->m_pkthdr.rcvif != NULL,
917 ("%s:%d rcvif == NULL: m=%p", __func__, __LINE__, m));
918 CURVNET_SET(m->m_pkthdr.rcvif->if_vnet);
919 netisr_proto[proto].np_handler(m);
920 CURVNET_RESTORE();
921 }
922 KASSERT(local_npw.nw_len == 0,
(kgdb) backtrace
#0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c09916 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80c09b93 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff8108b187 in trap_fatal (frame=0xfffffe0115bea900, eva=904) at /usr/src/sys/amd64/amd64/trap.c:915
#6 0xffffffff8108b1df in trap_pfault (frame=***@entry=0xfffffe0115bea900, usermode=false, signo=<optimized out>, ***@entry=0x0, ucode=<optimized
out>, ***@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:732
#7 0xffffffff8108a83d in trap (frame=0xfffffe0115bea900) at /usr/src/sys/amd64/amd64/trap.c:398
#8 <signal handler called>
#9 0xffffffff80d3fa67 in netisr_process_workstream_proto (nwsp=<optimized out>, proto=1) at /usr/src/sys/net/netisr.c:918
#10 swi_net (arg=<optimized out>) at /usr/src/sys/net/netisr.c:966
#11 0xffffffff80bcae5d in intr_event_execute_handlers (p=<optimized out>, ie=0xfffff80003dbb600) at /usr/src/sys/kern/kern_intr.c:1168
#12 ithread_execute_handlers (p=<optimized out>, ie=0xfffff80003dbb600) at /usr/src/sys/kern/kern_intr.c:1181
#13 ithread_loop (arg=***@entry=0xfffff80003dced40) at /usr/src/sys/kern/kern_intr.c:1269
#14 0xffffffff80bc7c5e in fork_exit (callout=0xffffffff80bcac10 <ithread_loop>, arg=0xfffff80003dced40, frame=0xfffffe0115beab00) at
/usr/src/sys/kern/kern_fork.c:1069
#15 <signal handler called>


--- #2


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ca599c
stack pointer = 0x28:0xfffffe0115bea6c0
frame pointer = 0x28:0xfffffe0115bea700
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi1: netisr 0)
trap number = 12
panic: page fault
cpuid = 1
time = 1620152374
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80dbf0ae at tcp_do_segment+0x10ce
#8 0xffffffff80dbd21e at tcp_input+0xabe
#9 0xffffffff80dafc15 at ip_input+0x125
#10 0xffffffff80d3fa7b at swi_net+0x12b
#11 0xffffffff80bcae5d at ithread_loop+0x24d
#12 0xffffffff80bc7c5e at fork_exit+0x7e
#13 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 2h3m59s
Dumping 2666 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c09916 in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80c09b93 in panic (fmt=<unavailable>)
at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff8108b187 in trap_fatal (frame=0xfffffe0115bea600, eva=8)
at /usr/src/sys/amd64/amd64/trap.c:915
#6 0xffffffff8108b1df in trap_pfault (frame=***@entry=0xfffffe0115bea600,
usermode=false, signo=<optimized out>, ***@entry=0x0,
ucode=<optimized out>, ***@entry=0x0)
at /usr/src/sys/amd64/amd64/trap.c:732
#7 0xffffffff8108a83d in trap (frame=0xfffffe0115bea600)
at /usr/src/sys/amd64/amd64/trap.c:398
#8 <signal handler called>
#9 sbcut_internal (sb=0xfffff80522aa09c0, len=203, ***@entry=476)
at /usr/src/sys/kern/uipc_sockbuf.c:1491
#10 0xffffffff80ca5b8a in sbcut_locked (sb=0xfffff80522aa09c0,
len=-743943424, ***@entry=476) at /usr/src/sys/kern/uipc_sockbuf.c:1591
#11 0xffffffff80dbf0ae in tcp_do_segment (m=0xfffff8004c2aae00,
th=<optimized out>, so=<optimized out>, tp=<optimized out>,
drop_hdrlen=52, tlen=<optimized out>, iptos=0 '\000')
at /usr/src/sys/netinet/tcp_input.c:2918
#12 0xffffffff80dbd21e in tcp_input (mp=<optimized out>,
offp=<optimized out>, proto=<optimized out>)
at /usr/src/sys/netinet/tcp_input.c:1382
#13 0xffffffff80dafc15 in ip_input (m=0x0)
at /usr/src/sys/netinet/ip_input.c:829
#14 0xffffffff80d3fa7b in netisr_process_workstream_proto (
nwsp=<optimized out>, proto=1) at /usr/src/sys/net/netisr.c:919
#15 swi_net (arg=<optimized out>) at /usr/src/sys/net/netisr.c:966
#16 0xffffffff80bcae5d in intr_event_execute_handlers (p=<optimized out>,
ie=0xfffff80003bbe500) at /usr/src/sys/kern/kern_intr.c:1168
#17 ithread_execute_handlers (p=<optimized out>, ie=0xfffff80003bbe500)
at /usr/src/sys/kern/kern_intr.c:1181
#18 ithread_loop (arg=***@entry=0xfffff80003cb6d40)
at /usr/src/sys/kern/kern_intr.c:1269
#19 0xffffffff80bc7c5e in fork_exit (
callout=0xffffffff80bcac10 <ithread_loop>, arg=0xfffff80003cb6d40,
frame=0xfffffe0115beab00) at /usr/src/sys/kern/kern_fork.c:1069
#20 <signal handler called>


---
Post by Mark Johnston
Post by Michael Schmiedgen
Hi List,
if I start a Samba jail, after a few seconds the system crashes. Very reproducible.
System has ~10 jails and 3 bhyve VMs. Dell server, Xeon E3-1240, 64GB RAM, 3 way mirror ZFS.
It also occurs a few seconds after I start a phone call using the SIP VM of that machine,
very strange.
I got some log messages suggesting raising somaxconn, so I did
kern.ipc.somaxconn=4096
in sysctl.conf
Below some debug information, please let me know if I should provide further information.
Should I open a bug or something?
Thank you very much!
Michael
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ca52c0
stack pointer = 0x28:0xfffffe019d039650
frame pointer = 0x28:0xfffffe019d039690
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 649 (devd)
trap number = 12
panic: page fault
cpuid = 0
time = 1620061253
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80ca51c3 at sbappendaddr_locked+0x93
#8 0xffffffff80cb437a at uipc_send+0x73a
#9 0xffffffff80ca9053 at sosend_generic+0x633
#10 0xffffffff80ca94e0 at sosend+0x50
#11 0xffffffff80caff2e at kern_sendit+0x20e
#12 0xffffffff80cb032b at sendit+0x1db
#13 0xffffffff80cb013d at sys_sendto+0x4d
#14 0xffffffff8108ba8c at amd64_syscall+0x10c
#15 0xffffffff810620ce at fast_syscall_common+0xf8
Uptime: 2m2s
Dumping 2373 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) list *0xffffffff80ca52c0
0xffffffff80ca52c0 is in sbappendaddr_locked_internal (/usr/src/sys/kern/uipc_sockbuf.c:1169).
1164 if (ctrl_last)
1165 ctrl_last->m_next = m0; /* concatenate data to control */
1166 else
1167 control = m0;
1168 m->m_next = control;
1169 for (n = m; n->m_next != NULL; n = n->m_next)
1170 sballoc(sb, n);
1171 sballoc(sb, n);
1172 nlast = n;
1173 SBLINKRECORD(sb, m);
So we are crashing because "n" is somehow NULL? That seems difficult to
explain. Can you show the local variables in this frame?
Does the panic always have the same stack trace?
Mark Johnston
2021-05-04 19:02:42 UTC
Permalink
Post by Michael Schmiedgen
Hi Mark,
sorry for the delay, I only can test after work. I triggered another 2 panics, this time
with a different result (see below). Can I provide some more information?
This looks like fairly random kernel memory corruption. Are you able to
build an INVARIANTS kernel and test that? Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
Michael Schmiedgen
2021-05-05 09:08:49 UTC
Permalink
Post by Mark Johnston
Post by Michael Schmiedgen
Hi Mark,
sorry for the delay, I only can test after work. I triggered another 2 panics, this time
with a different result (see below). Can I provide some more information?
This looks like fairly random kernel memory corruption. Are you able to
build an INVARIANTS kernel and test that? Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
I will try INVARIANTS after work, but in the meantime I got 2 more panics
from tonight.



--- #1


Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ca599c
stack pointer = 0x28:0xfffffe0115bc46c0
frame pointer = 0x28:0xfffffe0115bc4700
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi1: netisr 0)
trap number = 12
panic: page fault
cpuid = 7
time = 1620172732
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff8108b187 at trap_fatal+0x387
#4 0xffffffff8108b1df at trap_pfault+0x4f
#5 0xffffffff8108a83d at trap+0x27d
#6 0xffffffff810617a8 at calltrap+0x8
#7 0xffffffff80dbf0ae at tcp_do_segment+0x10ce
#8 0xffffffff80dbd21e at tcp_input+0xabe
#9 0xffffffff80dafc15 at ip_input+0x125
#10 0xffffffff80d3fa7b at swi_net+0x12b
#11 0xffffffff80bcae5d at ithread_loop+0x24d
#12 0xffffffff80bc7c5e at fork_exit+0x7e
#13 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 5h36m39s
Dumping 7281 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c09916 in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80c09b93 in panic (fmt=<unavailable>)
at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff8108b187 in trap_fatal (frame=0xfffffe0115bc4600, eva=8)
at /usr/src/sys/amd64/amd64/trap.c:915
#6 0xffffffff8108b1df in trap_pfault (frame=***@entry=0xfffffe0115bc4600,
usermode=false, signo=<optimized out>, ***@entry=0x0,
ucode=<optimized out>, ***@entry=0x0)
at /usr/src/sys/amd64/amd64/trap.c:732
#7 0xffffffff8108a83d in trap (frame=0xfffffe0115bc4600)
at /usr/src/sys/amd64/amd64/trap.c:398
#8 <signal handler called>
#9 sbcut_internal (sb=0xfffff8043bc00610, len=57, ***@entry=304)
at /usr/src/sys/kern/uipc_sockbuf.c:1491
#10 0xffffffff80ca5b8a in sbcut_locked (sb=0xfffff8043bc00610,
len=-1796951296, ***@entry=304) at /usr/src/sys/kern/uipc_sockbuf.c:1591
#11 0xffffffff80dbf0ae in tcp_do_segment (m=0xfffff8024b9a6900,
th=<optimized out>, so=<optimized out>, tp=<optimized out>,
drop_hdrlen=52, tlen=<optimized out>, iptos=0 '\000')
at /usr/src/sys/netinet/tcp_input.c:2918
#12 0xffffffff80dbd21e in tcp_input (mp=<optimized out>,
offp=<optimized out>, proto=<optimized out>)
at /usr/src/sys/netinet/tcp_input.c:1382
#13 0xffffffff80dafc15 in ip_input (m=0x0)
at /usr/src/sys/netinet/ip_input.c:829
#14 0xffffffff80d3fa7b in netisr_process_workstream_proto (
nwsp=<optimized out>, proto=1) at /usr/src/sys/net/netisr.c:919
#15 swi_net (arg=<optimized out>) at /usr/src/sys/net/netisr.c:966
#16 0xffffffff80bcae5d in intr_event_execute_handlers (p=<optimized out>,
ie=0xfffff80003b88c00) at /usr/src/sys/kern/kern_intr.c:1168
#17 ithread_execute_handlers (p=<optimized out>, ie=0xfffff80003b88c00)
at /usr/src/sys/kern/kern_intr.c:1181
#18 ithread_loop (arg=***@entry=0xfffff80003b95d20)
at /usr/src/sys/kern/kern_intr.c:1269
#19 0xffffffff80bc7c5e in fork_exit (
callout=0xffffffff80bcac10 <ithread_loop>, arg=0xfffff80003b95d20,
frame=0xfffffe0115bc4b00) at /usr/src/sys/kern/kern_fork.c:1069


--- #2


Unread portion of the kernel message buffer:
panic: sbappendaddr_locked
cpuid = 2
time = 1620181490
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff80ca51e0 at sbappendaddr_locked_internal+0
#4 0xffffffff82c4efd0 at divert_packet+0x1a0
#5 0xffffffff82c2bc81 at ipfw_check_packet+0x2c1
#6 0xffffffff80d41f87 at pfil_run_hooks+0x97
#7 0xffffffff80dafeb5 at ip_input+0x3c5
#8 0xffffffff80d3f2da at netisr_dispatch_src+0xca
#9 0xffffffff80d23a68 at ether_demux+0x148
#10 0xffffffff80d24dec at ether_nh_input+0x34c
#11 0xffffffff80d3f2da at netisr_dispatch_src+0xca
#12 0xffffffff80d23eb9 at ether_input+0x69
#13 0xffffffff80d2074a at if_input+0xa
#14 0xffffffff8060a98e at bge_rxeof+0x49e
#15 0xffffffff80607f27 at bge_intr_task+0x1a7
#16 0xffffffff80c6afe1 at taskqueue_run_locked+0x181
#17 0xffffffff80c6c2fc at taskqueue_thread_loop+0xac
Uptime: 2h21m11s
Dumping 8148 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c09916 in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80c09b93 in panic (fmt=<unavailable>)
at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff80ca51e0 in sbappendaddr_locked (sb=0xfffff8002829a8a8,
asa=0xfffffe0115ebc5a0, m0=0xfffff804a977b700, control=0x0)
at /usr/src/sys/kern/uipc_sockbuf.c:1198
#6 0xffffffff82c4efd0 in divert_packet (m=0xfffff804a977b700,
incoming=<optimized out>) at /usr/src/sys/netinet/ip_divert.c:285
#7 0xffffffff82c2bc81 in ipfw_divert (m0=0xfffffe0115ebc760,
args=0xfffffe0115ebc610, tee=<optimized out>)
at /usr/src/sys/netpfil/ipfw/ip_fw_pfil.c:525
#8 ipfw_check_packet (m0=0xfffffe0115ebc760, ifp=0xfffff8000506f000,
flags=65536, ruleset=<optimized out>, inp=0x0)
at /usr/src/sys/netpfil/ipfw/ip_fw_pfil.c:283
#9 0xffffffff80d41f87 in pfil_run_hooks (head=<optimized out>, p=...,
ifp=***@entry=0xfffff8000506f000, flags=***@entry=65536,
inp=***@entry=0x0) at /usr/src/sys/net/pfil.c:187
#10 0xffffffff80dafeb5 in ip_input (m=0x0)
at /usr/src/sys/netinet/ip_input.c:610
#11 0xffffffff80d3f2da in netisr_dispatch_src (proto=1,
source=<optimized out>, ***@entry=0, m=<unavailable>)
at /usr/src/sys/net/netisr.c:1143
#12 0xffffffff80d3f5cf in netisr_dispatch (proto=<unavailable>,
m=<unavailable>) at /usr/src/sys/net/netisr.c:1234
#13 0xffffffff80d23a68 in ether_demux (ifp=***@entry=0xfffff8000506f000,
m=<unavailable>) at /usr/src/sys/net/if_ethersubr.c:923
#14 0xffffffff80d24dec in ether_input_internal (ifp=0xfffff8000506f000,
m=<unavailable>) at /usr/src/sys/net/if_ethersubr.c:709
#15 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:739
#16 0xffffffff80d3f2da in netisr_dispatch_src (proto=***@entry=5,
source=<optimized out>, ***@entry=0, m=<unavailable>,
***@entry=0xfffff804a977b700) at /usr/src/sys/net/netisr.c:1143
#17 0xffffffff80d3f5cf in netisr_dispatch (proto=<unavailable>,
***@entry=5, m=<unavailable>, ***@entry=0xfffff804a977b700)
at /usr/src/sys/net/netisr.c:1234
#18 0xffffffff80d23eb9 in ether_input (ifp=<optimized out>,
***@entry=<error reading variable: value is not available>,
m=0xfffff804a977b700,
***@entry=<error reading variable: value is not available>)
at /usr/src/sys/net/if_ethersubr.c:830
#19 0xffffffff80d2074a in if_input (ifp=<unavailable>,
***@entry=0xfffff8000506f000, sendmp=<unavailable>,
***@entry=0xfffff804a977b700) at /usr/src/sys/net/if.c:4391
#20 0xffffffff8060a98e in bge_rxeof (sc=***@entry=0xfffffe0115cd4000,
rx_prod=***@entry=448, holdlck=***@entry=0)
at /usr/src/sys/dev/bge/if_bge.c:4412
#21 0xffffffff80607f27 in bge_intr_task (arg=0xfffffe0115cd4000,
pending=<optimized out>) at /usr/src/sys/dev/bge/if_bge.c:4642
#22 0xffffffff80c6afe1 in taskqueue_run_locked (
queue=***@entry=0xfffff80005051d00)
at /usr/src/sys/kern/subr_taskqueue.c:476
#23 0xffffffff80c6c2fc in taskqueue_thread_loop (arg=<optimized out>,
***@entry=0xfffffe0115cdb568) at /usr/src/sys/kern/subr_taskqueue.c:793
#24 0xffffffff80bc7c5e in fork_exit (
callout=0xffffffff80c6c250 <taskqueue_thread_loop>,
arg=0xfffffe0115cdb568, frame=0xfffffe0115ebcb00)
at /usr/src/sys/kern/kern_fork.c:1069
Michael Schmiedgen
2021-05-05 16:35:32 UTC
Permalink
Post by Mark Johnston
This looks like fairly random kernel memory corruption. Are you able to
build an INVARIANTS kernel and test that? Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
Below some info with an INVARIANTS kernel. Please let me know if I can provide
further information. Thank you!


--- kgdb backtrace


(kgdb) backtrace
#0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80bf580b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80bf5c50 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80bf59b3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff80f1ae71 in uma_dbg_free (zone=0xfffffe006e3e3c00, slab=0xfffff8053b159fd8, item=0xfffff8053b159300) at /usr/src/sys/vm/uma_core.c:5437
#6 0xffffffff80f13a64 in item_dtor (zone=0xfffffe006e3e3c00, item=0xfffff8053b159300, size=256, udata=0x0, skip=SKIP_NONE) at
/usr/src/sys/vm/uma_core.c:3220
#7 uma_zfree_arg (zone=0xfffffe006e3e3c00, item=***@entry=0xfffff8053b159300, udata=***@entry=0x0) at /usr/src/sys/vm/uma_core.c:4165
#8 0xffffffff80bcefcf in mb_free_ext (m=***@entry=0xfffff8053b159300) at /usr/src/sys/kern/kern_mbuf.c:1200
#9 0xffffffff80bcda68 in m_free (m=***@entry=0xfffff8053b159300) at /usr/src/sys/sys/mbuf.h:1441
#10 0xffffffff80bceda8 in m_freem (mb=***@entry=0xfffff8053b159300) at /usr/src/sys/kern/kern_mbuf.c:1525
#11 0xffffffff82c4d79a in div_output (so=<optimized out>, m=0xfffff8053b159300, sin=<optimized out>, control=<optimized out>) at
/usr/src/sys/netinet/ip_divert.c:396
#12 div_send (so=<optimized out>, ***@entry=<error reading variable: value is not available>, flags=<optimized out>, ***@entry=<error reading
variable: value is not available>, m=0xfffff8053b159300, ***@entry=<error reading variable: value is not available>, nam=<optimized out>,
***@entry=<error reading variable: value is not available>, control=<optimized out>, ***@entry=<error reading variable: value is not
available>, td=<optimized out>, ***@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet/ip_divert.c:659
#13 0xffffffff80c92f97 in sosend_generic (so=0xfffff800468d5760, ***@entry=<error reading variable: value is not available>, addr=0xfffff800120c72e0,
***@entry=<error reading variable: value is not available>, uio=<optimized out>, ***@entry=<error reading variable: value is not available>,
top=0xfffff8053b159300,
***@entry=<error reading variable: value is not available>, control=<optimized out>, ***@entry=<error reading variable: value is not
available>, flags=0, ***@entry=<error reading variable: value is not available>, td=0xfffffe019cdc2300, ***@entry=<error reading variable: value is
not available>)
at /usr/src/sys/kern/uipc_socket.c:1755
#14 0xffffffff80c93286 in sosend (so=<unavailable>, ***@entry=0xfffff800468d5760, addr=<unavailable>, uio=<unavailable>, ***@entry=0xfffffe0199b338a8,
top=<unavailable>, ***@entry=0x0, control=***@entry=0x0, flags=<unavailable>, ***@entry=0, td=0xfffffe019cdc2300) at
/usr/src/sys/kern/uipc_socket.c:1810
#15 0xffffffff80c99ffc in kern_sendit (td=<optimized out>, ***@entry=0xfffffe019cdc2300, s=3, mp=<optimized out>, ***@entry=0xfffffe0199b33980, flags=0,
control=0x0, segflg=***@entry=UIO_USERSPACE) at /usr/src/sys/kern/uipc_syscalls.c:798
#16 0xffffffff80c9a39b in sendit (td=0xfffffe019cdc2300, ***@entry=<unavailable>, s=<unavailable>, mp=***@entry=0xfffffe0199b33980, flags=<unavailable>)
at /usr/src/sys/kern/uipc_syscalls.c:723
#17 0xffffffff80c9a1ad in sys_sendto (td=<unavailable>, ***@entry=<error reading variable: value is not available>, uap=<unavailable>, ***@entry=<error
reading variable: value is not available>) at /usr/src/sys/kern/uipc_syscalls.c:841
#18 0xffffffff8108824e in syscallenter (td=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#19 amd64_syscall (td=0xfffffe019cdc2300, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1156
#20 <signal handler called>


--- core.txt


panic: Duplicate free of 0xfffff8053b159300 from zone 0xfffffe006e3e3c00(mbuf_packet) slab 0xfffff8053b159fd8(3)

Unread portion of the kernel message buffer:
<110>ipfw: 4500 Deny UDP 192.168.10.100:137 192.168.10.255:137 out via bge0
panic: Duplicate free of 0xfffff8053b159300 from zone 0xfffffe006e3e3c00(mbuf_packet) slab 0xfffff8053b159fd8(3)
cpuid = 6
time = 1620231385
KDB: stack backtrace:
#0 0xffffffff80c400e5 at kdb_backtrace+0x65
#1 0xffffffff80bf5be1 at vpanic+0x181
#2 0xffffffff80bf59b3 at panic+0x43
#3 0xffffffff80f1ae71 at uma_dbg_free+0x1e1
#4 0xffffffff80f13a64 at uma_zfree_arg+0x144
#5 0xffffffff80bcefcf at mb_free_ext+0x11f
#6 0xffffffff80bcda68 at m_free+0xd8
#7 0xffffffff80bceda8 at m_freem+0x28
#8 0xffffffff82c4d79a at div_send+0x43a
#9 0xffffffff80c92f97 at sosend_generic+0x5f7
#10 0xffffffff80c93286 at sosend+0x66
#11 0xffffffff80c99ffc at kern_sendit+0x1ec
#12 0xffffffff80c9a39b at sendit+0x1db
#13 0xffffffff80c9a1ad at sys_sendto+0x4d
#14 0xffffffff8108824e at amd64_syscall+0x12e
#15 0xffffffff8105bf4e at fast_syscall_common+0xf8
Uptime: 5m17s
Dumping 2609 out of 65454 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80bf580b in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80bf5c50 in vpanic (fmt=<optimized out>, ap=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80bf59b3 in panic (fmt=<unavailable>)
at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff80f1ae71 in uma_dbg_free (zone=0xfffffe006e3e3c00,
slab=0xfffff8053b159fd8, item=0xfffff8053b159300)
at /usr/src/sys/vm/uma_core.c:5437
#6 0xffffffff80f13a64 in item_dtor (zone=0xfffffe006e3e3c00,
item=0xfffff8053b159300, size=256, udata=0x0, skip=SKIP_NONE)
at /usr/src/sys/vm/uma_core.c:3220
#7 uma_zfree_arg (zone=0xfffffe006e3e3c00,
item=***@entry=0xfffff8053b159300, udata=***@entry=0x0)
at /usr/src/sys/vm/uma_core.c:4165
#8 0xffffffff80bcefcf in mb_free_ext (m=***@entry=0xfffff8053b159300)
at /usr/src/sys/kern/kern_mbuf.c:1200
#9 0xffffffff80bcda68 in m_free (m=***@entry=0xfffff8053b159300)
at /usr/src/sys/sys/mbuf.h:1441
#10 0xffffffff80bceda8 in m_freem (mb=***@entry=0xfffff8053b159300)
at /usr/src/sys/kern/kern_mbuf.c:1525
#11 0xffffffff82c4d79a in div_output (so=<optimized out>,
m=0xfffff8053b159300, sin=<optimized out>, control=<optimized out>)
at /usr/src/sys/netinet/ip_divert.c:396
#12 div_send (so=<optimized out>,
***@entry=<error reading variable: value is not available>,
flags=<optimized out>,
***@entry=<error reading variable: value is not available>,
m=0xfffff8053b159300,
***@entry=<error reading variable: value is not available>,
nam=<optimized out>,
***@entry=<error reading variable: value is not available>,
control=<optimized out>,
***@entry=<error reading variable: value is not available>,
td=<optimized out>,
***@entry=<error reading variable: value is not available>)
at /usr/src/sys/netinet/ip_divert.c:659
#13 0xffffffff80c92f97 in sosend_generic (so=0xfffff800468d5760,
***@entry=<error reading variable: value is not available>,
addr=0xfffff800120c72e0,
***@entry=<error reading variable: value is not available>,
uio=<optimized out>,
***@entry=<error reading variable: value is not available>,
top=0xfffff8053b159300,
***@entry=<error reading variable: value is not available>,
control=<optimized out>,
***@entry=<error reading variable: value is not available>, flags=0,
***@entry=<error reading variable: value is not available>,
td=0xfffffe019cdc2300,
***@entry=<error reading variable: value is not available>)
at /usr/src/sys/kern/uipc_socket.c:1755
#14 0xffffffff80c93286 in sosend (so=<unavailable>,
***@entry=0xfffff800468d5760, addr=<unavailable>, uio=<unavailable>,
***@entry=0xfffffe0199b338a8, top=<unavailable>, ***@entry=0x0,
control=***@entry=0x0, flags=<unavailable>, ***@entry=0,
td=0xfffffe019cdc2300) at /usr/src/sys/kern/uipc_socket.c:1810
#15 0xffffffff80c99ffc in kern_sendit (td=<optimized out>,
***@entry=0xfffffe019cdc2300, s=3, mp=<optimized out>,
***@entry=0xfffffe0199b33980, flags=0, control=0x0,
segflg=***@entry=UIO_USERSPACE)
at /usr/src/sys/kern/uipc_syscalls.c:798
#16 0xffffffff80c9a39b in sendit (td=0xfffffe019cdc2300,
***@entry=<unavailable>, s=<unavailable>, mp=***@entry=0xfffffe0199b33980,
flags=<unavailable>) at /usr/src/sys/kern/uipc_syscalls.c:723
#17 0xffffffff80c9a1ad in sys_sendto (td=<unavailable>,
***@entry=<error reading variable: value is not available>,
uap=<unavailable>,
***@entry=<error reading variable: value is not available>)
at /usr/src/sys/kern/uipc_syscalls.c:841
#18 0xffffffff8108824e in syscallenter (td=<optimized out>)
at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#19 amd64_syscall (td=0xfffffe019cdc2300, traced=0)
at /usr/src/sys/amd64/amd64/trap.c:1156
#20 <signal handler called>
Michael Schmiedgen
2021-05-05 16:51:30 UTC
Permalink
Post by Michael Schmiedgen
This looks like fairly random kernel memory corruption.  Are you able to
build an INVARIANTS kernel and test that?  Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
Below some info with an INVARIANTS kernel. Please let me know if I can provide
further information. Thank you!
--- kgdb backtrace
(kgdb) backtrace
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
[...]


FYI, I just triggered another panic with INVARIANTS and the backtrace looks like the one before.

Michael
Mark Johnston
2021-05-05 18:38:23 UTC
Permalink
Post by Michael Schmiedgen
Post by Mark Johnston
This looks like fairly random kernel memory corruption. Are you able to
build an INVARIANTS kernel and test that? Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
Below some info with an INVARIANTS kernel. Please let me know if I can provide
further information. Thank you!
Thanks, this helped a lot. I believe https://reviews.freebsd.org/D30129
will fix the problem. That patch is against the main branch but applies
cleanly to 13.0.
Michael Schmiedgen
2021-05-06 16:00:05 UTC
Permalink
Post by Mark Johnston
Post by Michael Schmiedgen
Post by Mark Johnston
This looks like fairly random kernel memory corruption. Are you able to
build an INVARIANTS kernel and test that? Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
Below some info with an INVARIANTS kernel. Please let me know if I can provide
further information. Thank you!
Thanks, this helped a lot. I believe https://reviews.freebsd.org/D30129
will fix the problem. That patch is against the main branch but applies
cleanly to 13.0.
I applied the patch and the server is running fine now for 8 hours with the
INVARIANTS kernel, including the Samba jail and SIP VM. I just compiled my
custom kernel and it looks like it is working too. Are there plans to get
this MFCed or even as Errata?

BTW, we got 2 other systems, also with userland NAT but different workload.
After an uncertain amount of time, mostly weeks, the natd starts to spin 100%
CPU on these systems. Quick noobish workaround was restarting natd every night.
I saw your recent commits that applied some more safety in that area, do you
plan to MFC these as well? I can imagine that could help with my NAT problems.

Anyway, many thanks for your investigation and your fix, much appreciated!

Michael
Mark Johnston
2021-05-06 16:02:56 UTC
Permalink
Post by Michael Schmiedgen
Post by Mark Johnston
Post by Michael Schmiedgen
Post by Mark Johnston
This looks like fairly random kernel memory corruption. Are you able to
build an INVARIANTS kernel and test that? Assuming you're using 13.0,
you'd grab the 13.0 sources, add "options INVARIANT_SUPPORT" and
"options INVARIANTS" to the GENERIC kernel configuration in
sys/amd64/conf, and do a "make buildkernel installkernel".
Below some info with an INVARIANTS kernel. Please let me know if I can provide
further information. Thank you!
Thanks, this helped a lot. I believe https://reviews.freebsd.org/D30129
will fix the problem. That patch is against the main branch but applies
cleanly to 13.0.
I applied the patch and the server is running fine now for 8 hours with the
INVARIANTS kernel, including the Samba jail and SIP VM. I just compiled my
custom kernel and it looks like it is working too. Are there plans to get
this MFCed or even as Errata?
Great, thanks. Yes I think we will do an EN for this.
Post by Michael Schmiedgen
BTW, we got 2 other systems, also with userland NAT but different workload.
After an uncertain amount of time, mostly weeks, the natd starts to spin 100%
CPU on these systems. Quick noobish workaround was restarting natd every night.
I saw your recent commits that applied some more safety in that area, do you
plan to MFC these as well? I can imagine that could help with my NAT problems.
I am skeptical that anything I did recently would fix this. Did you try
attaching a debugger to natd to see where it's getting stuck? Is it
also a regression from upgrading to 13.0?
Post by Michael Schmiedgen
Anyway, many thanks for your investigation and your fix, much appreciated!
Michael
Michael Schmiedgen
2021-05-06 16:22:12 UTC
Permalink
Post by Mark Johnston
Post by Michael Schmiedgen
BTW, we got 2 other systems, also with userland NAT but different workload.
After an uncertain amount of time, mostly weeks, the natd starts to spin 100%
CPU on these systems. Quick noobish workaround was restarting natd every night.
I saw your recent commits that applied some more safety in that area, do you
plan to MFC these as well? I can imagine that could help with my NAT problems.
I am skeptical that anything I did recently would fix this. Did you try
attaching a debugger to natd to see where it's getting stuck? Is it
also a regression from upgrading to 13.0?
Unfortunately this was difficult to trigger, and when it did, networking stopped
working and these are remote machines. This was/is on 12, I plan to upgrade these
in the near future and keep an eye on it.

Continue reading on narkive:
Loading...