Discussion:
reading routing table
Debarshi Ray
2008-09-01 12:07:07 UTC
Permalink
I am implementing a library/utility which basically encompasses the
features of the traditional route utilities and those of newer tools
(like ip from iproute2), which are mostly specific to a particular
kernel. The overpowering objective is to make the library/utility work
uniformly across all different kernels, so that programs like
NetworkManager have a portable library/utility to use instead of the
Linux-kernel specific ip which is now being used.

I was going through the FreeBSD and NetBSD documentation and the
FreeBSD sources of netstat and route. I was suprised to see that while
NetBSD's route implementation has a 'show' command, FreeBSD does not
offer any such thing. Moreover it seems that one can not read the
entire routing table using the PF_ROUTE sockets and RTM_GET returns
information pertaining to only one destination. This suprised me
because one can do such a thing with the Linux kernel's RTNETLINK.

Is there a reason why this is so? Or is reading from /dev/kmem the
only way to get a dump of the routing tables?

Thanks,
Debarshi
Bruce M. Simpson
2008-09-01 12:54:10 UTC
Permalink
Post by Debarshi Ray
I am implementing a library/utility which basically encompasses the
features of the traditional route utilities and those of newer tools
(like ip from iproute2), which are mostly specific to a particular
kernel. The overpowering objective is to make the library/utility work
uniformly across all different kernels, so that programs like
NetworkManager have a portable library/utility to use instead of the
Linux-kernel specific ip which is now being used.
Why don't you just use XORP's FEA code?
It already does all this under a BSD-type license.

cheers
BMS
Debarshi Ray
2008-09-01 13:23:49 UTC
Permalink
Post by Bruce M. Simpson
Why don't you just use XORP's FEA code?
It already does all this under a BSD-type license.
I was not aware of it. What does it do? Is it portable across other
OSes or is it *BSD specific?

Thanks,
Debarshi
Bruce M. Simpson
2008-09-01 13:34:26 UTC
Permalink
Post by Debarshi Ray
Post by Bruce M. Simpson
Why don't you just use XORP's FEA code?
It already does all this under a BSD-type license.
I was not aware of it. What does it do? Is it portable across other
OSes or is it *BSD specific?
XORP's FEA process is responsible for talking to the underlying
forwarding plane. It supports *BSD, Linux, MacOS X, and Microsoft Windows.

Over the last year there was a refactoring where the forwarding table
management got split into plugin-like modules. It is written in C++
although it's likely this split might make integration into other
projects easier.

Normally that support all goes into a single process, rather than being
linked into many.

cheers
BMS
Debarshi Ray
2008-09-01 15:15:23 UTC
Permalink
Post by Bruce M. Simpson
Why don't you just use XORP's FEA code?
It already does all this under a BSD-type license.
Nice stuff. However, it looks like a full blown routing platform. In
that case it would be easier to re-write those portions using the
relevant set of APIs.

Happy hacking,
Debarshi
Bruce M. Simpson
2008-09-01 13:01:38 UTC
Permalink
Post by Debarshi Ray
...
I was going through the FreeBSD and NetBSD documentation and the
FreeBSD sources of netstat and route. I was suprised to see that while
NetBSD's route implementation has a 'show' command, FreeBSD does not
offer any such thing. Moreover it seems that one can not read the
entire routing table using the PF_ROUTE sockets and RTM_GET returns
information pertaining to only one destination. This suprised me
because one can do such a thing with the Linux kernel's RTNETLINK.
Is there a reason why this is so? Or is reading from /dev/kmem the
only way to get a dump of the routing tables?
You want 'netstat -rn' to dump them, this is a very common command which
should be present in a number of online resources on using and
administering FreeBSD so I am somewhat surprised that you didn't find it.

P.S. Look in the sysctl tree if you need to snapshot the kernel IP
forwarding tables. You can use kmem, but it is generally frowned upon
unless you're working from core dumps -- kernels can be built without
kmem support, or kmem locked down, etc.

cheers
BMS
Debarshi Ray
2008-09-01 13:19:02 UTC
Permalink
Post by Bruce M. Simpson
You want 'netstat -rn' to dump them, this is a very common command which
should be present in a number of online resources on using and administering
FreeBSD so I am somewhat surprised that you didn't find it.
I know about netstat. I did mention having gone through its
implementation. :-) What I am asking about is related to the
implementation of the 'netstat -rn' functionality, which on some
non-FreeBSD systems is also provided by 'route [show] -n'.
Post by Bruce M. Simpson
P.S. Look in the sysctl tree if you need to snapshot the kernel IP
forwarding tables. You can use kmem, but it is generally frowned upon unless
you're working from core dumps -- kernels can be built without kmem support,
or kmem locked down, etc.
Ok, I will have a look. Thanks.

Happy hacking,
Debarshi
Julian Elischer
2008-09-02 07:00:58 UTC
Permalink
Post by Bruce M. Simpson
Post by Debarshi Ray
...
I was going through the FreeBSD and NetBSD documentation and the
FreeBSD sources of netstat and route. I was suprised to see that while
NetBSD's route implementation has a 'show' command, FreeBSD does not
offer any such thing. Moreover it seems that one can not read the
entire routing table using the PF_ROUTE sockets and RTM_GET returns
information pertaining to only one destination. This suprised me
because one can do such a thing with the Linux kernel's RTNETLINK.
Is there a reason why this is so? Or is reading from /dev/kmem the
only way to get a dump of the routing tables?
You want 'netstat -rn' to dump them, this is a very common command which
should be present in a number of online resources on using and
administering FreeBSD so I am somewhat surprised that you didn't find it.
P.S. Look in the sysctl tree if you need to snapshot the kernel IP
forwarding tables. You can use kmem, but it is generally frowned upon
unless you're working from core dumps -- kernels can be built without
kmem support, or kmem locked down, etc.
unfortunatly netstat -rn uses /dev/kmem

we've just never got around to implementing a better interface..
Post by Bruce M. Simpson
cheers
BMS
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-net
Debarshi Ray
2008-09-02 07:17:51 UTC
Permalink
Post by Julian Elischer
unfortunatly netstat -rn uses /dev/kmem
Yes. I also found that FreeBSD's route(8) implementation does not have
an equivalent of 'netstat -r'. NetBSD and GNU/Linux implementations
have such an option. Any reason for this? Is it because you did not
want to muck with /dev/kmem in route(8) and wanted it to work with
PF_ROUTE only? I have not yet gone through NetBSD's route(8) code
though.

Happy hacking,
Debarshi
Robert Watson
2008-09-02 09:19:55 UTC
Permalink
Post by Julian Elischer
unfortunatly netstat -rn uses /dev/kmem
Yes. I also found that FreeBSD's route(8) implementation does not have an
equivalent of 'netstat -r'. NetBSD and GNU/Linux implementations have such
an option. Any reason for this? Is it because you did not want to muck with
/dev/kmem in route(8) and wanted it to work with PF_ROUTE only? I have not
yet gone through NetBSD's route(8) code though.
Usually the "reason" for things like this is that no one has written the code
to do otherwise :-). PF_ROUTE is probably not a good mechanism for any bulk
data transfer due to the constraints of being a datagram socket, although
doing it via an interated dump rather than a simple dump operation would
probably work. Sysctl is generally a better interface for monitoring for
various reasona, although it also has limitations. Maintaining historic kmem
support is important, since it is also the code used for interpreting core
dumps, and we don't want to lose support for that.

Robert N M Watson
Computer Laboratory
University of Cambridge
Luigi Rizzo
2008-09-02 10:51:24 UTC
Permalink
in the (short so far) thread which i am hijacking, the issue came
out of what is a good mechanism for reading the route table from
the kernel, since FreeBSD currently uses /dev/kmem and this is not
always available/easy to use with dynamically changing data structures.

The routing table is only one instance of potentially many similar
data structures that we might want to fetch - others are the various
firewall tables (the output of 'ipfw show'), possibly bridging
tables, socket lists and so on.

The issue is actually twofold.

The interface problem, or how to pull bits from the kernel, is so
easy to be almost irrelevant -- getsockopt, sysctl, kmem, or some
special file descriptor does the job as long as the underlying chunk
of data does not change (or can be locked) during the syscall.

The real problem is that these data structures are dynamic and
potentially large, so the following approach (used e.g. in ipfw)

enter kernel;
get shared lock on the structure;
navigate through the structure and make a linearized copy;
unlock;
copyout the linearized copy;

is extremely expensive and has the potential to block other activities
for a long time.

Accessing /dev/kmem and follow pointers there has probably the risk
that you cannot lock the kernel data structure while you navigate
on it, so you are likely to follow stale pointers.

What we'd need is some internal representation of the data structure
that could give us individual entries of the data structure on each
call, together with extra info (a pointer if we can guarantee that
it doesn't get stale, something more if we cannot make the guarantee)
to allow the navigation to occur.

I believe this is a very old and common problem, so my question is:

do you know if any of the *BSD kernels implements some good mechanism
to access a dynamic kernel data structure (e.g. the routing tree/trie,
or even a list or hash table) without the flaws of the two approaches
i indicate above ?

cheers
luigi

[original thread below just for reference, but i believe i made a
fair summary above]
Post by Robert Watson
Post by Julian Elischer
unfortunatly netstat -rn uses /dev/kmem
Yes. I also found that FreeBSD's route(8) implementation does not have an
equivalent of 'netstat -r'. NetBSD and GNU/Linux implementations have such
an option. Any reason for this? Is it because you did not want to muck
with /dev/kmem in route(8) and wanted it to work with PF_ROUTE only? I
have not yet gone through NetBSD's route(8) code though.
Usually the "reason" for things like this is that no one has written the
code to do otherwise :-). PF_ROUTE is probably not a good mechanism for
any bulk data transfer due to the constraints of being a datagram socket,
although doing it via an interated dump rather than a simple dump operation
would probably work. Sysctl is generally a better interface for monitoring
for various reasona, although it also has limitations. Maintaining
historic kmem support is important, since it is also the code used for
interpreting core dumps, and we don't want to lose support for that.
Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-net
Bruce M. Simpson
2008-09-02 14:55:53 UTC
Permalink
Post by Luigi Rizzo
do you know if any of the *BSD kernels implements some good mechanism
to access a dynamic kernel data structure (e.g. the routing tree/trie,
or even a list or hash table) without the flaws of the two approaches
i indicate above ?
Hahaha. I ran into an isomorphic problem with Net-SNMP at work last week.

There's a need to export the BGP routing table via SNMP. Of course
doing this in our framework at work requires some IPC calls which always
require a select() (or WaitForMultipleObjects()) based continuation.
Net-SNMP doesn't support continuations at the table iterator level,
so somehow, we need to implement an iterator which can accomodate our
blocking IPC mechanism.

[No, we don't use threads, and that would actually create more
problems than it solves -- running single-threaded with continuations
lets us run lock free, and we rely on the OS's IPC primitives to
serialize our code. works just fine for us so far...]

So we would end up caching the whole primary key range in the SNMP
sub-agent on a table OID access, a technique which would allow us to
defer the IPC calls providing we walk the entire range of the iterator
and cache the keys -- but even THAT is far too much data for the BGP
table, which is a trie with ~250,000 entries. I hate SNMP GETNEXT.

Back to the FreeBSD kernel, though.

If you look at in_mcast.c, particularly in p4 bms_netdev, this is
what happens for the per-socket multicast source filters -- there is the
linearization of an RB-tree for setsourcefilter().
This is fine for something with a limit of ~256 entries per socket
(why RB for something so small? this is for space vs time -- and also it
has to merge into a larger filter list in the IGMPv3 paths.)
And the lock granularity is per-socket. However it doesn't do for
something as big as a BGP routing table.

C++ lends itself well to expressing these kinds of smart-pointer
idioms, though.
I'm thinking perhaps we need the notion of a sysctl iterator, which
allocates a token for walking a shared data structure, and is able to
guarantee that the token maps to a valid pointer for the same entry,
until its 'advance pointer' operation is called.

Question is, who's going to pull the trigger?

cheers
BMS

P.S. I'm REALLY getting fed up with the lack of openness and
transparency largely incumbent in doing work in p4.

Come one come all -- we shouldn't need accounts for folk to see and
contribute what's going on, and the stagnation is getting silly. FreeBSD
development should not be a committer or chum-of-committer in-crowd.
Robert Watson
2008-09-02 21:02:10 UTC
Permalink
The real problem is that these data structures are dynamic and potentially
large, so the following approach (used e.g. in ipfw)
enter kernel;
get shared lock on the structure;
navigate through the structure and make a linearized copy;
unlock;
copyout the linearized copy;
is extremely expensive and has the potential to block other activities for a
long time.
Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with varying
levels of abstraction, for pushing data, and all fundamentally suffer from the
problem of a lack of general export abstraction.
What we'd need is some internal representation of the data structure that
could give us individual entries of the data structure on each call,
together with extra info (a pointer if we can guarantee that it doesn't get
stale, something more if we cannot make the guarantee) to allow the
navigation to occur.
I think there's necessarily implementation-specific details to all of these
steps for any given kernel subsystem -- we have data structures,
synchronization models, etc, that are all tuned to their common use
requirements, and monitoring is very much an edge case. I don't think this is
bad: this is an OS kernel, after all, but it does make things a bit more
tricky. Even if we can't share code, sharing approaches across subsystems is
a good idea.

For an example of what you have in mind, take a look at the sysctl monitoring
for UNIX domain sockets. First, we allocate an array of pointers sized to the
number of unpcb's we have. Then we walk the list, bumping the references and
adding pointers to the array. Then we release the global locks, and proceed
lock, externalize, unlock, and copyout each individual entry, using a
generation number fo manage staleness. Finally, we walk the list dropping the
refcounts and free the array. This voids holding global locks for a long
time, as well as the stale data issue. It's unideal in other ways -- long
lists, reference count complexity, etc, but as I mentioned, it is very much an
edge case, and much of that mechanism (especially refcounts) is something we
need anyway for any moderately complex kernel data structure.

Robert N M Watson
Computer Laboratory
University of Cambridge
Accessing /dev/kmem and follow pointers there has probably the risk
that you cannot lock the kernel data structure while you navigate
on it, so you are likely to follow stale pointers.
do you know if any of the *BSD kernels implements some good mechanism
to access a dynamic kernel data structure (e.g. the routing tree/trie,
or even a list or hash table) without the flaws of the two approaches
i indicate above ?
cheers
luigi
[original thread below just for reference, but i believe i made a
fair summary above]
Post by Robert Watson
Post by Julian Elischer
unfortunatly netstat -rn uses /dev/kmem
Yes. I also found that FreeBSD's route(8) implementation does not have an
equivalent of 'netstat -r'. NetBSD and GNU/Linux implementations have such
an option. Any reason for this? Is it because you did not want to muck
with /dev/kmem in route(8) and wanted it to work with PF_ROUTE only? I
have not yet gone through NetBSD's route(8) code though.
Usually the "reason" for things like this is that no one has written the
code to do otherwise :-). PF_ROUTE is probably not a good mechanism for
any bulk data transfer due to the constraints of being a datagram socket,
although doing it via an interated dump rather than a simple dump operation
would probably work. Sysctl is generally a better interface for monitoring
for various reasona, although it also has limitations. Maintaining
historic kmem support is important, since it is also the code used for
interpreting core dumps, and we don't want to lose support for that.
Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-net
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-net
Luigi Rizzo
2008-09-02 21:31:18 UTC
Permalink
Post by Robert Watson
The real problem is that these data structures are dynamic and potentially
large, so the following approach (used e.g. in ipfw)
enter kernel;
get shared lock on the structure;
navigate through the structure and make a linearized copy;
unlock;
copyout the linearized copy;
is extremely expensive and has the potential to block other activities for
a long time.
Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with
varying levels of abstraction, for pushing data, and all fundamentally
suffer from the problem of a lack of general export abstraction.
yes, this is why i said we should not bother about which one is used.
Post by Robert Watson
What we'd need is some internal representation of the data structure that
could give us individual entries of the data structure on each call,
together with extra info (a pointer if we can guarantee that it doesn't
get stale, something more if we cannot make the guarantee) to allow the
navigation to occur.
I think there's necessarily implementation-specific details to all of these
steps for any given kernel subsystem -- we have data structures,
synchronization models, etc, that are all tuned to their common use
requirements, and monitoring is very much an edge case. I don't think this
is bad: this is an OS kernel, after all, but it does make things a bit more
tricky. Even if we can't share code, sharing approaches across subsystems
is a good idea.
For an example of what you have in mind, take a look at the sysctl
monitoring for UNIX domain sockets. First, we allocate an array of
pointers sized to the number of unpcb's we have. Then we walk the list,
bumping the references and adding pointers to the array. Then we release
the global locks, and proceed lock, externalize, unlock, and copyout each
individual entry, using a generation number fo manage staleness. Finally,
we walk the list dropping the refcounts and free the array. This voids
holding global locks for a long time, as well as the stale data issue.
It's unideal in other ways -- long lists, reference count complexity, etc,
but as I mentioned, it is very much an edge case, and much of that
mechanism (especially refcounts) is something we need anyway for any
moderately complex kernel data structure.
what you describe above is more efficient but not that different
from what i described. The thing is, i always forget that in many
cases an iterator doesn't care for the order in which elements
are generated so you could use a solution like the one below,
by just doing a tiny little bit of work while modifying the main
data structure
(this might well be a known solution, since it is so trivial...)

[i already emailed the following to BMS, so apologies for the duplicate]

if all you care is iterating the whole data structure, without a
particular order, you could manage an additional array of pointers
to all the objects in the data structure (the array should be
implemented as a sparse, resizable array but that's a separate
issue, and probably a relatively trivial one -- i am googling for
it...).

Navigation and iterators are simple:

+ When inserting a new element, append an entry to the array, and make
it point to the newly added record. Each entry gets a fresh sequence
numbers, and one should make sure that seqnumbers are not recycled
(64 bit should suffice ?).

+ when deleting an element, logically remove the entry from the array

+ the iterator returns a copy of the object, and its sequence number;

+ getnext returns the existing element following the 'current' one
in the sparse array.

Complexity for most ops (data insertion, removal, lookup)
would be O(1) plus whatever is needed to do housekeeping on the
sparse array, and this depends on the usage of the main data structure
and how much we worry for expensive 'getnext' ops.
Sure you need a read lock on the main struct while you lookup
the next element on the sparse array, but the nice thing is that
you can release the lock at each step even if you have a poorly
implemented sparse array.

Makes sense ?

cheers
luigi
Julian Elischer
2008-09-02 22:10:24 UTC
Permalink
Post by Robert Watson
Post by Luigi Rizzo
The real problem is that these data structures are dynamic and
potentially large, so the following approach (used e.g. in ipfw)
enter kernel;
get shared lock on the structure;
navigate through the structure and make a linearized copy;
unlock;
copyout the linearized copy;
is extremely expensive and has the potential to block other activities
for a long time.
Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with
varying levels of abstraction, for pushing data, and all fundamentally
suffer from the problem of a lack of general export abstraction.
Post by Luigi Rizzo
What we'd need is some internal representation of the data structure
that could give us individual entries of the data structure on each
call, together with extra info (a pointer if we can guarantee that it
doesn't get stale, something more if we cannot make the guarantee) to
allow the navigation to occur.
In some code I have seen (and some I have written) there is always two
levels of storage in some modules.. One that contains the majority
of the information and one that contains "changes that occured since
the main container was locked"..

so for example the routing tables might be locked and if
a routing change is requested thereafter, it gets stored in a
transactional form on the side structure.. a routing lookup
during the period that the structure is locked (if a read lock) simply
goes ahead, and at completion checks if there is a better
answer in the waiting list. A write request is stored as a
transaction request on the waiting list. not saying it works for
everything but If we had a kernel written in a high enough level
language, such methods could be broadly used.. oh well.

using reader-writer locking mitigates a lot of this..
Post by Robert Watson
I think there's necessarily implementation-specific details to all of
these steps for any given kernel subsystem -- we have data structures,
synchronization models, etc, that are all tuned to their common use
requirements, and monitoring is very much an edge case. I don't think
this is bad: this is an OS kernel, after all, but it does make things a
bit more tricky. Even if we can't share code, sharing approaches across
subsystems is a good idea.
For an example of what you have in mind, take a look at the sysctl
monitoring for UNIX domain sockets. First, we allocate an array of
pointers sized to the number of unpcb's we have. Then we walk the list,
bumping the references and adding pointers to the array. Then we
release the global locks, and proceed lock, externalize, unlock, and
copyout each individual entry, using a generation number fo manage
staleness. Finally, we walk the list dropping the refcounts and free
the array. This voids holding global locks for a long time, as well as
the stale data issue. It's unideal in other ways -- long lists,
reference count complexity, etc, but as I mentioned, it is very much an
edge case, and much of that mechanism (especially refcounts) is
something we need anyway for any moderately complex kernel data structure.
refcounts are, unfortunatly a really bad thing for multiprocessors.
refcounts, if they are actually incremented now and then are usually
out of scope for any given CPU forcing lots of cache flushes and real
reads from memory. There are some elaborate MP-refcount schemes we
really should look at but most require a lot of memory.
Debarshi Ray
2008-09-18 08:01:45 UTC
Permalink
So I got something working for FreeBSD now:
http://rishi.fedorapeople.org/gnu/inetutils-1.5.tar.gz

I have been using a combination of sysctl and PF_ROUTE to retrieve the
routing table, much like the approach taken by the NetBSD
implementation. Support for modifying the routing table is yet to be
implemented.

Thanks for all your comments.

By the way, would you want someone to implement 'show' support for
FreeBSD's route implementation? I can give it a go now. :-)

Happy hacking,
Debarshi
Bruce M. Simpson
2008-09-18 22:00:06 UTC
Permalink
Post by Debarshi Ray
...
By the way, would you want someone to implement 'show' support for
FreeBSD's route implementation? I can give it a go now. :-)
For sure, we'd be very happy to see a patch like that.

Many thanks
BMS
Julian Elischer
2008-09-18 23:57:32 UTC
Permalink
Post by Bruce M. Simpson
Post by Debarshi Ray
...
By the way, would you want someone to implement 'show' support for
FreeBSD's route implementation? I can give it a go now. :-)
For sure, we'd be very happy to see a patch like that.
Many thanks
BMS
and don't forget the same patch for netsta\t so that it doesn't need
/dev/kmem for netstat -r :-)

BUT, don't forget about multiple routing tables..

Loading...