Hi Laurent,
Post by Jude NelsonI haven't tried this myself, but it should be doable. Vdev's
Post by Jude Nelsonevent-propagation mechanism is a small program that constructs a
uevent string from environment variables passed to it by vdev and
writes the string to the appropriate place. The vdev daemon isn't
aware of its existence; it simply executes it like it would for any
another matching device-event action. Another device manager could
supply the same program with the right environment variables and use
it for the same purposes.
Indeed. My question then becomes: what are the differences between
the string passed by the kernel (which is more or less a list of
environment variables, too) and the string constructed by vdev ?
In other words, is vdev itself more than a trivial netlink listener,
and if yes, what does it do ? (I'll just take a pointer to the
documentation if that question is answered somewhere.)
For now I'll take a wild guess and say that vdev analyzes the
MODALIAS or something, according to a conf file, in order to know
the correct fan-out to perform and write the event to the correct
subsystems. Am I close ?
(I should really sit down and write documentation sometime :)
I think you're close. The jist of it is that vdev needs to supply a lot
more information than the kernel gives it. In particular, its helper
programs go on to query the properties and status of each device (this
often requires root privileges, i.e. via privileged ioctl()s), and vdev
gathers the information into a (much larger) event packet and stores it in
a directory tree under /dev for subsequent query by less-privileged
programs. It doesn't rely on the MODALIAS per se; instead it matches
fields of the kernel's uevent packet (one of which is the MODALIAS) to the
right helper programs to run.
Here's an example of what vdev gathers for my laptop's SATA disk:
$ cat /dev/metadata/dev/sda/properties
VDEV_ATA=1
VDEV_WWN=0x5000c500299a9a7a
VDEV_BUS=ata
VDEV_SERIAL=ST9500420AS_5VJ7A0BM
VDEV_SERIAL_SHORT=5VJ7A0BM
VDEV_REVISION=0003LVM1
VDEV_TYPE=ata
VDEV_MAJOR=8
VDEV_MINOR=0
VDEV_OS_SUBSYSTEM=block
VDEV_OS_DEVTYPE=disk
VDEV_OS_DEVPATH=/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda
VDEV_OS_DEVNAME=sda
VDEV_ATA=1
VDEV_ATA_TYPE=disk
VDEV_ATA_MODEL=ST9500420AS
VDEV_ATA_MODEL_ENC=ST9500420ASx20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20
VDEV_ATA_REVISION=0003LVM1
VDEV_ATA_SERIAL=ST9500420AS_5VJ7A0BM
VDEV_ATA_SERIAL_SHORT=5VJ7A0BM
VDEV_ATA_WRITE_CACHE=1
VDEV_ATA_WRITE_CACHE_ENABLED=1
VDEV_ATA_FEATURE_SET_HPA=1
VDEV_ATA_FEATURE_SET_HPA_ENABLED=1
VDEV_ATA_FEATURE_SET_PM=1
VDEV_ATA_FEATURE_SET_PM_ENABLED=1
VDEV_ATA_FEATURE_SET_SECURITY=1
VDEV_ATA_FEATURE_SET_SECURITY_ENABLED=0
VDEV_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=100
VDEV_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=100
VDEV_ATA_FEATURE_SET_SECURITY_FROZEN=1
VDEV_ATA_FEATURE_SET_SMART=1
VDEV_ATA_FEATURE_SET_SMART_ENABLED=1
VDEV_ATA_FEATURE_SET_APM=1
VDEV_ATA_FEATURE_SET_APM_ENABLED=1
VDEV_ATA_FEATURE_SET_APM_CURRENT_VALUE=128
VDEV_ATA_DOWNLOAD_MICROCODE=1
VDEV_ATA_SATA=1
VDEV_ATA_SATA_SIGNAL_RATE_GEN2=1
VDEV_ATA_SATA_SIGNAL_RATE_GEN1=1
VDEV_ATA_ROTATION_RATE_RPM=7200
VDEV_ATA_WWN=0x5000c500299a9a7a
VDEV_ATA_WWN_WITH_EXTENSION=0x5000c500299a9a7a
Anything that starts with "VDEV_ATA_", as well as "VDEV_BUS",
"VDEV_SERIAL_*", "VDEV_TYPE", and "VDEV_REVISION" had to be extracted via
an ioctl, by exploring files in sysfs, or by querying a hardware database.
The kernel only supplied a few of these fields.
Post by Jude NelsonTmpfs and devtmps are designed for holding ephemeral state already,
Post by Jude Nelsonso I'm not sure why the fact that they expose data as regular files
is a concern?
Two different meanings of "ephemeral".
tmpfs and devtmpfs are supposed to retain their data until the
end of the system's lifetime. An event is much more ephemeral
than that: it's supposed to be consumed instantly - like the
event from the kernel is consumed instantly by the netlink listener.
Files, even in a tmpfs, remain alive in the absence of a live
process to hold them; but events have no meaning if no process needs
them, which is the reason for the "event leaking" problem.
Ideally, you need a file type with basically the same lifetime
as a process.
Holding event data in a file is perfectly valid as long as you have
a mechanism to reclaim the file as soon as the last reference to it
dies.
Funny you mention this--I also created runfs (
https://github.com/jcnelson/runfs) to do exactly this. In particular, I
use it for PID files. Also, eventfs was actually derived from runfs, but
specialized more to make it more suitable for managing event-queues.
Post by Jude NelsonI couldn't think of a simpler way that was also as robust. Unless
Post by Jude NelsonI'm misunderstanding something, wrapping an arbitrary program to
clean up the files it created would, in the extreme, require coming
up with a way to do so on SIGKILL. I'd love to know if there is a
simple way to do this, though.
That's where supervisors come into play: the parent of a process
always knows when it dies, even on SIGKILL. Supervised daemons can
have a cleaner script in place.
For the general case, it shouldn't be hard to have a wrapper that
forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid*
when it dies. The price to pay is an additional process, but that
additional process would be very small.
You can still have a polling "catch-all cleaner" to collect dead events
in case the supervisor/wrapper also died, but since that occurrence will
be rare, the polling period can be pretty long so it's not a problem.
Agreed. I would be happy to keep this approach in mind in the design of
libudev-compat. Eventfs isn't a hard requirement and I don't want it to
be, since there's more than one way to deal with this problem.
Post by Jude NelsonI went with a specialized filesystem for two reasons; both of which
Post by Jude Nelsonwere to fulfill libudev's API contract: * Efficient, reliable event
multicasting. By using hard-links as described above, the event only
needs to be written out once, and the OS only needs to store one
copy.
That's a good mechanism; you're already fulfilling that contract
with the non-eventfs implementation.
* Automatic multicast channel cleanup. Eventfs would ensure that no
Post by Jude Nelsonmatter how a process dies, its multicast state would be come
inaccessible and be reclaimed once it is dead (i.e. a subsequent
filesystem operation on the orphaned state, no matter how soon after
the process's exit, will fail).
That's where storing events as files is problematic: files survive
processes. But I still don't think a specific fs is necessary: you can
either ensure files do not survive processes (see the supervisor/cleaner
idea above), or you can use another Unix mechanism (see below).
Both of the above are implicitly guaranteed by libudev, since it
Post by Jude Nelsonrelies on a netlink multicast group shared with the udevd process
to achieve them.
And honestly, that's not a bad design. If you want to have multicast,
and you happen to have a true multicast IPC mechanism, might as well
use it. It will be hard to be as efficient as that: if you don't have
true multicast, you have to compromise somewhere.
I dare say using a netlink multicast group is lighter than designing
a FUSE filesystem to do the same thing. If you want the same
functionality, why didn't you adopt the same mechanism ?
I agree that netlink is lighter, but I avoided it for two reasons:
* Sometime down the road, I'd like to port vdev to OpenBSD. Not because I
believe that the OpenBSD project is in dire need of a dynamic device
manager, but simply because it's the thing I miss the most when I'm using
OpenBSD (personal preference). Netlink is Linux-specific, whereas FUSE
works on pretty much every Unix these days.
* There is no way to namespace netlink messages that I'm aware of. The
kernel (and udev) sends the same device events to every container on the
system--in fact, this is one of the major reasons cited by the systemd
folks for moving off of netlink for udevd-to-libudev communications. By
using a synthetic filesystem for message transport, I can use bind-mounts
to control which device events get routed to which containers (this is also
the reason why the late kdbus was implemented as a synthetic filesystem).
Using fifodirs has the same benefit :)
Post by Jude Nelson(It can be made modular. You can have a uevent listener that just gets
the event from the kernel and transmits it to the event manager; and
the chosen event manager multicasts it.)
Good point; something I'll keep in mind in the future evolution of
libudev-compat :)
Post by Jude NelsonIt is my understanding (please correct me if I'm wrong) that with
Post by Jude Nelsons6-ftrig-*, I would need to write out the event data to each
listener's pipe (i.e. once per struct udev_monitor instance), and I
would still be responsible for cleaning up the fifodir every now and
then if the libudev-compat client failed to do so itself. Is my
understanding correct?
Yes and no. I'm not suggesting you to use libftrig for your purpose. :)
* My concern with libftrig was never event storage: it was
many-to-many notification. I didn't design it to transmit arbitrary
amounts of data, but to instantly wake up processes when something
happens; data transmission *is* possible, but the original idea is
to send one byte at a time, for just 256 types of event.
Notification and data transmission are orthogonal concepts. It's
always possible to store data somewhere and notify processes that
data is available; then processes can fetch the data. Data
transmission can be pull, whereas notification has to be push.
libftrig is only about the push.
Leaking space is not a concern with libftrig, because fifodirs
never store data, only pipes; at worst, they leak a few inodes.
That is why a polling cleaner is sufficient: even if multiple
subscribers get SIGKILLed, they will only leave behind a few
fifos, and no data - so sweeping now and then is more than enough.
It's different if you're storing data, because leaks can be much
more problematic.
* Unless you have true multicast, you will have to push a
notification as many times as you have listeners, no matter what.
That's what I'm doing when writing to all the fifos in a fifodir.
That's what you are doing when linking the event into every
subscriber's directory. I guess your subscriber library uses some
kind of inotify to know when a new file has arrived?
Yes, modulo some other mechanisms to ensure that the libudev-compat process
doesn't get back-logged and lose messages. I completely agree with you
about the benefits of separating notification (control-plane) from message
delivery (data-plane).
Post by Jude NelsonAgain, I would love to know of a simpler approach that is just as
Whenever you have "pull" data transmission, you necessarily have the
problem of storage lifetime. Here, as often, what you want is
reference counting: when the last handle to the data disappears, the data
is automatically collected.
The problem is that your current handle, an inode, is not tied to the
subscriber's lifetime. You want a type of handle that will die with the
process.
File descriptors fit this.
- Your event manager listens to a Unix domain socket.
- Your subscribers connect to that socket.
+ the event manager stores the event into an anonymous file (e.g. a file
in a tmpfs that is unlinked as soon as it is created) while keeping a
reading fd on it
+ the event manager sends a copy of the reading fd, via fd-passing,
to every subscriber. This counts as a notification, since it will wake up
subscribers.
+ the event manager closes its own fd to the file.
+ subscribers will read the fd when they so choose, and they will
close it afterwards. The kernel will also close it when they die, so you
won't leak any data.
Of course, at that point, you may as well give up and just push the
whole event over the Unix socket. It's what udevd does, except it uses a
netlink multicast group instead of a normal socket (so its complexity is
independent from the number of subscribers). Honestly, given that the
number of subscribers will likely be small, and your events probably aren't
too large either, it's the simplest design - it's what I'd go for.
(I even already have the daemon to do it, as a part of skabus. Sending
data to subscribers is exactly what a pubsub does.)
But if you estimate that the amount of data is too large and you don't
want to copy it, then you can just send a fd instead. It's still
manual broadcast, but it's not in O(event length * subscribers), it's in
O(subscribers), i.e. the same complexity as your "hard link the event
file" strategy; and it has the exact storage properties that you want.
What do you think ?
I think both approaches are good ideas and would work just as well. I
really like skabus's approach--I'll take a look at using it for message
delivery as an additional (preferred?) vdev-to-libudev-compat message
delivery mechanism :) It looks like it offers all the aforementioned
benefits over netlink that I'm looking for.
A question on the implementation--what do you think of having each
subscriber create its own Unix domain socket in a canonical directory, and
having the sender connect as a client to each subscriber? Since each
subscriber needs its own fd to read and close, the directory of subscriber
sockets automatically gives the sender a list of who to communicate with
and a count of how many fds to create. It also makes it easy to detect and
clean up a dead subscriber's socket: the sender can request a struct ucred
from a subscriber to get its PID (and then other details from /proc), and
if the process ever exits (which the sender can detect on Linux using a
netlink process monitor, like [1]), the process that created the socket can
be assumed to be dead and the sender can unlink it. The sender would rely
on additional process instance-identifying information from /proc (like its
start-time) to avoid PID-reuse races.
Thanks again for all your input!
-Jude
[1] http://bewareofgeek.livejournal.com/2945.html?page=1