[alpine-devel] udev replacement on Alpine Linux

Discussion:

Alan Pillay

2015-07-26 00:11:41 UTC

Dear Alpine Linux developers and mailing-list lurkers,

udev is currently being used on Alpine version 3.2.2, but we all know
it detracts from the philosophy to keep things simple, small and
efficient.
There are many programs out there that could replace udev and help
Alpine get in a better shape. I will list some that I know.

[mdev] there are 2 mdev implementations that I know, busybox's and
toybox's. On Alpine Linux, busybox already comes installed by default
(and its mdev comes with it, which is weird since it isn't currently
used, but I digress)
[smdev] smdev is an even simpler implementation of a device manager by
the well-known suckless developers. If it is mature enough, certainly
a high contender.
[eudev] a fork of udev from the gentoo developers. Doesn't appear to
be as small as others, but should be more easily integrated into
alpine.
[vdev] a device manager with an approach a bit different, offers an
optional filesystem interface that implements a per-process view of
/dev. Possibly the least simple alternative, but interesting
nonetheless.

I thought about using this means of communication so developers ca
discuss this matter that impacts the use of the Alpine Linux
distribution as a whole.
I am also emailing relevant parties (developers of the cited device
managers, so they can participate if they desire).
Thanks for the attention.

KISS

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Jude Nelson

2015-07-26 00:51:12 UTC

Permalink

Hi everyone,

Post by Alan Pillay
[vdev] a device manager with an approach a bit different, offers an
optional filesystem interface that implements a per-process view of
/dev. Possibly the least simple alternative, but interesting
nonetheless.

Thank you for your interest in vdev. I am the principal developer. I hope
you all don't mind me chiming in to provide a little bit more information.

First, I can't emphasize the "optional" qualifier enough regarding the
filesystem component vdevfs. The hotplug daemon vdevd and the
libudev-compat library are meant to replace udevd and libudev respectively;
the filesystem is an add-on that can be used independently of whatever
device manager is running. There are more detailed write-up's on the
design goals for each of the three vdev components in [0] and [1].

Second, I would like to point out that vdevd by itself is not too different
from mdev, nldev, and smdev. The udev-like behavior comes almost entirely
from the shell scripts and auxiliary helper programs it executes in
reaction to device uevents. I bring this up because these scripts and
helper programs could easily be ported to mdev, nldev, and smdev simply by
providing a wrapper that sets the appropriate environment variables (a
listing can be found in Appendix A of [2]).

Third, I would like to point out that libudev-compat is *not* dependent on
vdevd or any device manager. A libudev-compat process receives device
events by watching for new files written into a well-defined directory in a
RAM-backed filesystem. Vdevd simply runs a script to fill in /run/udev and
write device event files in order to communicate with libudev-compat
processes.

I'm happy to help test and evaluate vdev with Alpine. This document [2]
contains a tutorial on how to try it out without installing, and what to
send me if something breaks. We'll be making an alpha branch soon (see
[3]).

[0] http://judecnelson.blogspot.com/2015/01/introducing-vdev.html
[1] https://github.com/jcnelson/vdev/issues/32
[2] https://github.com/jcnelson/vdev/blob/master/how-to-test.md
[3] https://github.com/jcnelson/vdev/issues/33

Thanks,
Jude

Rob Landley

2015-07-26 03:34:34 UTC

Permalink

Post by Alan Pillay
Dear Alpine Linux developers and mailing-list lurkers,
udev is currently being used on Alpine version 3.2.2, but we all know
it detracts from the philosophy to keep things simple, small and
efficient.
There are many programs out there that could replace udev and help
Alpine get in a better shape. I will list some that I know.
[mdev] there are 2 mdev implementations that I know, busybox's and
toybox's. On Alpine Linux, busybox already comes installed by default
(and its mdev comes with it, which is weird since it isn't currently
used, but I digress)

I'm the primary developer of toybox and the original author of busybox
mdev, but busybox's mdev has grown a lot of new features over the years
that toybox doesn't implement yet.

I'm happy to add them, but am mostly waiting for patches from the users
telling me what they need. (My own embedded systems mostly just use
devtmpfs, they don't tend to hotplug a lot of stuff.)

If there's interest in my fleshing out toybox's mdev, I can bump it up
the todo list, but I tend to be chronically overcommitted so need
repeated poking...

Rob

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Natanael Copa

2015-07-27 10:00:52 UTC

Permalink

On Sat, 25 Jul 2015 22:34:34 -0500

Post by Rob Landley

I'm the primary developer of toybox and the original author of busybox
mdev, but busybox's mdev has grown a lot of new features over the years
that toybox doesn't implement yet.

Busybox mdev has lots of feature/solutions for problems that I think
should not been there in first place. For example firmware loading, now
handled by kernel, device node creation could be handled by devtmpfs
(if we want be able to optionally use udev for Xorg we will need
devtmpfs anyway).

busybox mdev has also a solution for serialization of the
hotplug events, which I think is an ugly hack. Code could have been
simpler by just reading events from netlink.

Post by Rob Landley
I'm happy to add them, but am mostly waiting for patches from the users
telling me what they need. (My own embedded systems mostly just use
devtmpfs, they don't tend to hotplug a lot of stuff.)

So what I have been thinking:

a netlink socket activator[1], which when there comes an event, fork
and execs a handler and passes over the the netlink socket.

The handler reads various events from netlink socket. It should be able
load kernel modules without forking, and ideally, it should be able to
handle each event without forking, including doing blkid lookups
without forking. After one or two seconds without any netlink event it
will exit and the socket activator takes over again.

There was a huge thread about netlink and mdev in busybox mailing list.
There were some strong opinions of making a more general read events
from any pipe, but I think that needlessly complicates things.

I am also interested in loading modules without forking, so I was
thinking of making modprobe read modaliases from a stream. Doing so in
busybox would require a major refactoring so I instead looked at using
libkmod for that. But then libkmod only works with binary format of
modaliases so busybox depmod needed a fix[2][3] to generate a binary
format of the indexes so libkmod can read those.

The current plan is to use nlsockd as socket activator, a netlink
reader[4] which will load kernel modules with libkmod and fork mdev -
but only on the relevant events - those who has DEVNAME set.

Post by Rob Landley
If there's interest in my fleshing out toybox's mdev, I can bump it up
the todo list, but I tend to be chronically overcommitted so need
repeated poking...

What I might be interested in is making toybox mdev read events from a
netlink socket (stdin or other filedescriptor), add support for loading
modaliases without forking.

-nc

[1]: http://git.alpinelinux.org/cgit/user/ncopa/nlplug/tree/nlsockd.c
[2]: http://lists.busybox.net/pipermail/busybox/2015-July/083143.html
[3]: http://lists.busybox.net/pipermail/busybox/2015-July/083142.html
[4]: http://git.alpinelinux.org/cgit/user/ncopa/nlplug/tree/nlplug.c

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Rob Landley

2015-07-27 18:45:29 UTC

Permalink

Post by Natanael Copa
On Sat, 25 Jul 2015 22:34:34 -0500

Post by Rob Landley

I'm the primary developer of toybox and the original author of busybox
mdev, but busybox's mdev has grown a lot of new features over the years
that toybox doesn't implement yet.

I lean towards devtmpfs too, but you just listed requiring it as one of
the _downsides_ of eudev if I recall... :)

At some point I need to immerse myself in this for a month to design and
implement The Right Thing. Unfortunately, it's about 12th on the list,
especially since Android does its own thing we're unlikely to displace,
and that's literally a billion seats. (Getting android, posix, lsb, and
the prerequisites of a linux from scratch build right are the top
priorities, everything else comes after that unless people submit
patches and to be honest usually a couple follow up pokes.)

Post by Natanael Copa
busybox mdev has also a solution for serialization of the
hotplug events, which I think is an ugly hack. Code could have been
simpler by just reading events from netlink.

I have a patch to do that somewhere, actually. Not against it, just...
that requires a persistent demon, which the historical mode doesn't.

Post by Natanael Copa

a netlink socket activator[1], which when there comes an event, fork
and execs a handler and passes over the the netlink socket.

Wasn't requiring a new fork for each event was a performance bottleneck
in what the sash guys wrote, last email?

Post by Natanael Copa
The handler reads various events from netlink socket. It should be able
load kernel modules without forking, and ideally, it should be able to
handle each event without forking, including doing blkid lookups
without forking.

Note: last time I benched this fork was 5% of the overhead and exec was
95% of the overhead. Is fork what you object to, or is process spawning
what you object to?

Toybox can fork() and internally run insmod or modprobe as a builtin
command without re-execing itself (except on nommu). The reason for the
fork() is to avoid needing to clean up after commands, especially from
error paths that exit halfway through because it couldn't open a file or
something. (There's a recursion limit, after 9 recursive invocations it
invokes the exec() path anyway to keep the stack size down to a dull
roar. But that doesn't come up much in practice.)

I _can_ do nofork calls to various commands, but daemons are careful to
avoid xfunction() library calls that exit on error, and most other
commands aren't, and even if I audit what's there now relying on it to
stay that way is a regression waiting to happen. (Maybe at some point
I'll expand the nofork stuff, but it's a can of worms and I'm going for
simplicity where possible.)

Post by Natanael Copa
After one or two seconds without any netlink event it
will exit and the socket activator takes over again.

By socket activator you basically mean an inetd variant?

Post by Natanael Copa
There was a huge thread about netlink and mdev in busybox mailing list.

I catch up on that _maybe_ yearly these days, lemme see... Google says:

http://lists.busybox.net/pipermail/busybox/2015-March/082690.html

Post by Natanael Copa
There were some strong opinions of making a more general read events
from any pipe, but I think that needlessly complicates things.
I am also interested in loading modules without forking, so I was
thinking of making modprobe read modaliases from a stream. Doing so in
busybox would require a major refactoring so I instead looked at using
libkmod for that. But then libkmod only works with binary format of
modaliases so busybox depmod needed a fix[2][3] to generate a binary
format of the indexes so libkmod can read those.

The whole of toybox insmod is 46 lines. (There's a 568 line modprobe in
toys/pending but there's a _reason_ it's in pending.)

Post by Natanael Copa
The current plan is to use nlsockd as socket activator, a netlink
reader[4] which will load kernel modules with libkmod and fork mdev -
but only on the relevant events - those who has DEVNAME set.

No man page in ubuntu, and when I type the command it doesn't suggest a
packakge to install...

Ah, your email has a bibliography.

Post by Natanael Copa

Post by Rob Landley
If there's interest in my fleshing out toybox's mdev, I can bump it up
the todo list, but I tend to be chronically overcommitted so need
repeated poking...

What I might be interested in is making toybox mdev read events from a
netlink socket (stdin or other filedescriptor),

-n netlink file descriptor

Post by Natanael Copa
add support for loading modaliases without forking.

Again, fork, or exec?

(If you really care that much I could probably move the modalias parsing
stuff to lib, it's really #include <linux/blah.h> from lib I try to
avoid. Building on bsd you can switch off commands, but lib/*.c gets
compiled unconditionally and then --gc-sections trimmed.)

Rob

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

FRIGN

2015-07-26 07:42:46 UTC

Permalink

On Sat, 25 Jul 2015 21:11:41 -0300
Alan Pillay <***@gmail.com> wrote:

Hey Alan,

Post by Alan Pillay
[smdev] smdev is an even simpler implementation of a device manager by
the well-known suckless developers. If it is mature enough, certainly
a high contender.

any things regarding smdev can be directed to me. I'm glad to see it
taken into account.
For the interested reader, the official git-repository[0] and the
website[1].
Let me know if there's anything you need added to smdev, and I'll
happily discuss it with the team.

Regarding the discussion, I'd like to point at nldev and nlmon[2], also
in-house developments, which work on top of smdev and replace udevd
and udevadm.

Cheers

FRIGN

[0]: http://git.suckless.org/smdev/
[1]: http://core.suckless.org/smdev/
[2]: http://core.suckless.org/nldev

--
FRIGN <***@frign.de>

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Natanael Copa

2015-07-27 08:37:37 UTC

Permalink

On Sat, 25 Jul 2015 21:11:41 -0300

udev is optional. Default Alpine Linux uses only mdev.

Post by Alan Pillay
There are many programs out there that could replace udev and help
Alpine get in a better shape. I will list some that I know.
[mdev] there are 2 mdev implementations that I know, busybox's and
toybox's. On Alpine Linux, busybox already comes installed by default
(and its mdev comes with it, which is weird since it isn't currently
used, but I digress)

mdev is used and is fully supported. You may replace mdev with udev if
you need Xorg hotplugging. This is not installed by default though.

There was also a long discussion about adding netlink support to
busybox mdev on busybox mailing list. There was some disagreement on
how to do it.

There was even some patches that made busybox mdev read events from
stdin.

Post by Alan Pillay
[smdev] smdev is an even simpler implementation of a device manager by
the well-known suckless developers. If it is mature enough, certainly
a high contender.

I did look at smdev. One of the big benefits with smdev is the
mdev.conf compatibility. I don't want support 3-4 different config
formats (udev rules, mdev.conf etc).

smdev requires fork/exec for every single event which has a performance
issue. I believe that you can solve the performance issue too, with
just a little more effort.

Post by Alan Pillay
[eudev] a fork of udev from the gentoo developers. Doesn't appear to
be as small as others, but should be more easily integrated into
alpine.

Alpine Linux switched the udev support to eudev a couple of weeks ago
and rebuilt everything that linked to libudev.

Benefit with eudev is that it is "mainstream" nowdays. upstream
softwoare project often supports only udev.

Downside is that code comes from systemd and suffers from many of the
same management issues that upstream systemd. (eg. no support for
separate /usr partition, network interface renaming policies etc,
require devtmpfs etc)

To use eudev efficiently we would have to follow whatever systemd does
on many things. I am not comfortable with that thought.

I would like to get rid of eudev/udev, but at the same time, I want
support for hotplugging in Xorg. I want plug in a moue and keyboard and
I want it to just work, without needing changing xorg.xonf and restart
xorg. Today you need (e)udev for that.

I will have to look at vdev. The udev compat might be of interest.

Post by Alan Pillay
I thought about using this means of communication so developers ca
discuss this matter that impacts the use of the Alpine Linux
distribution as a whole.
I am also emailing relevant parties (developers of the cited device
managers, so they can participate if they desire).
Thanks for the attention.

Thanks for raising the topic and for bringing in the people who likely
sit with the answers. I think it would be great if we together could
come up with something. I should present my thoughts/ideas on the
subject in a separate email.

Thanks!

Post by Alan Pillay
KISS

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Kevin Chadwick

2015-07-27 10:49:06 UTC

Permalink

Post by Natanael Copa
smdev requires fork/exec for every single event which has a performance
issue. I believe that you can solve the performance issue too, with
just a little more effort.

I don't know the details of this or the other options but wonder from
reading a recent OpenBSD paper if although a performance hit this lends
itself to better process location randomisation to help fight rop
attacks with fork being a copy and exec then being a new random layout?

--

KISSIS - Keep It Simple So It's Securable

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Isaac Dunham

2015-07-28 00:55:20 UTC

Permalink

Post by Natanael Copa
I would like to get rid of eudev/udev, but at the same time, I want
support for hotplugging in Xorg. I want plug in a moue and keyboard and
I want it to just work, without needing changing xorg.xonf and restart
xorg. Today you need (e)udev for that.

Quibbling:
If you drop some config files in xorg.conf.d (from "mdev-like-a-boss"),
you do not need to edit anything...although it will still be necessary
to restart X:
https://github.com/slashbeast/mdev-like-a-boss/tree/master/xorg.conf.d

Unfortunately, the author of that package has not specified a license
in that repository, although elsewhere he describes it as BSD.

HTH,
Isaac Dunham

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Isaac Dunham

2015-07-28 05:24:37 UTC

Permalink

[pruning CC: to those who are more likely to care about mdev features
and command line]

Post by Natanael Copa
There was also a long discussion about adding netlink support to
busybox mdev on busybox mailing list. There was some disagreement on
how to do it.
There was even some patches that made busybox mdev read events from
stdin.

A few comments regarding those patches...

It is much simpler to debug something like "mdev -i" than a netlink reader
spawned by a "netlink inetd", since there's less indirection in using strace
and you can trivially create, log, and replay events. Do not underestimate
how useful the ability to play back a series of hotplug events is.

While I did prepare a patch for mdev -i (read events from stdin) based
on your work, other modifications were made to mdev.
Meanwhile, the agreement that there had been about reading events from
stdin disappeared, Denys assumed that it was entirely about serialization
and reimplemented nldev, and a simple rework of the patch to match the
new code didn't work, so I never got the patch updated. (I figured that
with a maintainer who didn't understand the feature request, it would take
a bit more support to get it in.)

If mdev -i is desired rather than a netlink reader, I think I could update
the patch; but if netlink support in mdev itself is desired, I don't want
to set up the environment that would be needed to test it.

Thanks,
Isaac Dunham

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Alan Pillay

2016-01-12 12:51:25 UTC

Permalink

It has been half a month since we had this nice conversation. First of
all, I wish everyone a happy new year!

Due to other priorities, the udev replacement has been postponed, but
now Sören Tempel came up with this idea again. He also proposes smdev
to be shipped with next major Alpine Linux version - 3.4.0 - which is
expected to be released in about 4 months.
For this reason, I think it is good to reconsider this subject and
restart this conversation.
Natanael, do you believe smdev is mature enough to be used implemented
on Alpine Linux as the default device manager? What about the
accompanying software, nldev and nlmon? If something is missing, what
is it exactly? Would the developers of smdev, nldev and nlmon be
willing to help with the transaction from udev to their lightweight
alternatives?

KISS

Post by Isaac Dunham
[pruning CC: to those who are more likely to care about mdev features
and command line]

A few comments regarding those patches...
It is much simpler to debug something like "mdev -i" than a netlink reader
spawned by a "netlink inetd", since there's less indirection in using strace
and you can trivially create, log, and replay events. Do not underestimate
how useful the ability to play back a series of hotplug events is.
While I did prepare a patch for mdev -i (read events from stdin) based
on your work, other modifications were made to mdev.
Meanwhile, the agreement that there had been about reading events from
stdin disappeared, Denys assumed that it was entirely about serialization
and reimplemented nldev, and a simple rework of the patch to match the
new code didn't work, so I never got the patch updated. (I figured that
with a maintainer who didn't understand the feature request, it would take
a bit more support to get it in.)
If mdev -i is desired rather than a netlink reader, I think I could update
the patch; but if netlink support in mdev itself is desired, I don't want
to set up the environment that would be needed to test it.
Thanks,
Isaac Dunham
---
---

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

u***@aetey.se

2016-01-12 15:38:04 UTC

Permalink

Post by Alan Pillay
Due to other priorities, the udev replacement has been postponed, but
now Sören Tempel came up with this idea again. He also proposes smdev
to be shipped with next major Alpine Linux version - 3.4.0 - which is
expected to be released in about 4 months.
For this reason, I think it is good to reconsider this subject and
restart this conversation.
Natanael, do you believe smdev is mature enough to be used implemented
on Alpine Linux as the default device manager? What about the
accompanying software, nldev and nlmon? If something is missing, what
is it exactly? Would the developers of smdev, nldev and nlmon be
willing to help with the transaction from udev to their lightweight
alternatives?

I would love to get rid of udev but isn't libudev the harder part?

Would you summarize in which ways smdev is better than mdev, I do not find
the corresponding documentation (studying the source is expensive, even
for such a compact program).

The smdev license is more free, what are the other differences?

The README says
"mostly compatible with mdev but doesn't have all of its features"

which looks like "almost as good as mdev", then why replace mdev?

Rune

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Laurent Bercot

2016-01-12 17:41:18 UTC

Permalink

Post by u***@aetey.se
I would love to get rid of udev but isn't libudev the harder part?

Yes, libudev is definitely the harder part. Handling hotplug
events via netlink is easy, and has been done several times over
already; but libudev introduces policy in software, and most of
the work is providing a compatible interface.

I have my eyes set on libudev-compat from vdev:
https://github.com/jcnelson/vdev

but I don't know how much of a drop-in it is, or how production-
ready it is. I'll be asking people around who have experience with it.

--
Laurent

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Jude Nelson

2016-01-12 20:06:33 UTC

Permalink

Post by Laurent Bercot

Post by u***@aetey.se
I would love to get rid of udev but isn't libudev the harder part?

Yes, libudev is definitely the harder part. Handling hotplug
events via netlink is easy, and has been done several times over
already; but libudev introduces policy in software, and most of
the work is providing a compatible interface.
https://github.com/jcnelson/vdev
but I don't know how much of a drop-in it is, or how production-
ready it is. I'll be asking people around who have experience with it.

I've been using vdev and libudev-compat it on my production machine for
several months. I use it with heavily with Chromium (YouTube and Google
Hangouts work) and udev-enabled Xorg (hotplugged input devices work as
expected). My encrypted swap partition's device-mapped nodes and
directories show up where they should, and my Android development tools
work with my Android phone when I plug it in.

I wouldn't say it's ready for prime time just yet, though. In particular,
because libudev-compat uses (dev)tmpfs to record and distribute event
messages as regular files (under /dev/metadata/udev/events), a program can
leak files and directories simply by exiting without shutting down libudev
(i.e. failing freeing up the struct udev_device). My plan is to have
libudev-compat store its events to a special-purpose FUSE filesystem called
eventfs [1] that automatically removes orphaned files and denies all future
access to them. Eventfs works in my tests, but I have yet to move over to
using it in production. Instead, I've been running a script every now and
then that clears out orphaned directories in /dev/metadata/udev/events.

-Jude

[1] https://github.com/jcnelson/eventfs

Post by Laurent Bercot
--
Laurent
---
---

Laurent Bercot

2016-01-12 23:37:06 UTC

Permalink

Post by Jude Nelson
I've been using vdev and libudev-compat it on my production machine
for several months.

Sure, but since you're the author, it's certainly easier for you
than for other people. ;)

Post by Jude Nelson
I use it with heavily with Chromium (YouTube and
Google Hangouts work) and udev-enabled Xorg (hotplugged input devices
work as expected). My encrypted swap partition's device-mapped nodes
and directories show up where they should, and my Android development
tools work with my Android phone when I plug it in.

That's neat, and very promising.
I doubt you're the right person to ask, but do you have any
experience running libudev-compat with a different hotplug
manager than vdev ? I'd like to stick with (s)mdev as long as
I can make it work.

Post by Jude Nelson
I wouldn't say it's ready for prime time just yet, though. In
particular, because libudev-compat uses (dev)tmpfs to record and
distribute event messages as regular files (under
/dev/metadata/udev/events), a program can leak files and directories
simply by exiting without shutting down libudev (i.e. failing freeing
up the struct udev_device).

That may be OOT, but I'm interested in hearing the rationale for
that choice. An event is ephemeral, a file is (relatively) permanent;
recording events as regular files does not sound like a good match,
unless you have a reference counting process/thread somewhere that
cleans up an event as soon as it's consumed.

Anyway, unless I'm misunderstanding the architecture completely,
it sounds like leaks could be prevented by wrapping programs you're
not sure of.

Post by Jude Nelson
My plan is to have libudev-compat store
its events to a special-purpose FUSE filesystem called eventfs [1]
that automatically removes orphaned files and denies all future
access to them.

Unfortunately, FUSE is a deal breaker for the project I'm working on.

I'm under the impression that you're slightly overengineering this;
you shouldn't need a specific filesystem to distribute events. My
s6-ftrig-* set of tools distribute events to arbitrary subscribers
without needing anything specific - the mechanism is just directories
and named pipes.
But I don't know the details of libudev, so I may be missing
something, and I'm really interested in learning more.

Post by Jude Nelson
Instead, I've been running a script
every now and then that clears out orphaned directories in
/dev/metadata/udev/events.

A polling cleaner script works if you have no sensitive data.
A better design, though, is a notification-based cleaner, that
is triggered as soon as a reference expires. And I'm almost
certain you don't need eventfs for this :)

--
Laurent

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Jude Nelson

2016-01-13 03:47:45 UTC

Permalink

Hi Laurent, thank you as always for your input.

Post by Laurent Bercot

Post by Jude Nelson
I've been using vdev and libudev-compat it on my production machine
for several months.

Sure, but since you're the author, it's certainly easier for you
than for other people. ;)

Agreed; I was just pointing out that the system has been seeing some
real-world use :)

Post by Laurent Bercot
I use it with heavily with Chromium (YouTube and

Post by Jude Nelson
Google Hangouts work) and udev-enabled Xorg (hotplugged input devices
work as expected). My encrypted swap partition's device-mapped nodes
and directories show up where they should, and my Android development
tools work with my Android phone when I plug it in.

I haven't tried this myself, but it should be doable. Vdev's
event-propagation mechanism is a small program that constructs a uevent
string from environment variables passed to it by vdev and writes the
string to the appropriate place. The vdev daemon isn't aware of its
existence; it simply executes it like it would for any another matching
device-event action. Another device manager could supply the same program
with the right environment variables and use it for the same purposes.

Post by Laurent Bercot
I wouldn't say it's ready for prime time just yet, though. In

Post by Jude Nelson
particular, because libudev-compat uses (dev)tmpfs to record and
distribute event messages as regular files (under
/dev/metadata/udev/events), a program can leak files and directories
simply by exiting without shutting down libudev (i.e. failing freeing
up the struct udev_device).

Tmpfs and devtmps are designed for holding ephemeral state already, so I'm
not sure why the fact that they expose data as regular files is a concern?

I went with a file-oriented model specifically because it made
reference-counting simple and easy--specifically, by using hard-links. The
aforementioned event-propagation tool writes the uevent into a scratch area
under /dev, hard-links it into each libudev-compat monitor directory under
/dev/metadata/udev, and unlinks the file (there is a directory in
/dev/metadata/udev for each struct udev_monitor created by each
libudev-compat program). When the libudev-compat client wakes up next, it
consumes any new event-files (in delivery order) and unlinks them, thereby
ensuring that once each libudev-compat client "receives" the event, the
event's resources are fully reclaimed.

Post by Laurent Bercot
Anyway, unless I'm misunderstanding the architecture completely,
it sounds like leaks could be prevented by wrapping programs you're
not sure of.

I couldn't think of a simpler way that was also as robust. Unless I'm
misunderstanding something, wrapping an arbitrary program to clean up the
files it created would, in the extreme, require coming up with a way to do
so on SIGKILL. I'd love to know if there is a simple way to do this,
though.

Post by Laurent Bercot
My plan is to have libudev-compat store

Post by Jude Nelson
its events to a special-purpose FUSE filesystem called eventfs [1]
that automatically removes orphaned files and denies all future
access to them.

Unfortunately, FUSE is a deal breaker for the project I'm working on.
I'm under the impression that you're slightly overengineering this;
you shouldn't need a specific filesystem to distribute events. My
s6-ftrig-* set of tools distribute events to arbitrary subscribers
without needing anything specific - the mechanism is just directories
and named pipes.
But I don't know the details of libudev, so I may be missing
something, and I'm really interested in learning more.

I went with a specialized filesystem for two reasons; both of which were to
fulfill libudev's API contract:
* Efficient, reliable event multicasting. By using hard-links as described
above, the event only needs to be written out once, and the OS only needs
to store one copy.
* Automatic multicast channel cleanup. Eventfs would ensure that no matter
how a process dies, its multicast state would be come inaccessible and be
reclaimed once it is dead (i.e. a subsequent filesystem operation on the
orphaned state, no matter how soon after the process's exit, will fail).

Both of the above are implicitly guaranteed by libudev, since it relies on
a netlink multicast group shared with the udevd process to achieve them.

It is my understanding (please correct me if I'm wrong) that with
s6-ftrig-*, I would need to write out the event data to each listener's
pipe (i.e. once per struct udev_monitor instance), and I would still be
responsible for cleaning up the fifodir every now and then if the
libudev-compat client failed to do so itself. Is my understanding correct?

Again, I would love to know of a simpler approach that is just as robust.

Post by Laurent Bercot
Instead, I've been running a script

Post by Jude Nelson
every now and then that clears out orphaned directories in
/dev/metadata/udev/events.

I agree that a notification-based cleaner could be just as effective, but I
wonder whether or not the machinery necessary to track all libudev-compat
processes in a reliable and efficient manner would be simpler than
eventfs? Would love to know what you had in mind :)

Thanks for your feedback,
Jude

Post by Laurent Bercot
--
Laurent
---
---

Laurent Bercot

2016-01-13 12:33:24 UTC

Permalink

Post by Jude Nelson
I haven't tried this myself, but it should be doable. Vdev's
event-propagation mechanism is a small program that constructs a
uevent string from environment variables passed to it by vdev and
writes the string to the appropriate place. The vdev daemon isn't
aware of its existence; it simply executes it like it would for any
another matching device-event action. Another device manager could
supply the same program with the right environment variables and use
it for the same purposes.

Indeed. My question then becomes: what are the differences between
the string passed by the kernel (which is more or less a list of
environment variables, too) and the string constructed by vdev ?
In other words, is vdev itself more than a trivial netlink listener,
and if yes, what does it do ? (I'll just take a pointer to the
documentation if that question is answered somewhere.)
For now I'll take a wild guess and say that vdev analyzes the
MODALIAS or something, according to a conf file, in order to know
the correct fan-out to perform and write the event to the correct
subsystems. Am I close ?

Post by Jude Nelson
Tmpfs and devtmps are designed for holding ephemeral state already,
so I'm not sure why the fact that they expose data as regular files
is a concern?

Two different meanings of "ephemeral".
tmpfs and devtmpfs are supposed to retain their data until the
end of the system's lifetime. An event is much more ephemeral
than that: it's supposed to be consumed instantly - like the
event from the kernel is consumed instantly by the netlink listener.
Files, even in a tmpfs, remain alive in the absence of a live
process to hold them; but events have no meaning if no process needs
them, which is the reason for the "event leaking" problem.
Ideally, you need a file type with basically the same lifetime
as a process.

Holding event data in a file is perfectly valid as long as you have
a mechanism to reclaim the file as soon as the last reference to it
dies.

Post by Jude Nelson
I couldn't think of a simpler way that was also as robust. Unless
I'm misunderstanding something, wrapping an arbitrary program to
clean up the files it created would, in the extreme, require coming
up with a way to do so on SIGKILL. I'd love to know if there is a
simple way to do this, though.

That's where supervisors come into play: the parent of a process
always knows when it dies, even on SIGKILL. Supervised daemons can
have a cleaner script in place.
For the general case, it shouldn't be hard to have a wrapper that
forks an arbitrary program and cleans up /dev/metadata/whatever/*$childpid*
when it dies. The price to pay is an additional process, but that
additional process would be very small.
You can still have a polling "catch-all cleaner" to collect dead events
in case the supervisor/wrapper also died, but since that occurrence will
be rare, the polling period can be pretty long so it's not a problem.

Post by Jude Nelson
I went with a specialized filesystem for two reasons; both of which
were to fulfill libudev's API contract: * Efficient, reliable event
multicasting. By using hard-links as described above, the event only
needs to be written out once, and the OS only needs to store one
copy.

That's a good mechanism; you're already fulfilling that contract
with the non-eventfs implementation.

Post by Jude Nelson
* Automatic multicast channel cleanup. Eventfs would ensure that no
matter how a process dies, its multicast state would be come
inaccessible and be reclaimed once it is dead (i.e. a subsequent
filesystem operation on the orphaned state, no matter how soon after
the process's exit, will fail).

That's where storing events as files is problematic: files survive
processes. But I still don't think a specific fs is necessary: you can
either ensure files do not survive processes (see the supervisor/cleaner
idea above), or you can use another Unix mechanism (see below).

Post by Jude Nelson
Both of the above are implicitly guaranteed by libudev, since it
relies on a netlink multicast group shared with the udevd process
to achieve them.

And honestly, that's not a bad design. If you want to have multicast,
and you happen to have a true multicast IPC mechanism, might as well
use it. It will be hard to be as efficient as that: if you don't have
true multicast, you have to compromise somewhere.
I dare say using a netlink multicast group is lighter than designing
a FUSE filesystem to do the same thing. If you want the same
functionality, why didn't you adopt the same mechanism ?

(It can be made modular. You can have a uevent listener that just gets
the event from the kernel and transmits it to the event manager; and
the chosen event manager multicasts it.)

Post by Jude Nelson
It is my understanding (please correct me if I'm wrong) that with
s6-ftrig-*, I would need to write out the event data to each
listener's pipe (i.e. once per struct udev_monitor instance), and I
would still be responsible for cleaning up the fifodir every now and
then if the libudev-compat client failed to do so itself. Is my
understanding correct?

Yes and no. I'm not suggesting you to use libftrig for your purpose. :)

* My concern with libftrig was never event storage: it was
many-to-many notification. I didn't design it to transmit arbitrary
amounts of data, but to instantly wake up processes when something
happens; data transmission *is* possible, but the original idea is
to send one byte at a time, for just 256 types of event.

Notification and data transmission are orthogonal concepts. It's
always possible to store data somewhere and notify processes that
data is available; then processes can fetch the data. Data
transmission can be pull, whereas notification has to be push.
libftrig is only about the push.

Leaking space is not a concern with libftrig, because fifodirs
never store data, only pipes; at worst, they leak a few inodes.
That is why a polling cleaner is sufficient: even if multiple
subscribers get SIGKILLed, they will only leave behind a few
fifos, and no data - so sweeping now and then is more than enough.
It's different if you're storing data, because leaks can be much
more problematic.

* Unless you have true multicast, you will have to push a
notification as many times as you have listeners, no matter what.
That's what I'm doing when writing to all the fifos in a fifodir.
That's what you are doing when linking the event into every
subscriber's directory. I guess your subscriber library uses some
kind of inotify to know when a new file has arrived?

Post by Jude Nelson
Again, I would love to know of a simpler approach that is just as robust.

Whenever you have "pull" data transmission, you necessarily have the
problem of storage lifetime. Here, as often, what you want is
reference counting: when the last handle to the data disappears, the data
is automatically collected.
The problem is that your current handle, an inode, is not tied to the
subscriber's lifetime. You want a type of handle that will die with the
process.
File descriptors fit this.

So, an idea would be to do something like:
- Your event manager listens to a Unix domain socket.
- Your subscribers connect to that socket.
- For every event:
+ the event manager stores the event into an anonymous file (e.g. a file
in a tmpfs that is unlinked as soon as it is created) while keeping a
reading fd on it
+ the event manager sends a copy of the reading fd, via fd-passing,
to every subscriber. This counts as a notification, since it will wake up
subscribers.
+ the event manager closes its own fd to the file.
+ subscribers will read the fd when they so choose, and they will
close it afterwards. The kernel will also close it when they die, so you
won't leak any data.

Of course, at that point, you may as well give up and just push the
whole event over the Unix socket. It's what udevd does, except it uses a
netlink multicast group instead of a normal socket (so its complexity is
independent from the number of subscribers). Honestly, given that the
number of subscribers will likely be small, and your events probably aren't
too large either, it's the simplest design - it's what I'd go for.
(I even already have the daemon to do it, as a part of skabus. Sending
data to subscribers is exactly what a pubsub does.)

But if you estimate that the amount of data is too large and you don't
want to copy it, then you can just send a fd instead. It's still
manual broadcast, but it's not in O(event length * subscribers), it's in
O(subscribers), i.e. the same complexity as your "hard link the event file" strategy; and it has the exact storage properties that you want.

What do you think ?

--
Laurent

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Jude Nelson

2016-01-14 05:55:57 UTC

Permalink

Hi Laurent,

Post by Jude Nelson
I haven't tried this myself, but it should be doable. Vdev's

Post by Jude Nelson
event-propagation mechanism is a small program that constructs a
uevent string from environment variables passed to it by vdev and
writes the string to the appropriate place. The vdev daemon isn't
aware of its existence; it simply executes it like it would for any
another matching device-event action. Another device manager could
supply the same program with the right environment variables and use
it for the same purposes.

(I should really sit down and write documentation sometime :)

I think you're close. The jist of it is that vdev needs to supply a lot
more information than the kernel gives it. In particular, its helper
programs go on to query the properties and status of each device (this
often requires root privileges, i.e. via privileged ioctl()s), and vdev
gathers the information into a (much larger) event packet and stores it in
a directory tree under /dev for subsequent query by less-privileged
programs. It doesn't rely on the MODALIAS per se; instead it matches
fields of the kernel's uevent packet (one of which is the MODALIAS) to the
right helper programs to run.

Here's an example of what vdev gathers for my laptop's SATA disk:

$ cat /dev/metadata/dev/sda/properties
VDEV_ATA=1
VDEV_WWN=0x5000c500299a9a7a
VDEV_BUS=ata
VDEV_SERIAL=ST9500420AS_5VJ7A0BM
VDEV_SERIAL_SHORT=5VJ7A0BM
VDEV_REVISION=0003LVM1
VDEV_TYPE=ata
VDEV_MAJOR=8
VDEV_MINOR=0
VDEV_OS_SUBSYSTEM=block
VDEV_OS_DEVTYPE=disk
VDEV_OS_DEVPATH=/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda
VDEV_OS_DEVNAME=sda
VDEV_ATA=1
VDEV_ATA_TYPE=disk
VDEV_ATA_MODEL=ST9500420AS
VDEV_ATA_MODEL_ENC=ST9500420ASx20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20x20
VDEV_ATA_REVISION=0003LVM1
VDEV_ATA_SERIAL=ST9500420AS_5VJ7A0BM
VDEV_ATA_SERIAL_SHORT=5VJ7A0BM
VDEV_ATA_WRITE_CACHE=1
VDEV_ATA_WRITE_CACHE_ENABLED=1
VDEV_ATA_FEATURE_SET_HPA=1
VDEV_ATA_FEATURE_SET_HPA_ENABLED=1
VDEV_ATA_FEATURE_SET_PM=1
VDEV_ATA_FEATURE_SET_PM_ENABLED=1
VDEV_ATA_FEATURE_SET_SECURITY=1
VDEV_ATA_FEATURE_SET_SECURITY_ENABLED=0
VDEV_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=100
VDEV_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=100
VDEV_ATA_FEATURE_SET_SECURITY_FROZEN=1
VDEV_ATA_FEATURE_SET_SMART=1
VDEV_ATA_FEATURE_SET_SMART_ENABLED=1
VDEV_ATA_FEATURE_SET_APM=1
VDEV_ATA_FEATURE_SET_APM_ENABLED=1
VDEV_ATA_FEATURE_SET_APM_CURRENT_VALUE=128
VDEV_ATA_DOWNLOAD_MICROCODE=1
VDEV_ATA_SATA=1
VDEV_ATA_SATA_SIGNAL_RATE_GEN2=1
VDEV_ATA_SATA_SIGNAL_RATE_GEN1=1
VDEV_ATA_ROTATION_RATE_RPM=7200
VDEV_ATA_WWN=0x5000c500299a9a7a
VDEV_ATA_WWN_WITH_EXTENSION=0x5000c500299a9a7a

Anything that starts with "VDEV_ATA_", as well as "VDEV_BUS",
"VDEV_SERIAL_*", "VDEV_TYPE", and "VDEV_REVISION" had to be extracted via
an ioctl, by exploring files in sysfs, or by querying a hardware database.
The kernel only supplied a few of these fields.

Post by Jude Nelson
Tmpfs and devtmps are designed for holding ephemeral state already,

Post by Jude Nelson
so I'm not sure why the fact that they expose data as regular files
is a concern?

Two different meanings of "ephemeral".
tmpfs and devtmpfs are supposed to retain their data until the
end of the system's lifetime. An event is much more ephemeral
than that: it's supposed to be consumed instantly - like the
event from the kernel is consumed instantly by the netlink listener.
Files, even in a tmpfs, remain alive in the absence of a live
process to hold them; but events have no meaning if no process needs
them, which is the reason for the "event leaking" problem.
Ideally, you need a file type with basically the same lifetime
as a process.
Holding event data in a file is perfectly valid as long as you have
a mechanism to reclaim the file as soon as the last reference to it
dies.

Funny you mention this--I also created runfs (
https://github.com/jcnelson/runfs) to do exactly this. In particular, I
use it for PID files. Also, eventfs was actually derived from runfs, but
specialized more to make it more suitable for managing event-queues.

Post by Jude Nelson
I couldn't think of a simpler way that was also as robust. Unless

Post by Jude Nelson
I'm misunderstanding something, wrapping an arbitrary program to
clean up the files it created would, in the extreme, require coming
up with a way to do so on SIGKILL. I'd love to know if there is a
simple way to do this, though.

Agreed. I would be happy to keep this approach in mind in the design of
libudev-compat. Eventfs isn't a hard requirement and I don't want it to
be, since there's more than one way to deal with this problem.

Post by Jude Nelson
I went with a specialized filesystem for two reasons; both of which

Post by Jude Nelson
were to fulfill libudev's API contract: * Efficient, reliable event
multicasting. By using hard-links as described above, the event only
needs to be written out once, and the OS only needs to store one
copy.

That's a good mechanism; you're already fulfilling that contract
with the non-eventfs implementation.
* Automatic multicast channel cleanup. Eventfs would ensure that no

Post by Jude Nelson
matter how a process dies, its multicast state would be come
inaccessible and be reclaimed once it is dead (i.e. a subsequent
filesystem operation on the orphaned state, no matter how soon after
the process's exit, will fail).

Post by Jude Nelson
relies on a netlink multicast group shared with the udevd process
to achieve them.

I agree that netlink is lighter, but I avoided it for two reasons:
* Sometime down the road, I'd like to port vdev to OpenBSD. Not because I
believe that the OpenBSD project is in dire need of a dynamic device
manager, but simply because it's the thing I miss the most when I'm using
OpenBSD (personal preference). Netlink is Linux-specific, whereas FUSE
works on pretty much every Unix these days.
* There is no way to namespace netlink messages that I'm aware of. The
kernel (and udev) sends the same device events to every container on the
system--in fact, this is one of the major reasons cited by the systemd
folks for moving off of netlink for udevd-to-libudev communications. By
using a synthetic filesystem for message transport, I can use bind-mounts
to control which device events get routed to which containers (this is also
the reason why the late kdbus was implemented as a synthetic filesystem).
Using fifodirs has the same benefit :)

Post by Jude Nelson
(It can be made modular. You can have a uevent listener that just gets
the event from the kernel and transmits it to the event manager; and
the chosen event manager multicasts it.)

Good point; something I'll keep in mind in the future evolution of
libudev-compat :)

Post by Jude Nelson
It is my understanding (please correct me if I'm wrong) that with

Post by Jude Nelson
s6-ftrig-*, I would need to write out the event data to each
listener's pipe (i.e. once per struct udev_monitor instance), and I
would still be responsible for cleaning up the fifodir every now and
then if the libudev-compat client failed to do so itself. Is my
understanding correct?

Yes and no. I'm not suggesting you to use libftrig for your purpose. :)
* My concern with libftrig was never event storage: it was
many-to-many notification. I didn't design it to transmit arbitrary
amounts of data, but to instantly wake up processes when something
happens; data transmission *is* possible, but the original idea is
to send one byte at a time, for just 256 types of event.
Notification and data transmission are orthogonal concepts. It's
always possible to store data somewhere and notify processes that
data is available; then processes can fetch the data. Data
transmission can be pull, whereas notification has to be push.
libftrig is only about the push.
Leaking space is not a concern with libftrig, because fifodirs
never store data, only pipes; at worst, they leak a few inodes.
That is why a polling cleaner is sufficient: even if multiple
subscribers get SIGKILLed, they will only leave behind a few
fifos, and no data - so sweeping now and then is more than enough.
It's different if you're storing data, because leaks can be much
more problematic.
* Unless you have true multicast, you will have to push a
notification as many times as you have listeners, no matter what.
That's what I'm doing when writing to all the fifos in a fifodir.
That's what you are doing when linking the event into every
subscriber's directory. I guess your subscriber library uses some
kind of inotify to know when a new file has arrived?

Yes, modulo some other mechanisms to ensure that the libudev-compat process
doesn't get back-logged and lose messages. I completely agree with you
about the benefits of separating notification (control-plane) from message
delivery (data-plane).

Post by Jude Nelson
Again, I would love to know of a simpler approach that is just as

Post by Jude Nelson
robust.

Whenever you have "pull" data transmission, you necessarily have the
problem of storage lifetime. Here, as often, what you want is
reference counting: when the last handle to the data disappears, the data
is automatically collected.
The problem is that your current handle, an inode, is not tied to the
subscriber's lifetime. You want a type of handle that will die with the
process.
File descriptors fit this.
- Your event manager listens to a Unix domain socket.
- Your subscribers connect to that socket.
+ the event manager stores the event into an anonymous file (e.g. a file
in a tmpfs that is unlinked as soon as it is created) while keeping a
reading fd on it
+ the event manager sends a copy of the reading fd, via fd-passing,
to every subscriber. This counts as a notification, since it will wake up
subscribers.
+ the event manager closes its own fd to the file.
+ subscribers will read the fd when they so choose, and they will
close it afterwards. The kernel will also close it when they die, so you
won't leak any data.
Of course, at that point, you may as well give up and just push the
whole event over the Unix socket. It's what udevd does, except it uses a
netlink multicast group instead of a normal socket (so its complexity is
independent from the number of subscribers). Honestly, given that the
number of subscribers will likely be small, and your events probably aren't
too large either, it's the simplest design - it's what I'd go for.
(I even already have the daemon to do it, as a part of skabus. Sending
data to subscribers is exactly what a pubsub does.)
But if you estimate that the amount of data is too large and you don't
want to copy it, then you can just send a fd instead. It's still
manual broadcast, but it's not in O(event length * subscribers), it's in
O(subscribers), i.e. the same complexity as your "hard link the event
file" strategy; and it has the exact storage properties that you want.
What do you think ?

I think both approaches are good ideas and would work just as well. I
really like skabus's approach--I'll take a look at using it for message
delivery as an additional (preferred?) vdev-to-libudev-compat message
delivery mechanism :) It looks like it offers all the aforementioned
benefits over netlink that I'm looking for.

A question on the implementation--what do you think of having each
subscriber create its own Unix domain socket in a canonical directory, and
having the sender connect as a client to each subscriber? Since each
subscriber needs its own fd to read and close, the directory of subscriber
sockets automatically gives the sender a list of who to communicate with
and a count of how many fds to create. It also makes it easy to detect and
clean up a dead subscriber's socket: the sender can request a struct ucred
from a subscriber to get its PID (and then other details from /proc), and
if the process ever exits (which the sender can detect on Linux using a
netlink process monitor, like [1]), the process that created the socket can
be assumed to be dead and the sender can unlink it. The sender would rely
on additional process instance-identifying information from /proc (like its
start-time) to avoid PID-reuse races.

Thanks again for all your input!
-Jude

[1] http://bewareofgeek.livejournal.com/2945.html?page=1

Laurent Bercot

2016-01-14 11:36:02 UTC

Permalink

Post by Jude Nelson
I think you're close. The jist of it is that vdev needs to supply a
lot more information than the kernel gives it. In particular, its
helper programs go on to query the properties and status of each
device (this often requires root privileges, i.e. via privileged
ioctl()s), and vdev gathers the information into a (much larger)
event packet and stores it in a directory tree under /dev for
subsequent query by less-privileged programs.

I see.
I think this is exactly what could be made modular. I've heard
people say they were reluctant to using vdev because it's not KISS, and
I suspect the ioctl machinery and data gathering is a large part of
the complexity. If that part could be pluggable, i.e. if admins could
choose a "data gatherer" just complex enough for their needs, I believe
it could encourage adoption. In other words, I'm looking at a 3-part
program:
- the netlink listener
- the data gatherer
- the event publisher

Of course, for libudev to work, you would need the full data gatherer;
but if people aren't using libudev programs, they can use a simpler one,
closer to what mdev is doing.
It's all from a very high point-of-view, and I don't know the details of
the code so I have no idea whether it's envisionable for vdev, but that's
what I'm thinking off the top of my head.

Post by Jude Nelson
Funny you mention this--I also created runfs
(https://github.com/jcnelson/runfs) to do exactly this. In
particular, I use it for PID files.

I have no love for mechanisms that help people keep using PID files,
which are an ugly relic that can't end up in the museum of mediaeval
programming soon enough. :P
That said, runfs is interesting, and I would love it if Unix provided
such a mechanism. Unfortunately, for now it has to rely on FUSE, which
is one of the most clunky mutant features of Linux, and an extra layer
of complexity; so I find it cleaner if a program can achieve its
functionality without depending on such a filesystem.

Post by Jude Nelson
* Sometime down the road, I'd like to port vdev to OpenBSD.

Post by Jude Nelson
* There is no way to namespace netlink messages that I'm aware of.

I didn't know that - I'm no netlink expert. But that's also a good
reason. AFAICT, there are 32 netlink multicast groups, and they use
hardcoded numbers - this is ugly, or at least requires a global
registry of what group is used for. If you can't namespace them, it
becomes even more of a scarce resource; although it's legitimate to
use one for uevent publishing, I'm pretty sure people will find a way
to clog them with random crap very soon - better stay away from
resources you can't reliably lock. And from what you're saying, even
systemd people have realized that. :)

I'm not advocating netlink use for anything else than reading kernel
events. It's just that true multicast will be more efficient than manual
broadcast, there's no way around it.

Post by Jude Nelson
By using a synthetic filesystem for
message transport, I can use bind-mounts to control which device
events get routed to which containers

I'm torn between "oooh, clever" and "omg this hack is atrocious". :)

Post by Jude Nelson
Yes, modulo some other mechanisms to ensure that the libudev-compat
process doesn't get back-logged and lose messages.

What do you mean by that?
If libudev-compat is, like libudev, linked into the application, then
you have no control over client behaviour; if a client doesn't properly
act on a notification, then there's nothing you can do about it and
it's not your responsibility. Can you give a few details about what
you're doing client-side?

Post by Jude Nelson
I think both approaches are good ideas and would work just as well.
I really like skabus's approach--I'll take a look at using it for
message delivery as an additional (preferred?) vdev-to-libudev-compat
message delivery mechanism :) It looks like it offers all the
aforementioned benefits over netlink that I'm looking for.

Unfortunately, it's not published yet, because there's still a lot
of work to be done on clients. And now I'm wondering whether it would
be more efficient to store messages in anonymous files and transmit
fds, instead of transmitting copies of messages. I may have to rewrite
stuff. :)
I think I'll be able to get back to work on skabus by the end of this
year - but no promises, since I'll be working on the Alpine init system
as soon as I'm done with my current contract. But I can leak a few
pieces of source code if you're interested.

Post by Jude Nelson
A question on the implementation--what do you think of having each
subscriber create its own Unix domain socket in a canonical
directory, and having the sender connect as a client to each
subscriber?

That's exactly how fifodirs work, with pipes instead of sockets.
But I don't think that's a good fit here.

A point of fifodirs is to have many-to-many communication: there
are several subscribers, but there can also be several publishers
(even if in practice there's often only one publisher). Publishers and
subscribers are completely independent.
Here, you only ever have one publisher: the event dispatcher. You
only ever need one-to-many communication.

Another point of fifodirs is to avoid the need for a daemon to act
as a bus. It's notification that happens between unrelated processes
without requiring a central server to ensure the communication.
It's important because I didn't want my supervision system (which is
supposed to manage daemons) to itself rely on a daemon (which would
then have to be unsupervised).
Here, you don't have that requirement, and you already have a daemon:
the event dispatcher is long-lived.

I think a "socketdir" mechanism is just too heavy:
- for every event, you perform opendir(), readdir() and closedir()
- for every event * subscriber, you perform at least socket(), connect(),
sendmsg() and close()
- the client library needs to listen() and accept(), which means it
needs its own thread (and I hate, hate, hate, libraries that pull in
thread support in my otherwise single-threaded programs)
- the client library needs to perform access control on the socket,
to avoid connects from unrelated processes, and even then you can't
be certain it's the event publisher and not a random root process

You definitely don't want a client library to be listen()ing.
listen() is server stuff - mixing client and server stuff is complex.
Too much so for what you need here.

Post by Jude Nelson
Since each subscriber needs its own fd to read and
close, the directory of subscriber sockets automatically gives the
sender a list of who to communicate with and a count of how many fds
to create. It also makes it easy to detect and clean up a dead
subscriber's socket: the sender can request a struct ucred from a
subscriber to get its PID (and then other details from /proc), and if
the process ever exits (which the sender can detect on Linux using a
netlink process monitor, like [1]), the process that created the
socket can be assumed to be dead and the sender can unlink it. The
sender would rely on additional process instance-identifying
information from /proc (like its start-time) to avoid PID-reuse
races.

Bleh. Of course it can be made to work, but you really don't need all
that complexity. You have a daemon that wants to publish data, and
several clients that want to receive data from that daemon: it's
one (long-lived) to many (short-lived) communication, and there's a
perfectly appropriate, simple and portable IPC for that: a single Unix
domain socket that your daemon listens on and your clients connect to.
If you want to be perfectly reliable, you can implement some kind of
autoreconnect in the client library - in case you want to restart the
event publisher without killing X, for instance. But that's still a
lot simpler than playing with multiple sockets and mixing clients and
serverswhen you don't need to.

Post by Jude Nelson
Thanks again for all your input!

No problem. I love design discussions, I can't get enough of them.
(The reason why I left the Devuan mailing-list is that there was too
much ideological mumbo-jumbo, and not enough technical/design stuff.
Speaking of which, my apologies to Alpine devs for hijacking their ML;
if it's too OT/uninteresting, we'll take the discussion elsewhere.)

--
Laurent

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Isaac Dunham

2016-01-15 04:54:52 UTC

Permalink

Post by Laurent Bercot

I see.
I think this is exactly what could be made modular. I've heard
people say they were reluctant to using vdev because it's not KISS, and
I suspect the ioctl machinery and data gathering is a large part of
the complexity. If that part could be pluggable, i.e. if admins could
choose a "data gatherer" just complex enough for their needs, I believe
it could encourage adoption. In other words, I'm looking at a 3-part
- the netlink listener
- the data gatherer
- the event publisher
Of course, for libudev to work, you would need the full data gatherer;
but if people aren't using libudev programs, they can use a simpler one,
closer to what mdev is doing.
It's all from a very high point-of-view, and I don't know the details of
the code so I have no idea whether it's envisionable for vdev, but that's
what I'm thinking off the top of my head.

I haven't really looked at the vdevd code in a while, but from what
I recollect...
The vdev/libudev-compat "solution" is split up as follows:
-the netlink listener, vdevd
-several data gatherers, which are invoked according to a set of rules
for vdevd (analogous to mdev.conf)
-helper scripts to create extra links and so on, run via the same rules
-I don't recall exactly how the event publishing is done, but IIRC,
there's a daemon watching the directory where the helpers write out the
data, and then distributing it as described
-libudev-compat is libudev, patched to read events that are published
as described
-IIRC, there's some helper that will (also? instead?) maintain a list of
devices more like the way udev does, so you don't *need* libudev-compat.

Jude went to a bit of effort to design vdevd and libudev-compat so that
it would be possible to use part alongside mdev.

The discussion is in the Devuan archives; I don't recall how far back,
though.

HTH,
Isaac Dunham

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Jude Nelson

2016-01-16 18:25:03 UTC

Permalink

Hi Isaac,

Post by Isaac Dunham

Post by Laurent Bercot

I see.
I think this is exactly what could be made modular. I've heard
people say they were reluctant to using vdev because it's not KISS, and
I suspect the ioctl machinery and data gathering is a large part of
the complexity. If that part could be pluggable, i.e. if admins could
choose a "data gatherer" just complex enough for their needs, I believe
it could encourage adoption. In other words, I'm looking at a 3-part
- the netlink listener
- the data gatherer
- the event publisher
Of course, for libudev to work, you would need the full data gatherer;
but if people aren't using libudev programs, they can use a simpler one,
closer to what mdev is doing.
It's all from a very high point-of-view, and I don't know the details of
the code so I have no idea whether it's envisionable for vdev, but that's
what I'm thinking off the top of my head.

I haven't really looked at the vdevd code in a while, but from what
I recollect...
-the netlink listener, vdevd
-several data gatherers, which are invoked according to a set of rules
for vdevd (analogous to mdev.conf)
-helper scripts to create extra links and so on, run via the same rules
-I don't recall exactly how the event publishing is done, but IIRC,
there's a daemon watching the directory where the helpers write out the
data, and then distributing it as described

-libudev-compat is libudev, patched to read events that are published

Post by Isaac Dunham
as described
-IIRC, there's some helper that will (also? instead?) maintain a list of
devices more like the way udev does, so you don't *need* libudev-compat.

Yup--all the requisite device state is maintained under /dev/metadata. The
libudev-compat event helper behaves just like any other vdev/mdev-style
script--when invoked, it reads /dev/metadata for a given device, sets up
the appropriate files in /run/udev, and generates, writes, and hard-links
the event-file (which libudev-compat clients detect and consume).
Libudev-compat is totally unnecessary if device-aware programs and scripts
can get away with reading and watching the contents of /dev/metadata.

-Jude

Jude Nelson

2016-01-16 17:48:10 UTC

Permalink

Hi Laurent, apologies for the delay,

Post by Laurent Bercot

I see.
I think this is exactly what could be made modular. I've heard
people say they were reluctant to using vdev because it's not KISS, and
I suspect the ioctl machinery and data gathering is a large part of
the complexity. If that part could be pluggable, i.e. if admins could
choose a "data gatherer" just complex enough for their needs, I believe
it could encourage adoption. In other words, I'm looking at a 3-part
- the netlink listener
- the data gatherer
- the event publisher
Of course, for libudev to work, you would need the full data gatherer;
but if people aren't using libudev programs, they can use a simpler one,
closer to what mdev is doing.

It's all from a very high point-of-view, and I don't know the details of

Post by Laurent Bercot
the code so I have no idea whether it's envisionable for vdev, but that's
what I'm thinking off the top of my head.

This sounds reasonable. In fact, within vdevd there are already distinct
netlink listener and data gatherer threads that communicate over a
producer/consumer queue. Splitting them into separate processes connected
by a pipe is consistent with the current design, and would also help with
portability.

Post by Laurent Bercot
Funny you mention this--I also created runfs

Post by Jude Nelson
(https://github.com/jcnelson/runfs) to do exactly this. In
particular, I use it for PID files.

I have no love for mechanisms that help people keep using PID files,
which are an ugly relic that can't end up in the museum of mediaeval
programming soon enough. :P

Haha, true. I have other purposes for it though.

That said, runfs is interesting, and I would love it if Unix provided

Post by Laurent Bercot
such a mechanism. Unfortunately, for now it has to rely on FUSE, which
is one of the most clunky mutant features of Linux, and an extra layer
of complexity; so I find it cleaner if a program can achieve its
functionality without depending on such a filesystem.

I think this is one of the things Plan 9 got right--letting a process
expose whatever fate-sharing state it wanted through the VFS. I agree that
using FUSE to do this is a lot clunkier, but I don't think that's FUSE's
fault. As far as I know, Linux doesn't allow a process to expose custom
state through /proc.

Post by Laurent Bercot

Post by Jude Nelson
* Sometime down the road, I'd like to port vdev to OpenBSD.

That's a good reason, and an additional reason to separate the
netlink listener from the event publisher (and the data gatherer).
The event publisher and client library can be made 100% portable,
whereas the netlink listener and data gatherer obviously cannot.
* There is no way to namespace netlink messages that I'm aware of.
I didn't know that - I'm no netlink expert. But that's also a good
reason. AFAICT, there are 32 netlink multicast groups, and they use
hardcoded numbers - this is ugly, or at least requires a global
registry of what group is used for. If you can't namespace them, it
becomes even more of a scarce resource; although it's legitimate to
use one for uevent publishing, I'm pretty sure people will find a way
to clog them with random crap very soon - better stay away from
resources you can't reliably lock. And from what you're saying, even
systemd people have realized that. :)
I'm not advocating netlink use for anything else than reading kernel
events. It's just that true multicast will be more efficient than manual
broadcast, there's no way around it.
By using a synthetic filesystem for

Post by Jude Nelson
message transport, I can use bind-mounts to control which device
events get routed to which containers

I'm torn between "oooh, clever" and "omg this hack is atrocious". :)

Haha, thanks :)

Post by Laurent Bercot
Yes, modulo some other mechanisms to ensure that the libudev-compat

Post by Jude Nelson
process doesn't get back-logged and lose messages.

A bit of background:
* Unlike netlink sockets, a program cannot control the size of an inotify
descriptor's "receive" buffer. This is a system-wide constant, defined
in /proc/sys/fs/inotify/max_queued_events. However, libudev offers clients
the ability to do just this (via udev_monitor_set_receive_buffer_size).
This is what I originally meant--libudev-compat needs to ensure that the
desired receive buffer size is honored.
* libudev's API exposes the udev_monitor's netlink socket descriptor
directly to the client, so it can poll on it (via udev_monitor_get_fd).
* libudev allows clients to define event filters, so they receive only the
events that they want to receive (via udev_monitor_filter_*). The
implementation achieves this by translating filters into BPF programs, and
attaching them to the client's netlink socket. It is also somewhat
complex, and I didn't want to have to re-write it each time I sync'ed the
code with upstream.

To work around these constraints, libudev-compat routes a udev_monitor's
events through an internal socket pair. It uses inotify as an edge-trigger
instead of a level-trigger: when there is at least one file to consume
from the event directory, it will read as many files as it can and try to
saturate the struct udev_monitor's socket pair (the number of bytes the
socketpair can hold now gets controlled by
udev_monitor_set_receive_buffer_size). The receive end of the socket pair
and the inotify descriptor are unified into a single pollable epoll
descriptor, which gets returned via libudev-compat's udev_monitor_get_fd
(it will poll as ready if either there are unconsumed events in the socket
pair, or a new file has arrived in the directory). The filtering
implementation works almost unmodified, except that it attaches BPF
programs to the udev_monitor's socket pair's receiving end instead of a
netlink socket.

In summary, the system doesn't try to outright prevent event loss for
clients; it tries to ensure the clients can control their receive-buffer
size, with expected results. One of the more subtle reasons for using
eventfs is that it makes it possible to control the maximum number of bytes
an event directory can hold. By making this work on a per-directory basis,
the system retains the ability to control on a per-monitor basis the
maximum number of events it will hold before NACKing the event-pusher.
The udev_monitor_set_receive_buffer_size would also set the upper
byte-limit value for its udev_monitor's event directory, thereby retaining
the original API contract.

Post by Laurent Bercot
I think both approaches are good ideas and would work just as well.

Post by Jude Nelson
I really like skabus's approach--I'll take a look at using it for
message delivery as an additional (preferred?) vdev-to-libudev-compat
message delivery mechanism :) It looks like it offers all the
aforementioned benefits over netlink that I'm looking for.

I'd be willing to take a crack at it, if I have time between now and the
end of the year. I'm trying to finish my PhD this year, which is why vdev
development has been slow-going for the past several months. Will keep you
posted :)

Post by Laurent Bercot
A question on the implementation--what do you think of having each

Post by Jude Nelson
subscriber create its own Unix domain socket in a canonical
directory, and having the sender connect as a client to each
subscriber?

- for every event * subscriber, you perform at least socket(), connect(),

Post by Laurent Bercot
sendmsg() and close()

- the client library needs to listen() and accept(), which means it

Post by Laurent Bercot
needs its own thread (and I hate, hate, hate, libraries that pull in
thread support in my otherwise single-threaded programs)
- the client library needs to perform access control on the socket,
to avoid connects from unrelated processes, and even then you can't
be certain it's the event publisher and not a random root process
You definitely don't want a client library to be listen()ing.
listen() is server stuff - mixing client and server stuff is complex.
Too much so for what you need here.
Since each subscriber needs its own fd to read and

Post by Jude Nelson
close, the directory of subscriber sockets automatically gives the
sender a list of who to communicate with and a count of how many fds
to create. It also makes it easy to detect and clean up a dead
subscriber's socket: the sender can request a struct ucred from a
subscriber to get its PID (and then other details from /proc), and if
the process ever exits (which the sender can detect on Linux using a
netlink process monitor, like [1]), the process that created the
socket can be assumed to be dead and the sender can unlink it. The
sender would rely on additional process instance-identifying
information from /proc (like its start-time) to avoid PID-reuse
races.

Agreed--if the event dispatcher is going to be a message bus, then a lot of
the aforementioned difficulties can be eliminated by design. But I'm
uncomfortable with the technical debt it can introduce to the
ecosystem--for example, a message bus has its own semantics that
effectively require a bus-specific library, clients' design choices can
require a message bus daemon to be running at all times, pervasive use of
the message bus by system-level software can make the implementation a hard
requirement for having a usable system, etc. (in short, we get dbus
again). By going with filesystem-oriented approach, this risk is averted,
since the filesystem interface is well-understood, universally supported,
and somewhat future-proof. Most programs can use it without being aware of
the fact.

Post by Laurent Bercot
Thanks again for all your input!
No problem. I love design discussions, I can't get enough of them.
(The reason why I left the Devuan mailing-list is that there was too
much ideological mumbo-jumbo, and not enough technical/design stuff.
Speaking of which, my apologies to Alpine devs for hijacking their ML;
if it's too OT/uninteresting, we'll take the discussion elsewhere.)

Happy to move offline, unless the Alpine devs still want to be CC'ed :)

-Jude

Post by Laurent Bercot
---
---

Laurent Bercot

2016-01-18 12:14:20 UTC

Permalink

Post by Jude Nelson
This sounds reasonable. In fact, within vdevd there are already
distinct netlink listener and data gatherer threads that communicate
over a producer/consumer queue. Splitting them into separate
processes connected by a pipe is consistent with the current design,
and would also help with portability.

I have a standalone netlink listener:
http://skarnet.org/software/s6-linux-utils/s6-uevent-listener.html
Any data gatherer / event dispatcher program can be used behind it.
I'm currently using it as "s6-uevent-listener s6-uevent-spawner mdev",
which spawns a mdev instance per uevent.
Ideally, I should be able to use it as something like
"s6-uevent-listener vdev-data-gatherer vdev-event-dispatcher" and have
a pipeline of 3 long-lived processes, every process being independently
replaceable on the command-line by any other implementation that uses
the same API.

Post by Jude Nelson
I think this is one of the things Plan 9 got right--letting a process
expose whatever fate-sharing state it wanted through the VFS.

The more I keep hearing about Plan 9, the more I tell myself I really
need to try it out. The day where I actually do it is getting closer and
closer - I'm just afraid that once I do, I'll realize how horrible Unix
is and won't ever want to work with Unix again, which would be bad for
my financial well-being. XD

Post by Jude Nelson
* Unlike netlink sockets, a program cannot
control the size of an inotify descriptor's "receive" buffer. This
is a system-wide constant, defined in
/proc/sys/fs/inotify/max_queued_events. However, libudev offers
clients the ability to do just this (via
udev_monitor_set_receive_buffer_size). This is what I originally
meant--libudev-compat needs to ensure that the desired receive buffer
size is honored.

Reading the udev_monitor doc pages stirs up horrible memories of the
D-Bus API. Urge to destroy world rising.

It looks like udev_monitor_set_receive_buffer_size() could be
completely stubbed out for your implementation via inotify. It is only
useful when events queue up in the kernel buffer because a client isn't
reading them fast enough; but with your system, events are stored in
the filesystem so they will never be lost - so there's no such thing as
a meaningful "kernel buffer" in your case, and nobody cares what its
size is: clients will always have access to the full set of events.
"return 0;" is the implementation you want here.

Post by Jude Nelson
To work around these constraints, libudev-compat routes a
udev_monitor's events through an internal socket pair.
(cut layers upon layers of hacks to emulate udev_monitor filters)

Blech.
I understand the API is inherently complex and kinda enforces the
system's architecture - which is very similar to what systemd does, so
it's very unsurprising to me that systemd phagocyted udev: those two
were *made* to be together - but it looks like by deciding to do things
differently and wanting to still provide compatibility, you ended up
coding something that's just as complex, and more convoluted (since
you're not using the original mechanisms) than the original.

The filter mechanism is horribly specific and does not leave much
room for alternative implementations, so I know it's hard to do
correctly, but it seems to me that your implementation gets the worst
of both worlds:
- one of your implementation's advantages is that clients can never
lose events, but by piling your socketpair thingy onto it for an "accurate"
udev_monitor emulation, you make it so clients can actually shoot
themselves in the foot. It may be accurate, but it's lower quality than
your idea permits.
- the original udev implementation's advantage is that clients are never
woken up when an event arrives if the event doesn't pass the filter. Here,
your application will never be woken up indeed, but libudev-compat will be,
since you will get readability on your inotify descriptor. Filters are
not server-side (or even kernel-side) as udev intended, they're client-side,
and that's not efficient.

I believe that you'd be much better off simply using a normal Unix
socket connection from the client to an event dispatcher daemon, and
implementing a small protocol where udev_monitor_filter primitives just
write strings to the socket, and the server reads them and implements
filters server-side by *not* linking filtered events to the
client's event directory. This way, clients really aren't woken up by
events that do not pass the filter.

Post by Jude Nelson
But I'm uncomfortable with the technical debt it can introduce to the
ecosystem--for example, a message bus has its own semantics that
effectively require a bus-specific library, clients' design choices
can require a message bus daemon to be running at all times,
pervasive use of the message bus by system-level software can make
the implementation a hard requirement for having a usable system,
etc. (in short, we get dbus again).

Huh?
I wasn't suggesting using a generic bus.
I was suggesting that the natural architecture for an event dispatcher
was that of a single publisher (the server) with multiple subscribers
(the clients). And that was similar to a bus - except simpler, because
you don't even have multiple publishers.

It's not about using a system bus or anything of the kind. It's about
writing the event dispatcher and the client library as you'd write a bus
server and a bus client library (and please, forget about the insane
D-Bus model of message-passing between symmetrical peers - a client-server
model is much simpler, and easier to implement, at least on Unix).

Good luck with your Ph.D. thesis!

--
Laurent

---
Unsubscribe: alpine-devel+***@lists.alpinelinux.org
Help: alpine-devel+***@lists.alpinelinux.org
---

Jude Nelson

2016-01-19 06:20:32 UTC

Permalink

Hi Laurent,

Post by Laurent Bercot
http://skarnet.org/software/s6-linux-utils/s6-uevent-listener.html
Any data gatherer / event dispatcher program can be used behind it.
I'm currently using it as "s6-uevent-listener s6-uevent-spawner mdev",
which spawns a mdev instance per uevent.
Ideally, I should be able to use it as something like
"s6-uevent-listener vdev-data-gatherer vdev-event-dispatcher" and have
a pipeline of 3 long-lived processes, every process being independently
replaceable on the command-line by any other implementation that uses
the same API.

Sounds good! I'll aim to add that in the medium-term.

Post by Laurent Bercot
* Unlike netlink sockets, a program cannot

Post by Jude Nelson
control the size of an inotify descriptor's "receive" buffer. This
is a system-wide constant, defined in
/proc/sys/fs/inotify/max_queued_events. However, libudev offers
clients the ability to do just this (via
udev_monitor_set_receive_buffer_size). This is what I originally
meant--libudev-compat needs to ensure that the desired receive buffer
size is honored.

Reading the udev_monitor doc pages stirs up horrible memories of the
D-Bus API. Urge to destroy world rising.
It looks like udev_monitor_set_receive_buffer_size() could be
completely stubbed out for your implementation via inotify. It is only
useful when events queue up in the kernel buffer because a client isn't
reading them fast enough; but with your system, events are stored in
the filesystem so they will never be lost - so there's no such thing as
a meaningful "kernel buffer" in your case, and nobody cares what its
size is: clients will always have access to the full set of events.
"return 0;" is the implementation you want here.

<snip>

Post by Laurent Bercot
Blech.
I understand the API is inherently complex and kinda enforces the
system's architecture - which is very similar to what systemd does, so
it's very unsurprising to me that systemd phagocyted udev: those two
were *made* to be together - but it looks like by deciding to do things
differently and wanting to still provide compatibility, you ended up
coding something that's just as complex, and more convoluted (since
you're not using the original mechanisms) than the original.
The filter mechanism is horribly specific and does not leave much
room for alternative implementations, so I know it's hard to do
correctly, but it seems to me that your implementation gets the worst
- one of your implementation's advantages is that clients can never
lose events, but by piling your socketpair thingy onto it for an "accurate"
udev_monitor emulation, you make it so clients can actually shoot
themselves in the foot. It may be accurate, but it's lower quality than
your idea permits.

- the original udev implementation's advantage is that clients are never

Post by Laurent Bercot
woken up when an event arrives if the event doesn't pass the filter. Here,
your application will never be woken up indeed, but libudev-compat will be,
since you will get readability on your inotify descriptor. Filters are
not server-side (or even kernel-side) as udev intended, they're client-side,
and that's not efficient.
I believe that you'd be much better off simply using a normal Unix
socket connection from the client to an event dispatcher daemon, and
implementing a small protocol where udev_monitor_filter primitives just
write strings to the socket, and the server reads them and implements
filters server-side by *not* linking filtered events to the
client's event directory. This way, clients really aren't woken up by
events that do not pass the filter.

I agree with everything you have said. It is true that libudev-compat
emphasizes compatibility to the point where it sacrifices simplicity and
performance to achieve correctness (i.e. consistency with libudev's
behavior). This is not because I believe in the soundness of libudev's
design, but because I'm trying to avoid any breakage.

Believe me, I would love to get away from libudev completely. If programs
expect the device manager to expose device metadata and publish events,
then the device manager should do so in a way that lets programs access
them directly, without an additional client library. This is what vdev
strives to do--its helpers expose all device metadata as a set of
easy-to-parse files, and propagate events through the VFS (but I'm in favor
of moving towards using an event dispatcher like you suggest, since that
would be much simpler to implement and only incur a minimal increase to the
subscriber's interface complexity).

I think switching to a carefully-designed event dispatcher fixes both of
these two problems, while allowing me to retain the unmodified
event-filtering logic from libudev. Specifically, the event dispatcher
would use a UNIX domain socket to establish a shared socket pair with each
libudev-compat client, and libudev-compat would install the BPF programs on
the client's end of the socket pair (this would also preserve the ability
to set the receiving buffer size). This approach eliminates zero-copy
multicast, but as you pointed out earlier this is probably not a problem in
practice anymore, given how small messages are and how infrequent they
appear to be. Moreover, device events could still be namespaced, for
example:
* each context would run its own event dispatcher
* the parent context runs a client program (an "event-forwarder") that
writes events to a FIFO
* when the child context is started, the FIFO gets bind-mounted to a
canonical location for its event dispatcher to connect to and receive events
* the parent context controls which events get propagated to its children
by interposing filtering programs between the event-forwarder and the
shared FIFO (e.g. Don't want the child context to see USB hotplugs? Then
capture and don't write USB events to the child's FIFO endpoint in the
parent context.)

Post by Laurent Bercot
But I'm uncomfortable with the technical debt it can introduce to the

Post by Jude Nelson
ecosystem--for example, a message bus has its own semantics that
effectively require a bus-specific library, clients' design choices
can require a message bus daemon to be running at all times,
pervasive use of the message bus by system-level software can make
the implementation a hard requirement for having a usable system,
etc. (in short, we get dbus again).

Huh?
I wasn't suggesting using a generic bus.
I was suggesting that the natural architecture for an event dispatcher
was that of a single publisher (the server) with multiple subscribers
(the clients). And that was similar to a bus - except simpler, because
you don't even have multiple publishers.
It's not about using a system bus or anything of the kind. It's about
writing the event dispatcher and the client library as you'd write a bus
server and a bus client library (and please, forget about the insane
D-Bus model of message-passing between symmetrical peers - a client-server
model is much simpler, and easier to implement, at least on Unix).

Sorry--let me try to clarify what I meant. I was trying to say that one of
the things that appeals to me about exposing events through a specialized
filesystem is that it exposes a well-understood, universal, and easy-to-use
API. All existing file-oriented tools would work with it, without
modification. The downside is that it requires a somewhat complex
implementation, as we discussed.

I'm not suggesting that we look to dbus for inspiration :) I was trying to
point out that while the upside of using an event dispatcher is that it has
a simple implementation, the downside is that without careful design, an
event dispatcher with a simple implementation can still evolve a complex
contract with its client programs that is difficult to honor (so much so
that a complex client library is all but required to mediate access to the
dispatcher). I was pointing out that any system-wide complexity introduced
by specifying a dispatcher-specific publish/subscribe protocol for
device-aware applications should be considered as part of the "total
complexity" of using an event dispatcher, so it can be minimized up-front
(this was the "minimal increase to the subscriber's interface complexity" I
mentioned above). But bringing this up was very academic of me ;) I don't
think that using a carefully-designed event dispatcher is nearly as complex
as using a filesystem.

I feel like I can replace eventfs with an event dispatcher that is both
simple to implement and simple to use, while lowering the overall
complexity of device propagation and retaining enough functionality to
achieve libudev compatibility for legacy programs.

Thanks,
Jude