Welcome to LWN.netThe following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net! |
|
Anyone who has been paying attention to Linux kernel development in recent years would be aware that IPC — interprocess communication — is not a solved problem. There are certainly many partial solutions, from pipes and signals, through sockets and shared memory, to more special-purpose solutions like Cross Memory Attach and Android's binder. But it seems there are still some use cases that aren't fully addressed by current solutions, leading to new solutions being occasionally proposed to try to meet those needs. The latest proposal is called "bus1". While that isn't a particularly interesting name, it could be much worse: it could have been named after a town in Massachusetts like Wayland,Dracut, Plymouth, and others.
The focus for bus1 is much the same as that for recent "kdbus" proposals— to provide kernel support for D-Bus — and the implementation has strong similarities with binder, which occupies a similar place in the IPC space. The primary concerns seem to be low-overhead message passing between established peers and multicast, which are useful for remote procedure calls and distributing status updates, respectively.
David Herrmann announced bus1 on the kernel-summit mailing list with the goal of having a session for discussing the new functionality at the upcoming summit in November. The announcement was accompanied by a link to the current code and documentation, which is pleasingly well organized and thorough. Any imperfections are easily explained by the fact that bus1 is under active development and did not interfere with my ability to form the following picture of the structure, strengths, and, occasionally, weaknesses of bus1.
Peers, nodes, and handles
All communication mediated by bus1 travels between "peers", where a peer is a kernel abstraction somewhat like a socket. A process accesses a peer through a file descriptor which, in the current implementation, is obtained by opening the character special device /dev/bus1. Tom Gundersen, one of the authors, acknowledged that having a dedicated system call to create a peer, and to perform the various other operations that currently use ioctl(), might more make sense for eventual upstream submission. Devices and ioctl() have been used so far because they make out-of-tree development easier.
A peer is not directly addressable, but it can hold an arbitrary number of addressable objects known as "nodes". Bus1 maintains minimal state about a node beyond a reference counter and linkage into other data structures; it serves only as a rendezvous point. When a message is sent to a node, it is delivered to the peer that owns the node, and the peer is told "this message was sent to that node". The application managing the node can interpret that as a particular object, or as a particular service, or whatever is appropriate.
Nodes themselves do not have externally visible names, but are identified by "handles" that are a little bit like file descriptors, a little bit like the "watch descriptors" used by inotify, and a lot like the object descriptors used by binder. As such, a handle acts like a "capability". If a peer holds a handle that identifies a particular node, then it has implicit permission to send messages to that node, and hence whatever object some application associates with that node. If, instead, it doesn't have a handle for a particular node, the only way get one is to ask another node to provide it, and that is where permission checking will happen.
A handle is a 64-bit number (allocated by bus1) that only has meaning in the context of a particular peer, in the same way that a file descriptor only has meaning the context of the process that owns it. A handle refers to a node that belongs to some peer; it might be the same peer that owns the handle, or it might be a different peer. In the latter case, the handle effectively acts as a link between two peers, though neither can directly determine any details of the other.
To create a new node, an application performs any of the various ioctl() operations that accept a handle, passing the special reserved handle number of "3". This causes a new node to be allocated; the handle number for that node will replace the reserved number in the argument to ioctl().
When a peer is created by opening /dev/bus1, it has no nodes and no handles. Nodes local to the peer can be created on demand as described above, but an extra step is required to get a handle for a node in another peer. The BUS1_CMD_HANDLE_TRANSFERioctl() can be used for this. This ioctl() is called on the file descriptor for one peer and is given the file descriptor for the destination peer along with the handle to transfer. A new handle will be created for the second peer referring to the node pointed to by the original handle.
The question of how to arrange for a process to have two different peer-descriptors so it can call BUS1_CMD_HANDLE_TRANSFER is not addressed by bus1. One option would be to have a daemon listening on a Unix-domain socket. An application that wants to communicate on that daemon's bus would open /dev/bus1 and send the resulting file descriptor over a socket to the daemon. The daemon would then perform the handle transfer and tell the application what the new handle is.
This mechanism could be repeated to give the first peer a handle that can be used to reach the second, but once there is this one point of communication, it is possibly easier to use the more general approach of message passing.
Messages, queues, and pools
The core functionality of any IPC mechanism is to pass messages. Bus1 allows a message to be sent via a peer to a list of nodes by specifying a list of destination handles. This message has three application-controlled segments and a fourth segment that is imposed by bus1.
The three application-controlled segments all contain resources to be passed to the message recipient(s). They are: a block of uninterpreted data that can be assembled from multiple locations as is done by writev(2), a list of handles, and a list of file descriptors. The handles and file descriptors are mapped to references to internal data structures when a message is sent, and mapped back to handles and descriptors relevant to the receiver when it is received. In order to give an application control over its open files, the receiving application can request that files not be mapped to local file descriptors. This is particularly useful when combined with the "peek" version of message reception which reports the content of the message without removing it from the incoming queue.
If a peer is sent a handle for a node that it already has a handle for, then the handle it is given will be exactly the same as the handle it already has. This means that handles can be compared for equality by just comparing the 64-bit values. This contrasts with file descriptors in that when a file descriptor is received, a new local descriptor is allocated even if the process already has the same file open on a different descriptor.
The fourth segment of a message identifies the sender of the message; it contains the sender's process, thread, user, and group IDs. Each of these numbers is mapped appropriately if the message travels between namespaces. These details are similar to those contained in the SCM_CREDENTIALS message that can be passed over a Unix-domain socket, but with an important difference: when usingSCM_CREDENTIALS, the sending process provides the credentials and the kernel validates them; with bus1, instead, the sending process has no control at all. This means that if the sending process is running a setuid program and has different real and effective user IDs, it cannot choose which one to send. Bus1 currently insists on always sending the real user and group IDs.
Note that the identification of the sender does not include a handle by which a reply might be sent. If the sender wants a reply, it must explicitly pass a handle as part of the message; there is no implicit return path.
As mentioned, a list of recipient handles can be given, and the message will be delivered to all of the recipients, if possible. This provides a form of multicast, though there is no native support for a publish/subscribe multicast arrangement where multicast messages are sent to a well-known address and recipients can indicate which addresses they are listening on. If publish/subscribe is needed, it would have to be layered on top with some application maintaining subscription lists and sending messages to those lists as required.
Each peer has a "pool" of memory, currently limited to 256MB, in which incoming messages are placed. When a message is sent to a particular peer, space is allocated in that peer's pool and the data segment of the message is copied directly from the sender's memory into the recipient's pool. The recipient can map that pool into virtual memory and will get read-only access to the data. When it has finished with the data it can tell bus1 that section (or "slice") of the pool is free for reuse.
The message-transfer process copies the data to the pool, reserves a little extra space in the pool to store the translated handles and file descriptors, and adds a fixed-sized structure with other message details to a per-peer queue. Once there is a message on the queue, the peer's file descriptor will report to poll() or select() that it is readable; an ioctl() request can then be made to find out where in the pool the message is and to collect other details like the sender's credentials. It is only when this request is made that the handles and file descriptors are translated and their details added to the pool.
In order to avoid denial-of-service attacks, each user has quotas limiting the amount of space they can consume in pools by sending messages and the number of handles and file descriptors they can have sent that haven't been processed yet. The default quota on space is 256MB in a total of at most 16,383 messages. As soon as the receiving application accepts the message, whether it releases the pool space or not, the charge against the sender is removed.
Total message ordering
An important property that the bus1 developers put some effort into providing is a global ordering of messages. This doesn't mean that every message has a unique sequence number, but instead provides semantics that are just as good for practical purposes, without needing any global synchronization.
The particular properties that are ensured are "consistency" (if two peers receive the same two messages, they will both see them in the same order), and "causality" (if there is any chance of causality between two messages, then the message relating to the cause will be certain to arrive before the message relating to the result). This is achieved using local clocks at each peer that are synchronized in a manner similar to that used for Lamport clocks. For the fine details it is best to read the documentation section on message ordering.
Like binder, only better?
As I absorbed all the details about bus1, I was struck by its similarities to Android's binder. The message structure is similar, the handles are similar, and the use of a pool to receive messages is similar. Though I haven't covered them here, bus1 delivers node destruction messages when a node is destroyed in a similar manner to binder. Given this, a useful perspective might be provided by looking for differences.
The most obvious difference is that binder strongly unifies the concept of a peer with that of a process. In binder, the message-receipt pool is reserved in one process's address space, and each object ID, the equivalent of a bus1 handle, is a per-process identifier. Having all these concepts connected with a file descriptor in bus1, instead of with a process, is a clear improvement in flexibility and simplicity.
Binder has an internal distinction between requests and replies, and a dedicated send/wait/receive operation so that a complete remote procedure call can be effected in a single system call. Bus1 doesn't have this and so would require separate send, wait, and receive steps. This may seem like a minor optimization, but there is an important underlying benefit that this brings to binder.
Unifying all the steps into a single operation provides binder with the concept of a transaction; the message and reply can be closely associated into a single abstraction. If the recipient of the message needs to perform other IPC calls as part of handling the message, binder can see those as part of the same transaction, which will continue until the reply comes back to the originating process. Since all of this can be identified as a single transaction, binder is able to temporarily elevate the priority of every process involved to match the priority of the calling process, effectively allowing priority inheritance across IPC calls. The introduction to bus1 that was posted to the kernel-summit list identified priority inheritance as an important requirement for an IPC system, but the current code and documentation don't give any hint about how that will be implemented. Until we can see the design for priority inheritance, we cannot know if discarding the transaction concept of binder is a good simplification, or a bad loss of functionality.
Any IPC mechanism requires some sort of shared namespace for actors to find each other. Both binder and bus1 largely leave this to other layers, though in slightly different ways. In binder there is a single well-known object ID — the number "0". Only a single privileged process can receive messages sent to "0". An obvious use of this would be to support registration and lookup in a global namespace. The process listening on ID 0 would be a location broker that checked the privileges of any processes registering a name, and then would direct any request for that name to the associated object. Bus1 doesn't even have this "ID 0". Whatever handshake is used to allow BUS1_CMD_HANDLE_TRANSFER to be called must also make sure that each new client knows how to contact any broker that it might need.
Finally, binder has nothing like the global message-ordering guarantees that bus1 provides, but bus1 has nothing like the thread-pool management that is built into binder. The importance of either of these cannot be known without considerable experience working in this space, so it might be a worthy topic to explore at the Kernel Summit. These differences, together with the lack of a request/reply distinction in bus1, are probably enough that it would be unwise to hope that bus1 might eventually replace binder in the kernel.
Summary
It is early days yet for bus1. Though it has been under development for a least eight months (based on Git history) and is based on even older ideas, there has been little public discussion. The follow-up comments on the kernel-summit email thread primarily involved people indicating their interest rather than commenting on the design. From my limited perspective, though, it is looking positive. The quality of the code and documentation is excellent. The design takes the best of binder, which is a practical success as a core part of the Android platform, and improves on it. And the development team appears to be motivated towards healthy informed community discussion prior to any acceptance. The tea-leaves tell me there are good things in store for bus1.
Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.
(Log in to post comments)