A peek at the DragonFly Virtual Kernel (part 2)

LWN.net needs you!
Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing.

April 16, 2007

This article was contributed by Aggelos Economopoulos

[ Editor's note: this article is the second and final part of the look at the DragonFly BSD virtual kernel article by Aggelos Economopoulos. For those who questioned why a BSD development appears on this page, the answer is simple: there is value in seeing how others have solved common problems.]

Userspace I/O

Our previous article gave an overview of the DragonFly virtual kernel and the kernel virtual memory subsystem. In this article, we can finally cover the complications that present themselves in implementing such a virtualized execution environment. If you haven't read the previous article, it would be a good idea to do so before continuing.

Now that we know how the virtual kernel regains control when its processes request/need servicing, let us turn to how it goes about satisfying those requests. Signal transmission and most of the filesystem I/O (read, write, ...), process control (kill, signal, ...) and net I/O system calls are easy; the vkernel takes the same code paths that a real kernel would. The only difference is in the implementation of the copyin()/copyout() family of routines for performing I/O to and from userspace.

When the real kernel needs to access user memory locations, it must first make sure that the page in question is resident and will remain in memory for the duration of a copy. In addition, because it acts on behalf of a user process, it should adhere to the permissions associated with that process. Now, on top of that, the vkernel has to work around the fact that the process address space is not mapped while it is running. Of course, the vkernel knows which pages it needs to access and can therefore perform the copy by creating a temporary kernel mapping for the pages in question. This operation is reasonably fast; nevertheless, it does incur measurable overhead compared to the host kernel.

Page Faults

The interesting part is dealing with page faults (this includes lazily servicing mmap()/madvise()/... operations). When a process mmap()s a file (or anonymous memory) in its address space, the kernel (real or virtual) does not immediately allocate pages to read in the file data (or locate the pages in the cache, if applicable), nor does it setup the pagetable entries to fulfill the request. Instead, it merely notes in its data structures that it has promised that the specified data will be there when read and that writes to the corresponding memory locations will not fail (for a writable mapping) and will be reflected on disk (if they correspond to a file area). Later, if the process tries to access these addresses (which do not still have valid pagetable entries (PTES), if they ever did, because new mappings invalidate old ones), the CPU throws a pagefault and the fault handling code has to deliver as promised; it obtains the necessary data pages and updates the PTES. Following that, the faulting instruction is restarted.

Consider what happens when a process running on an alternate vmspace of a vkernel process generates a page fault trying to access the memory region it has just mmap()ed. The real kernel knows nothing about this and through a mechanism that will be described later, passes the information about the fault on to the vkernel. So, how does the vkernel deal with it? The case when the faulting address is invalid is trivially handled by delivering a signal (SIGBUS or SIGSEGV) to the faulting vproc. But in the case of a reference to a valid address, how can the vkernel ensure that the current and succeeding accesses will complete? Existing system facilities are not appropriate for this task; clearly, a new mechanism is called for.

What we need, is a way for the vkernel to execute mmap-like operations on its alternate vmspaces. With this functionality available as a set of system calls, say vmspace_mmap()/vmspace_munmap()/etc, the vkernel code servicing an mmap()/munmap()/mprotect()/etc vproc call would, after doing some sanity checks, just execute the corresponding new system call specifying the vmspace to operate on. This way, the real kernel would be made aware of the required mapping and its VM system would do our work for us.

The DragonFly kernel provides a vmspace_mmap() and a vmspace_munmap() like the ones we described above, but none of the other calls we thought we would need. The reason for this is that it takes a different, non-obvious, approach that is probably the most intriguing aspect of the vkernel work. The kernel's generic mmap code now recognizes a new flag, MAP_VPAGETABLE. This flag specifies that the created mapping is governed by a userspace virtual pagetable structure (a vpagetable), the address of which can be set using the new vmspace_mcontrol() system call (which is an extension of madvise(), accepting an extra pointer parameter) with an argument of MADV_SETMAP. This software pagetable structure is similar to most architecture-defined pagetables. The complementary vmspace_munmap(), not surprisingly, removes mappings in alternate address spaces. These are the primitives on which the memory management of the virtual kernel is built.

Table 1. New vkernel-related system calls

    int vmspace_create(void *id, int type, void *data);
    int vmspace_destroy(void *id,);
    int vmspace_ctl(void *id, int cmd, struct trapframe *tf,
                    struct vextframe *vf);
    int vmspace_mmap(void *id, void *start, size_t len, int prot,
                     int flags, int fd, off_t offset);
    int vmspace_munmap(void *id, void *start, size_t len);
    int mcontrol(void *start, size_t len, int adv, void *val);
    int vmspace_mcontrol(void *id, void *start, size_t len, int adv,
                         void *val);

At this point, an overview of the virtual memory map of each vmspace associated with the vkernel process is in order. When the virtual kernel starts up, there is just one vmspace for the process and it is similar to that of any other process that just begun executing (mainly consisting of mappings for the heap, stack, program text and libc). During its initialization, the vkernel mmap()s a disk file that serves the role of physical memory (RAM). The real kernel is instructed (via madvise(MADV_NOSYNC)) to not bother synchronizing this memory region with the disk file unless it has to, which is typically when the host kernel is trying to reclaim RAM pages in a low memory situation. This is imperative; otherwise all the vkernel "RAM" data would be treated as valuable by the host kernel and would periodically be flushed to disk. Using MADV_NOSYNC, the vkernel data will be lost if the system crashes, just like actual RAM, which is exactly what we want: it is up to the vkernel to sync user data back to its own filesystem. The memory file is mmap()ed specifying MAP_VPAGETABLE. It is in this region that all memory allocations (both for the virtual kernel and its processes) take place. The pmap module, the role of which is to manage the vpagetables according to instructions from higher level VM code, also uses this space to create the vpagetables for user processes.

On the real kernel side, new vmspaces that are created for these user processes are very simple in structure. They consist of a single vm_map_entry that covers the 0 - VM_MAX_USER_ADDRESS address range. This entry is of type MAPTYPE_VPAGETABLE and the address for its vpagetable has been set (by means of vmspace_mcontrol()) to point to the vkernel's RAM, wherever the pagetable for the process has been allocated.

The true vm_map_entry structures are managed by the vkernel's VM subsystem. For every one of its processes, the virtual kernel maintains the whole set of vmspace/vm_map, vm_map_entry, vm_object objects that we described earlier. Additionally, the pmap module needs to keep its own (not to be described here) data structures. All of the above objects reside in the vkernel's "physical" memory. Here we see the primary benefit of the DragonFly approach: no matter how fragmented an alternate vmspace's virtual memory map is and independently of the amount of sharing of a given page by processes of the virtual kernel, the host kernel expends a fixed (and reasonably sized) amount of memory for each vmspace. Also, after the initial vmspace creation, the host kernel's VM system is taken out of the equation (expect for pagefault handling), so that when vkernel processes require VM services, they only compete among themselves for CPU time and not with the host processes. Compared to the "obvious" solution, this approach saves large amounts of host kernel memory and achieves a higher degree of isolation.

Now that we have grasped the larger picture, we can finally examine our "interesting" case: a page fault occurs while the vkernel process is using one of its alternate vmspaces. In that case, the vm_fault() code will notice it is dealing with a mapping governed by a virtual pagetable and proceed to walk the vpagetable much like the hardware would. Suppose there is a valid entry in the vpagetable for the faulting address; then the host kernel simply updates its own pagetable and returns to userspace. If, on the other hand, the search fails, the pagefault is passed on to the vkernel which has the necessary information to update the vpagetable or deliver a signal to the faulting vproc if the access was invalid. Assuming the vpagetable was updated, the next time the vkernel process runs on the vmspace that caused the fault, the host kernel will be able to correct its own pagetable after searching the vpagetable as described above.

There are a few complications to take into account, however. First of all, any level of the vpagetable might be paged out. This is straightforward to deal with; the code that walks the vpagetable must make sure that a page is resident before it tries to access it. Secondly, the real and virtual kernels must work together to update the accessed and modified bits in the virtual pagetable entries (VPTES). Traditionally, in architecture-defined pagetables, the hardware conveniently sets those bits for us. The hardware knows nothing about vpagetables, though. Ignoring the problem altogether is not a viable solution. The availability of these two bits is necessary in order for the VM subsystem algorithms to be able to decide if a page is heavily used and whether it can be easily reclaimed or not (see [AST06]). Note that the different semantics of the modified and accessed bits mean that we are dealing with two separate problems.

Keeping track of the accessed bit turns out to require a minimal amount of work. To explain this, we need to give a short, incomplete, description of how the VM subsystem utilizes the accessed bit to keep memory reference statistics for every physical page it manages. When the DragonFly pageout daemon is awakened and begins scanning pages, it first instructs the pmap subsystem to free whatever memory it can that is consumed by process pagetables, updating the physical page reference and modification statistics from the PTES it throws away. Until the next scan, any pages that are referenced will cause a pagefault and the fault code will have to set the accessed bit on the corresponding pte (or vpte). As a result, the hardware is not involved[4]. The behavior of the virtual kernel is identical to that just sketched above, except that in this case page faults are more expensive since they must always go through the real kernel.

While the advisory nature of the accessed bit gives us the flexibility to exchange a little bit of accuracy in the statistics to avoid a considerable loss in performance, this is not an option in emulating the modified bit. If the data has been altered via some mapping the (now "dirty") page cannot be reused at will; it is imperative that the data be stored in the backing object first. The software is not notified when a pte has the modified bit set in the hardware pagetable. To work around this, when a vproc requests a mapping for a page and that said mapping be writable, the host kernel will disallow writes in the pagetable entry that it instantiates. This way, when the vproc tries to modify the page data, a fault will occur and the relevant code will set the modified bit in the vpte. After that, writes on the page can finally be enabled. Naturally, when the vkernel clears the modified bit in the vpagetable it must force the real kernel to invalidate the hardware pte so that it can detect further writes to the page and again set the bit in the vpte, if necessary.

Floating Point Context

Another issue that requires special treatment is saving and restoring of the state of the processor's Floating Point Unit (FPU) when switching vprocs. To the real kernel, the FPU context is a per-thread entity. On a thread switch, it is always saved[5] and machine-dependent arrangements are made that will force an exception ("device not available" or DNA) the first time that the new thread (or any thread that gets scheduled later) tries to access the FPU[6]. This gives the kernel the opportunity to restore the proper FPU context so that floating point computations can proceed as normal.

Now, the vkernel needs to perform similar tasks if one of its vprocs throws an exception because of missing FPU context. The only difficulty is that it is the host kernel that initially receives the exception. When such a condition occurs, the host kernel must first restore the vkernel thread's FPU state, if another host thread was given ownership of the FPU in the meantime. The virtual kernel, on the other hand, is only interested in the exception if it has some saved context to restore. The correct behavior is obtained by having the vkernel inform the real kernel whether it also needs to handle the DNA exception. This is done by setting a new flag (PGEX_FPFAULT) in the trapframe argument of vmspace_ctl(). Of course, the flag need not be set if the to-be-run virtualized thread is the owner of the currently loaded FPU state. The existence of PGEX_FPFAULT causes the vkernel host thread to be tagged with FP_VIRTFP. If the host kernel notices said tag when handed a "device not available" condition, it will restore the context that was saved for the vkernel thread, if any, before passing the exception on to the vkernel.

Platform drivers

Just like for ports to new hardware platforms, the changes made for vkernel are confined to few parts of the source tree and most of the kernel code is not aware that it is in fact running as a user process. This applies to filesystems, the vfs, the network stack and core kernel code. Hardware device drivers are not needed or wanted and special drivers have been developed to allow the vkernel to communicate with the outside world. In this subsection, we will briefly mention a couple of places in the platform code where the virtual kernel needs to differentiate itself from the host kernel. These examples should make clear how much easier it is to emulate platform devices using the high level primitives provided by the host kernel, than dealing directly with the hardware.

Timer. The DragonFly kernel works with two timer types. The first type provides an abstraction for a per-CPU timer (called a systimer) implemented on top of a cputimer. The latter is just an interface to a platform-specific timer. The vkernel implements one cputimer using kqueue's EVFILT_TIMER. kqueue is the BSD high performance event notification and filtering facility described in some detail in [Lemon00]. The EVFILT_TIMER filter provides access to a periodic or one-shot timer. In DragonFly, kqueue has been extended with signal-driven I/O support (see [Stevens99]) which, coupled with the a signal mailbox delivery mechanism allows for fast and very low overhead signal reception. The vkernel makes full use of the two extensions.

Console. The system console is simply the terminal from which the vkernel was executed. It should be mentioned that the vkernel applies special treatment to some of the signals that might be generated by this terminal; for instance, SIGINT will drop the user to the in-kernel debugger.

Virtual Device Drivers

The virtual kernel disk driver exports a standard disk driver interface and provides access to an externally specified file. This file is treated as a disk image and is accessed with a combination of the read(), write() and lseek() system calls. Probably the simplest driver in the kernel tree, the memio driver for /dev/zero included in the comparison.

VKE implements an ethernet interface (in the vkernel) that tunnels all the packets it gets to the corresponding tap interface in the host kernel. It is a typical example of a network interface driver, with the exception that its interrupt routine runs as a response to an event notification by kqueue. A properly configured vke interface is the vkernel's window to the outside world.

Bibliography

[McKusick04] The Design and Implementation of the FreeBSD Operating System, Kirk McKusick and George Neville-Neil

[Dillon00] Design elements of the FreeBSD VM system Matthew Dillon

[Lemon00] Kqueue: A generic and scalable event notification facility Jonathan Lemon

[AST06] Operating Systems Design and Implementation, Andrew Tanenbaum and Albert Woodhull.

[Provos03] Improving Host Security with System Call Policies Niels Provos

[Stevens99] UNIX Network Programming, Volume 1: Sockets and XTI, Richard Stevens.

Notes

[4]	Well not really, but a thorough VM walkthrough is out of scope here.
[5]	This is not optimal; x86 hardware supports fully lazy FPU save, but the current implementation does not take advantage of that yet.
[6]	The kernel will occasionally make use of the FPU itself, but this does not directly affect the vkernel related code paths.
[7]	Or any alternative stack the user has designated for signal delivery.

Index entries for this article
GuestArticles	Economopoulos, Aggelos

A peek at the DragonFly Virtual Kernel (part 2)

Posted Apr 19, 2007 4:44 UTC (Thu) by rsidd (subscriber, #2582) [Link] (4 responses)

According to your own FAQ:

LWN, initially, was "Linux Weekly News." That name has been deemphasized over time as we have moved beyond just the weekly coverage, and as we have looked at the free software community as a whole.

Yet now you write:

For those who questioned why a BSD development appears on this page, the answer is simple: there is value in seeing how others have solved common problems.

Nice to know that the BSD community is now regarded as "others". In that case why not Windows articles too?

A peek at the DragonFly Virtual Kernel (part 2)

Posted Apr 19, 2007 10:32 UTC (Thu) by pointwood (guest, #2814) [Link] (2 responses)

When the Windows source code is open and freely available, sure ;)

A peek at the DragonFly Virtual Kernel (part 2)

Posted Apr 20, 2007 6:47 UTC (Fri) by rsidd (subscriber, #2582) [Link] (1 responses)

My point is, LWN now claims to be about the whole free software community.
In that case no justification should be necessary for this article. If on
the other hand, it is "useful to see how others do things", we can equally
learn from closed-source systems.

A peek at the DragonFly Virtual Kernel (part 2)

Posted Apr 20, 2007 7:06 UTC (Fri) by pointwood (guest, #2814) [Link]

True, no justification should be neccesary but some people apparently thought it was outside the scope of lwn.net.

In regards to closed source systems, the fact that the source (among other things) isn't available makes that quite a bit more difficult.

BSD on LWN

Posted Apr 20, 2007 20:21 UTC (Fri) by giraffedata (guest, #1954) [Link]

The issue is putting it on this page, not in LWN. Though the title doesn't convey it, the "Kernel" page is specifically about the Linux kernel. So Dragonfly does have to be relevant to the Linux kernel for this article to fit on this page.

Other pages have contained BSD news without justification.

non-Linux

Posted Apr 19, 2007 12:14 UTC (Thu) by ldo (guest, #40946) [Link] (3 responses)

I have absolutely no problem seeing articles about development for other operating systems in these pages. Just as long as they're open-source.

non-Linux

Posted Apr 19, 2007 16:12 UTC (Thu) by MisterIO (guest, #36192) [Link] (2 responses)

ReactOS for example?

non-Linux

Posted Apr 20, 2007 6:47 UTC (Fri) by ldo (guest, #40946) [Link]

Yeah, why not? ReactOS, Haiku, FreeDOS, the BSDs of course, whatever. Let a thousand flowers bloom. :)

non-Linux

Posted Apr 26, 2007 9:47 UTC (Thu) by farnz (subscriber, #17727) [Link]

ReactOS could make for really interesting articles; the NT kernel (until they imported the video drivers into it) is an interesting example of a microkernel architecture that actually worked in real life. If ReactOS really are doing a compatible implementation of the NT kernel, I'd be interested to hear about the problems they've solved.

Cool, how about FreeBSD kernel content

Posted Apr 20, 2007 7:38 UTC (Fri) by dion (guest, #2764) [Link]

I'd really like to see some coverage of the FreeBSD kernel, there are plenty of details in "their" kernel that are very well thought out that Linux people could learn from.

It would also be nice to have some context to the Linux kernel developments, tings like traffic shaping would be nice to see compared for example as I'm vaguely aware of "their" pipes metaphor that is clearly much more usable than the horrid and cryptic Linux shaping.