Updates to LKD3

From Crashcourse Wiki

Jump to: navigation, search

Contents

What's this page all about?

This is an unofficial page documenting various errors in, omissions from or changes to "Linux Kernel Development" (3rd ed) by Robert Love, or what I call "LKD3" as opposed to the earlier "LKD2" edition. I'm throwing this open as a crowdsourcing project so if you want to play along, keep reading.

Not surprisingly, the instant any Linux book is published (especially any Linux kernel book), it slowly but inexorably starts to get out of date and, in the event there is ever an "LKD4", it would be nice to have a comprehensive list of necessary corrections and updates, as well as just being nice to have a list of what's changed in the book, anyway. And the things I'd like to document here fall into a few different categories.

First, there are things that should have been updated since LKD2 but were simply and mistakenly overlooked and, as I was the technical editor, I'd have to take the blame for that. As technical editor, it was my job to dig into the source and make sure it matched the manuscript, but without question, I missed some things so those should be pointed out.

The second category to document are changes that have occurred since my editing and the book publication and, as time goes on, there will certainly be more and more of that. As a simple example, consider this note: Updates_to_LKD3#The_atomic_t_type_loses_its_volatile. It seems like a simple enough change, and it's worth documenting, but let's use that as a demonstration of how you can dig into these changes.

Once you notice a discrepancy between the book and the latest kernel source, you might be able to use the git blame feature to see which git commit was responsible for that change:

$ git blame include/linux/types.h
... snip ...
81880d60 (Anton Blanchard           2010-05-17 14:34:57 +1000 213)      int counter;
... snip ...

And once you have the commit, you can list it in its entirety with:

$ git show 81880d60
It may not be essential to document every change to this extent, but perhaps it doesn't hurt to understand the rationale behind such things.

As another example, on p. 192 of LKD3, there's a reference to the alleged mutex macro DECLARE_MUTEX, which doesn't exist anymore, so a note should be (and will be) made of that.

Finally, there are things that can be documented that aren't so much errors or updates so much as features that deserve some (or more) mention. These could be features that are either new since LKD3 came out, or are features that have been there for a while but weren't mentioned in the book at all, but perhaps deserve to be.

In any event, if you're feeling ambitious, you can get your own copy of LKD3 and start perusing it, and if you find anything that you think deserves mention, drop me an e-mail at rpjday@crashcourse.ca with as much info as you can pull together, and I'll arrange that it gets added in the appropriate place with suitable credit given to the reporter.

Proposed/possible changes for LKD4

Completely new features that definitely should be added

threaded interrupts

TASK_KILLABLE

Features that have been extended/enhanced since LKD3

Features that entirely obsolete/replace old features

Stuff that's been removed

  • io_remap_page_range (commit 03ff3efb64c8a64cb8cdf35e36bead5c78eb3024)
  • simple_vma_open() (not sure when)

Stuff we're not sure about but we'll just keep track of

Tickless OS in more detail

Scheduler domains

Huge pages

Simple errors/typoes

Figure 3.2 (p. 26)

"struct thread_struct" should read "struct thread_info"

(Reported by Julie Sullivan, kernelnewbies list)

Page 94, reference to inode_find_handle()

Apparently, no such routine exists anymore, and hasn't for quite some time, so pick another example.

STUFF TO THINK ABOUT

  • strace
  • How about appendices for stuff like common kernel-space programming macros?
  • Show readers how to generate kernel documentation from kernel-doc?
  • CONFIG_IRQ_DOMAIN_DEBUG
  • More on gcc extensions used in kernel.
  • What about an entire chapter on SMP issues?
  • Early on, discuss common kernel macros you'll see everywhere (PAGE_SIZE, THREAD_SIZE)

New kernel features/updates since book publication, by chapter

Introduction to the Linux Kernel

More architectures (p. 3)

List additional architectures (blackfin, microblaze, etc, etc.), definitely mention MIPS explicitly given its current popularity in embedded devices/routers.

Getting Started with the Kernel

Kernel configuration with Qt-based make xconfig (p. 14)

In addition to mentioning the Gtk-based make gconfig command, it's worth mentioning the Qt-based make xconfig.

Installing the New Kernel (p. 16)

A couple observations here. First, with GRUB version 2, the configuration file is now /boot/grub/grub.cfg and you wouldn't edit it manually, you would instead install the new kernel and modules and run update-grub, as you would do these days on newer versions of Ubuntu.

Also, rather than just install the new modules with:

# make modules_install

it's worth knowing that you can save piles of disk space by stripping those module files during installation with:

# make INSTALL_MOD_STRIP=1 modules_install

Usage of inline (pp. 18-19)

This is currently a bit of a mess, with usage of all of inline, __inline and __inline__, as well as the usage of __always_inline and now, in the drivers/staging directory, defines of the form:

#define INLINE     __inline

as if things weren't confusing enough. If this is ever clarified, it really needs to be explained.

Process Management

Process Descriptor and the Task Structure

I think it would be useful to discuss get_current(), current_stack_pointer() and current_thread_info() for multiple architectures. In general, any process will have:

  • beginning of stack
  • current stack pointer
  • thread_info structure
  • task_struct (from thread_info)

Header files:

  • sched.h
  • current.h
  • thread_info.h

As an example, show how both x86 and ARM handle the above.

This is going to be lengthy since there have been some fundamental changes and I'm still working my way through the current code. First, here's the beginning of the task_struct structure from sched.h:

struct task_struct {
        volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
        void *stack;      <-- hello?
        ... snip ...

Yeah, it now contains the actual stack pointer for that task -- that change was made way back in 2007:

commit f7e4217b007d1f73e7e3cf10ba4fea4a608c603f
Author: Roman Zippel <zippel@linux-m68k.org>
Date:   Wed May 9 02:35:17 2007 -0700

    rename thread_info to stack
    
    This finally renames the thread_info field in task structure to stack, so that
    the assumptions about this field are gone and archs have more freedom about
    placing the thread_info structure.
    
    Nonbroken archs which have a proper thread pointer can do the access to both
    current thread and task structure via a single pointer.
... snip ...
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 28000b1..17b72d8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -817,7 +817,7 @@ struct prio_array;
 
 struct task_struct {
        volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
-       struct thread_info *thread_info;
+       void *stack;

I won't go too much further with this right now, but it seems clear that every architecture needs to define a current_thread_info() routine, quite likely architecture-dependent and in the header file arch/?/include/asm/thread_info.h. Here's for Alpha:

/* How to get the thread information struct from C.  */
register struct thread_info *__current_thread_info __asm__("$8");
#define current_thread_info()  __current_thread_info

Here's for ARM:

/*
 * how to get the thread information struct from C
 */
static inline struct thread_info *current_thread_info(void) __attribute_const__;

static inline struct thread_info *current_thread_info(void)
{
        register unsigned long sp asm ("sp");
        return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}

And here's for x86:

/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
        return (struct thread_info *)
                (current_stack_pointer & ~(THREAD_SIZE - 1));
}

In some cases (as in x86 just above), current_thread_info is defined in terms of current_stack_pointer, so some architectures need to define that:

/* how to get the current stack pointer from C */
register unsigned long current_stack_pointer asm("esp") __used;

Just to finish things off for now, every architecture needs to define current and get_current(), in their version of current.h.

Process Scheduling

ENQUEUE_MIGRATE is now ENQUEUE_WAKING (p. 54)

Change in kernel/sched_fair.c based on:

commit 371fd7e7a56a5c136d31aa980011bd2f131c3ef5
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Wed Mar 24 16:38:48 2010 +0100

    sched: Add enqueue/dequeue flags

    In order to reduce the dependency on TASK_WAKING rework the enqueue
    interface to support a proper flags field.

    Replace the int wakeup, bool head arguments with an int flags argument
    and create the following flags:

      ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
      ENQUEUE_WAKING - the enqueue has relative vruntime due to
                       having sched_class::task_waking() called,
      ENQUEUE_HEAD - the waking task should be places on the head
                     of the priority queue (where appropriate).

    For symmetry also convert sched_class::dequeue() to a flags scheme.

    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <new-submission>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

This section probably deserves some expansion.

(Reported by Julie Sullivan, kernelnewbies list.)

System Calls

Kernel Data Structures

Linked lists

The current explanation in the book is wrong in places.

What makes the kernel linked list unique?

Personally, I think there's a different way to explain what makes the kernel linked list implementation so novel. The standard implementation defines the linked list node with one or two pointers, and typically a void * pointer to the payload -- that is, the actual data to be tracked. This has a couple fairly obvious consequences.

First, and trivially, the nodes of the linked list structure live outside of the payload; the payload data itself for each list node exists elsewhere and the list nodes are simply keeping track of that data. And given that the payload data is exactly that -- pure data -- it's easy to see that any single payload is capable of being on an unlimited number of different lists. All of that should be obvious.

The kernel implementation changes this drastically by adding the pointer/link structure:

struct list_head {
    struct list_head *next;
    struct list_head *prev;
};

into each object you want to track, as in:

struct fox {
    ... all my data member fields ...
    struct list_head animals;
};
and this has two obvious consequences.

The first is that any data object you want to store on a list has to carry with in, in its own structure definition, a structure that links it to another payload object on that list. No longer do lists keep track of payload objects elsewhere; instead, payload objects embed their links inside their definition.

There is another consequence that isn't quite so obvious, until it's pointed out. Unlike the original data-only payloads, since each data object now needs to embed its link information, those payload objects can no longer participate on an unlimited number of lists. Instead, each payload has to explicitly define a struct list_head member field for each list of which it wants to be a member. In short, by simply looking at a structure definition, you can tell what lists that structure is allowed to be on.

A perfect example is from <linux/sched.h>, in the task_struct structure, where you can see just a subset of the number of different lists that this structure is supposed to be on:

struct task_struct {
    ... snip ...
    struct list_head rcu_node_entry;
    struct list_head tasks;
    struct list_head children;
    struct list_head sibling;
    ... etc etc ...
};

As you can see, given that a potential payload data object has to embed its own sibling pointer(s), a structure will have to explicitly define a member field for each list on which it wants to participate.

What's so special about a linked list "head" node?

It's important to understand how the head node of a kernel linked list works. Every linked list is defined by a head node that, while it is part of the normal pointer structure -- that is, it's on the list -- it does not represent any data; that is, it is not enclosed by any payload structure. This means that, when you're iterating through a linked list, jumping from one struct list_head to the next, you must explicitly know which of those nodes represents the head of the list so that you can avoid processing it.

That this is true can be seen from the definition of the list_for_each() routine in list.h:

#define list_for_each(pos, head) \
        for (pos = (head)->next; prefetch(pos->next), pos != (head); \
                pos = pos->next)

Note how, given the head of a list and a variable to be used for the iterator, that iteration clearly starts processing list elements after the head node, and terminates at the final list element before the head node, thereby avoiding processing that head node.

Another example that drives this home can be seen in the source file net/sctp/associola.c, where an infinite list iteration loop clearly has to skip the head node each time it runs across it:

while (1) {
        /* Skip the head. */
        if (pos->next == head)
                pos = head->next;
        else
                pos = pos->next;
        ... snip ...

An empty list is defined by a head node whose next pointer simply points back to itself, as you can see from list.h:

/**
 * list_empty - tests whether a list is empty
 * @head: the list to test.
 */
static inline int list_empty(const struct list_head *head)
{
        return head->next == head;
}

One last obvious observation: you must always keep track of the head node of a list. If you forget which node in a list is the head node, there's no way to recover that information and bad things will happen.

Incorrect description of head nodes on pp. 90-91

If you accept all of the above, it seems that the way head nodes are described in LKD3 is inaccurate. On p. 90, we read that a head node is "a special pointer that refers to your linked list, without being a list node itself." But as we've already seen, a head node is a member of the list; it just needs to be recognized as the head node so that it is never dereferenced to return its (non-existent) data container.

Page 91 also seems wrong, stating that, "because the lists are circular, you can generally pass any element for head." That can't be correct -- it's critically important to never lose track of where the head node of a list is, and to treat it differently from the rest of the nodes.

Page 93 could also use some tweaking since it claims that traversing a linked list iterates until each entry has been visited," whereas I'd add the qualifier that that doesn't include the head node.

The list_is_last() macro

Dave Hylands points out another useful macro defined in list.h that tells you whether you're at the last element in a list just before the head node:

static inline int list_is_last(const struct list_head *list,
                                const struct list_head *head)
{
        return list->next == head;
}

The definition of that macro should be obvious, but one wonders how useful it is since the standard list iteration routines will stop at that point, anyway. Obviously, it would make sense if there's something you need to do immediately upon recognizing that you're at the end of a list that can't wait for after the body of the loop.

(As an aside, Boolean tests like that really should be defined to return Boolean values, not ints.)

Sorting loop entries

Another feature I'd never noticed before is the ability to "sort" the entries in a loop, as defined in list_sort.h:

#include <linux/types.h>

struct list_head;

void list_sort(void *priv, struct list_head *head,
               int (*cmp)(void *priv, struct list_head *a,
                          struct list_head *b));

There doesn't appear to be a lot of use of that facility, so one wonders how much kernel code is sorting list entries manually, unaware that that feature exists.

As one possible example, consider this snippet from drivers/base/bus.c:

static void 
device_insertion_sort_klist(
    struct device *a,
    struct list_head *list,
    int (*compare)(const struct device *a,
                   const struct device *b))
{
        struct list_head *pos;
        struct klist_node *n;
        struct device_private *dev_prv;
        struct device *b;

        list_for_each(pos, list) {
                n = container_of(pos, struct klist_node, n_node);
                dev_prv = to_device_private_bus(n);
                b = dev_prv->device;
                if (compare(a, b) <= 0) {
                        list_move_tail(&a->p->knode_bus.n_node,
                                       &b->p->knode_bus.n_node);
                        return;
                }
        }
        list_move_tail(&a->p->knode_bus.n_node, list);
}

I'm wondering if that's not somehow rewritable using the list sort library routine.

Debugging list manipulation with CONFIG_DEBUG_LIST

There is a kernel CONFIG variable CONFIG_DEBUG_LIST that adds a certain level of debugging to kernel linked lists, but I have to admit I'm a bit confused by it. Here's a snippet from list.h:

#ifndef CONFIG_DEBUG_LIST
static inline void list_del(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
        entry->next = LIST_POISON1;
        entry->prev = LIST_POISON2;
}
#else
extern void list_del(struct list_head *entry);
#endif

Obviously, if you don't select that option, the implementation of list_del() is fairly simple, whereas if you do select it, you get this implementation from lib/list_debug.c:

void list_del(struct list_head *entry)
{
        WARN(entry->next == LIST_POISON1,
                "list_del corruption, next is LIST_POISON1 (%p)\n",
                LIST_POISON1);
        WARN(entry->next != LIST_POISON1 && entry->prev == LIST_POISON2,
                "list_del corruption, prev is LIST_POISON2 (%p)\n",
                LIST_POISON2);
        WARN(entry->prev->next != entry,
                "list_del corruption. prev->next should be %p, "
                "but was %p\n", entry, entry->prev->next);
        WARN(entry->next->prev != entry,
                "list_del corruption. next->prev should be %p, "
                "but was %p\n", entry, entry->next->prev);
        __list_del(entry->prev, entry->next);
        entry->next = LIST_POISON1;
        entry->prev = LIST_POISON2;
}

What's confusing is that, even without that kernel parameter, the regular implementation of that routine poisons the deleted node's link pointers. Why?

Queues/kfifo (MORE TO COME)

I may have more to say here, but it's worth pointing out that the kernel source tree now has a samples/kfifo directory.

Yes, there's a fair bit more to say about queues that isn't in LKD3. As a single example, there's a macro DEFINE_KFIFO() that combines the functionality of DECLARE_KFIFO() and INIT_KFIFO(), as well as more routines, so this section can be expanded considerably.

Maps (TO DO)

Binary trees (TO DO)

Interrupts and Interrupt Handlers

Bottom Halves and Deferring Work

An Introduction to Kernel Synchronization

Since this chapter is simply an introduction to the principles of kernel synchronization, I'm not sure there's much here that will need updating. The actual kernel mechanisms for concurrency and synchronization are in the next chapter.

Kernel Synchronization Methods

The atomic_t type loses its volatile

On p. 176, the typedef for atomic_t is shown as:

typedef struct {
        volatile int counter;
} atomic_t;

However, these days, that same typedef is in the header file <linux/types.h> as simply:

typedef struct {
        int counter;
} atomic_t;

If you want the rationale behind the change:

$ git show 81880d60
commit 81880d603d00c645e0890d0a44d50711c503b72b
Author: Anton Blanchard <anton@samba.org>
Date:   Mon May 17 14:34:57 2010 +1000

    atomic_t: Remove volatile from atomic_t definition
    ... snip ...

DECLARE_MUTEX and init_MUTEX don't exist anymore

On pp. 192-3 in the section on semaphores, there are references to DECLARE_MUTEX, init_MUTEX and init_MUTEX_LOCKED that no longer exist. These have been replaced by DEFINE_MUTEX.

It might be worth breaking out a section on mutexes all on its own, separate from general semaphores, rather than having it a bit scattered as it is now.

Timers and Timer Management

"A Tickless OS"

On p. 212, the side note refers to the option to configure a tickless kernel, CONFIG_HZ. That should probably read CONFIG_NO_HZ.

On a more general note, given that tickless behaviour appears to be the standard these days, I suspect a good chunk of this chapter could be rewritten to reflect that.

Memory Management

It looks like there can be a fair bit added here so this is just a first pass; it will almost certainly be reorganized as more content is added.

Pages (pp. 231-2)

I have to admit, I'm still a little puzzled as to how the count fields work in a struct page, especially with the difference between head and tail, and compound pages defined in <linux/page-flags.h>. This will require more reading.

Zones (p. 233)

That section in the book describes four "primary" memory zones defined in <linux/mmzone.h> -- ZONE_DMA, ZONE_DMA32, ZONE_NORMAL and ZONE_HIGHMEM -- then refers to "two other, less notable ones." But, AFAICT, the only other zone defined in that header file is ZONE_MOVABLE; as in, just the additional one.

Is there really just that single additional zone? And might it be worth explaining, if only briefly, even if there is no further reference to it?

The Virtual Filesystem

The Block I/O Layer

Anticipatory scheduler (pp. 302-3)

Now gone:

commit 492af6350a5ccf087e4964104a276ed358811458
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Sat Oct 3 09:37:51 2009 +0200

    block: remove the anticipatory IO scheduler

    AS is mostly a subset of CFQ, so there's little point in still
    providing this separate IO scheduler. Hopefully at some point we
    can get down to one single IO scheduler again, at least this brings
    us closer by having only one intelligent IO scheduler.

    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

(Reported by Mulyadi Santosa, kernelnewbies list.)

The Process Address Space

do_mmap() is gone as of May 30, 2012:

commit dc982501d9643ab0c117e7d87562857ce234652d
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed May 30 20:11:57 2012 -0400

    kill do_mmap() completely
    
    just pull into vm_mmap()
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

The Page Cache and Page Writeback

Devices and Modules

Debugging

kdb

Yes, we probably should have a short discussion of kdb here. Coming soon ...

Portability

Patches, Hacking and the Community

Things that deserve more emphasis?

This section is just my personal perspective on what beginning kernel programmers should know more about; not directly related to the book.

Useful macros in <linux/kernel.h> and other header files

Way too many programmers re-invent the wheel in terms of simple macros for alignment, integer values, determining the size of an array and so on. Take some time and peruse that header file to see what's already defined for you.

Other header files in that directory that might be useful:

  • log2.h, for power-of-2 macros
  • ... more as I think of them ...

getconf

I might be tempted to demonstrate the use of the getconf command fairly early. It can then be used later on to do things like display the machine page size with:

$ getconf PAGESIZE

or everything with:

$ getconf -a

Personal tools