Friday 15 July 2016

how does per cpu variable works in linux ?

Per-CPU variables

Per-CPU variables are one of the kernel features. You can understand the meaning of this feature by reading its name. We can create a variable and each processor core will have its own copy of this variable. In this part, we take a closer look at this feature and try to understand how it is implemented and how it works.
The kernel provides an API for creating per-cpu variables - the DEFINE_PER_CPU macro:
#define DEFINE_PER_CPU(type, name) \
        DEFINE_PER_CPU_SECTION(type, name, "")
This macro defined in the include/linux/percpu-defs.h as many other macros for work with per-cpu variables. Now we will see how this feature is implemented.
Take a look at the DECLARE_PER_CPU definition. We see that it takes 2 parameters: type and name, so we can use it to create per-cpu variables, for example like this:
DEFINE_PER_CPU(int, per_cpu_n)
We pass the type and the name of our variable. DEFINE_PER_CPU calls the DEFINE_PER_CPU_SECTION macro and passes the same two parameters and empty string to it. Let's look at the definition of the DEFINE_PER_CPU_SECTION:
#define DEFINE_PER_CPU_SECTION(type, name, sec)    \
         __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES  \
         __typeof__(type) name
#define __PCPU_ATTRS(sec)                                                \
         __percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))     \
         PER_CPU_ATTRIBUTES
where section is:
#define PER_CPU_BASE_SECTION ".data..percpu"
After all macros are expanded we will get a global per-cpu variable:
__attribute__((section(".data..percpu"))) int per_cpu_n
It means that we will have a per_cpu_n variable in the .data..percpu section. We can find this section in the vmlinux:
.data..percpu 00013a58  0000000000000000  0000000001a5c000  00e00000  2**12
              CONTENTS, ALLOC, LOAD, DATA
Ok, now we know that when we use the DEFINE_PER_CPU macro, a per-cpu variable in the .data..percpu section will be created. When the kernel initializes it calls the setup_per_cpu_areas function which loads the .data..percpu section multiple times, one section per CPU.
Let's look at the per-CPU areas initialization process. It starts in the init/main.c from the call of the setup_per_cpu_areas function which is defined in the arch/x86/kernel/setup_percpu.c.
pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
        NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
The setup_per_cpu_areas starts from the output information about the maximum number of CPUs set during kernel configuration with the CONFIG_NR_CPUS configuration option, actual number of CPUs, nr_cpumask_bits is the same that NR_CPUS bit for the new cpumask operators and number of NUMA nodes.
We can see this output in the dmesg:
$ dmesg | grep percpu
[    0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
In the next step we check the percpu first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has percpu_alloc command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation:
percpu_alloc=    Select which percpu first chunk allocator to use.
        Currently supported values are "embed" and "page".
        Archs may support subset or none of the    selections.
        See comments in mm/percpu.c for details on each
        allocator.  This parameter is primarily    for debugging
        and performance comparison.
The mm/percpu.c contains the handler of this command line option:
early_param("percpu_alloc", percpu_alloc_setup);
Where the percpu_alloc_setup function sets the pcpu_chosen_fc variable depends on the percpu_alloc parameter value. By default the first chunk allocator is auto:
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
If the percpu_alloc parameter is not given to the kernel command line, the embed allocator will be used which embeds the first percpu chunk into bootmem with the memblock. The last allocator is the first chunk page allocator which maps the first chunk with PAGE_SIZE pages.
As I wrote above, first of all we make a check of the first chunk allocator type in the setup_per_cpu_areas. We check that first chunk allocator is not page:
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
    ...
    ...
    ...
}
If it is not PCPU_FC_PAGE, we will use the embed allocator and allocate space for the first chunk with the pcpu_embed_first_chunk function:
rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
                        dyn_size, atom_size,
                        pcpu_cpu_distance,
                        pcpu_fc_alloc, pcpu_fc_free);
As shown above, the pcpu_embed_first_chunk function embeds the first percpu chunk into bootmem then we pass a couple of parameters to the pcup_embed_first_chunk. They are as follows:
  • PERCPU_FIRST_CHUNK_RESERVE - the size of the reserved space for the static percpu variables;
  • dyn_size - minimum free size for dynamic allocation in bytes;
  • atom_size - all allocations are whole multiples of this and aligned to this parameter;
  • pcpu_cpu_distance - callback to determine distance between cpus;
  • pcpu_fc_alloc - function to allocate percpu page;
  • pcpu_fc_free - function to release percpu page.
We calculate all of these parameters before the call of the pcpu_embed_first_chunk:
const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
size_t atom_size;
#ifdef CONFIG_X86_64
        atom_size = PMD_SIZE;
#else
        atom_size = PAGE_SIZE;
#endif
If the first chunk allocator is PCPU_FC_PAGE, we will use the pcpu_page_first_chunk instead of the pcpu_embed_first_chunk. After that percpu areas up, we setup percpu offset and its segment for every CPU with the setup_percpu_segment function (only for x86 systems) and move some early data from the arrays to the percpu variables (x86_cpu_to_apicid, irq_stack_ptr and etc...). After the kernel finishes the initialization process, we will have loaded N .data..percpu sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the DEFINE_PER_CPU macro.
The kernel provides an API for per-cpu variables manipulating:
  • get_cpu_var(var)
  • put_cpu_var(var)
Let's look at the get_cpu_var implementation:
#define get_cpu_var(var)     \
(*({                         \
         preempt_disable();  \
         this_cpu_ptr(&var); \
}))
The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel is running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why, first of all we can see a call of the preempt_disable function then a call of the this_cpu_ptr macro, which looks like:
#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
and
#define raw_cpu_ptr(ptr)        per_cpu_ptr(ptr, 0)
where per_cpu_ptr returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the put_cpu_var macro which enables preemption with a call of preempt_enable function. So the typical usage of a per-cpu variable is as follows:
get_cpu_var(var);
...
//Do something with the 'var'
...
put_cpu_var(var);
Let's look at the per_cpu_ptr macro:
#define per_cpu_ptr(ptr, cpu)                             \
({                                                        \
        __verify_pcpu_ptr(ptr);                           \
         SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)));  \
})
As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls __verify_pcpu_ptr:
#define __verify_pcpu_ptr(ptr)
do {
    const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;
    (void)__vpp_verify;
} while (0)
which makes the given ptr type of const void __percpu *,
After this we can see the call of the SHIFT_PERCPU_PTR macro with two parameters. As first parameter we pass our ptr and for second parameter we pass the cpu number to the per_cpu_offset macro:
#define per_cpu_offset(x) (__per_cpu_offset[x])
which expands to getting the x element from the __per_cpu_offset array:
extern unsigned long __per_cpu_offset[NR_CPUS];
where NR_CPUS is the number of CPUs. The __per_cpu_offset array is filled with the distances between cpu-variable copies. For example all per-cpu data is X bytes in size, so if we access __per_cpu_offset[Y], X*Y will be accessed. Let's look at the SHIFT_PERCPU_PTR implementation:
#define SHIFT_PERCPU_PTR(__p, __offset)                                 \
         RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
RELOC_HIDE just returns offset (typeof(ptr)) (__ptr + (off)) and it will return a pointer to the variable.
That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the include/linux/percpu-defs.h magic.
Let's again look at the algorithm of getting a pointer to a per-cpu variable:
  • The kernel creates multiple .data..percpu sections (one per-cpu) during initialization process;
  • All variables created with the DEFINE_PER_CPU macro will be relocated to the first section or for CPU0;
  • __per_cpu_offset array filled with the distance (BOOT_PERCPU_OFFSET) between .data..percpu sections;
  • When the per_cpu_ptr is called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the __per_cpu_offset array will be accessed, where every index points to the required CPU.

Tuesday 12 July 2016

How softirq implementation and works in linux ?

Softirqs are rarely used; tasklets are a much more common form of bottom half. Nonetheless, because tasklets are built on softirqs, So we'll cover them first. The softirq code lives in kernel/softirq.c.


Implementation of Softirqs

Softirqs are statically allocated at compile-time. Unlike tasklets, you cannot dynamically register and destroy softirqs. Softirqs are represented by the softirq_action structure, which is defined in <linux/interrupt.h>:
/*
 * structure representing a single softirq entry
 */
struct softirq_action {
        void (*action)(struct softirq_action *); /* function to run */
        void *data;                              /* data to pass to function */
};

A 32-entry array of this structure is declared in kernel/softirq.c:
static struct softirq_action softirq_vec[32];
 
Each registered softirq consumes one entry in the array. Consequently, 
there can be a maximum of 32 registered softirqs. Note that this cap is 
fixed the maximum number of registered softirqs cannot be dynamically 
changed.
 
The Softirq Handler
The prototype of a softirq handler, action, looks like:
void softirq_handler(struct softirq_action *)
When the kernel runs a softirq handler, it executes this action function with a pointer to the corresponding softirq_action structure as its lone argument. For example, if my_softirq pointed to an entry in the softirq_vec array, the kernel would invoke the softirq handler function as
my_softirq->action(my_softirq)


It seems a bit odd that the kernel passes the entire structure, and not just the data value, to the softirq handler. This trick allows future additions to the structure without requiring a change in every softirq handler. Softirq handlers can retrieve the data value, if they need to, simply by dereferencing their argument and reading the data member.
A softirq never preempts another softirq. In fact, the only event that can preempt a softirq is an interrupt handler. Another softirqeven the same onecan run on another processor, however.

Executing Softirqs:
A registered softirq must be marked before it will execute. This is called raising the softirq. Usually, an interrupt handler marks its softirq for execution before returning. Then, at a suitable time, the softirq runs. Pending softirqs are checked for and executed in the following places:
  • In the return from hardware interrupt code
  • In the ksoftirqd kernel thread
  • In any code that explicitly checks for and executes pending softirqs, such as the networking subsystem
Regardless of the method of invocation, softirq execution occurs in do_softirq(). The function is really quite simple. If there are pending softirqs, do_softirq() loops over each one, invoking its handler. Let's look at a simplified variant of the important part of do_softirq():

u32 pending = softirq_pending(cpu);

if (pending) {
    struct softirq_action *h = softirq_vec;

    softirq_pending(cpu) = 0;

    do {
        if (pending & 1)
            h->action(h);
        h++;
        pending >>= 1;
    } while (pending);
} 
 
This snippet is the heart of softirq processing. It checks for, and executes, any pending softirqs. Specifically,

  1. It sets the pending local variable to the value returned by the softirq_pending() macro. This is a 32-bit mask of pending softirqsif bit n is set, the nth softirq is pending.
  2. Now that the pending bitmask of softirqs is saved, it clears the actual bitmask.
    This actually occurs with local interrupts disabled, but that is omitted in this simplified example. If interrupts were not disabled, a softirq could have been raised (and thus be pending) in the intervening time between saving the mask and clearing it. This would result in incorrectly clearing a pending bit.
  3. The pointer h is set to the first entry in the softirq_vec.
  4. If the first bit in pending is set, h->action(h) is called.
  5. The pointer h is incremented by one so that it now points to the second entry in the softirq_vec array.
  6. The bitmask pending is right-shifted by one. This tosses the first bit away, and moves all other bits one place to the right. Consequently, the second bit is now the first (and so on).
  7. The pointer h now points to the second entry in the array and the pending bitmask now has the second bit as the first. Repeat the previous steps.
  8. Continue repeating until pending is zero, at which point there are no more pending softirqs and the work is done. Note, this check is sufficient to ensure h always points to a valid entry in softirq_vec because pending has at most 32 set bits and thus this loop executes at most 32 times.
    Code flow return from hardware interrupt code:

    asm_do_IRQ(irq, struct pt_regs *regs) //arch/arm/kernel/irq.c
       | 
    irq_exit()      //defined in kernel/softirq.c:
        |
    invoke_softirq();
         |
    do_softirq()
         |
    __do_softirq()
       {
            pending = local_softirq_pending();
            struct softirq_action *h = softirq_vec;
            do {
                     if(pending & 1)
                      {
                         h ->action(h);    // the softirq handler gets called
                         ----
                      }
                     h++;
                    pending >> = 1;
                  } while(pending);
        pending = local_softirq_pending();
        if(pending)
              wakeup_softirqd();
    }