Friday 4 March 2022

Linux system call flow in ARM64

 ARMv8 has four exception has four levels.

EL0 --  user applications

EL1 --  OS kernel 

EL2  - - Hypervisor for virtualization platform

EL3  -- Secure Monitor firmware

The EL3 to EL0 elevation from one exception level to next exception level are achieved by setting exceptions. These exceptions will be set by one level and the next level will handle it.

The synchronous exception from user space EL0 to kernel EL1 using the svc supervisor call. Thus an application runs in Linux should issue svc with registers set with appropriate values. To know what are those appropriate values, Lets see how kernel handles svc.



Kernel :

Note : https://elixir.bootlin.com/linux/v5.16.10/source/arch/arm64/kernel/entry.S

Vector table :

There are multiple exceptions can be set by applications [EL0] which will be taken by Kernel [EL1]. The handlers for these exceptions are stored in a vector table. In ARMv8 the register that mentions the base address of that vector table is VBAR_EL1 [Vector Base Address Register for EL1].

When an exception occurs, the processor must execute handler code which corresponds to the exception. The location in memory where the handler is stored is called the exception vector. In the ARM architecture, exception vectors are stored in a table, called the exception vector table. Each Exception level has its own vector table, that is, there is one for each of EL3, EL2 and EL1. The table contains instructions to be executed, rather than a set of addresses. Vectors for individual exceptions are located at fixed offsets from the beginning of the table. 
 The virtual address of each table base is set by the Vector Based Address Registers VBAR_EL3, VBAR_EL2 and VBAR_EL1.

Linux defines the vector table at arch/arm64/kernel/entry.S + 493. Eachkerenl_ventry is 32 instructions long. As an instruction in ARMv8 is 4 bytes long, next kerenl_ventry will start at +0x80 of current kerenl_ventry.

ARM infocenter.

The exception-handlers reside in a continuous memory and each vector spans up to 32 instructions long. Based on type of the exception, the execution will start from an instruction in a particular offset from the base address VBAR_EL1. Below is the ARM64 vector table. For example when an synchronous exception is set from EL0 is set, the handler at VBAR_EL1 +0x400 will execute to handle the exception


Offset from VBAR_EL1Exception typeException set level
+0x000SynchronousCurrent EL with SP0
+0x080IRQ/vIRQ
+0x100FIQ/vFIQ
+0x180SError/vSError
+0x200SynchronousCurrent EL with SPx
+0x280IRQ/vIRQ
+0x300FIQ/vFIQ
+0x380SError/vSError
+0x400SynchronousLower EL using ARM64
+0x480IRQ/vIRQ
+0x500FIQ/vFIQ
+0x580SError/vSError
+0x600SynchronousLower EL with ARM32
+0x680IRQ/vIRQ
+0x700FIQ/vFIQ
+0x780SError/vSError


Linux defines the vector table at arch/arm64/kernel/entry.S + 493. Eachkerenl_ventry is 32 instructions long. As an instruction in ARMv8 is 4 bytes long, next kerenl_ventry will start at +0x80 of current kerenl_ventry.

ENTRY(vectors)
kernel_ventry 1, t, 64, sync // Synchronous EL1t
kernel_ventry 1,t, 64, irq // IRQ EL1t
kernel_ventry 1,t, 64, fiq // FIQ EL1t
kernel_ventry 1,t, 64, error // Error EL1t

kernel_ventry 1,h, 64 sync // Synchronous EL1h
kernel_ventry 1,h, 64 irq // IRQ EL1h
kernel_ventry 1,h, 64 fiq // FIQ EL1h
kernel_ventry 1,h, 64 error // Error EL1h

kernel_ventry 0,t, 64 sync // Synchronous 64-bit EL0
kernel_ventry 0,t, 64 irq // IRQ 64-bit EL0
kernel_ventry 0,t, 64 fiq // FIQ 64-bit EL0
kernel_ventry 0,t, 64 error // Error 64-bit EL0

    kernel_ventry 0,t, 32 sync // Synchronous 32-bit EL0
kernel_ventry 0,t, 32 irq // IRQ 32-bit EL0
kernel_ventry 0,t, 32 fiq // FIQ 32-bit EL0
kernel_ventry 0,t, 32 error // Error 32-bit EL0
END(vectors)

Loads the vector table into VBAR_EL1 at arch/arm64/kernel/head.S +429


adr_l   x8, vectors    // load VBAR_EL1 with virtual
msr     vbar_el1, x8   // vector table address
isb                       // instruction set barrier

VBAR_EL1 is an system register. So it cannot be accessed directly. Special system instructions msr and mrs should be used manipulate system registers.
InstructionDescription
adr_l x8, vectorloads the address of vector table into general purpose register X8
msr vbar_el1, x8moves value in X8 to system register VBAR_EL1
isbinstruction sync barrier

System call flow in Kernel

Lets see what happens when an application issues the instruction svc. From thtable, we can see for AArch64 synchronous exception from lower level, the offset is +0x400. In the Linux vector definition VBAR_EL1+0x400 is el0t_64_sync. it call el0t_64_sync_handler definition at arch/arm64/kernel/entry-common.c + 615

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
34
35
35
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
{
	unsigned long esr = read_sysreg(esr_el1); //read the syndrome register
switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_SVC64: el0_svc(regs); // SVC in 64-bit state
break; case ESR_ELx_EC_DABT_LOW: el0_da(regs, esr); // Data abort in EL0
break; case ESR_ELx_EC_IABT_LOW: el0_ia(regs, esr); // instruction abort in EL0 break; case ESR_ELx_EC_FP_ASIMD: el0_fpsimd_acc(regs, esr); // FP/ASIMD access break; case ESR_ELx_EC_SVE: el0_sve_acc(regs, esr); // SVE access in EL0 break; case ESR_ELx_EC_FP_EXC64: //FP access execution el0_fpsimd_exc(regs, esr); // break; case ESR_ELx_EC_SYS64: case ESR_ELx_EC_WFx: el0_sys(regs, esr); //configurable trap break; case ESR_ELx_EC_SP_ALIGN: el0_sp(regs, esr); //stack alignment exception break; case ESR_ELx_EC_PC_ALIGN: el0_pc(regs, esr); //PC alignment exception break; case ESR_ELx_EC_UNKNOWN: el0_undef(regs); //Unknown error break; case ESR_ELx_EC_BTI: //unallocated exception el0_bti(regs); break; case ESR_ELx_EC_BREAKPT_LOW: case ESR_ELx_EC_SOFTSTP_LOW: case ESR_ELx_EC_WATCHPT_LOW: case ESR_ELx_EC_BRK64: el0_dbg(regs, esr); //Debug exception break; case ESR_ELx_EC_FPAC: el0_fpac(regs, esr); break; default: el0_inv(regs, esr); } }

The synchronous exception can have multiple reasons which will be stored in the syndrome register esr_el1. Compare the value in syndrome register with predefined macros and branch to the corresponding subroutine.

In a system call case, control will be branched to el0_svc and it call do_e10_svc. It is defined at arm64/kernel/entry-common.c +599 and arch/arm64/kernel/syscall.c +178 as follows


/*
* SVC handler.
*/


static void noinstr el0_svc(struct pt_regs *regs) { enter_from_user_mode(regs); cortex_a76_erratum_1463225_svc_handler(); do_el0_svc(regs); exit_to_user_mode(regs); }

void do_el0_svc(struct pt_regs *regs)
{
	sve_user_discard();
	el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
}

static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
			   const syscall_fn_t syscall_table[])
{
   invoke_syscall(regs, scno, sc_nr, syscall_table); //system call invoke here
}                                                                                                              

sys_call_table

It is nothing but an array of function pointer indexed with the system call number. It has to be placed in an 4K aligned memory. For ARM64 sys_call_table is defined at arch/arm64/kernel/sys.c +58.

#undef __SYSCALL
#define __SYSCALL(nr, sym) [nr] = sym,

/*
* The sys_call_table array must be 4K aligned to be accessible from
* kernel/entry.S.
*/
void * const sys_call_table[__NR_syscalls] __aligned(4096) = {
[0 ... __NR_syscalls - 1] = sys_ni_syscall,
#include <asm/unistd.h>
};
  • __NR_syscalls defines the number of system call. This varies from architecture to architecture.
  • Initially all the system call numbers were set sys_ni_syscall - not implemented system call. If a system call is removed, its system call number will not be reused. Instead it will be assigned with sys_ni_syscall function.
  • And the include goes like this arch/arm64/include/asm/unistd.h -> arch/arm64/include/uapi/asm/unistd.h -> include/asm-generic/unistd.h -> include/uapi/asm-generic/unistd.h. The last file has the definition of all system calls. For example the write system call is defined here as

1
2
#define __NR_write 64
__SYSCALL(__NR_write, sys_write)

  • The sys_call_table is an array of function pointers. As in ARM64 a function pointer is 8 bytes long, to calculate the address of actual system call, system call number scno is left shifted by 3 and added with system call table address.

System call definition

Each system call is defined with a macro SYSCALL_DEFINEn macro. n is corresponding to the number of arguments the system call accepts. For example the write is implemented at fs/read_write.c +652

1
2
3
4

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
return ksys_write(fd, buf, count);
}

This macro will expand into sys_write function definition and other aliases functions as mentioned in this LWN article. The expanded function will have the compiler directive asmlinkage set. It instructs the compiler to look for arguments in CPU stack instead of registers. This is to implement system calls architecture independent. That’s why kernel_entry macro in el0_sync pushed all general purpose registers into stack. In ARM64 case registers X0 to X7 will have the arguments.

Application Flow






Reference:

No comments:

Post a Comment