1. Shell forks and execs the new
process. Shell waits for the completion of the process using wait4() unless
forked process is a daemon (is a session leader and kills the parent and
continues with the child after exec)
2. fork() actually creates clone of the shell process itself. This address space is overwritten by exec call. Exec call starts off by reading the first few pages of the executable from the disk. This operation involves file system, block layer, page cache and the device driver (if backend is disk, then disk device driver).
3. Kernel needs to be compiled with different binary file handlers. Some handlers present in Linux kernel are ELF, a.out, script etc. Header of the executable says what kind of executable it is. If kernel is compiled with the right kind of binary format handler, then the read data is handed over to the handler.
4. For example, if handler is an ELF handler, then the header has the location of the text in the file. Few pages of the text are read from disk (basically mmapped) along with all the directly linked libraries. Now the process is ready to start as its text and required libraries are in memory. Rest of the text is read using demand paging when needed.
examples:
strace ./a.out
execve("./a.out", ["./a.out"], [/* 45 vars */]) = 0
brk(0) = 0x1056000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce4a9000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=88130, ...}) = 0
mmap(NULL, 88130, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff4ce493000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ff4cdec4000
mprotect(0x7ff4ce07e000, 2097152, PROT_NONE) = 0
mmap(0x7ff4ce27e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7ff4ce27e000
mmap(0x7ff4ce284000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce284000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce492000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce490000
arch_prctl(ARCH_SET_FS, 0x7ff4ce490740) = 0
mprotect(0x7ff4ce27e000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7ff4ce4ab000, 4096, PROT_READ) = 0
munmap(0x7ff4ce493000, 88130) = 0
exit_group(4195565) = ?
+++ exited with 237 +++
Implementation of execve
• The entry point of the system call is the architecture-dependent sys_execve function. This function quickly delegates its work to the system-independent do_execve routine.
2. fork() actually creates clone of the shell process itself. This address space is overwritten by exec call. Exec call starts off by reading the first few pages of the executable from the disk. This operation involves file system, block layer, page cache and the device driver (if backend is disk, then disk device driver).
3. Kernel needs to be compiled with different binary file handlers. Some handlers present in Linux kernel are ELF, a.out, script etc. Header of the executable says what kind of executable it is. If kernel is compiled with the right kind of binary format handler, then the read data is handed over to the handler.
4. For example, if handler is an ELF handler, then the header has the location of the text in the file. Few pages of the text are read from disk (basically mmapped) along with all the directly linked libraries. Now the process is ready to start as its text and required libraries are in memory. Rest of the text is read using demand paging when needed.
examples:
strace ./a.out
execve("./a.out", ["./a.out"], [/* 45 vars */]) = 0
brk(0) = 0x1056000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce4a9000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=88130, ...}) = 0
mmap(NULL, 88130, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff4ce493000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ff4cdec4000
mprotect(0x7ff4ce07e000, 2097152, PROT_NONE) = 0
mmap(0x7ff4ce27e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7ff4ce27e000
mmap(0x7ff4ce284000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce284000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce492000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4ce490000
arch_prctl(ARCH_SET_FS, 0x7ff4ce490740) = 0
mprotect(0x7ff4ce27e000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7ff4ce4ab000, 4096, PROT_READ) = 0
munmap(0x7ff4ce493000, 88130) = 0
exit_group(4195565) = ?
+++ exited with 237 +++
Implementation of execve
• The entry point of the system call is the architecture-dependent sys_execve function. This function quickly delegates its work to the system-independent do_execve routine.
• int do_execve(char
* filename, char __user *__user *argv, char __user *__user *envp,struct
pt_regs * regs)
• Filename
• * argv (ex:
ls –l /usr/bin/)
• *envp :environment
of the program
• The notation is
slightly clumsy because argv and envp are arrays of pointers, and both the
pointer to the array itself as well as all pointers in the array are located in
the userspace portion of the virtual address space.
the
code flow diagram for do_execve
• bprm_init
then handles several administrative tasks
– mm_alloc generates
a new instance of mm_struct to manage the process address space.
– init_new_context is
an architecture-specific function that initializes the instance.
– __bprm_mm_init sets
up an initial stack.
• prepare_binprm
is used to supply a number of parent process values (above
all, the effective UID and GID)
• search_binary_handler
is used at the end of do_execve to find a suitable binary format for the
particular file.
• Binary format
handler performs the following actions:
– It releases all
resources used by the old process.
– It maps the
application into virtual address space.
– The instruction
pointer of the process and some other architecture-specific registers are set
so that the main function of the program is executed when the scheduler selects
the process.
Interpreting
Binary Formats
<binfmts.h>
struct
linux_binfmt
struct
linux_binfmt * next;
struct
module *module;
int
(*load_binary)(struct linux_binprm *, struct pt_regs * regs);
int
(*load_shlib)(struct file *);
int
(*core_dump)(long signr, struct pt_regs * regs, struct file * file);
unsigned
long min_coredump; /* minimal dump size */
};
1. load_binary to load
normal programs.
1. load_shlib to load
a shared library, that is, a dynamic library.
1. core_dump to write
a core dump if there is a program error.
Exiting
Processes
• Processes must
terminate with the exit system call.
• The entry point for
this call is the sys_exit function that requires an error code as its parameter
in order to exit the process.
• Its implementation
is not particularly interesting because it immediately delegates its work to
do_exit.
call
flow:
do_execve() //./fs/exec.c
|
do_execve_common(filename, argv,
envp, regs);
|
--->file = open_exec(filename);
---> sched_exec();
bprm_mm_init(bprm)
|
bprm->mm = mm =
mm_alloc();
|
--->alloctae_mm()
---->mm_init(mm,current)
|
mm_alloc_pgd(mm)
|
--->pgd_alloc()
|
init_new_context(current, mm);
//Arch specific -->arch/arm/mm/context.c
|
__bprm_mm_init(bprm);
|
---> insert_vm_struct(mm,
vma)
|
prepare_binprm(bprm);
|
search_binary_handler(bprm,regs); //./fs/exec.c