embedded system by venkatpari: 2016

Friday 25 November 2016

Building and Running Clang(LLVM)

Release Clang Versions

Clang is released as part of regular LLVM releases. You can download the release versions from http://llvm.org/releases/.
Clang is also provided in all major BSD or GNU/Linux distributions as part of their respective packaging systems. From Xcode 4.2, Clang is the default compiler for Mac OS X.

Building Clang and Working with the Code

On Unix-like Systems

Note: as an experimental setup, you can use a single checkout with all the projects, and an easy CMake invocation, see the LLVM Doc "For developers to work with a git monorepo"
If you would like to check out and build Clang, the current procedure is as follows:

Get the required tools.
- See Getting Started with the LLVM System - Requirements.
- Note also that Python is needed for running the test suite. Get it at: http://www.python.org/download
- Standard build process uses CMake. Get it at: http://www.cmake.org/download
Check out LLVM:
- Change directory to where you want the llvm directory placed.
- svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
Check out Clang:
- cd llvm/tools
- svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
- cd ../..
Check out extra Clang tools: (optional)
- cd llvm/tools/clang/tools
- svn co http://llvm.org/svn/llvm-project/clang-tools-extra/trunk extra
- cd ../../../..
Check out Compiler-RT (optional):
- cd llvm/projects
- svn co http://llvm.org/svn/llvm-project/compiler-rt/trunk compiler-rt
- cd ../..
Check out libcxx: (only required to build and run Compiler-RT tests on OS X, optional otherwise)
- cd llvm/projects
- svn co http://llvm.org/svn/llvm-project/libcxx/trunk libcxx
- cd ../..
Build LLVM and Clang:
- mkdir build (in-tree build is not supported)
- cd build
- cmake -G "Unix Makefiles" ../llvm
- make
- This builds both LLVM and Clang for debug mode.
- Note: For subsequent Clang development, you can just run make clang.
- CMake allows you to generate project files for several IDEs: Xcode, Eclipse CDT4, CodeBlocks, Qt-Creator (use the CodeBlocks generator), KDevelop3. For more details see Building LLVM with CMake page.
If you intend to use Clang's C++ support, you may need to tell it how to find your C++ standard library headers. In general, Clang will detect the best version of libstdc++ headers available and use them - it will look both for system installations of libstdc++ as well as installations adjacent to Clang itself. If your configuration fits neither of these scenarios, you can use the -DGCC_INSTALL_PREFIX cmake option to tell Clang where the gcc containing the desired libstdc++ is installed.
Try it out (assuming you add llvm/build/bin to your path):
- clang --help
- clang file.c -fsyntax-only (check for correctness)
- clang file.c -S -emit-llvm -o - (print out unoptimized llvm code)
- clang file.c -S -emit-llvm -o - -O3
- clang file.c -S -O3 -o - (output native machine code)
Run the testsuite:
- make check-clang

If you encounter problems while building Clang, make sure that your LLVM checkout is at the same revision as your Clang checkout. LLVM's interfaces change over time, and mismatched revisions are not expected to work together.

Simultaneously Building Clang and LLVM:

Once you have checked out Clang into the llvm source tree it will build along with the rest of llvm. To build all of LLVM and Clang together all at once simply run make from the root LLVM directory.
Note: Observe that Clang is technically part of a separate Subversion repository. As mentioned above, the latest Clang sources are tied to the latest sources in the LLVM tree. You can update your toplevel LLVM project and all (possibly unrelated) projects inside it with make update. This will run svn update on all subdirectories related to subversion.

Using Visual Studio

The following details setting up for and building Clang on Windows using Visual Studio:

Get the required tools:
- Subversion. Source code control program. Get it from: http://subversion.apache.org/packages.html
- CMake. This is used for generating Visual Studio solution and project files. Get it from: http://www.cmake.org/cmake/resources/software.html
- Visual Studio 2013 or later
- Python. This is needed only if you will be running the tests (which is essential, if you will be developing for clang). Get it from: http://www.python.org/download/
- GnuWin32 tools These are also necessary for running the tests. (Note that the grep from MSYS or Cygwin doesn't work with the tests because of embedded double-quotes in the search strings. The GNU grep does work in this case.) Get them from http://getgnuwin32.sourceforge.net/.
Check out LLVM:
- svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
Check out Clang:
- cd llvm\tools
- svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
Note: Some Clang tests are sensitive to the line endings. Ensure that checking out the files does not convert LF line endings to CR+LF. If you use git-svn, make sure your core.autocrlf setting is false.
Run CMake to generate the Visual Studio solution and project files:
- cd ..\.. (back to where you started)
- mkdir build (for building without polluting the source dir)
- cd build
- If you are using Visual Studio 2013: cmake -G "Visual Studio 12" ..\llvm
- See the LLVM CMake guide for more information on other configuration options for CMake.
- The above, if successful, will have created an LLVM.sln file in the build directory.
Build Clang:
- Open LLVM.sln in Visual Studio.
- Build the "clang" project for just the compiler driver and front end, or the "ALL_BUILD" project to build everything, including tools.
Try it out (assuming you added llvm/debug/bin to your path). (See the running examples from above.)
See Hacking on clang - Testing using Visual Studio on Windows for information on running regression tests on Windows.

Note that once you have checked out both llvm and clang, to synchronize to the latest code base, use the svn update command in both the llvm and llvm\tools\clang directories, as they are separate repositories.

Clang Compiler Driver (Drop-in Substitute for GCC)

The clang tool is the compiler driver and front-end, which is designed to be a drop-in replacement for the gcc command. Here are some examples of how to use the high-level driver:

$ cat t.c
#include <stdio.h>
int main(int argc, char **argv) { printf("hello world\n"); }
$ clang t.c
$ ./a.out
hello world

The 'clang' driver is designed to work as closely to GCC as possible to maximize portability. The only major difference between the two is that Clang defaults to gnu99 mode while GCC defaults to gnu89 mode. If you see weird link-time errors relating to inline functions, try passing -std=gnu89 to clang.

Examples of using Clang

$ cat ~/t.c
typedef float V __attribute__((vector_size(16)));
V foo(V a, V b) { return a+b*a; }

Preprocessing:

$ clang ~/t.c -E
# 1 "/Users/sabre/t.c" 1

typedef float V __attribute__((vector_size(16)));

V foo(V a, V b) { return a+b*a; }

Type checking:

$ clang -fsyntax-only ~/t.c

GCC options:

$ clang -fsyntax-only ~/t.c -pedantic
/Users/sabre/t.c:2:17: warning: extension used
typedef float V __attribute__((vector_size(16)));
                ^
1 diagnostic generated.

Pretty printing from the AST:

Note, the -cc1 argument indicates the compiler front-end, and not the driver, should be run. The compiler front-end has several additional Clang specific features which are not exposed through the GCC compatible driver interface.

$ clang -cc1 ~/t.c -ast-print
typedef float V __attribute__(( vector_size(16) ));
V foo(V a, V b) {
   return a + b * a;
}

Code generation with LLVM:

$ clang ~/t.c -S -emit-llvm -o -
define <4 x float> @foo(<4 x float> %a, <4 x float> %b) {
entry:
         %mul = mul <4 x float> %b, %a
         %add = add <4 x float> %mul, %a
         ret <4 x float> %add
}
$ clang -fomit-frame-pointer -O3 -S -o - t.c # On x86_64
...
_foo:
Leh_func_begin1:
 mulps %xmm0, %xmm1
 addps %xmm1, %xmm0
 ret
Leh_func_end1:

Low Level Virtual Machine (LLVM)

The LLVM compiler infrastructure project (formerly Low Level Virtual Machine) is a "collection of modular and reusable compiler and toolchain technologies used to develop compiler front ends and back ends.

LLVM is written in C++ and is designed for compile-time, link-time, run-time, and "idle-time" optimization of programs written in arbitrary programming languages. Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#,^[4]^[5]^[6] Common Lisp, Crystal, D, Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode, Julia, Lua, Objective-C, Pony,^[7] Python, R, Ruby, Rust, CUDA, Scala,^[8] and Swift.

The LLVM project started in 2000 at the University of Illinois at Urbana–Champaign, under the direction of Vikram Adve and Chris Lattner. LLVM was originally developed as a research infrastructure to investigate dynamic compilation techniques for static and dynamic programming languages. LLVM was released under the University of Illinois/NCSA Open Source License,^[2] a permissive free software licence. In 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM system for various uses within Apple's development systems.^[9] LLVM is an integral part of Apple's latest development tools for OS X and iOS.^[10] Since 2013, Sony has been using LLVM's primary front end Clang compiler in the software development kit (SDK) of its PS4 console.^[11]
The name LLVM was originally an initialism for Low Level Virtual Machine, but this became increasingly less apt as LLVM became an "umbrella project" that included a variety of other compiler and low-level tool technologies, so the project abandoned the initialism.^[12] Now, LLVM is a brand that applies to the LLVM umbrella project, the LLVM intermediate representation (IR), the LLVM debugger, the LLVM C++ Standard Library (with full support of C++11 and C++14^[13]), etc. LLVM is administered by the LLVM Foundation. Its president is compiler engineer Tanya Lattne

A Quick Introduction to Classical Compiler Design

The most popular design for a traditional static compiler (like most C compilers) is the three phase design whose major components are the front end, the optimizer and the back end The front end parses source code, checking it for errors, and builds a language-specific Abstract Syntax Tree (AST) to represent the input code. The AST is optionally converted to a new representation for optimization, and the optimizer and back end are run on the code.

Three Major Components of a Three-Phase Compiler

The optimizer is responsible for doing a broad variety of transformations to try to improve the code's running time, such as eliminating redundant computations, and is usually more or less independent of language and target. The back end (also known as the code generator) then maps the code onto the target instruction set. In addition to making correct code, it is responsible for generating good code that takes advantage of unusual features of the supported architecture. Common parts of a compiler back end include instruction selection, register allocation, and instruction scheduling.
This model applies equally well to interpreters and JIT compilers. The Java Virtual Machine (JVM) is also an implementation of this model, which uses Java bytecode as the interface between the front end and optimizer.

Implications of this Design

The most important win of this classical design comes when a compiler decides to support multiple source languages or target architectures. If the compiler uses a common code representation in its optimizer, then a front end can be written for any language that can compile to it, and a back end can be written for any target that can compile from it.

Retargetablity

With this design, porting the compiler to support a new source language (e.g., Algol or BASIC) requires implementing a new front end, but the existing optimizer and back end can be reused. If these parts weren't separated, implementing a new source language would require starting over from scratch, so supporting N targets and M source languages would need N*M compilers.
Another advantage of the three-phase design (which follows directly from retargetability) is that the compiler serves a broader set of programmers than it would if it only supported one source language and one target. For an open source project, this means that there is a larger community of potential contributors to draw from, which naturally leads to more enhancements and improvements to the compiler. This is the reason why open source compilers that serve many communities (like GCC) tend to generate better optimized machine code than narrower compilers like FreePASCAL. This isn't the case for proprietary compilers, whose quality is directly related to the project's budget. For example, the Intel ICC Compiler is widely known for the quality of code it generates, even though it serves a narrow audience.
A final major win of the three-phase design is that the skills required to implement a front end are different than those required for the optimizer and back end. Separating these makes it easier for a "front-end person" to enhance and maintain their part of the compiler. While this is a social issue, not a technical one, it matters a lot in practice, particularly for open source projects that want to reduce the barrier to contributing as much as possible.

When and Where Each Phase Runs

As mentioned earlier, LLVM IR can be efficiently (de)serialized to/from a binary format known as LLVM bitcode. Since LLVM IR is self-contained, and serialization is a lossless process, we can do part of compilation, save our progress to disk, then continue work at some point in the future. This feature provides a number of interesting capabilities including support for link-time and install-time optimization, both of which delay code generation from "compile time".
Link-Time Optimization (LTO) addresses the problem where the compiler traditionally only sees one translation unit (e.g., a .c file with all its headers) at a time and therefore cannot do optimizations (like inlining) across file boundaries. LLVM compilers like Clang support this with the -flto or -O4 command line option. This option instructs the compiler to emit LLVM bitcode to the .ofile instead of writing out a native object file, and delays code generation to link time

Link-Time Optimization

Details differ depending on which operating system you're on, but the important bit is that the linker detects that it has LLVM bitcode in the .o files instead of native object files. When it sees this, it reads all the bitcode files into memory, links them together, then runs the LLVM optimizer over the aggregate. Since the optimizer can now see across a much larger portion of the code, it can inline, propagate constants, do more aggressive dead code elimination, and more across file boundaries. While many modern compilers support LTO, most of them (e.g., GCC, Open64, the Intel compiler, etc.) do so by having an expensive and slow serialization process. In LLVM, LTO falls out naturally from the design of the system, and works across different source languages (unlike many other compilers) because the IR is truly source language neutral.
Install-time optimization is the idea of delaying code generation even later than link time, all the way to install time, Install time is a very interesting time (in cases when software is shipped in a box, downloaded, uploaded to a mobile device, etc.), because this is when you find out the specifics of the device you're targeting. In the x86 family for example, there are broad variety of chips and characteristics. By delaying instruction choice, scheduling, and other aspects of code generation, you can pick the best answers for the specific hardware an application ends up running on.

Monday 17 October 2016

Writing an ALSA driver

Writing an ALSA driver
Over the past week I've been writing an ALSA driver for an MPEG-4 capture board (4/8/16 channel). What I discovered is there are not many good documents on the basics of writing a simple ALSA driver. So I wanted to share my experience in the hopes that it would help others.

My driver needed to be pretty simple. The encoder produced 8Khz mono G.723-24 ADPCM. So you can avoid the wikepedia trip, that's 3-bits per sample, or 24000 bits per second. The card produced this at a rate of 128 samples per interrupt (48 bytes) for every channel available (you cannot disable each channel).

The card delivered this data in a 32kbyte buffer, split into 32 pages. Each page was written as 48*20 channels, which took up 960 bytes of the 1024 byte page (it could do up to this number, but for my purposes I was only using 4, 8 or 16 channels of encoded data depending on the capabilities of the card).

Now, let's set aside the fact that ALSA does not have a format spec for G.723-24, so my usage entails dumping out the 48 bytes to userspace as unsigned 8-bit PCM (and my userspace application handles the G.723-24 decoding, knowing that it is getting this data).

First, where to start in ALSA. I had to decide how to expose these capture interfaces. I could have exposed a capture device for each channel, but instead I chose to expose one capture interface with a subdevice for each channel. This made programming a bit easier, gave a better overview of the devices as perceived by ALSA, and kept /dev/snd/ less cluttered (especially when you had multiple 16-channel cards installed). It also made programming userspace easier since it kept channels hierarchically under the card/device.

I'll dig a little deeper into the base driver. I wont go into the details of the module and PCI initialization that was already present in my driver (I developed the core and v4l2 components first, so all of that is taken care of).

So first off I needed to register with ALSA that we actually have a sound card. This bit is easy, and looks like this:

struct snd_card *card;
ret = snd_card_create(SNDRV_DEFAULT_IDX1, "MySoundCard",
                      THIS_MODULE, 0, &card);
if (ret < 0)
        return ret;

This asks ALSA to allocate a new sound card with the name "MySoundCard". This is also the name that appears in /proc/asound/ as a symlink to the card ID (e.g. "card0"). In my particular instance I actually name the card with an ID number, so it ends up being "MySoundCard0". This is because I can, and typically do, have more than one installed at a time for this type of device. I notice some other sound drivers do not do this, probably because they don't expect more than one to be installed at a time (think HDA, which is usually embedded on the motherboard, and so wont have two or more inserted into a PCIe slot). Next, we set some of the properties of this new card.

strcpy(card->driver, "my_driver");
strcpy(card->shortname, "MySoundCard Audio");
sprintf(card->longname, "%s on %s IRQ %d", card->shortname,
        pci_name(pci_dev), pci_dev->irq);
snd_card_set_dev(card, &pci_dev->dev);

Here, we've assigned the name of the driver that handles this card, which is typically the same as the actual name of your driver. Next is a short description of the hardware, followed by a longer description. Most drivers seem to set the long description to something containing the PCI info. If you have some other bus, then the convention would follow to use information from that particular bus. Finally, set the parent device associated with the card. Again, since this is a PCI device, I set it to that.

Now to set this card up in ALSA along with a decent description of how the hardware works. We add the next bit of code to do this:

static struct snd_device_ops ops = { NULL };
ret = snd_device_new(card, SNDRV_DEV_LOWLEVEL, mydev, &ops);
if (ret < 0)
        return ret;

We're basically telling ALSA to create a new card that is a low level sound driver. The mydev argument is passed as the private data that is associated with this device, for your convenience. We leave the ops structure as a no-op here for now.

Lastly, to complete the registration with ALSA:

if ((ret = snd_card_register(card)) < 0)
        return ret;

ALSA now knows about this card, and lists it in /proc/asound/ among other places such as /sys. We still haven't told ALSA about the interfaces associated with this card (capture/playback). This will be discussed in the next installment. One last thing, when you cleanup your device/driver, you must do so through ALSA as well, like this:

snd_card_free(card);

This will cleanup all items associated with this card, including any devices that we will register later.

Setting up capture:

Now that we have an ALSA card initialized and registered with the middle layer we can move on to describing to ALSA our capture device. Unfortunately for anyone wishing to do playback, I will not be covering that since my device driver only provides for capture. If I end up implementing the playback feature, I will make an additional post.

So let's get started. ALSA provides a PCM API in its middle layer. We will be making use of this to register a single PCM capture device that will have a number of subdevices depending on the low level hardware I have. NOTE: All of the initialization below must be done just before the call to snd_card_register() in the last posting.

struct snd_pcm *pcm;
ret = snd_pcm_new(card, card->driver, 0, 0, nr_subdevs,
                  &pcm);
if (ret < 0)
        return ret;

In the above code we allocate a new PCM structure. We pass the card we allocated beforehand. The second argument is a name for the PCM device, which I have just conveniently set to the same name as the driver. It can be whatever you like. The third argument is the PCM device number. Since I am only allocating one, it's set to 0.

The third and fourth arguments are the number of playback and capture streams associated with this device. For my purpose, playback is 0 and capture is the number I have detected that the card supports (4, 8 or 16).

The last argument is where ALSA allocates the PCM device. It will associate any memory for this with the card, so when we later call snd_card_free(), it will cleanup our PCM device(s) as well.

Next we must associate the handlers for capturing sound data from our hardware. We have a struct defined as such:

static struct snd_pcm_ops my_pcm_ops = {
        .open      = my_pcm_open,
        .close     = my_pcm_close,
        .ioctl     = snd_pcm_lib_ioctl,
        .hw_params = my_hw_params,
        .hw_free   = my_hw_free,
        .prepare   = my_pcm_prepare,
        .trigger   = my_pcm_trigger,
        .pointer   = my_pcm_pointer,
        .copy      = my_pcm_copy,
};

I will go into the details of how to define these handlers in the next post, but for now we just want to let the PCM middle layer know to use them:

snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_CAPTURE,
                &my_pcm_ops);
pcm->private_data = mydev;
pcm->info_flags = 0;
strcpy(pcm->name, card->shortname);

Here, we first set the capture handlers for this PCM device to the one we defined above. Afterwards, we also set some basic info for the PCM device such as adding our main device as part of the private data (so that we can retrieve it more easily in the handler callbacks).

Now that we've made the device, we want to initialize the memory management associated with the PCM middle layer. ALSA provides some basic memory handling routines for various functions. We want to make use of it since it allows us to reduce the amount of code we write and makes working with userspace that much easier.

ret = snd_pcm_lib_preallocate_pages_for_all(pcm,
                     SNDRV_DMA_TYPE_CONTINUOUS,
                     snd_dma_continuous_data(GFP_KERNEL),
                     MAX_BUFFER, MAX_BUFFER);
if (ret < 0)
        return ret;

The MAX_BUFFER is something we've defined earlier and will be discussed further in the next post. Simply put, it's the maximum size of the buffer in the hardware (the maximum size of data that userspace can request at one time without waiting on the hardware to produce more data).

We are using the simple continuous buffer type here. Your hardware may support direct DMA into the buffers, and as such you would use something like snd_dma_dev() along with your PCI device to initialize this. I'm using standard buffers because my hardware will require me to handle moving data around manually.

Writing an ALSA driver: PCM Hardware Description
we'll dig into the snd_pcm_hardware structure that will be used in the next post which will describe the PCM handler callbacks.

Here is a look at the snd_pcm_hardware structure I have for my driver. It's fairly simplistic:

static struct snd_pcm_hardware my_pcm_hw = {
        .info = (SNDRV_PCM_INFO_MMAP |
                 SNDRV_PCM_INFO_INTERLEAVED |
                 SNDRV_PCM_INFO_BLOCK_TRANSFER |
                 SNDRV_PCM_INFO_MMAP_VALID),
        .formats          = SNDRV_PCM_FMTBIT_U8,
        .rates            = SNDRV_PCM_RATE_8000,
        .rate_min         = 8000,
        .rate_max         = 8000,
        .channels_min     = 1,
        .channels_max     = 1,
        .buffer_bytes_max = (32 * 48),
        .period_bytes_min = 48,
        .period_bytes_max = 48,
        .periods_min      = 1,
        .periods_max      = 32,
};

This structure describes how my hardware lays out the PCM data for capturing. As I described before, it writes out 48 bytes at a time for each stream, into 32 pages. A period basically describes an interrupt. It sums up the "chunk" size that the hardware supplies data in.

This hardware only supplies mono data (1 channel) and only 8000HZ sample rate. Most hardware seems to work in the range of 8000 to 48000, and there is a define for that of SNDRV_PCM_RATE_8000_48000. This is a bit masked field, so you can add whatever rates your harware supports.

My hardware driver describes this data as unsigned 8-bit format (it's actually signed 3-bit g723-24, but ALSA doesn't support that, so I fake it). Most common PCM data is signed 16-bit little-endian (S16_LE). You would use whatever your hardware supplies, which can be more than one type. Since the format is a bit mask, you can define multiple data formats.

Lastly, the info field describes some middle layer features that your hardware/driver supports. What I have here is the base for what most drivers will supply. See the ALSA docs for more details. For example, if your hardware has stereo (or multiple channels) but it does not interleave these channels together, you would not have the interleave flag.

Writing an ALSA driver: PCM handler callbacks
So here we are on the final chapter of the ALSA driver series. We will finally fill in the meat of the driver with some simple handler callbacks for the PCM capture device we've been developing. In the previous post, Writing an ALSA driver: Setting up capture, we defined my_pcm_ops, which was used when calling snd_pcm_set_ops() for our PCM device. Here is that structure again:

static struct snd_pcm_ops my_pcm_ops = {
        .open      = my_pcm_open,
        .close     = my_pcm_close,
        .ioctl     = snd_pcm_lib_ioctl,
        .hw_params = my_hw_params,
        .hw_free   = my_hw_free,
        .prepare   = my_pcm_prepare,
        .trigger   = my_pcm_trigger,
        .pointer   = my_pcm_pointer,
        .copy      = my_pcm_copy,
};

First let's start off with the open and close methods defined in this structure. This is where your driver gets notified that someone has opened the capture device (file open) and subsequently closed it.

static int my_pcm_open(struct snd_pcm_substream *ss)
{
        ss->runtime->hw = my_pcm_hw;
        ss->private_data = my_dev;

        return 0;
}

static int my_pcm_close(struct snd_pcm_substream *ss)
{
        ss->private_data = NULL;

        return 0;
}

This is the minimum you would do for these two functions. If needed, you would allocate private data for this stream and free it on close.

For the ioctl handler, unless you need something special, you can just use the standard snd_pcm_lib_ioctl callback.

The next three callbacks handle hardware setup.

static int my_hw_params(struct snd_pcm_substream *ss,
                        struct snd_pcm_hw_params *hw_params)
{
        return snd_pcm_lib_malloc_pages(ss,
                         params_buffer_bytes(hw_params));
}

static int my_hw_free(struct snd_pcm_substream *ss)
{
        return snd_pcm_lib_free_pages(ss);
}

static int my_pcm_prepare(struct snd_pcm_substream *ss)
{
        return 0;
}

Since we've been using standard memory allocation routines from ALSA, these functions stay fairly simple. If you have some special exceptions between different versions of the hardware supported by your driver, you can make changes to the ss->hw structure here (e.g. if one version of your card supports 96khz, but the rest only support 48khz max).

The PCM prepare callback should handle anything your driver needs to do before alsa-lib can ask it to start sending buffers. My driver doesn't do anything special here, so I have an empty callback.

This next handler tells your driver when ALSA is going to start and stop capturing buffers from your device. Most likely you will enable and disable interrupts here.

static int my_pcm_trigger(struct snd_pcm_substream *ss,
                          int cmd)
{
        struct my_device *my_dev = snd_pcm_substream_chip(ss);
        int ret = 0;

        switch (cmd) {
        case SNDRV_PCM_TRIGGER_START:
                // Start the hardware capture
                break;
        case SNDRV_PCM_TRIGGER_STOP:
                // Stop the hardware capture
                break;
        default:
                ret = -EINVAL;
        }

        return ret;
}

Let's move on to the handlers that are the work horse in my driver. Since the hardware that I'm writing my driver for cannot directly DMA into memory that ALSA has supplied for us to communicate with userspace, I need to make use of the copy handler to perform this operation.

static snd_pcm_uframes_t my_pcm_pointer(struct snd_pcm_substream *ss)
{
        struct my_device *my_dev = snd_pcm_substream_chip(ss);

        return my_dev->hw_idx;
}

static int my_pcm_copy(struct snd_pcm_substream *ss,
                       int channel, snd_pcm_uframes_t pos,
                       void __user *dst,
                       snd_pcm_uframes_t count)
{
        struct my_device *my_dev = snd_pcm_substream_chip(ss);

        return copy_to_user(dst, my_dev->buffer + pos, count);
}

So here we've defined a pointer function which gets called by userspace to find our where the hardware is in writing to the buffer.

Next, we have the actual copy function. You should note that count and pos are in sample sizes, not bytes. The buffer I've shown we assume to have been filled during interrupt.

Speaking of interrupt, that is where you should also signal to ALSA that you have more data to consume. In my ISR (interrupt service routine), I have this:

snd_pcm_period_elapsed(my_dev->ss);

ALSA Audio Driver

The below figure shows audio connection on a PC compatible system; the audio controller on the south bridge, together with an external codec, interfaces with analog audio circuitry.

Figure: Audio in a PC environment

An audio codec converts digital audio data to analog sound signals for playing via speakers and performs the reverse operation for recording via a microphone. Other common audio inputs and outputs that interface with a codec include headsets, earphone, hand free, line in and line Out. A codec offers mixer functionality that connects it to a combination of these audio outputs and inputs and controls the volume gain associated with audio signal.

Digital data is obtained by sampling audio analog signals at specific bit rates using a technique called pulse code modulation (PCM). CD quality is for example sound sampled at 44.1 KHz, using 16 bits to hold each sample. A codec is responsible for recording audio by sampling at supported PCM bit rates and for playing audio originally sampled at different bit rates.

Linux Sound Sub System

Advanced Linux Sound Architecture (ALSA) is the sound sub system in 2.6 Kernel. Open Sound System (OSS), the sound layer in the 2.4 layer is now obsolete and depreciated.

The following figure shows the Linux sound sub system;

Figure: Linux Sound Sub System (ALSA)

1). The following /dev/snd/* device nodes are created and managed by the ALSA core. /dev/snd/controlC0 is a control node that application uses for controlling volume gain and such. /dev/snd/pcmC0D0P is a play back device. /dev/snd/pcmD0C is a recording device.

2). Audio controller drivers specific to the controller hardware . To drive the audio controller present in the intel ICH south bridge chipsets e.g. use the snd_intel8x0 driver.

3). Audio codec interface that assist communication between controllers and codecs for AC’ 97 codec used the snd_ac97_codec and ac97_bus modules.

4). The nodes /dev/adsp and /dev/mixer allow OSS application to run unchanged over ALSA. The OSS /dev/dsp node maps to ALSA nodes /dev/snd/pcmC0D0; /dev/adsp corresponds to /dev/snd/pcmC0D1* and /dev/mixer associates with /dev/snd/ControlC0.

5). Procfs and sysfc interface implementations for accessing info via /proc/aSound and /Sys/class/sound.

6). The user space ALSA library alsa-lib which provides the libSound.so object. The library passes the job of the ALSA application programming by offering several canned routines to access ALSA drivers.

Tools and Links

1). http://tldp.org/HOWTO/Sound-HOWTO/

Answers of FAQ pertaining to Linux

2). http://sourceforge.net/projects/mad/files

MadPlay is a software MP3 decoder and player that are both ALSA and OSS aware.

3). http://rawrec.sourceforge.net

It is used for basic playback and recording.

Data Structure and API

The figures show the list of data structure and API used in developing the ALSA driver.

Figure: Summary of data structures

From <https://jawadsblog.wordpress.com/2010/11/11/alsa-audio-driver/>

Friday 14 October 2016

Embedded System Testing (Software & Hardware)

Embedded systems is gaining importance with increasing adoption of 16 ,32 and 64-bit processors across a wide variety of electronic products. As consumer expectation from these systems grow, manufacturers are challenged with the following factors before testing this to perfect for market release:

    Real time responses
    Separate host (target) systems from development environments
    Lack of standardization in deployment architectures
    Lack of established interfaces to systems under testing
    Stringent Fail-Safe requirements
    Extremely high cost of isolating and fixing defects

embedded system testing methods

System Level Testing
Application Testing
Middleware Testing
BSP & Driver testing
Embedded Hardware Design

Embedded Internet

The embedded Internet is bringing transformative changes to the embedded world. The era of intelligent connectivity is dawning. And the industry is about to hit the fast-forward button.¹

We have watched the Internet grow from its early beginnings to a global human network, breaking down old boundaries, stimulating new usage models, and unleashing opportunities for building businesses and growing revenue on a global scale.

Now the Internet is evolving again, to the embedded space. How big will it become? Intel Vice President Doug Davis cites the IDC prediction of 15 billion intelligent, connected devices by the year 2015¹.

Extending the power of Internet connectivity to a virtually limitless variety of embedded devices, with many communicating machine-to-machine without human intervention, has more far-reaching implications than any single technologist can imagine.

The important question is: how many of these breakthrough solutions will you create?

Even one billion of anything is an astounding number. But when you start to think about 15 billion intelligent devices1 connected to each other, you realize that our industry is on the threshold of something new.

If Internet-connected PCs and phones were transformative, imagine what happens when the Internet connects cars, home media phones, digital signs and shopping carts, mobile medical diagnostic tools, factory robots and intelligent wind turbines.

-Read the full article

At Intel we are working with the embedded computing and communications ecosystem, and with end customers, to envision the innovative possibilities and capture unprecedented opportunities for industry growth.

Driven by breakthroughs in microarchitecture and process technology, the same Intel® architecture that is at the heart of the majority of today’s Internet applications can now deliver scalable intelligence and connectivity to billions of new intelligent, connected devices.

Building on our 30 years of embedded industry experience, Intel is delivering the platforms you need today—based on products whose specifications range from milliwatts of power consumption to petaflops of performance—all based on a single, familiar and proven software architecture.

Newer and even more visionary embedded applications are yet to come. Their implications will be vast—for the industry, and for your future.

"Gantz, John. "The Embedded Internet: Methodology and Findings." IDC. January 2009."

Embedded Linux Vendors

Organizations developing embedded/mobile Linux products include
1.µClinux
2. RTLinux
3.LynuxWorks
4. Wind River
5.MontaVista
6. Andriod
7. IOS

which has released Mobilinux, the first version of its Linux operating system specifically designed and optimized for mobile phones and wireless devices.

Tuesday 13 September 2016

Interesting C Interview Questions and Answers

In this article, we will discuss some interesting problems on C language that can help students to brush up their C programming skills and help them prepare their C fundamentals for interviews.

1. gets() function
Question: There is a hidden problem with the following code. Can you detect it?

#include

int main(void)
{
    char buff[10];
    memset(buff,0,sizeof(buff));

    gets(buff);

    printf("\n The buffer entered is [%s]\n",buff);

    return 0;
}

Answer: The hidden problem with the code above is the use of the function gets(). This function accepts a string from stdin without checking the capacity of buffer in which it copies the value. This may well result in buffer overflow. The standard function fgets() is advisable to use in these cases.
2. strcpy() function
Question: Following is the code for very basic password protection. Can you break it without knowing the password?

#include

int main(int argc, char *argv[])
{
    int flag = 0;
    char passwd[10];

    memset(passwd,0,sizeof(passwd));

    strcpy(passwd, argv[1]);

    if(0 == strcmp("LinuxGeek", passwd))
    {
        flag = 1;
    }

    if(flag)
    {
        printf("\n Password cracked \n");
    }
    else
    {
        printf("\n Incorrect passwd \n");

    }
    return 0;
}

Answer: Yes. The authentication logic in above password protector code can be compromised by exploiting the loophole of strcpy() function. This function copies the password supplied by user to the ‘passwd’ buffer without checking whether the length of password supplied can be accommodated by the ‘passwd’ buffer or not. So if a user supplies a random password of such a length that causes buffer overflow and overwrites the memory location containing the default value ’0′ of the ‘flag’ variable then even if the password matching condition fails, the check of flag being non-zero becomes true and hence the password protection is breached.
For example :

$ ./psswd aaaaaaaaaaaaa

Password cracked

So you can see that though the password supplied in the above example is not correct but still it breached the password security through buffer overflow.
To avoid these kind of problems the function strncpy() should be used.

    Note from author : These days the compilers internally detect the possibility of stack smashing and so they store variables on stack in such a way that stack smashing becomes very difficult. In my case also, the gcc does this by default so I had to use the the compile option ‘-fno-stack-protector’ to reproduce the above scenario.

3. Return type of main()
Question: Will the following code compile? If yes, then is there any other problem with this code?

#include

void main(void)
{
    char *ptr = (char*)malloc(10);

    if(NULL == ptr)
    {
        printf("\n Malloc failed \n");
        return;
    }
    else
    {
        // Do some processing

        free(ptr);
    }

    return;
}

Answer: The code will compile error free but with a warning (by most compilers) regarding the return type of main()function. Return type of main() should be ‘int’ rather than ‘void’. This is because the ‘int’ return type lets the program to return a status value. This becomes important especially when the program is being run as a part of a script which relies on the success of the program execution.
4. Memory Leak
Question: Will the following code result in memory leak?

#include

void main(void)
{
    char *ptr = (char*)malloc(10);

    if(NULL == ptr)
    {
        printf("\n Malloc failed \n");
        return;
    }
    else
    {
        // Do some processing
    }

    return;
}

Answer: Well, Though the above code is not freeing up the memory allocated to ‘ptr’ but still this would not cause a memory leak as after the processing is done the program exits. Since the program terminates so all the memory allocated by the program is automatically freed as part of cleanup. But if the above code was all inside a while loop then this would have caused serious memory leaks.
Note : If you want to know more on memory leaks and the tool that can detect memory leaks, read our article on Valgrind.
5. The free() function
Question: The following program seg-faults (crashes) when user supplies input as ‘freeze’ while it works fine with input ‘zebra’. Why?

#include

int main(int argc, char *argv[])
{
    char *ptr = (char*)malloc(10);

    if(NULL == ptr)
    {
        printf("\n Malloc failed \n");
        return -1;
    }
    else if(argc == 1)
    {
        printf("\n Usage \n");
    }
    else
    {
        memset(ptr, 0, 10);

        strncpy(ptr, argv[1], 9);

        while(*ptr != 'z')
        {
            if(*ptr == '')
                break;
            else
                ptr++;
        }

        if(*ptr == 'z')
        {
            printf("\n String contains 'z'\n");
            // Do some more processing
        }

       free(ptr);
    }

    return 0;
}

Answer: The problem here is that the code changes the address in ‘ptr’ (by incrementing the ‘ptr’) inside the while loop. Now when ‘zebra’ is supplied as input, the while loop terminates before executing even once and so the argument passed to free() is the same address as given by malloc(). But in case of ‘freeze’ the address held by ptr is updated inside the while loop and hence incorrect address is passed to free() which causes the seg-fault or crash.
6. atexit with _exit
Question: In the code below, the atexit() function is not being called. Can you tell why?

#include

void func(void)
{
    printf("\n Cleanup function called \n");
    return;
}

int main(void)
{
    int i = 0;

    atexit(func);

    for(;i<0xffffff _exit="" i="" pre="">
Answer: This behavior is due to the use of function
_exit(). This function does not call the clean-up functions like
atexit() etc. If atexit() is required to be called then exit() or
‘return’ should be used.

7. void* and C structures

Question: Can you design a function that can accept
any type of argument and returns an integer? Also, is there a way in
which more than one arguments can be passed to it?

Answer: A function that can accept any type of argument looks like :

int func(void *ptr)

if more than one argument needs to be passed to this function then
this function could be called with a structure object where-in the
structure members can be populated with the arguments that need to be
passed.

8. * and ++ operators

Question: What would be the output of the following code and why?

#include

int main(void)
{
    char *ptr = "Linux";
    printf("\n [%c] \n",*ptr++);
    printf("\n [%c] \n",*ptr);

    return 0;
}

Answer: The output of the above would be :

[L]

[i]

Since the priority of both ‘++’ and ‘*’ are same so processing of
‘*ptr++’ takes place from right to left. Going by this logic, ptr++ is
evaluated first and then *ptr. So both these operations result in ‘L’.
Now since a post fix ‘++’ was applied on ptr so the next printf() would
print ‘i’.

9. Making changes in Code(or read-only) segment

Question: The following code seg-faults (crashes). Can you tell the reason why?

#include

int main(void)
{
    char *ptr = "Linux";
    *ptr = 'T';

    printf("\n [%s] \n", ptr);

    return 0;
}

Answer: This is because, through *ptr = ‘T’, the
code is trying to change the first byte of the string ‘Linux’ kept in
the code (or the read-only) segment in the memory. This operation is
invalid and hence causes a seg-fault or a crash.

10. Process that changes its own name

Question: Can you write a program that changes its own name when run?

Answer: Following piece of code tries to do the required :

#include

int main(int argc, char *argv[])
{
    int i = 0;
    char buff[100];

    memset(buff,0,sizeof(buff));

    strncpy(buff, argv[0], sizeof(buff));
    memset(argv[0],0,strlen(buff));

    strncpy(argv[0], "NewName", 7);

    // Simulate a wait. Check the process
    // name at this point.
    for(;i<0xffffffff 0="" i="" pre="" return="">

11. Returning address of local variable

Question: Is there any problem with the following code?If yes, then how it can be rectified?

#include

int* inc(int val)
{
int a = val;
a++;
return &a;
}

int main(void)
{
    int a = 10;

    int *val = inc(a);

    printf("\n Incremented value is equal to [%d] \n", *val);

    return 0;
}

Answer: Though the above program may run perfectly
fine at times but there is a serious loophole in the function ‘inc()’.
This function returns the address of a local variable. Since the life
time of this local variable is that of the function ‘inc()’ so after
inc() is done with its processing, using the address of its local
variable can cause undesired results. This can be avoided by passing the
address of variable ‘a’ from main() and then inside changes can be made
to the value kept at this address.

12. Processing printf() arguments

Question: What would be the output of the following code?

#include

int main(void)
{
    int a = 10, b = 20, c = 30;

    printf("\n %d..%d..%d \n", a+b+c, (b = b*2), (c = c*2));

    return 0;
}

Answer: The output of the above code would be :

110..40..60

This is because the arguments to the function are processed from right to left but are printed from left to right.

What is I/O Scheduler for a Hard Disk on linux?

What is I/O Scheduler for a Hard Disk on linux?

The 2.6 LinuxKernel includes selectable I/O schedulers. They control the way the Kernel commits reads and writes to disks – the intention of providing different schedulers is to allow better optimisation for different classes of workload.

Why does kernel need IO scheduler?

ANS : Without an I/O scheduler, the kernel would basically just issue each request to disk in the order that it received them. This could result in massive HardDisk thrashing: if one process was reading from one part of the disk, and one writing to another, the heads would have to seek back and forth across the disk for every operation. The scheduler’s main goal is to optimise disk access times.

An I/O scheduler can use the following techniques to improve performance:

a)Request merging : The scheduler merges adjacent requests together to reduce disk seeking.
b)Elevator : The scheduler orders requests based on their physical location on the block device, and it basically tries to seek in one direction as much as possible.
c)Prioritisation : The scheduler has complete control over how it prioritises requests, and can do so in a number of ways

All I/O schedulers should also take into account resource starvation, to ensure requests eventually do get serviced!

How to view Current Disk scheduler ?

Assuming that we have a disk name /dev/sda, type :

# cat /sys/block/{DEVICE-NAME}/queue/scheduler
# cat /sys/block/sda/queue/scheduler

Sample output:

noop anticipatory deadline [cfq]

Here used scheduler is cfq.

How to set I/O Scheduler For A Hard Disk ?

To set a specific scheduler, simply type the command as follows:

# echo {SCHEDULER-NAME} > /sys/block/{DEVICE-NAME}/queue/scheduler
For example, set noop scheduler, enter:
# echo noop > /sys/block/hda/queue/scheduler

OR

Edit /boot/grub/grub.conf and enter in kernel line "elevator=noop" or any other scheduler available.

There are currently 4 available IO schedulers :

* No-op Scheduler
* Anticipatory IO Scheduler (AS)
* Deadline Scheduler
* Complete Fair Queueing Scheduler (CFQ)

A) No-op Scheduler : This scheduler only implements request merging.

B) Anticipatory IO Scheduler : The anticipatory scheduler is the default scheduler in older 2.6 kernels – if you've not specified one, this is the one that will be loaded. It implements request merging, a one-way elevator, read and write request batching, and attempts some anticipatory reads by holding off a bit after a read batch if it thinks a user is going to ask for more data. It tries to optimise for physical disks by avoiding head movements if possible – one downside to this is that it probably give highly erratic performance on database or storage systems.

C) Deadline Scheduler : The deadline scheduler implements request merging, a one-way elevator, and imposes a deadline on all operations to prevent resource starvation. Because writes return instantly within Linux, with the actual data being held in cache, the deadline scheduler will also prefer readers – as long as the deadline for a write request hasn't passed. The kernel docs suggest this is the preferred scheduler for database systems, especially if you have TCQ aware disks, or any system with high disk performance.

D) Complete Fair Queueing Scheduler (CFQ) : The complete fair queueing scheduler implements both request merging and the elevator, and attempts to give all users of a particular device the same number of IO requests over a particular time interval. This should make it more efficient for multiuser systems. It seems that Novel SLES sets cfq as the scheduler by default, as does the latest Ubuntu release. As of the 2.6.18 kernel, this is the default schedular in kernel.org releases. RHEL 6 uses default scheduler CFQ.

Changing Schedulers :

The most reliable way to change schedulers is to set the kernel option “elevator” at boot time. You can set it to one of “as”, “cfq”, “deadline” or “noop”, to set the appropriate scheduler. elevator=cfq

It seems under more recent 2.6 kernels (2.6.11, possibly earlier), you can change the scheduler at runtime by echoing the name of the scheduler into /sys/block/$devicename/queue/scheduler, where the device name is the basename of the block device, eg “sda” for /dev/sda.

doc : /usr/src/linux/Documentation/block/switching-sched.txt,

syslog on linux

Syslog :

Whenever syslogd, the syslog dæmon, receives a log message, it acts based on the message's type (or facility) and its priority. syslog's mapping of actions to facilities and priorities is specified in /etc/syslog.conf. Each line in this file specifies one or more facility/priority selectors followed by an action. A selector consists of a facility or facilities and a (single) priority.

In the following syslog.conf line, mail.notice is the selector and /var/log/mail is the action (i.e., “write messages to /var/log/mail”):

mail.notice /var/log/mail

facility.level_of_priority file_where_msg_will_be_saved

Within the selector, “mail” is the facility (message category) and “notice” is the level of priority.

Facilities :

Facilities are simply categories. Supported facilities in Linux are auth, authpriv, cron, dæmon, kern, lpr, mail, mark, news, syslog, user, UUCP and local0 through local7. Some of these are self-explanatory, but of special note are:

* auth: used for many security events.
* authpriv: used for access-control-related messages.
* dæmon: used by system processes and other dæmons.
* kern: used for kernel messages.
* mark: messages generated by syslogd itself that contain only a timestamp and the string “--MARK--”. To specify how many minutes should transpire between marks, invoke syslogd with the -m [minutes] flag.
* user: the default facility when none is specified by an application or in a selector.
* local7: boot messages.
* *: wildcard signifying “any facility”.
* none: wildcard signifying “no facility”

----

Priorities :

Unlike facilities, which have no relationship to each other, priorities are hierarchical. Possible priorities in Linux are (in increasing order of urgency): debug > info > notice > warning > err > crit > alert and > emerg. Note that the urgency of a given message is determined by the programmer who wrote it; facility and priority are set by the programs that generate messages, not by syslog.

If you specify a single priority in a selector (without modifiers), you're actually specifying that priority plus all higher priorities. Thus the selector mail.notice translates to “all mail-related messages having a priority of notice or higher”, i.e., having a priority of notice, warning, err, crit, alert or emerg.

This behaviour can be cancelled by prepending an = to the priority. The selector mail.=notice translates to “all mail-related messages having a priority of notice”. Priorities may also be negated: mail.!notice is equivalent to “all mail messages except those with priority of notice or higher”, and mail.!=notice corresponds to “all mail messages except those with the priority notice”.

If overall system performance becomes an important factor in regard to logging, you can tell syslogd **not** to sync the disk each time it writes to a log file. This is done by putting a minus sign (-) in front of the file name, like this:

lpr.info -/var/adm/printer.log

Sending the log messages to another machine is done by using an at-sign (@) in front of the machine name as the action. For example:

*.emerg @logserver

details abnout rsyslog : http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch05_:_Troubleshooting_Linux_with_syslog

Logrotate :

The Linux utility logrotate renames and reuses system error log files on a periodic basis so that they don't occupy excessive disk space.

The /etc/logrotate.conf File :
This is logrotate's general configuration file in which you can specify the frequency with which the files are reused.

* You can specify either a weekly or daily rotation parameter. In the case below the weekly option is commented out with a #, allowing for daily updates.
* The rotate parameter specifies the number of copies of log files logrotate will maintain. In the case below the 4 copy option is commented out with a #, while allowing 7 copies.
* The create parameter creates a new log file after each rotation

Sample conf file:

# rotate log files weekly
#weekly

# rotate log files daily
daily

# keep 4 weeks worth of backlogs
#rotate 4

# keep 7 days worth of backlogs
rotate 7

# create new (empty) log files after rotating old ones
create
-----

The /etc/logrotate.d Directory :

Most Linux applications that use syslog will put an additional configuration file in this directory to specify the names of the log files to be rotated. It is a good practice to verify that all new applications that you want to use the syslog log have configuration files in this directory. Here are some sample files that define the specific files to be rotated for each application.

Here is an example of a custom file located in this directory that rotates files with the .tgz extension which are located in the /data/backups directory. The parameters in this file will override the global defaults in the /etc/logrotate.conf file. In this case, the rotated files won't be compressed, they'll be held for 30 days only if they are not empty, and they will be given file permissions of 600 for user root.

/data/backups/*.tgz {

daily
rotate 30
nocompress
missingok
notifempty
create 0600 root root
}

Activating logrotate :

The above logrotate settings in the previous section will not take effect until you issue the following command:
#logrotate -f

If you want logrotate to reload only a specific configuration file, and not all of them, then issue the logrotate command with just that filename as the argument like this:

[root@me]# logrotate -f /etc/logrotate.d/syslog

To compress log file use "compress" in main conf file.

How to check the logrotate status?

To check the current logrotate status, e.g. which files are covered by logrotate, what are their last processed date etc.

You can check the /var/lib/logrotate/status file

Tuesday 23 August 2016

mmu initialization on ARM

The AArch64 architecture allows up to 4 levels of translation tables with a 4KB page size and up to 3 levels with a 64KB page size. AArch64 Linux uses either 3 levels or 4 levels of translation tables with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit (256TB) virtual addresses, respectively, for both user and kernel. With 64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB) virtual address, are used but the memory layout is the same. This blog post describes the virtual memory layout used by the AArch64 Linux kernel as well as its implementation in its start up procedure.

Virtual Memory Layout

According to linux-4.2/Documentation/arm64/memory.txt, user space addresses have bits 63:48 set to 0 while the kernel addresses have the same bits set to 1. TTBRx selection is given by bit 63 of the virtual address. The swapper_pg_dir contains only kernel (global) mappings while the user pgd contains only user (non-global) mappings. The swapper_pg_dir address is written to TTBR1 and never written to TTBR0.

    AArch64 Linux memory layout with 4KB pages + 3 levels:

    Start           End         Size        Use
    -----------------------------------------------------------------------
    0000000000000000    0000007fffffffff     512GB      user
    ffffff8000000000    ffffffffffffffff     512GB      kernel


    AArch64 Linux memory layout with 4KB pages + 4 levels:

    Start           End         Size        Use
    -----------------------------------------------------------------------
    0000000000000000    0000ffffffffffff     256TB      user
    ffff000000000000    ffffffffffffffff     256TB      kernel


    AArch64 Linux memory layout with 64KB pages + 2 levels:

    Start           End         Size        Use
    -----------------------------------------------------------------------
    0000000000000000    000003ffffffffff       4TB      user
    fffffc0000000000    ffffffffffffffff       4TB      kernel


    AArch64 Linux memory layout with 64KB pages + 3 levels:

    Start           End         Size        Use
    -----------------------------------------------------------------------
    0000000000000000    0000ffffffffffff     256TB      user
    ffff000000000000    ffffffffffffffff     256TB      kernel


    For details of the virtual kernel memory layout please see the kernel
    booting log.


    Translation table lookup with 4KB pages:

    +--------+--------+--------+--------+--------+--------+--------+--------+
    |63    56|55    48|47    40|39    32|31    24|23    16|15     8|7      0|
    +--------+--------+--------+--------+--------+--------+--------+--------+
     |                 |         |         |         |         |
     |                 |         |         |         |         v
     |                 |         |         |         |   [11:0]  in-page offset
     |                 |         |         |         +-> [20:12] L3 index
     |                 |         |         +-----------> [29:21] L2 index
     |                 |         +---------------------> [38:30] L1 index
     |                 +-------------------------------> [47:39] L0 index
     +-------------------------------------------------> [63] TTBR0/1


    Translation table lookup with 64KB pages:

    +--------+--------+--------+--------+--------+--------+--------+--------+
    |63    56|55    48|47    40|39    32|31    24|23    16|15     8|7      0|
    +--------+--------+--------+--------+--------+--------+--------+--------+
     |                 |    |               |              |
     |                 |    |               |              v
     |                 |    |               |            [15:0]  in-page offset
     |                 |    |               +----------> [28:16] L3 index
     |                 |    +--------------------------> [41:29] L2 index
     |                 +-------------------------------> [47:42] L1 index
     +-------------------------------------------------> [63] TTBR0/1


    When using KVM, the hypervisor maps kernel pages in EL2, at a fixed
    offset from the kernel VA (top 24bits of the kernel VA set to zero):

    Start           End         Size        Use
    -----------------------------------------------------------------------
    0000004000000000    0000007fffffffff     256GB      kernel objects mapped in HYP

Before Initial Page Table Creation

The linux-4.2/arch/arm64/kernel/head.S has some header handling in its first lines of code, then it follows the ENTRY(stext) which is jumped to by a b stext at the start of the code.

    ENTRY(stext)
        bl  preserve_boot_args
        bl  el2_setup           // Drop to EL1, w20=cpu_boot_mode
        adrp    x24, __PHYS_OFFSET
        bl  set_cpu_boot_mode_flag
        bl  __create_page_tables        // x25=TTBR0, x26=TTBR1
        /*
         * The following calls CPU setup code, see arch/arm64/mm/proc.S for
         * details.
         * On return, the CPU will be ready for the MMU to be turned on and
         * the TCR will have been set.
         */
        ldr x27, =__mmap_switched       // address to jump to after
                            // MMU has been enabled
        adr_l   lr, __enable_mmu        // return (PIC) address
        b   __cpu_setup         // initialise processor
    ENDPROC(stext)

The preserve_boot_args() preserves the arguments passed by the bootloader in x0 .. x3 to boot_args array, defined in linux-4.2/arch/arm64/kernel/setup.c.

    /*
     * The recorded values of x0 .. x3 upon kernel entry.
     */
    u64 __cacheline_aligned boot_args[4];

The el2_setup() tries to setup the processor to run in EL1 exception level if the processor was entered in EL2. It returns either BOOTCPUMODEEL1 or BOOTCPUMODEEL2 in x20 if booted in EL1 or EL2 respectively. The registers setup in this routine are as following:

hcr_el2 set to (1 << 31) so RW, bit [31] is set which makes sure 64-bit EL1 (The Execution state for EL1 is AArch64. The Execution state for EL0 is determined by the current value of PSTATE.nRW when executing at EL0).
cnthctl_el2 set to enables EL1 physical timers, with EL1PCEN, bit [1] (Traps Non-secure EL0 and EL1 accesses to the physical timer registers to EL2) and EL1PCTEN, bit [0] (Traps Non-secure EL0 and EL1 accesses to the physical counter register to EL2) both set.
cntvoff_el2 set to 0 so it clears virtual offset.
ICC_SRE_EL2 set to make ICC_SRE_EL2.SRE==1 and ICC_SRE_EL2.Enable==1, thus makes sure SRE is now set. The ICC_SRE_EL2, Interrupt Controller System Register Enable register (EL2) controls whether the System register interface or the memory-mapped interface to the GIC CPU interface is used for EL2. The SRE, bit [0] set to 1 makes the the System register interface to the ICH_* registers and the EL1 and EL2 ICC_* registers is enabled for EL2.
ICH_HCR_EL2 set to 0 so it resets ICC_HCR_EL2 to defaults. The ICH_HCR_EL2, Interrupt Controller Hyp Control Register controls the environment for VMs.
midr_el1 and mpidr_el1 are copied to vpidr_el2 (Holds the value of the Virtualization Processor ID. This is the value returned by Non-secure EL1 reads of MIDR_EL1) and vmpidr_el2 (Holds the value of the Virtualization Multiprocessor ID. This is the value returned by Non-secure EL1 reads of MPIDR_EL1) respectively. This copying makes the IDs not changed as seen from either EL1 or EL2.
sctlr_el1 is set to Set EE and E0E on BE systems or Clear EE and E0E on LE systems. The EE, bit [25] in SCTLR_EL1, System Control Register (EL1) controls endianness of data accesses at EL1, and stage 1 translation table walks in the EL1&0 translation regime, and E0E, bit [24] controls Endianness of data accesses at EL0.
cptr_el2 and hstr_el2 are set to disable Coprocessor/CP15 traps to EL2.
vttbr_el2 is set to 0. The VTTBR_EL2, Virtualization Translation Table Base Register holds the base address of the translation table for the stage 2 translation of memory accesses from Non-secure EL0 and EL1.
vbar_el2 is set to __hyp_stub_vectors, which holds the vector base address for any exception that is taken to EL2.
spsr_el2 is set to (PSR_F_BIT | PSR_I_BIT | PSR_A_BIT | PSR_D_BIT | PSR_MODE_EL1h).
elr_el2 is set to the lr that the calling code is to return to, so that after the eret instruction the processor returns to the caller of el2_setup().

    /*
     * If we're fortunate enough to boot at EL2, ensure that the world is
     * sane before dropping to EL1.
     *
     * Returns either BOOT_CPU_MODE_EL1 or BOOT_CPU_MODE_EL2 in x20 if
     * booted in EL1 or EL2 respectively.
     */
    ENTRY(el2_setup)
        mrs x0, CurrentEL
        cmp x0, #CurrentEL_EL2
        b.ne    1f
        mrs x0, sctlr_el2
    CPU_BE( orr x0, x0, #(1 << 25)  )   // Set the EE bit for EL2
    CPU_LE( bic x0, x0, #(1 << 25)  )   // Clear the EE bit for EL2
        msr sctlr_el2, x0
        b   2f
    1:  mrs x0, sctlr_el1
    CPU_BE( orr x0, x0, #(3 << 24)  )   // Set the EE and E0E bits for EL1
    CPU_LE( bic x0, x0, #(3 << 24)  )   // Clear the EE and E0E bits for EL1
        msr sctlr_el1, x0
        mov w20, #BOOT_CPU_MODE_EL1     // This cpu booted in EL1
        isb
        ret

        /* Hyp configuration. */
    2:  mov x0, #(1 << 31)          // 64-bit EL1
        msr hcr_el2, x0

        /* Generic timers. */
        mrs x0, cnthctl_el2
        orr x0, x0, #3          // Enable EL1 physical timers
        msr cnthctl_el2, x0
        msr cntvoff_el2, xzr        // Clear virtual offset

    #ifdef CONFIG_ARM_GIC_V3
        /* GICv3 system register access */
        mrs x0, id_aa64pfr0_el1
        ubfx    x0, x0, #24, #4
        cmp x0, #1
        b.ne    3f

        mrs_s   x0, ICC_SRE_EL2
        orr x0, x0, #ICC_SRE_EL2_SRE    // Set ICC_SRE_EL2.SRE==1
        orr x0, x0, #ICC_SRE_EL2_ENABLE // Set ICC_SRE_EL2.Enable==1
        msr_s   ICC_SRE_EL2, x0
        isb                 // Make sure SRE is now set
        msr_s   ICH_HCR_EL2, xzr        // Reset ICC_HCR_EL2 to defaults

    3:
    #endif

        /* Populate ID registers. */
        mrs x0, midr_el1
        mrs x1, mpidr_el1
        msr vpidr_el2, x0
        msr vmpidr_el2, x1

        /* sctlr_el1 */
        mov x0, #0x0800         // Set/clear RES{1,0} bits
    CPU_BE( movk    x0, #0x33d0, lsl #16    )   // Set EE and E0E on BE systems
    CPU_LE( movk    x0, #0x30d0, lsl #16    )   // Clear EE and E0E on LE systems
        msr sctlr_el1, x0

        /* Coprocessor traps. */
        mov x0, #0x33ff
        msr cptr_el2, x0            // Disable copro. traps to EL2

    #ifdef CONFIG_COMPAT
        msr hstr_el2, xzr           // Disable CP15 traps to EL2
    #endif

        /* Stage-2 translation */
        msr vttbr_el2, xzr

        /* Hypervisor stub */
        adrp    x0, __hyp_stub_vectors
        add x0, x0, #:lo12:__hyp_stub_vectors
        msr vbar_el2, x0

        /* spsr */
        mov x0, #(PSR_F_BIT | PSR_I_BIT | PSR_A_BIT | PSR_D_BIT |\
                  PSR_MODE_EL1h)
        msr spsr_el2, x0
        msr elr_el2, lr
        mov w20, #BOOT_CPU_MODE_EL2     // This CPU booted in EL2
        eret
    ENDPROC(el2_setup)

Then set_cpu_boot_mode_flag() sets the __boot_cpu_mode flag depending on the CPU boot mode passed in x20.

    /*
     * Sets the __boot_cpu_mode flag depending on the CPU boot mode passed
     * in x20. See arch/arm64/include/asm/virt.h for more info.
     */
    ENTRY(set_cpu_boot_mode_flag)
        adr_l   x1, __boot_cpu_mode
        cmp w20, #BOOT_CPU_MODE_EL2
        b.ne    1f
        add x1, x1, #4
    1:  str w20, [x1]           // This CPU has booted in EL1
        dmb sy
        dc  ivac, x1            // Invalidate potentially stale cache line
        ret
    ENDPROC(set_cpu_boot_mode_flag)

Creating Page Tables

The code that kickstarts setting up of MMU is __create_page_tables() in linux-4.2/arch/arm64/kernel/head.S.

    /*
     * Setup the initial page tables. We only setup the barest amount which is
     * required to get the kernel running. The following sections are required:
     *   - identity mapping to enable the MMU (low address, TTBR0)
     *   - first few MB of the kernel linear mapping to jump to once the MMU has
     *     been enabled
     */
    __create_page_tables:
        adrp    x25, idmap_pg_dir
        adrp    x26, swapper_pg_dir
        mov x27, lr

        /*
         * Invalidate the idmap and swapper page tables to avoid potential
         * dirty cache lines being evicted.
         */
        mov x0, x25
        add x1, x26, #SWAPPER_DIR_SIZE
        bl  __inval_cache_range

        /*
         * Clear the idmap and swapper page tables.
         */
        mov x0, x25
        add x6, x26, #SWAPPER_DIR_SIZE
    1:  stp xzr, xzr, [x0], #16
        stp xzr, xzr, [x0], #16
        stp xzr, xzr, [x0], #16
        stp xzr, xzr, [x0], #16
        cmp x0, x6
        b.lo    1b

        ldr x7, =MM_MMUFLAGS

        /*
         * Create the identity mapping.
         */
        mov x0, x25             // idmap_pg_dir
        adrp    x3, __idmap_text_start      // __pa(__idmap_text_start)

    #ifndef CONFIG_ARM64_VA_BITS_48
    #define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
    #define EXTRA_PTRS  (1 << (48 - EXTRA_SHIFT))

        /*
         * If VA_BITS < 48, it may be too small to allow for an ID mapping to be
         * created that covers system RAM if that is located sufficiently high
         * in the physical address space. So for the ID map, use an extended
         * virtual range in that case, by configuring an additional translation
         * level.
         * First, we have to verify our assumption that the current value of
         * VA_BITS was chosen such that all translation levels are fully
         * utilised, and that lowering T0SZ will always result in an additional
         * translation level to be configured.
         */
    #if VA_BITS != EXTRA_SHIFT
    #error "Mismatch between VA_BITS and page size/number of translation levels"
    #endif

        /*
         * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
         * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
         * this number conveniently equals the number of leading zeroes in
         * the physical address of __idmap_text_end.
         */
        adrp    x5, __idmap_text_end
        clz x5, x5
        cmp x5, TCR_T0SZ(VA_BITS)   // default T0SZ small enough?
        b.ge    1f          // .. then skip additional level

        adr_l   x6, idmap_t0sz
        str x5, [x6]
        dmb sy
        dc  ivac, x6        // Invalidate potentially stale cache line

        create_table_entry x0, x3, EXTRA_SHIFT, EXTRA_PTRS, x5, x6
    1:
    #endif

        create_pgd_entry x0, x3, x5, x6
        mov x5, x3              // __pa(__idmap_text_start)
        adr_l   x6, __idmap_text_end        // __pa(__idmap_text_end)
        create_block_map x0, x7, x3, x5, x6

        /*
         * Map the kernel image (starting with PHYS_OFFSET).
         */
        mov x0, x26             // swapper_pg_dir
        mov x5, #PAGE_OFFSET
        create_pgd_entry x0, x5, x3, x6
        ldr x6, =KERNEL_END         // __va(KERNEL_END)
        mov x3, x24             // phys offset
        create_block_map x0, x7, x3, x5, x6

        /*
         * Since the page tables have been populated with non-cacheable
         * accesses (MMU disabled), invalidate the idmap and swapper page
         * tables again to remove any speculatively loaded cache lines.
         */
        mov x0, x25
        add x1, x26, #SWAPPER_DIR_SIZE
        dmb sy
        bl  __inval_cache_range

        mov lr, x27
        ret
    ENDPROC(__create_page_tables)

Perform Identity Mapping

There is a git commit ARM: idmap: populate identity map pgd at init time using .init.text which has the following comments:

When disabling and re-enabling the MMU, it is necessary to take out an identity mapping for the code that manipulates the SCTLR in order to avoid it disappearing from under our feet. This is useful when soft rebooting and returning from CPU suspend.
This patch allocates a set of page tables during boot and populates them with an identity mapping for the .idmap.text section. This means that users of the identity map do not need to manage their own pgd and can instead annotate their functions with __idmap or, in the case of assembly code, place them in the correct section.

To understand why indentify mapping is required, lets say before mmu is turned on PC is at physical address XXX, also say at XXX there is going to be mmu on instruction. The next instruction fetch would be from XXX + 4 which would now be VIRTUAL address (as mmu is turned on), which should still get converted to (via page tables set up as above) to XXX + 4 (in physical). This is called identity mapping as specified in the comments. When the kernel starts, the MMU is off, and ther ARM is running with an implicit identity mapping (i.e. each virtual address maps to the same physical address). If your physical memory starts at 0x80000000, then the PC will be 0x800xxxxx. When the MMU table is turned on, the PC is still at 0x800xxxx, so even though the kernel has 0xc00xxxxx mapped to 0x800xxxxx it also has to have 0x800xxxxx mapped to 0x800xxxxx. So this mapping of 0x800xxxxx to 0x800xxxxx is the "identity" portion and is needed while switching the MMU on. The 0xc00xxxxx to 0x800xxxxx mapping is what's used while the kernel is running. For 64 bit systems, the 0xc00xxxxx would be something like 0xfffffcxxxxxxxxxx.
The IDMAP_TEXT, idmap_pg_dir and swapper_pg_dir are defined in linux-4.2/arch/arm64/kernel/vmlinux.lds.S.

    SECTIONS {
        . = PAGE_OFFSET + TEXT_OFFSET;

        .head.text : {
            _text = .;
            HEAD_TEXT
        }
    ...
        .text : {           /* Real text segment        */
            _stext = .;     /* Text and read-only data  */
                __exception_text_start = .;
                *(.exception.text)
                __exception_text_end = .;
                IRQENTRY_TEXT
                TEXT_TEXT
                SCHED_TEXT
                LOCK_TEXT
                HYPERVISOR_TEXT
                IDMAP_TEXT
                *(.fixup)
                *(.gnu.warning)
            . = ALIGN(16);
            *(.got)         /* Global offset table      */
        }

        BSS_SECTION(0, 0, 0)

        . = ALIGN(PAGE_SIZE);
        idmap_pg_dir = .;
        . += IDMAP_DIR_SIZE;
        swapper_pg_dir = .;
        . += SWAPPER_DIR_SIZE;
    ...
    }

The TEXT_OFFSET is generated in linux-4.2/arch/arm64/Makefile.

    # The byte offset of the kernel image in RAM from the start of RAM.
    ifeq ($(CONFIG_ARM64_RANDOMIZE_TEXT_OFFSET), y)
    TEXT_OFFSET := $(shell awk 'BEGIN {srand(); printf "0x%03x000\n", int(512 * rand())}')
    else
    TEXT_OFFSET := 0x00080000
    endif

The IDMAP_DIR_SIZE is defined in linux-4.2/arch/arm64/include/asm/page.h.

    /*
     * The idmap and swapper page tables need some space reserved in the kernel
     * image. Both require pgd, pud (4 levels only) and pmd tables to (section)
     * map the kernel. With the 64K page configuration, swapper and idmap need to
     * map to pte level. The swapper also maps the FDT (see __create_page_tables
     * for more information). Note that the number of ID map translation levels
     * could be increased on the fly if system RAM is out of reach for the default
     * VA range, so 3 pages are reserved in all cases.
     */
    #ifdef CONFIG_ARM64_64K_PAGES
    #define SWAPPER_PGTABLE_LEVELS  (CONFIG_PGTABLE_LEVELS)
    #else
    #define SWAPPER_PGTABLE_LEVELS  (CONFIG_PGTABLE_LEVELS - 1)
    #endif

    #define SWAPPER_DIR_SIZE    (SWAPPER_PGTABLE_LEVELS * PAGE_SIZE)
    #define IDMAP_DIR_SIZE      (3 * PAGE_SIZE)

The __create_page_tables() starts by setting addresses of idmap_pg_dir and swapper_pg_dir into x25 and x26, then follows by calling __inval_cache_range(start, end) to invalidate the idmap and swapper page tables to avoid potential dirty cache lines being evicted. It further follows with a bunch of code to clear the idmap and swapper page tables. I think the clearing of the idmap and swapper page tables should be performed before invalidating these page tables, because the clearing code would definitely bring the memory into the caches, thus voiding the effects of the invalidation.
The IDMAP_TEXT in the linux-4.2/arch/arm64/kernel/vmlinux.lds.S is defined as below, which exports the __idmap_text_start (note the ALIGN(SZ_4K)) and __idmap_text_end.

    #define IDMAP_TEXT                  \
        . = ALIGN(SZ_4K);               \
        VMLINUX_SYMBOL(__idmap_text_start) = .;     \
        *(.idmap.text)                  \
        VMLINUX_SYMBOL(__idmap_text_end) = .;

With this, in the __create_page_tables(), after it invalidates and clears the idmap_pg_dir, it tries to map the [__idmap_text_start .. __idmap_text_end] area with this idmap_pg_dir page tables.
If CONFIG_ARM64_VA_BITS_48 is not configured, which should be mostly default with either ARM64_4K_PAGES or ARM64_64K_PAGES, then the __create_page_tables() would try to configure an additional translation level if it is too small to allow for an ID mapping to be created that covers system RAM if that is located sufficiently high in the physical address space.

    choice
        prompt "Virtual address space size"
        default ARM64_VA_BITS_39 if ARM64_4K_PAGES
        default ARM64_VA_BITS_42 if ARM64_64K_PAGES
        help
          Allows choosing one of multiple possible virtual address
          space sizes. The level of translation table is determined by
          a combination of page size and virtual address space size.

    config ARM64_VA_BITS_39
        bool "39-bit"
        depends on ARM64_4K_PAGES

    config ARM64_VA_BITS_42
        bool "42-bit"
        depends on ARM64_64K_PAGES

    config ARM64_VA_BITS_48
        bool "48-bit"

    endchoice

We check this call create_table_entry x0, x3, EXTRA_SHIFT, EXTRA_PTRS, x5, x6, with the parameters described as below.

x0 is the virtual address of idmap_pg_dir.
x3 is physical address of __idmap_text_start, that is __pa(__idmap_text_start).
EXTRA_SHIFT is defined as (PGDIR_SHIFT + PAGE_SHIFT - 3)
EXTRA_PTRS is defined as (1 << (48 - EXTRA_SHIFT)).
x5 is the maximum allowed value for TCR_EL1.T0SZ so that the entire ID map region can be mapped. As T0SZ == (64 - #bits used), this number conveniently equals the number of leading zeroes in the physical address of __idmap_text_end.
x6 is the address of variable u64 idmap_t0sz = TCR_T0SZ(VA_BITS); as defined in linux-4.2/arch/arm64/mm/mmu.c, the value of the variable is updated to be x5.

    /*
     * Macro to create a table entry to the next page.
     *
     *  tbl:    page table address
     *  virt:   virtual address
     *  shift:  #imm page table shift
     *  ptrs:   #imm pointers per table page
     *
     * Preserves:   virt
     * Corrupts:    tmp1, tmp2
     * Returns: tbl -> next level table page address
     */
        .macro  create_table_entry, tbl, virt, shift, ptrs, tmp1, tmp2
        lsr \tmp1, \virt, #\shift
        and \tmp1, \tmp1, #\ptrs - 1    // table index
        add \tmp2, \tbl, #PAGE_SIZE
        orr \tmp2, \tmp2, #PMD_TYPE_TABLE   // address of next table and entry type
        str \tmp2, [\tbl, \tmp1, lsl #3]
        add \tbl, \tbl, #PAGE_SIZE      // next level table page
        .endm

According to the macro above, it does the following:

Calculates an table index in the idmap_pg_dir according to the shift parameter (EXTRA_SHIFT in this call).
Generate the virtual address of next page in tbl (idmap_pg_dir in this call).
Set the entry as specified by table index in tbl (idmap_pg_dir in this call) to point to the next page in tbl (idmap_pg_dir in this call).
Seth that entry to be PMD_TYPE_TABLE.
Make tbl register (x0 in this call) to point to the next page in tbl (idmap_pg_dir in this call).

Then we follow to create_pgd_entry x0, x3, x5, x6, with the parameters described as below.

x0 is the virtual address of idmap_pg_dir.
x3 is physical address of __idmap_text_start, that is __pa(__idmap_text_start).
x5 is used as 1st temporary register.
x6 is used as 2nd temporary register.

    /*
     * Macro to populate the PGD (and possibily PUD) for the corresponding
     * block entry in the next level (tbl) for the given virtual address.
     *
     * Preserves:   tbl, next, virt
     * Corrupts:    tmp1, tmp2
     */
        .macro  create_pgd_entry, tbl, virt, tmp1, tmp2
        create_table_entry \tbl, \virt, PGDIR_SHIFT, PTRS_PER_PGD, \tmp1, \tmp2
    #if SWAPPER_PGTABLE_LEVELS == 3
        create_table_entry \tbl, \virt, TABLE_SHIFT, PTRS_PER_PTE, \tmp1, \tmp2
    #endif
        .endm

Note that in the previous calls to create_table_entry(), the virt parameter is actually passed as physical address of __idmap_text_start, this is intentional since that is why we are doing identity mapping for the [__idmap_text_start .. __idmap_text_end] area.
We then get to the create_block_map x0, x7, x3, x5, x6 macro call, with the following parameters:

x0 is the lowest level of the idmap_pg_dir after the two or three create_table_entry() calls.
x7 is loaded with ldr x7, =MM_MMUFLAGS so it is something like PMD_ATTRINDX(MT_NORMAL) | PMD_FLAGS.
x3 is physical address of __idmap_text_start, that is __pa(__idmap_text_start).
x5 is the same as x3 so it is also physical address of __idmap_text_start.
x6 is physical address of __idmap_text_end, that is __pa(__idmap_text_end).

    /*
     * Macro to populate block entries in the page table for the start..end
     * virtual range (inclusive).
     *
     * Preserves:   tbl, flags
     * Corrupts:    phys, start, end, pstate
     */
        .macro  create_block_map, tbl, flags, phys, start, end
        lsr \phys, \phys, #BLOCK_SHIFT
        lsr \start, \start, #BLOCK_SHIFT
        and \start, \start, #PTRS_PER_PTE - 1   // table index
        orr \phys, \flags, \phys, lsl #BLOCK_SHIFT  // table entry
        lsr \end, \end, #BLOCK_SHIFT
        and \end, \end, #PTRS_PER_PTE - 1       // table end index
    9999:   str \phys, [\tbl, \start, lsl #3]       // store the entry
        add \start, \start, #1          // next entry
        add \phys, \phys, #BLOCK_SIZE       // next block
        cmp \start, \end
        b.ls    9999b
        .endm

Generate initial phys with flags set in place.
Calculate the table index in start.
Calculate the end of the table index for the mapping.
Loop to store the phys with flags, increase each time with BLOCK_SIZE.

The corresponding definitions are as bellow.

    #ifdef CONFIG_ARM64_64K_PAGES
    #define BLOCK_SHIFT PAGE_SHIFT
    #define BLOCK_SIZE  PAGE_SIZE
    #define TABLE_SHIFT PMD_SHIFT
    #else
    #define BLOCK_SHIFT SECTION_SHIFT
    #define BLOCK_SIZE  SECTION_SIZE
    #define TABLE_SHIFT PUD_SHIFT
    #endif

    #define KERNEL_START    _text
    #define KERNEL_END  _end

    /*
     * Initial memory map attributes.
     */
    #ifndef CONFIG_SMP
    #define PTE_FLAGS   PTE_TYPE_PAGE | PTE_AF
    #define PMD_FLAGS   PMD_TYPE_SECT | PMD_SECT_AF
    #else
    #define PTE_FLAGS   PTE_TYPE_PAGE | PTE_AF | PTE_SHARED
    #define PMD_FLAGS   PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S
    #endif

    #ifdef CONFIG_ARM64_64K_PAGES
    #define MM_MMUFLAGS PTE_ATTRINDX(MT_NORMAL) | PTE_FLAGS
    #else
    #define MM_MMUFLAGS PMD_ATTRINDX(MT_NORMAL) | PMD_FLAGS
    #endif

Perform Kernel Mapping

There are the kernel swapper_pg_dir mapping initialization as below.

        /*
         * Map the kernel image (starting with PHYS_OFFSET).
         */
        mov x0, x26             // swapper_pg_dir
        mov x5, #PAGE_OFFSET
        create_pgd_entry x0, x5, x3, x6
        ldr x6, =KERNEL_END         // __va(KERNEL_END)
        mov x3, x24             // phys offset
        create_block_map x0, x7, x3, x5, x6

The create_pgd_entry x0, x5, x3, x6 has the parameters described as below.

x0 is the virtual address of swapper_pg_dir.
x5 is the virtual address at PAGE_OFFSET which is defined as (UL(0xffffffffffffffff) << (VA_BITS - 1)) in linux-4.2/arch/arm64/include/asm/memory.h.
x3 is used as 1st temporary register.
x6 is used as 2nd temporary register.

The create_block_map x0, x7, x3, x5, x6 has the parameters described as below.

x0 is the lowest level of the swapper_pg_dir after the two create_table_entry() calls done in create_pgd_entry().
x7 is loaded with ldr x7, =MM_MMUFLAGS so it is something like PMD_ATTRINDX(MT_NORMAL) | PMD_FLAGS.
x3 is loaded from x24, which was loaded by adrp x24, __PHYS_OFFSET, as physical offset of value (KERNEL_START - TEXT_OFFSET).
x5 is the same as x3 so it is also physical offset of value (KERNEL_START - TEXT_OFFSET).
x6 is physical address of KERNEL_END.

This creates the mapping beteween virtual address PAGE_OFFSET to the physical address KERNEL_START.

References: