The LLVM compiler infrastructure project (formerly Low Level Virtual Machine) is a "collection of modular and reusable compiler and toolchain technologies used to develop compiler front ends and back ends.
LLVM is written in C++ and is designed for compile-time, link-time, run-time, and "idle-time" optimization of programs written in arbitrary programming languages. Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#,[4][5][6] Common Lisp, Crystal, D, Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode, Julia, Lua, Objective-C, Pony,[7] Python, R, Ruby, Rust, CUDA, Scala,[8] and Swift.
The LLVM project started in 2000 at the University of Illinois at Urbana–Champaign, under the direction of Vikram Adve and Chris Lattner. LLVM was originally developed as a research infrastructure to investigate dynamic compilation techniques for static and dynamic programming languages. LLVM was released under the University of Illinois/NCSA Open Source License,[2] a permissive free software licence. In 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM system for various uses within Apple's development systems.[9] LLVM is an integral part of Apple's latest development tools for OS X and iOS.[10] Since 2013, Sony has been using LLVM's primary front end Clang compiler in the software development kit (SDK) of its PS4 console.[11]
The name LLVM was originally an initialism for Low Level Virtual Machine, but this became increasingly less apt as LLVM became an "umbrella project" that included a variety of other compiler and low-level tool technologies, so the project abandoned the initialism.[12] Now, LLVM is a brand that applies to the LLVM umbrella project, the LLVM intermediate representation (IR), the LLVM debugger, the LLVM C++ Standard Library (with full support of C++11 and C++14[13]), etc. LLVM is administered by the LLVM Foundation. Its president is compiler engineer Tanya Lattne
This model applies equally well to interpreters and JIT compilers. The Java Virtual Machine (JVM) is also an implementation of this model, which uses Java bytecode as the interface between the front end and optimizer.
Another advantage of the three-phase design (which follows directly from retargetability) is that the compiler serves a broader set of programmers than it would if it only supported one source language and one target. For an open source project, this means that there is a larger community of potential contributors to draw from, which naturally leads to more enhancements and improvements to the compiler. This is the reason why open source compilers that serve many communities (like GCC) tend to generate better optimized machine code than narrower compilers like FreePASCAL. This isn't the case for proprietary compilers, whose quality is directly related to the project's budget. For example, the Intel ICC Compiler is widely known for the quality of code it generates, even though it serves a narrow audience.
A final major win of the three-phase design is that the skills required to implement a front end are different than those required for the optimizer and back end. Separating these makes it easier for a "front-end person" to enhance and maintain their part of the compiler. While this is a social issue, not a technical one, it matters a lot in practice, particularly for open source projects that want to reduce the barrier to contributing as much as possible.
Link-Time Optimization (LTO) addresses the problem where the compiler traditionally only sees one translation unit (e.g., a
Install-time optimization is the idea of delaying code generation even later than link time, all the way to install time, Install time is a very interesting time (in cases when software is shipped in a box, downloaded, uploaded to a mobile device, etc.), because this is when you find out the specifics of the device you're targeting. In the x86 family for example, there are broad variety of chips and characteristics. By delaying instruction choice, scheduling, and other aspects of code generation, you can pick the best answers for the specific hardware an application ends up running on.
LLVM is written in C++ and is designed for compile-time, link-time, run-time, and "idle-time" optimization of programs written in arbitrary programming languages. Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#,[4][5][6] Common Lisp, Crystal, D, Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode, Julia, Lua, Objective-C, Pony,[7] Python, R, Ruby, Rust, CUDA, Scala,[8] and Swift.
The LLVM project started in 2000 at the University of Illinois at Urbana–Champaign, under the direction of Vikram Adve and Chris Lattner. LLVM was originally developed as a research infrastructure to investigate dynamic compilation techniques for static and dynamic programming languages. LLVM was released under the University of Illinois/NCSA Open Source License,[2] a permissive free software licence. In 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM system for various uses within Apple's development systems.[9] LLVM is an integral part of Apple's latest development tools for OS X and iOS.[10] Since 2013, Sony has been using LLVM's primary front end Clang compiler in the software development kit (SDK) of its PS4 console.[11]
The name LLVM was originally an initialism for Low Level Virtual Machine, but this became increasingly less apt as LLVM became an "umbrella project" that included a variety of other compiler and low-level tool technologies, so the project abandoned the initialism.[12] Now, LLVM is a brand that applies to the LLVM umbrella project, the LLVM intermediate representation (IR), the LLVM debugger, the LLVM C++ Standard Library (with full support of C++11 and C++14[13]), etc. LLVM is administered by the LLVM Foundation. Its president is compiler engineer Tanya Lattne
A Quick Introduction to Classical Compiler Design
The most popular design for a traditional static compiler (like most C compilers) is the three phase design whose major components are the front end, the optimizer and the back end The front end parses source code, checking it for errors, and builds a language-specific Abstract Syntax Tree (AST) to represent the input code. The AST is optionally converted to a new representation for optimization, and the optimizer and back end are run on the code.
Three Major Components of a Three-Phase Compiler
The optimizer is responsible for doing a broad variety of
transformations to try to improve the code's running time, such as
eliminating redundant computations, and is usually more or less
independent of language and target. The back end (also known as the
code generator) then maps the code onto the target instruction set.
In addition to making correct code, it is responsible for
generating good code that takes advantage of unusual features
of the supported architecture. Common parts of a compiler back end
include instruction selection, register allocation, and instruction
scheduling.This model applies equally well to interpreters and JIT compilers. The Java Virtual Machine (JVM) is also an implementation of this model, which uses Java bytecode as the interface between the front end and optimizer.
Implications of this Design
The most important win of this classical design comes when a compiler decides to support multiple source languages or target architectures. If the compiler uses a common code representation in its optimizer, then a front end can be written for any language that can compile to it, and a back end can be written for any target that can compile from it.
Retargetablity
With this design, porting the compiler to support a new source
language (e.g., Algol or BASIC) requires implementing a new front end,
but the existing optimizer and back end can be reused. If these parts
weren't separated, implementing a new source language would require
starting over from scratch, so supporting N
targets and
M
source languages would need N*M compilers.Another advantage of the three-phase design (which follows directly from retargetability) is that the compiler serves a broader set of programmers than it would if it only supported one source language and one target. For an open source project, this means that there is a larger community of potential contributors to draw from, which naturally leads to more enhancements and improvements to the compiler. This is the reason why open source compilers that serve many communities (like GCC) tend to generate better optimized machine code than narrower compilers like FreePASCAL. This isn't the case for proprietary compilers, whose quality is directly related to the project's budget. For example, the Intel ICC Compiler is widely known for the quality of code it generates, even though it serves a narrow audience.
A final major win of the three-phase design is that the skills required to implement a front end are different than those required for the optimizer and back end. Separating these makes it easier for a "front-end person" to enhance and maintain their part of the compiler. While this is a social issue, not a technical one, it matters a lot in practice, particularly for open source projects that want to reduce the barrier to contributing as much as possible.
When and Where Each Phase Runs
As mentioned earlier, LLVM IR can be efficiently (de)serialized to/from a binary format known as LLVM bitcode. Since LLVM IR is self-contained, and serialization is a lossless process, we can do part of compilation, save our progress to disk, then continue work at some point in the future. This feature provides a number of interesting capabilities including support for link-time and install-time optimization, both of which delay code generation from "compile time".Link-Time Optimization (LTO) addresses the problem where the compiler traditionally only sees one translation unit (e.g., a
.c
file
with all its headers) at a time and therefore cannot do optimizations
(like inlining) across file boundaries. LLVM compilers like Clang
support this with the -flto
or -O4
command line option.
This option instructs the compiler to emit LLVM bitcode to the
.o
file instead of writing out a native object file, and delays
code generation to link time
Link-Time Optimization
Details differ depending on which operating system you're on, but the
important bit is that the linker detects that it has LLVM bitcode in
the .o
files instead of native object files. When it sees
this, it reads all the bitcode files into memory, links them together,
then runs the LLVM optimizer over the aggregate. Since the optimizer
can now see across a much larger portion of the code, it can inline,
propagate constants, do more aggressive dead code elimination, and
more across file boundaries. While many modern compilers support LTO,
most of them (e.g., GCC, Open64, the Intel compiler, etc.) do so by
having an expensive and slow serialization process. In LLVM, LTO
falls out naturally from the design of the system, and works across
different source languages (unlike many other compilers) because the
IR is truly source language neutral.Install-time optimization is the idea of delaying code generation even later than link time, all the way to install time, Install time is a very interesting time (in cases when software is shipped in a box, downloaded, uploaded to a mobile device, etc.), because this is when you find out the specifics of the device you're targeting. In the x86 family for example, there are broad variety of chips and characteristics. By delaying instruction choice, scheduling, and other aspects of code generation, you can pick the best answers for the specific hardware an application ends up running on.
No comments:
Post a Comment