This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible. |
CPU architectures often gain interesting new instructions as they evolve, but application developers often find it difficult to take advantage of those instructions. Reluctance to lose backward-compatibility is one of the main roadblocks slowing developers from using advancements in newer computing architectures. Function multi-versioning (FMV), which first appeared in GCC 4.8, is a way to have multiple implementations of a function, each using a different architecture's specialized instruction-set extensions. GCC 6 introduces changes to FMV that make it even easier to bring architecture-based optimizations to application code.
Despite the fact that newer versions of GCC and the kernel attempt to expose tools for using new architecture features before the platforms appear in the market, it can be tough for developers to start using those features when they become available. Currently, C code developers have a few choices:
- Write multiple versions of their code, each targeting different instruction-set extensions; this requires they also manually handle runtime dispatching for these versions.
- Generate multiple versions of their binary, each targeting a different platform.
- Choose a minimum hardware requirement that doesn't take advantage of the technology in newer platforms.
Often, the benefits of using the new architecture's technologies are compelling enough to outweigh integration challenges. Math-heavy code, for example, can be significantly optimized by turning on the Intel Advanced Vector Extensions (AVX). The second version of AVX (AVX2), which was introduced in the 4th-generation Intel Core processor family also known as "Haswell", is one option. The benefits of AVX2 are well-understood in scientific computing fields. The use of AVX2 in the OpenBLAS library can give a project like the R language a boost of up to 2x faster execution; it can also yield significant improvement in Python scientific libraries. These performance improvements are gained by doubling the number of floating-point operations per second (FLOPS) using 256-bit integer instructions, floating-point fused multiply-add instructions, and gather operations.
However, the use of Vector Extensions (VX) technology means a lot of work in terms of development, deployment, and manageability. The idea of having to maintain multiple versions of binaries (one for each architecture) discourages developers and distributions from supporting these features.
Would it be better to optimize some key functions for multiple architectures, then execute them when the binary detects the CPU capabilities at runtime? A feature to do that, FMV, has actually existed since GCC 4.8, but only for C++. FMV in GCC 4.8 made it easy for a developer to specify multiple versions of a function; each could be optimized for a specific target instruction-set feature. GCC would then take care of creating the dispatching code necessary to execute the right function version.
To use FMV in C++ code, the user would specify multiple versions of a function. For example, the code presented in the FMV documentation for GCC 4.8 shows:
__attribute__ ((target ("sse4.2"))) int foo(){ // foo version for SSE4.2 return 1; } __attribute__ ((target ("arch=atom"))) int foo(){ // foo version for the Intel Atom processor return 2; } int main() { int (*p)() = &foo; assert((*p)() == foo()); return 0; }
The target() directives will compile the functions for instruction-set extensions (e.g. sse4.2) or for specific architectures (e.g. arch=atom).
Here, for each function, the developer needed to create specific functions and code for each target. That would have required extra overhead in the code; increasing the number of lines of code in a program for FMV makes it more clunky to manage and maintain.
Fortunately, GCC 6 solves this problem: it supports FMV in both C and C++ code with a single attribute to define the minimum set of architectures to support. This makes it easier to develop Linux applications that can take advantage of enhanced instructions, without the overhead of replicating functions for each target.
A simple example of using FMV to take advantage of AVX can be shown using array addition (array_addition.c for this example):
#define MAX 1000000 int a[256], b[256], c[256]; __attribute__((target_clones("avx2","arch=atom","default"))) void foo(){ int i,x; for (x=0; x<MAX; x++){ for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } } } int main() { foo(); return 0; }
As we can see, the selection of the supported architectures is pretty simple using the target_clones() directive. The developer needs only to select the minimum set of architectures or instruction-set extensions to support: AVX2, Intel Atom, AMD, or almost any architecture option that GCC accepts from the command line. The compiler will create multiple versions of the function targeting the specified instruction sets and the right one will be chosen at runtime.
Ultimately, the object dump of this code will have the optimized assembly instructions for each architecture. For example:
No AVX code (Atom):add %eax,%edxAVX:
vpaddd 0x0(%rax),%xmm0,%xmm0AVX2:
vpaddd (%r9,%rax,1),%ymm0,%ymm0
Notice that the new implementation of FMV providesarray_addition.c with the ability to use registers and instructions for Intel AVX, AVX2, and even Atom platforms. This capability increases the range of platforms where the application can run without illegal-instruction errors.
Before GCC 6, telling the compiler to use Intel AVX2 instructions would limit the compatibility of the binary to only Haswell and newer processors. With the added features in FMV, the compiler can also generate AVX-optimized versions of the code; at runtime, it will automatically ensure that only the appropriate versions are used. In other words, when the binary is run on Haswell or later generation CPUs, it will use Haswell-specific optimizations; when that same binary is run on a pre-Haswell generation processor, it will fall back to using the standard instructions supported by the older processor.
CPUID selection
In GCC 4.8, FMV had a dispatch priority rather than a CPUID selection. The dispatch order was prioritized for each function version based on the target attributes. Function versions with more advanced features got higher priority. For example, a version targeted for AVX2 would have a higher dispatch priority than a version targeted for SSE2.
To keep the cost of dispatching low, the indirect function (ifunc) mechanism is used. That mechanism is a feature of the GNU toolchain that allows a developer to create multiple implementations of a given function and to select among them at runtime using a resolver function. The resolver function is called by the dynamic loader during early startup to resolve which of the implementations will be used by the application. Once an implementation choice is made, it is fixed and may not be changed for the lifetime of the process.
In GCC 6, the resolver checks the CPUID and then calls the corresponding function. It does this once per binary execution. So when there are multiple calls to the FMV function, only the first call will execute the CPUID comparison; the subsequent calls will find the required version by a pointer. This technique is already used for almost all glibc functions. For example, glibc has memcpy() optimized for each architecture, so when it is called, glibc will call the proper optimized memcpy().
Code size impact
FMV will increase the binary code size, but that impact can be minimized. The code-size increase depends on how large the functions where FMV will be applied are and the number of requested versions. If C is initial binary code size, N is the number of requested versions (including the default), andR is the ratio of those functions to the whole application code size, the new code size will be:
(1 - R) * C + R * C * N
If an application where the hottest part of the code takes 1% of the whole code size and FMV is applied to support three architectures (default, sse4.2, avx2) the overall code size increase will be 2%. That is a fairly small impact when considering the capacity of storage today. But that impact must be considered based on the deployment model. It is a tradeoff of performance and maintainability against the increased binary size, so FMV may not be the right choice for certain types of deployments (e.g. Internet of Things devices).
Results
The following table shows the execution of time of runningarray_addition.c on various processors and with different GCC flags:
Execution Time (ms) | ||||||
---|---|---|---|---|---|---|
GCC flags | Haswell | Skylake | Broadwell | Xeon | Atom | Ivy Bridge |
None | 603 | 645 | 580 | 1413 | 2369 | 517 |
-O3 | 38 | 44 | 37 | 107 | 96 | 60 |
-O3 -mavx | 26 | 32 | 26 | 73 | SIGILL | 45 |
-O3 -mavx2 | 26 | 32 | 26 | 73 | SIGILL | SIGILL |
-O3 (with FMV) | 26 | 32 | 26 | 73 | 96 | 45 |
The FMV version used the following directive:
__attribute__((target_clones("avx2","arch=atom","default")))The "SIGILL" entries indicate illegal instructions for some of the combinations. The default CFLAGS (which are nothing particularly noteworthy) and configurations are specified as part of the Clear Linux for Intel Architecture project.
Real-world example
Today, more and more industry segments are benefiting from the use of cloud-based scientific computing. These segments include chemical engineering, digital content creation, financial services, and analytics applications. One of the more popular scientific-computing libraries is the NumPy library for Python. It includes support for large, multi-dimensional arrays and matrices. It also has special features for linear algebra, Fourier transforms, and random number generation, among others.
The advantages of using FMV technology in a scientific library such as NumPy are generally well-understood and accepted. If vectorization is not enabled, a lot of unused space in SIMD registers goes to waste. If vectorization is enabled, the compiler gets to use the additional registers to perform more operations (such as for the addition of more integers in our example) in a single instruction.
The performance boost due to FMV technology (running on a Haswell machine with AVX2 instructions) can be up to 3 percent in terms of execution time for scientific workloads. We used the OpenBenchmarking.org numpy-1.0.2 test on a 1.8GHz Skylake system, which resulted in a runtime of 8400 seconds using FMV, compared to 8600 seconds when compiled with-O3.
This performance improvement is due to the functions in NumPy code that benefit from vectorization. In order to detect these functions, GCC provides the flag -fopt-info-vec . This flag is used to detect the vectorization-candidate functions. For example, building NumPy with this flag will tell us that the file fftpack.c has code that can use vectorization:
numpy/fft/fftpack.c:813:7: note: loop peeled for vectorization to enhance alignment
Looking at the NumPy source code will show that the radfg() function, which is part of the fast Fourier transform (FFT) support in NumPy, performs heavy array additions that can be optimized using AVX. The patches for NumPy are not upstream yet, but will headed that way soon.
Next steps
The Clear Linux project is currently focusing on applying FMV technology to packages where it is detected that AVX instructions can yield an improvement. To solve some of the issues involved with supporting FMV in a full Linux distribution, the project provides a patch generator based on vectorization-candidate detection (using the -fopt-info-vec flag). This tool can provide all the FMV patches that a Linux distribution might use. Clear Linux is selecting the ones that give a significant performance improvement based on a set of benchmarks.
(Log in to post comments)