Compiler tunes benchmarks with the Yocto Project

Introduction

This work is sponsored by Reliable Embedded Systems. You can find more information about our training/consulting services here.

Objectives

The goal of this blog post is to look at the mystery of compiler tunes. Since I'm a non believer in micro optimizations I will try to run some benchmarks with different compiler tunes and let them run on the same boards natively as well as inside a docker container. Since I have plenty of i.mx6 quad core boards I will use them for most of the tests. I might run a few tests on other boards as well.

We'll only look at CPU/calculations for now and will try to answer a few questions like:

  • What's the difference between armv7a soft float and armv7a hard float?
  • What's the difference between armv7a hard float and code optimized for a cortexa9 hard float?
  • What's the difference between different machines in general - ignoring the compiler tunes?
  • Does code with the same compiler tunes run slower on the native host than in a docker container?
  • Does code with different compiler tunes run slower or even faster in a docker container?


 

Compiler tunes and docker

This is what multi architecture docker images sometimes have on offer with respect to "tuned" arm containers: arm32v5, arm32v6, arm32v7, arm64v8.

It's not very clear to me what exactly this means with respect to compiler tunes, but as far as I can see this means for those arm32 chips I care about arm32v7 and for aarch64 chips arm64v8.

We consider it a very crucial feature of the Yocto Project to be able to fine tune the compiler tunes per board/processor but would it be enough to do something like that?

# This function changes the default tune for machines which
# are based on armv7a to use common tune value, note that we enforce hard-float
# which is default on Ångström for armv7+
# so if you have one of those machines which are armv7a but can't support
# hard-float, please change tune = 'armv7athf' to tune = 'armv7at'
# below but then this is for your own distro, Ångström will not support
# it
# - Khem

def arm_tune_handler(d):
    features = d.getVar('TUNE_FEATURES', True).split()
    if 'armv7a' in features or 'armv7ve' in features:
        tune = 'armv7athf'
        if 'bigendian' in features:
            tune += 'b'
        if 'neon' in features:
            tune += '-neon'
    else:
        tune = d.getVar('DEFAULTTUNE', True)
    return tune

DEFAULTTUNE_angstrom := "${@arm_tune_handler(d)}"

What do I mean by compiler tunes?

TUNE_FEATURES (copied from the Mega Manual):

Features used to "tune" a compiler for optimal use given a specific processor. The features are defined within the tune files and allow arguments (i.e. TUNE_*ARGS) to be dynamically generated based on the features.

The OpenEmbedded build system verifies the features to be sure they are not conflicting and that they are supported.

The BitBake configuration file (meta/conf/bitbake.conf) defines TUNE_FEATURES as follows:

     TUNE_FEATURES ??= "${TUNE_FEATURES_tune-${DEFAULTTUNE}}"
                    

See the DEFAULTTUNE variable for more information.

DEFAULTTUNE (copied from the Mega Manual):

The default CPU and Application Binary Interface (ABI) tunings (i.e. the "tune") used by the OpenEmbedded build system. The DEFAULTTUNE helps define TUNE_FEATURES.

The default tune is either implicitly or explicitly set by the machine (MACHINE). However, you can override the setting using available tunes as defined with AVAILTUNES.

I will play around with those variables and see what happens.

Hardware

The cortex-a9 contains an armv7-a processor so one might wonder what those compiler tunes are good for. In a similar way the cortex-a53 and the cortex-a72 contain ARMv8.0-A cores.

Benchmarks

I played around a bit with the Phoronix test suite as you can read in another bog post and took the challenge to build several images with it (which need to contain a native SDK, since that's required for it) and also oci images with the Phoronix test suite inside so I can run those from within a docker container. Please note that my impression is, that docker/container people are typically ignorant towards those compiler tunes as you can see above. Rightfully so?

pts/scimark2

This test runs the ANSI C version of SciMark 2.0, which is a benchmark for scientific and numerical computing developed by programmers at the National Institute of Standards and Technology. This test is made up of Fast Fourier Transform, Jacobi Successive Over-relaxation, Monte Carlo, Sparse Matrix Multiply, and dense LU matrix factorization benchmarks.

armv7a soft float vs. hard float

armv7a soft float

This test was performed on an i.mx6 quad processor with very generic compiler tunes - in particular soft float (which is the most generic code I could think of, but this is apparently not good for floating point operations).

TUNE_FEATURES        = "arm armv7a vfp neon"
TARGET_FPU           = "softfp"

Graphs are here.

armv7a soft float vs. armv7a hard float

TUNE_FEATURES        = "arm armv7a vfp neon"
TARGET_FPU           = "softfp"

vs.

TUNE_FEATURES        = "arm armv7a vfp thumb callconvention-hard"
TARGET_FPU           = "hard"

Graphs are here and yes there is a significant improvement between soft and hard floating point.

armv7a soft float vs. armv7a hard float vs cortex-a9 hard float

TUNE_FEATURES        = "arm armv7a vfp neon"
TARGET_FPU           = "softfp"

vs.

TUNE_FEATURES        = "arm armv7a vfp thumb callconvention-hard"
TARGET_FPU           = "hard"

vs. 

TUNE_FEATURES        = "arm vfp cortexa9 neon thumb callconvention-hard"
TARGET_FPU           = "hard" 

Graphs are here with the significant change between soft and hard float but no apparent change between armv7a hard float and cortex-a9 hard float.

different machines

armv7a soft float vs. cortex-a53 vs. cortex-a72

This test was performed on an i.mx6 quad, and i.mx53 and a BCM2711.

i.mx6 quad:

TUNE_FEATURES        = "arm armv7a vfp neon"
TARGET_FPU           = "softfp"

i.mx53:

TUNE_FEATURES        = "aarch64 cortexa53 crc crypto"
TARGET_FPU           = ""

BCM2711:

TUNE_FEATURES        = "aarch64 cortexa72 crc crypto"
TARGET_FPU           = ""

Graphs are here.

native vs. docker

armv7a soft float native vs. armv7a soft float container

This test was performed on an i.mx6 quad. Once it was run native configured as armv7 soft float and once inside a docker container configures as armv7 soft float.

No apparent changes can be observed and graphs are here

as before plus armv7a soft float container over armv7a hard float native

The graphs are here and armv7a hard float native does not show any difference if you run a soft float container over it.

as before plus armv7a soft float container over cortex-a9 hard float native

The graphs are here and armv7 hard float native behaves identical to cortex-a9 hard float native if you run a soft float container over it.

armv7a soft float vs. armv7a hard float vs. armv7a hard float container over armv7a hard float 

Graphs are here. Major difference between soft and hard float and hard float container behaves like hard float native.

as above but now we run the hard float container over a soft float native system

Graphs are here. The results may seem interesting, since here we can see that if we run a hard float container over a soft float native system the container behaves like running hard float native. This means all the runtime needed to perform hard float operations seems to be available in the container. How about the kernel? Well in the container we don't have a kernel. Does our soft float kernel support neon?

root@multi-v7-ml:/# zcat /proc/config.gz | grep NEON
CONFIG_NEON=y
CONFIG_KERNEL_MODE_NEON=y
CONFIG_CRYPTO_SHA1_ARM_NEON=m
CONFIG_CRYPTO_CHACHA20_NEON=m
# CONFIG_CRYPTO_NHPOLY1305_NEON is not set
root@multi-v7-ml:/# 

The complete kernel config is here for the curious.

pts/aircrack

cortexa9-hf vs. armv7-hf

Graphs are here. No apparent changes. Funny enough armv7-hf seems to be slightly better.

pts/john-the-ripper

cortexa9-hf vs. armv7-hf

Graphs are here. No apparent changes. Funny enough armv7-hf seems to be slightly better.

pts/scimark2

cortex-a53-crypto vs. armv8a-crc-crypto

TUNE_FEATURES        = "aarch64 cortexa53 crc crypto"
TARGET_FPU           = ""

vs.

TUNE_FEATURES        = "aarch64 armv8a crc crypto"
TARGET_FPU           = ""

Graphs are here. No apparent changes.

pts/coremark

cortex-a53-crypto vs. armv8a-crc-crypto

TUNE_FEATURES        = "aarch64 cortexa53 crc crypto"
TARGET_FPU           = ""

vs.

TUNE_FEATURES        = "aarch64 armv8a crc crypto"
TARGET_FPU           = ""

Graphs are here. No apparent changes. 

(No real) Conclusion

Well in German I would say "Wer misst misst Mist" (something like "Who measures measures manure"), but I have yet to see a benchmark which actually shows any differences between generic and specific compiler tunes. I guess I would see something with very specific multimedia benchmarks and hand crafted optimizations. The questions remains open why we use those chip specific compiler tunes at all and where do they make sense?

Addendum

Jon Mason pointed out that different compiler tunes don't only affect performance, but also e.g. security features. He says: "Tuning for A76 versus a more generic armv8a allows for security features like branch-protection to be enabled (as it isn't supported in older versions). You get these kind of things 'by default' when tuning for the specific model." I don't have yet any A76 to test it (which might be tricky anyways), but of course I believe Jon and also this seems to concur with what he says.

Comments

Popular posts from this blog

Yocto: BitBake and Dependencies - e.g. One recipe to use output of another recipe

Yocto: kernel modules not showing up in the rootfs