OpenGL ES Shading Language Potholes and Problems

Posted: 24 September 2013
Tags: GLSL, OpenGLES, android

Several months ago I changed my Android app Freeform Backgammon to use OpenGL ES 2.0 shaders. Since that change, I have had to fix and work around a number of platform-specific bugs with the shaders. In my experience OpenGL ES shaders are the real fragmentation pain point of OpenGL game development on Android.

OpenGL Shaders are small programs that provide an incredible level of programmability and fine-grained control to the OpenGL rendering pipeline. One of the biggest downsides of shaders is the wide diversity of the compilers and infrastructure that support them. Similar to the standard Android fragmentation issue, problems arise from having open standards and a lot of competition:

there are a diversity of hardware implementations of the OpenGL ES 2.0 API.
there are a diversity of devices of different ages implementing the API.
there are a diversity of quality and performance targets these devices are aiming at.
the OpenGL specs often specify a "minimum" or a range of in-specification behaviors

This post details some specific issues I have run into trying to maintain relatively simple OpenGL ES fragment shaders across a variety of Android devices.

In my experience the problems arising from GLSL fragmentation are more subtle and annoying than the issues that arise from standard Android fragmentation (like Android OS versions, display size diversity, etc) which have relatively well understood band-aids.

If you don't want to read any of the details or motivations here are my suggestions for making widely compatible fragment shaders:

Use the "#version 100" directive at the start of your shaders to ensure they are compatible with the maximum number of devices. This directive does (at least on my development platform) catch some extended and unsupported uses of GLSL. (This should not be news to anyone writing shaders.)
Avoid the preprocessor.
Make sure any for loops have an obviously compile-time fixed bounds.
Avoid inter-fragment flow-control changes, use straight-line code as much as possible.
Be aware of the magnitude of your floating point computations, scale them if necessary to avoid truncation.
Test your shaders on a wide variety of GPUs, and look at the results (compilation alone is insufficient).

Some Caveats

In my app I get some extra GLSL compiler diversity beyond just Android fragmentation because I develop with Libgdx and so my application and its shaders also run on my Windows desktop (and should work on iOS and WebGL, too!). In practice, though, I haven't run the shader on any non-Android platforms besides my personal desktop.

My shaders are not (as far as I can tell) very traditional shaders. I generally use fragment shaders to generate off-screen textures (like a nicely anti-aliased triangle or checker), and I do not use the shaders directly for "on screen" effects. This means my shaders can be a little more inefficient or can try and do more work than normally would be attempted in a "real time" shader. More traditional shaders may be constrained sufficiently by real-time requirements that they don't run into these problems. (Though more likely there is an entirely new axis of performance problems that real-time shaders will run into.)

I am mostly interested in fragment shaders (aka "pixel shaders"). My vertex shaders are nearly trivial and as such I have not run into problems with them.

Lots of GLSL Compilers

While all Android devices run the "Dalvik" virtual machine for executing Java bytecode generated from (generally) Java application source, the story with OpenGL shaders is not nearly as neat nor as well-defined. Shaders are included in an Android application as source code (i.e., with comments and even preprocessor macros). After the app starts up it sends the source code to the vendor's GLSL compiler through OpenGL API calls. The GLSL compiler parses the source and compiles the shader into GPU-specific instructions. So each GPU family has a different compiler. This exposes the shader to an additional level of compatibility issues because there are many more corner cases in the definition of a source language (vs. a bytecode or otherwise compiled representation).

GLSL Preprocessor Potholes

The ARM Mali-400 MP4 (at least) does not support line-continuation characters in macros. For example this code:

#define MULTILINEMACRO \
    vec4(1.0, 0.5, 0.1, 1.0)

is technically outside the specification (see Section 3.4 of the spec) but most shader compilers will accept such macros without complaint.

Additionally, the Broadcom BCM2763 VideoCore IV (at least) GLSL compiler does not support macro arguments like this:

#define FUNCMACRO(x) vec4(x, x*0.5, x*0.1, 1.0)

Again, many shader compilers go beyond the specification and support arguments to preprocessor macros so this may not be discovered during development or testing.

In both of these cases the GLSL compiler rejects the shader at compile time, making these problems a bit more "visible". In general, I suggest developers simply avoid preprocessor macros in their shaders.

GLSL Language "Flexibility" Potholes

The Qualcomm Adreno 220 (at least in some implementations?) supports only minimal for loop controls. The GLSL specification allows (for valid performance reasons) implementations to limit for loops in a shader to a compile-time unrollable subset (see Appendix A, section 4 of the spec). Specifically, this means the upper bounds on a loop must be (for some platforms) an obvious and direct compile-time constant.

Again, many shader compilers support more complex for loop controls so violations of this requirement require testing on the limiting hardware.

Adding to the complexity, these unsupported loop bounds (e.g., based on a uniform and not a compile-time constant) will often just be silently ignored. No error is generated when the shaders is compiled and linked. For example, a shader like this will run into problems on some systems:

uniform int max_loop_ct;

...

for (int i = 0; i < max_loop_ct; i++) {
  // Body is skipped on platforms that stay closer to spec minimums
}

In practice, when the body of the for loop is skipped, the result is just incorrect rendering, no errors or exceptions are raised. This can be very hard to detect in testing without inspecting the actual shader output.

GLSL SIMD Execution Potholes

GLSL shaders can accomplish so much (filling a screen 60 times per second is a lot of pixels on any platform) because they try hard to run in a "SIMD" (single-instruction multiple-data) parallel stream. Coarsely, this means that a single shader runs its instructions on multiple pixels in parallel (often a batch of 4 or 16 neighboring pixels). (See Fayvon Fatahalian's Siggraph 2008 presentation for a broader overview of GPU execution architectures.)

The GLSL language, however, provides a lot of ways to break SIMD execution. For example, an if branch that only applies to some pixels means the instruction streams can diverge. Generally this divergence is handled gracefully and the execution simply slows down a bit. (This leads to counter-intuitive coding where it is more efficient to always execute sometimes useless instructions than to add a branch skipping those instructions in some cases.)

For example, on the Imagine PowerVR GPU if a fragment shader executes a discard or return from within an if block, and other pixels within the batch do not also follow the if, then some of the code after the discard may get executed when it should not.

Unlike the other potholes listed, this is (as far as I can see) a bug in the PowerVR compiler or GPU. That said, it still needs to be handled and worked-around just like the more "legitimate" potholes. It is anyway a good idea to avoid if-based flow-control and to instead use multiplication, mix, step, fract, etc to acheive the same effect. (Though this can result in a rather different programming style than one is used to.)

GLSL Floating-point Precision Potholes

Mobile OpenGL hardware supports varying levels of "precision" of the floating point values used during a computation. The lowest precision (lowp) is defined as "intended to represent all color values for any color channel" (Section 4.5.2 of the spec). In practice this means it should be able to hold about 256 different values. This if fine for color channels, but when doing more robust calculations (for example computing the dot product of vectors based on screen space in a fragment shader) it becomes very easy to lose so much precision the calculations become useless.

The generally understood "fix" is to add:

#ifdef GL_ES
precision mediump float;
#endif

at the top of a fragment shader to make it clear you want better precision. But the actual precision you get is a function of the device you are running on. The specification requires a minimum level of support, but that minimum can be quite a bit lower than what a modern or performance-oriented device actually provides.

In my specific case, I was computing the products of dot products of vectors defined in "screen space". Since the magnitude of the vectors based on screen space coordinates could be in the mid "100s", the products of dot products (effectively four self-multiplications) could result in intermediate values in the billions (which is beyond the magnitude representable by some GPU's mediump floating point registers). The fix was to simply scale the vectors into single-digit range so that the values lost precision on weaker platforms but did not get truncated because of their magnitude.

Again, there are no hard errors in these cases. The math simply doesn't do what is expected and the output generated by the shader can be undefined (in my experience the shader just didn't output any pixels -- I'm not sure if the pixels were discarded or my algorithm just ended up in the "fully transparent" case for more pixels than it should normally have).

Summary

I am sure there are additional potholes and aspects of GLSL that I am currently mis-using that will not become apparent until the right combination of platform and use case intersect.

To combat this uncertainty, testing shaders on a variety of GPU hardware seems to be the best recourse. However as a solo Android developer, I do not have direct access to a wide variety of GPU hardware. I've been using the Apkudo service to test small shader applications, but given that all the tests run remotely, discovering silently mis-rendered artifacts has been difficult.

If you have other tools or suggestions for dealing with GLSL fragmentation, or if you have your own examples of fragmentation issues, please let me know and I'll include them here.

While not directly useful for combating this problem, the The GLSL Sandbox is an invaluable tool for developing shaders.

Comments or Questions?

Subscribe via RSS | Atom Feed