Monday, July 6, 2015

The Myth of the Free ALUs

One of the myths which arised with the new generation of consoles was that arithmetic operations (aka ALUs) were from now on literally free. If long ago we used cube map LUTs to normalize vectors, on GCN it costs nothing. Sure, they do have some cost, but with 6-7 ALU cycles per byte read from VRAM recommended at the top memory bandwidth (which you even almost never get), they should be all hidden beyond the memory fetches latency.

Well, it turns out, they are not. And here are some of the reasons why.

Some ALUs are more expensive than others. 

If an extra mad would unlikely (I won quite significantly one day by optimizing out a redundant matrix multiplication from a tight loop) make a change, divisions or even trigonometry would. Even in an SSAO shader could benefit up to 13% performance increase when arithmetic is optimized.

There are no free lunches now. 

Everything you could have relied hardware to do before (cube map texcoord calculation, attribute interpolation etc.) is now done with ALUs. You can use nointerpolation to avoid, if needed.

The scalar/vector ALUs are unbalanced.

In case your shader utilizes too many SALUs, not only you may become scalar register bound, but also your VALU units might stall while waiting for a scalar op result. An example here is excessive lane swizzling: when I tried to use it as a "cheap shared memory replacement" for blur filter, I got a ridiculous SALU/VALU ratio.