Thursday, December 20, 2018

LCECBF: Linear Cost Exact Circular Bokeh Filter

Depth-of-field has always been a costly post process for video games. Particularly, a circle-of-confusion filter proved to be a bottleneck. While being deceptively simple (all weights are either zero or constant values), it is non-separable: it means that, unlike box or Gaussian filter, one cannot run a two-pass O(r) filter (where r is filter radius), and rather needs to run O(r * r) pass, in order to get accurate results.

There have been a lot of efforts to solve this issue. I apologize for being lazy to make a proper bibliographical reference, so I will simply list the approaches:
  • NFS approach: works only for polygonal (e.g., hexagonal) bokeh - make a horizontal and then 2 diagonal passes, combine with a max filter (has some artifacts)
  • Crytek 2-pass approach with rotating kernel and flood fill
  • recent Fourier-series approach from EA (published at GDC18)
While being practical, these approaches still lack the following qualities:
  • being accurate
  • requiring only one pass/no additional render targets
The approach I want to present here can apply a circular (or, basically, of any convex shape) bokeh filter to an image:
  • with number of samples which is O(r), where r is filter radius
  • matching ground truth up to floating point error
  • single pass
  • not requiring any additional memory allocation
The proposed method utilizes an idea that was hinted to me about 10 years by my supervisor, Alexey Ignatenko, so here is a shout out to him. The key idea is that once you've computed a convolution with a constant-value filter kernel for one point, you can reuse most of it for the neighbor point:
If we compute convolution for the point A, effectively, we can reuse most of it (blue part) for the point B. This is due to fact that the weights are the same - the intrinsic property of the bokeh filter kernel. Now, to compute the convolution at B, all we have to do is take the A's convolution, add all pixels in green and subtract all pixels in pink.

How many pixels would this be? Perimeter-many :) For a circle (and regular polygons) this is a linear function of their radius.

How does this translate to shader code? Well, effectively, we can compute a full convolution just for one pixel, and then propagate it to neighbors at linear cost. That would require compute shader threads to output more than pixel, naturally.

Here is the shader code (I apologize for hardcoded constants and other less-than-ideal things):

Texture2D Input : register( t0 );
RWTexture2D<float4> Result : register( u0 );


[numthreads( 1, 32, 1 )]
void CSMain( uint3 Gid : SV_GroupID, uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID )
{
    const int nTotalSamples = 317;
    const int2 vCircleSamples[317] =
    {
        int2(-10, 0),     int2(-9, -4),     int2(-9, -3),     int2(-9, -2),     int2(-9, -1),     int2(-9, 0),     int2(-9, 1),     int2(-9, 2),     int2(-9, 3),     int2(-9, 4),
        int2(-8, -6),     int2(-8, -5),     int2(-8, -4),     int2(-8, -3),     int2(-8, -2),     int2(-8, -1),     int2(-8, 0),     int2(-8, 1),     int2(-8, 2),     int2(-8, 3),
        int2(-8, 4),     int2(-8, 5),     int2(-8, 6),     int2(-7, -7),     int2(-7, -6),     int2(-7, -5),     int2(-7, -4),     int2(-7, -3),     int2(-7, -2),     int2(-7, -1),
        int2(-7, 0),     int2(-7, 1),     int2(-7, 2),     int2(-7, 3),     int2(-7, 4),     int2(-7, 5),     int2(-7, 6),     int2(-7, 7),     int2(-6, -8),     int2(-6, -7),
        int2(-6, -6),     int2(-6, -5),     int2(-6, -4),     int2(-6, -3),     int2(-6, -2),     int2(-6, -1),     int2(-6, 0),     int2(-6, 1),     int2(-6, 2),     int2(-6, 3),
        int2(-6, 4),     int2(-6, 5),     int2(-6, 6),     int2(-6, 7),     int2(-6, 8),     int2(-5, -8),     int2(-5, -7),     int2(-5, -6),     int2(-5, -5),     int2(-5, -4),
        int2(-5, -3),     int2(-5, -2),     int2(-5, -1),     int2(-5, 0),     int2(-5, 1),     int2(-5, 2),     int2(-5, 3),     int2(-5, 4),     int2(-5, 5),     int2(-5, 6),
        int2(-5, 7),     int2(-5, 8),     int2(-4, -9),     int2(-4, -8),     int2(-4, -7),     int2(-4, -6),     int2(-4, -5),     int2(-4, -4),     int2(-4, -3),     int2(-4, -2),
        int2(-4, -1),     int2(-4, 0),     int2(-4, 1),     int2(-4, 2),     int2(-4, 3),     int2(-4, 4),     int2(-4, 5),     int2(-4, 6),     int2(-4, 7),     int2(-4, 8),
        int2(-4, 9),     int2(-3, -9),     int2(-3, -8),     int2(-3, -7),     int2(-3, -6),     int2(-3, -5),     int2(-3, -4),     int2(-3, -3),     int2(-3, -2),     int2(-3, -1),
        int2(-3, 0),     int2(-3, 1),     int2(-3, 2),     int2(-3, 3),     int2(-3, 4),     int2(-3, 5),     int2(-3, 6),     int2(-3, 7),     int2(-3, 8),     int2(-3, 9),
        int2(-2, -9),     int2(-2, -8),     int2(-2, -7),     int2(-2, -6),     int2(-2, -5),     int2(-2, -4),     int2(-2, -3),     int2(-2, -2),     int2(-2, -1),     int2(-2, 0),
        int2(-2, 1),     int2(-2, 2),     int2(-2, 3),     int2(-2, 4),     int2(-2, 5),     int2(-2, 6),     int2(-2, 7),     int2(-2, 8),     int2(-2, 9),     int2(-1, -9),
        int2(-1, -8),     int2(-1, -7),     int2(-1, -6),     int2(-1, -5),     int2(-1, -4),     int2(-1, -3),     int2(-1, -2),     int2(-1, -1),     int2(-1, 0),     int2(-1, 1),
        int2(-1, 2),     int2(-1, 3),     int2(-1, 4),     int2(-1, 5),     int2(-1, 6),     int2(-1, 7),     int2(-1, 8),     int2(-1, 9),     int2(0, -10),     int2(0, -9),
        int2(0, -8),     int2(0, -7),     int2(0, -6),     int2(0, -5),     int2(0, -4),     int2(0, -3),     int2(0, -2),     int2(0, -1),     int2(0, 0),     int2(0, 1),
        int2(0, 2),     int2(0, 3),     int2(0, 4),     int2(0, 5),     int2(0, 6),     int2(0, 7),     int2(0, 8),     int2(0, 9),     int2(0, 10),     int2(1, -9),
        int2(1, -8),     int2(1, -7),     int2(1, -6),     int2(1, -5),     int2(1, -4),     int2(1, -3),     int2(1, -2),     int2(1, -1),     int2(1, 0),     int2(1, 1),
        int2(1, 2),     int2(1, 3),     int2(1, 4),     int2(1, 5),     int2(1, 6),     int2(1, 7),     int2(1, 8),     int2(1, 9),     int2(2, -9),     int2(2, -8),
        int2(2, -7),     int2(2, -6),     int2(2, -5),     int2(2, -4),     int2(2, -3),     int2(2, -2),     int2(2, -1),     int2(2, 0),     int2(2, 1),     int2(2, 2),
        int2(2, 3),     int2(2, 4),     int2(2, 5),     int2(2, 6),     int2(2, 7),     int2(2, 8),     int2(2, 9),     int2(3, -9),     int2(3, -8),     int2(3, -7),
        int2(3, -6),     int2(3, -5),     int2(3, -4),     int2(3, -3),     int2(3, -2),     int2(3, -1),     int2(3, 0),     int2(3, 1),     int2(3, 2),     int2(3, 3),
        int2(3, 4),     int2(3, 5),     int2(3, 6),     int2(3, 7),     int2(3, 8),     int2(3, 9),     int2(4, -9),     int2(4, -8),     int2(4, -7),     int2(4, -6),
        int2(4, -5),     int2(4, -4),     int2(4, -3),     int2(4, -2),     int2(4, -1),     int2(4, 0),     int2(4, 1),     int2(4, 2),     int2(4, 3),     int2(4, 4),
        int2(4, 5),     int2(4, 6),     int2(4, 7),     int2(4, 8),     int2(4, 9),     int2(5, -8),     int2(5, -7),     int2(5, -6),     int2(5, -5),     int2(5, -4),
        int2(5, -3),     int2(5, -2),     int2(5, -1),     int2(5, 0),     int2(5, 1),     int2(5, 2),     int2(5, 3),     int2(5, 4),     int2(5, 5),     int2(5, 6),
        int2(5, 7),     int2(5, 8),     int2(6, -8),     int2(6, -7),     int2(6, -6),     int2(6, -5),     int2(6, -4),     int2(6, -3),     int2(6, -2),     int2(6, -1),
        int2(6, 0),     int2(6, 1),     int2(6, 2),     int2(6, 3),     int2(6, 4),     int2(6, 5),     int2(6, 6),     int2(6, 7),     int2(6, 8),     int2(7, -7),
        int2(7, -6),     int2(7, -5),     int2(7, -4),     int2(7, -3),     int2(7, -2),     int2(7, -1),     int2(7, 0),     int2(7, 1),     int2(7, 2),     int2(7, 3),
        int2(7, 4),     int2(7, 5),     int2(7, 6),     int2(7, 7),     int2(8, -6),     int2(8, -5),     int2(8, -4),     int2(8, -3),     int2(8, -2),     int2(8, -1),
        int2(8, 0),     int2(8, 1),     int2(8, 2),     int2(8, 3),     int2(8, 4),     int2(8, 5),     int2(8, 6),     int2(9, -4),     int2(9, -3),     int2(9, -2),
        int2(9, -1),     int2(9, 0),     int2(9, 1),     int2(9, 2),     int2(9, 3),     int2(9, 4),     int2(10, 0)
    };

    const int2 vCircleSamplesNeg[] =
    {
        int2(-11, 0),     int2(-10, -4),     int2(-10, -3),     int2(-10, -2),     int2(-10, -1),     int2(-10, 1),     int2(-10, 2),     int2(-10, 3),     int2(-10, 4),     int2(-9, -6),
        int2(-9, -5),     int2(-9, 5),     int2(-9, 6),     int2(-8, -7),     int2(-8, 7),     int2(-7, -8),     int2(-7, 8),     int2(-5, -9),     int2(-5, 9),     int2(-1, -10),
        int2(-1, 10)
    };
    const int totalSamplesBorderNeg = 21;

    const int2 vCircleSamplesPos[] =
    {
        int2(0, -10),     int2(0, 10),     int2(4, -9),     int2(4, 9),     int2(6, -8),     int2(6, 8),     int2(7, -7),     int2(7, 7),     int2(8, -6),     int2(8, -5),
        int2(8, 5),     int2(8, 6),     int2(9, -4),     int2(9, -3),     int2(9, -2),     int2(9, -1),     int2(9, 1),     int2(9, 2),     int2(9, 3),     int2(9, 4),
        int2(10, 0)
    };
    const int totalSamplesBorderPos = 21;
    

    float4 res = 0;
    int2 coord = int2(DTid.x * 64, DTid.y);
    [loop]
    for (int s = 0; s < nTotalSamples; ++s)
    {
        res += Input[coord + vCircleSamples[s]];
    }
    res /= float(nTotalSamples);
    Result[coord] = res;
    float4 prevRes = res;

    [loop]
    for (int i = 1; i < 64; ++i)
    {
        res = 0;
        coord = int2(DTid.x * 64 + i, DTid.y);
        [loop]
        for (int s = 0; s < totalSamplesBorderNeg; ++s)
        {
            res -= Input[coord + vCircleSamplesNeg[s]];
        }
        [loop]
        for (int s = 0; s < totalSamplesBorderPos; ++s)
        {
            res += Input[coord + vCircleSamplesPos[s]];
        }

        res /= float(nTotalSamples);
        res += prevRes;
        Result[coord] = res;
        prevRes = res;
    }   
}


And here is the code for the ground truth (all samples) bokeh computation:
float4 res = 0;
int2 coord = int2(DTid.x, DTid.y);
[loop]
for (int s = 0; s < nTotalSamples; ++s)
{
    res += Input[coord + vCircleSamples[s]];
}

res /= float(nTotalSamples);
Result[coord] = res;


Here are the dispatch calls, just in case:
pd3dImmediateContext->Dispatch((width + 63) / 64, (height + 31) / 32, 1); // Proposed method
pd3dImmediateContext->Dispatch(width, (height + 31) / 32, 1); // Ground truth


Performance (I didn't do any thorough optimization, e.g., half-res, making fetches more cache-friendly, etc), numbers for 21x21 filter, Full HD, GeForce 1060, RGBA16_FLOAT src/dst:
Ground Truth - 5ms
Proposed Method - 1.7ms

Visual results (apologize for not handling borders properly, again kinda lazy):

UPD: closeup for smartphone users :)

Original:

Ground Truth Filter:

Proposed Method Filter:

Obviously, there is yet a lot of optimization and polishing to be done to make it production-ready, but I think the concept is original and worth digging (and I like how it exploits bokeh filter kernel constant weight property).

Please, let me know (in the comments, Twitter, private messages etc) if you want me to polish&publish a demo. If there are enough people interested, I will try to find time for that :)

Friday, October 26, 2018

MinLod: A Cheap&Simple Method to Increase Texture Details at Acute Angles


UPD: a simpler and faster solution utilizing textureGrad instead of textureLod is uploaded (same ShaderToy link) thanks to Sergey Makeev's constructive feedback.

I haven't written here for over 3 years, and I think it's time to fix this :)

So, let's start with the problem definition. Without further ado, consider the following image (that I've stolen somewhere from the internets):




So, ultimately, we want to get as close as possible to the anisotropic filtering, but it's usually too expensive. Point filtering is way too aliased, while the classic mipmapping is too blurry in the distance.

But why mipmapping is too blurry in this case and what could be done (except expensive anisotropic filtering about it)? Well, let's consult the OpenGL Spec in order to figure out, why:

float
mip_map_level(in vec2 texture_coordinate)
{
    // The OpenGL Graphics System: A Specification 4.2
    //  - chapter 3.9.11, equation 3.21
 
    vec2  dx_vtc        = dFdx(texture_coordinate);
    vec2  dy_vtc        = dFdy(texture_coordinate);
    float delta_max_sqr = max(dot(dx_vtc, dx_vtc), dot(dy_vtc, dy_vtc));
 
    //return max(0.0, 0.5 * log2(delta_max_sqr) - 1.0);
    return 0.5 * log2(delta_max_sqr);
}

Effectively, we're taking texture coordinate gradients in screen-space (and this is, btw, the reason why rasterizer spits out 2x2 pixel quads and why you should avoid having subpixel triangles) and then taking the max between the lengths of gradients in x and y directions. Then, we convert the metric distances to mip levels by taking a logarithm (and there is a nice optimization of post-multiplying logarithm by 0.5 instead of taking a square root of an argument).

Is it good enough? For most cases, yes. However, it starts failing in our case - when a gradient in one direction, y, is significantly larger than in another direction, x. That results in selecting a higher mip level (since we take the max), hence, the blurrier result.

What do you do if you want to tolerate a bit of noise (e.g., you are planning to get rid of it later with temporal anti-aliasing) for a bit more detail? What I've seen in my practice was either introducing a mip bias, clipping the mip pyramid (i.e., having only 2-4 mip levels instead of a full mipmap chain), and, the most radical, just forcing the most detailed mip (i.e., 0).

I propose a more precise, better looking, and elegant solution. What we need to do is simply replace the max operator with min operator when we choose between x and y gradients.

Here's a quick&dirty ShaderToy demo I've made to illustrate the concept. The demo basically cross-fades between max (default) and min (proposed) mip selection. You can also try forcing mip 0 or adding mip bias to compare or other cool hacks you have up your sleeve ;)

Let me know in comments what you think!