Peter Sikachev Dev Blog

Thursday, March 13, 2025

Why Nanite Feels so Good or the Sad Tale of Unreal Engine Instancing

Nanite is Great... Comparatively Speaking

Let's start with the hot take: Nanite feels good because the vanilla UE instancing has been less than stellar in more than one place. With Nanite, we can instance many meshlets at once, we can do fine culling for each of them, but hasn't it been the case before. Let's find out.

Instancing or LODs: Choose One

They hate each other. Let's say you have mesh LODs, for instance. Unless you group them in an HLOD together, each mesh component will be drawn as a separate draw call despite sharing the same geometry and material. If you group them in the same HLOD, they will benefit from instancing... at the cost of switching LODs altogether. Imagine a patch of the forest where trees just instantaneously change LOD all at once.

Occlusion-not-Culling

I've been spoiled by working with custom engines. There, you'd run your occlusion culling on the GPU, find out which instances are visible, and then compact the instance buffer and issue an indirect instanced draw call (or not, with GPU-driven rendering and task graphs).

That is not the Unreal way, however. In Unreal, you either render the whole instance buffer or not render it at all. You test the bounding box around all the instances. You don't even need to see one leaf of one tree to render all the instances; you may just see a portion of a bounding box containing empty space between two instances to render them all.

The Mandatory Prepass

Now, one unexpected corollary of this is that you almost always render everything in pre-pass. But what about the bUseAsOccluder checkbox, say you? The problem is multiple things can override it, so you may almost never have it actually working. It's a known issue, but then again, I don't expect it to ever work correctly on the instance level (or component, for that sake) because that would mean having different instance buffers for the pre-pass and the main pass.

But Won't We Do Everything Nanite, Anyway?

So, why bother? In my opinion, there are certain geometry types that don't 100% optimally fit Nanite. Any foliage or other heavily alpha-tested geometry comes to mind. Besides, in case of a very expensive vertex shader, one might not want to render geometry twice. In any case, it's good to have options — and ones that instance well.

Thursday, December 20, 2018

LCECBF: Linear Cost Exact Circular Bokeh Filter

Depth-of-field has always been a costly post process for video games. Particularly, a circle-of-confusion filter proved to be a bottleneck. While being deceptively simple (all weights are either zero or constant values), it is non-separable: it means that, unlike box or Gaussian filter, one cannot run a two-pass O(r) filter (where r is filter radius), and rather needs to run O(r * r) pass, in order to get accurate results.

There have been a lot of efforts to solve this issue. I apologize for being lazy to make a proper bibliographical reference, so I will simply list the approaches:

NFS approach: works only for polygonal (e.g., hexagonal) bokeh - make a horizontal and then 2 diagonal passes, combine with a max filter (has some artifacts)
Crytek 2-pass approach with rotating kernel and flood fill
recent Fourier-series approach from EA (published at GDC18)

While being practical, these approaches still lack the following qualities:

being accurate
requiring only one pass/no additional render targets

The approach I want to present here can apply a circular (or, basically, of any convex shape) bokeh filter to an image:

with number of samples which is O(r), where r is filter radius
matching ground truth up to floating point error
single pass
not requiring any additional memory allocation

The proposed method utilizes an idea that was hinted to me about 10 years by my supervisor, Alexey Ignatenko, so here is a shout out to him. The key idea is that once you've computed a convolution with a constant-value filter kernel for one point, you can reuse most of it for the neighbor point:

If we compute convolution for the point A, effectively, we can reuse most of it (blue part) for the point B. This is due to fact that the weights are the same - the intrinsic property of the bokeh filter kernel. Now, to compute the convolution at B, all we have to do is take the A's convolution, add all pixels in green and subtract all pixels in pink.

How many pixels would this be? Perimeter-many :) For a circle (and regular polygons) this is a linear function of their radius.

How does this translate to shader code? Well, effectively, we can compute a full convolution just for one pixel, and then propagate it to neighbors at linear cost. That would require compute shader threads to output more than pixel, naturally.

Here is the shader code (I apologize for hardcoded constants and other less-than-ideal things):

Texture2D Input : register( t0 );
RWTexture2D<float4> Result : register( u0 );

[numthreads( 1, 32, 1 )]
void CSMain( uint3 Gid : SV_GroupID, uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID )
{
    const int nTotalSamples = 317;
    const int2 vCircleSamples[317] =
    {
        int2(-10, 0),     int2(-9, -4),     int2(-9, -3),     int2(-9, -2),     int2(-9, -1),     int2(-9, 0),     int2(-9, 1),     int2(-9, 2),     int2(-9, 3),     int2(-9, 4),
        int2(-8, -6),     int2(-8, -5),     int2(-8, -4),     int2(-8, -3),     int2(-8, -2),     int2(-8, -1),     int2(-8, 0),     int2(-8, 1),     int2(-8, 2),     int2(-8, 3),
        int2(-8, 4),     int2(-8, 5),     int2(-8, 6),     int2(-7, -7),     int2(-7, -6),     int2(-7, -5),     int2(-7, -4),     int2(-7, -3),     int2(-7, -2),     int2(-7, -1),
        int2(-7, 0),     int2(-7, 1),     int2(-7, 2),     int2(-7, 3),     int2(-7, 4),     int2(-7, 5),     int2(-7, 6),     int2(-7, 7),     int2(-6, -8),     int2(-6, -7),
        int2(-6, -6),     int2(-6, -5),     int2(-6, -4),     int2(-6, -3),     int2(-6, -2),     int2(-6, -1),     int2(-6, 0),     int2(-6, 1),     int2(-6, 2),     int2(-6, 3),
        int2(-6, 4),     int2(-6, 5),     int2(-6, 6),     int2(-6, 7),     int2(-6, 8),     int2(-5, -8),     int2(-5, -7),     int2(-5, -6),     int2(-5, -5),     int2(-5, -4),
        int2(-5, -3),     int2(-5, -2),     int2(-5, -1),     int2(-5, 0),     int2(-5, 1),     int2(-5, 2),     int2(-5, 3),     int2(-5, 4),     int2(-5, 5),     int2(-5, 6),
        int2(-5, 7),     int2(-5, 8),     int2(-4, -9),     int2(-4, -8),     int2(-4, -7),     int2(-4, -6),     int2(-4, -5),     int2(-4, -4),     int2(-4, -3),     int2(-4, -2),
        int2(-4, -1),     int2(-4, 0),     int2(-4, 1),     int2(-4, 2),     int2(-4, 3),     int2(-4, 4),     int2(-4, 5),     int2(-4, 6),     int2(-4, 7),     int2(-4, 8),
        int2(-4, 9),     int2(-3, -9),     int2(-3, -8),     int2(-3, -7),     int2(-3, -6),     int2(-3, -5),     int2(-3, -4),     int2(-3, -3),     int2(-3, -2),     int2(-3, -1),
        int2(-3, 0),     int2(-3, 1),     int2(-3, 2),     int2(-3, 3),     int2(-3, 4),     int2(-3, 5),     int2(-3, 6),     int2(-3, 7),     int2(-3, 8),     int2(-3, 9),
        int2(-2, -9),     int2(-2, -8),     int2(-2, -7),     int2(-2, -6),     int2(-2, -5),     int2(-2, -4),     int2(-2, -3),     int2(-2, -2),     int2(-2, -1),     int2(-2, 0),
        int2(-2, 1),     int2(-2, 2),     int2(-2, 3),     int2(-2, 4),     int2(-2, 5),     int2(-2, 6),     int2(-2, 7),     int2(-2, 8),     int2(-2, 9),     int2(-1, -9),
        int2(-1, -8),     int2(-1, -7),     int2(-1, -6),     int2(-1, -5),     int2(-1, -4),     int2(-1, -3),     int2(-1, -2),     int2(-1, -1),     int2(-1, 0),     int2(-1, 1),
        int2(-1, 2),     int2(-1, 3),     int2(-1, 4),     int2(-1, 5),     int2(-1, 6),     int2(-1, 7),     int2(-1, 8),     int2(-1, 9),     int2(0, -10),     int2(0, -9),
        int2(0, -8),     int2(0, -7),     int2(0, -6),     int2(0, -5),     int2(0, -4),     int2(0, -3),     int2(0, -2),     int2(0, -1),     int2(0, 0),     int2(0, 1),
        int2(0, 2),     int2(0, 3),     int2(0, 4),     int2(0, 5),     int2(0, 6),     int2(0, 7),     int2(0, 8),     int2(0, 9),     int2(0, 10),     int2(1, -9),
        int2(1, -8),     int2(1, -7),     int2(1, -6),     int2(1, -5),     int2(1, -4),     int2(1, -3),     int2(1, -2),     int2(1, -1),     int2(1, 0),     int2(1, 1),
        int2(1, 2),     int2(1, 3),     int2(1, 4),     int2(1, 5),     int2(1, 6),     int2(1, 7),     int2(1, 8),     int2(1, 9),     int2(2, -9),     int2(2, -8),
        int2(2, -7),     int2(2, -6),     int2(2, -5),     int2(2, -4),     int2(2, -3),     int2(2, -2),     int2(2, -1),     int2(2, 0),     int2(2, 1),     int2(2, 2),
        int2(2, 3),     int2(2, 4),     int2(2, 5),     int2(2, 6),     int2(2, 7),     int2(2, 8),     int2(2, 9),     int2(3, -9),     int2(3, -8),     int2(3, -7),
        int2(3, -6),     int2(3, -5),     int2(3, -4),     int2(3, -3),     int2(3, -2),     int2(3, -1),     int2(3, 0),     int2(3, 1),     int2(3, 2),     int2(3, 3),
        int2(3, 4),     int2(3, 5),     int2(3, 6),     int2(3, 7),     int2(3, 8),     int2(3, 9),     int2(4, -9),     int2(4, -8),     int2(4, -7),     int2(4, -6),
        int2(4, -5),     int2(4, -4),     int2(4, -3),     int2(4, -2),     int2(4, -1),     int2(4, 0),     int2(4, 1),     int2(4, 2),     int2(4, 3),     int2(4, 4),
        int2(4, 5),     int2(4, 6),     int2(4, 7),     int2(4, 8),     int2(4, 9),     int2(5, -8),     int2(5, -7),     int2(5, -6),     int2(5, -5),     int2(5, -4),
        int2(5, -3),     int2(5, -2),     int2(5, -1),     int2(5, 0),     int2(5, 1),     int2(5, 2),     int2(5, 3),     int2(5, 4),     int2(5, 5),     int2(5, 6),
        int2(5, 7),     int2(5, 8),     int2(6, -8),     int2(6, -7),     int2(6, -6),     int2(6, -5),     int2(6, -4),     int2(6, -3),     int2(6, -2),     int2(6, -1),
        int2(6, 0),     int2(6, 1),     int2(6, 2),     int2(6, 3),     int2(6, 4),     int2(6, 5),     int2(6, 6),     int2(6, 7),     int2(6, 8),     int2(7, -7),
        int2(7, -6),     int2(7, -5),     int2(7, -4),     int2(7, -3),     int2(7, -2),     int2(7, -1),     int2(7, 0),     int2(7, 1),     int2(7, 2),     int2(7, 3),
        int2(7, 4),     int2(7, 5),     int2(7, 6),     int2(7, 7),     int2(8, -6),     int2(8, -5),     int2(8, -4),     int2(8, -3),     int2(8, -2),     int2(8, -1),
        int2(8, 0),     int2(8, 1),     int2(8, 2),     int2(8, 3),     int2(8, 4),     int2(8, 5),     int2(8, 6),     int2(9, -4),     int2(9, -3),     int2(9, -2),
        int2(9, -1),     int2(9, 0),     int2(9, 1),     int2(9, 2),     int2(9, 3),     int2(9, 4),     int2(10, 0)
    };

    const int2 vCircleSamplesNeg[] =
    {
        int2(-11, 0),     int2(-10, -4),     int2(-10, -3),     int2(-10, -2),     int2(-10, -1),     int2(-10, 1),     int2(-10, 2),     int2(-10, 3),     int2(-10, 4),     int2(-9, -6),
        int2(-9, -5),     int2(-9, 5),     int2(-9, 6),     int2(-8, -7),     int2(-8, 7),     int2(-7, -8),     int2(-7, 8),     int2(-5, -9),     int2(-5, 9),     int2(-1, -10),
        int2(-1, 10)
    };
    const int totalSamplesBorderNeg = 21;

    const int2 vCircleSamplesPos[] =
    {
        int2(0, -10),     int2(0, 10),     int2(4, -9),     int2(4, 9),     int2(6, -8),     int2(6, 8),     int2(7, -7),     int2(7, 7),     int2(8, -6),     int2(8, -5),
        int2(8, 5),     int2(8, 6),     int2(9, -4),     int2(9, -3),     int2(9, -2),     int2(9, -1),     int2(9, 1),     int2(9, 2),     int2(9, 3),     int2(9, 4),
        int2(10, 0)
    };
    const int totalSamplesBorderPos = 21;

    float4 res = 0;
    int2 coord = int2(DTid.x * 64, DTid.y);
    [loop]
    for (int s = 0; s < nTotalSamples; ++s)
    {
        res += Input[coord + vCircleSamples[s]];
    }
    res /= float(nTotalSamples);
    Result[coord] = res;
    float4 prevRes = res;

    [loop]
    for (int i = 1; i < 64; ++i)
    {
        res = 0;
        coord = int2(DTid.x * 64 + i, DTid.y);
        [loop]
        for (int s = 0; s < totalSamplesBorderNeg; ++s)
        {
            res -= Input[coord + vCircleSamplesNeg[s]];
        }
        [loop]
        for (int s = 0; s < totalSamplesBorderPos; ++s)
        {
            res += Input[coord + vCircleSamplesPos[s]];
        }

        res /= float(nTotalSamples);
        res += prevRes;
        Result[coord] = res;
        prevRes = res;
    }
}

And here is the code for the ground truth (all samples) bokeh computation:
float4 res = 0;
int2 coord = int2(DTid.x, DTid.y);
[loop]
for (int s = 0; s < nTotalSamples; ++s)
{
    res += Input[coord + vCircleSamples[s]];
}

res /= float(nTotalSamples);
Result[coord] = res;

Here are the dispatch calls, just in case:
pd3dImmediateContext->Dispatch((width + 63) / 64, (height + 31) / 32, 1); // Proposed method
pd3dImmediateContext->Dispatch(width, (height + 31) / 32, 1); // Ground truth

Performance (I didn't do any thorough optimization, e.g., half-res, making fetches more cache-friendly, etc), numbers for 21x21 filter, Full HD, GeForce 1060, RGBA16_FLOAT src/dst:
Ground Truth - 5ms
Proposed Method - 1.7ms

Visual results (apologize for not handling borders properly, again kinda lazy):

UPD: closeup for smartphone users :)

Original:

Ground Truth Filter:

Proposed Method Filter:

Obviously, there is yet a lot of optimization and polishing to be done to make it production-ready, but I think the concept is original and worth digging (and I like how it exploits bokeh filter kernel constant weight property).

Please, let me know (in the comments, Twitter, private messages etc) if you want me to polish&publish a demo. If there are enough people interested, I will try to find time for that :)

Friday, October 26, 2018

MinLod: A Cheap&Simple Method to Increase Texture Details at Acute Angles

UPD: a simpler and faster solution utilizing textureGrad instead of textureLod is uploaded (same ShaderToy link) thanks to Sergey Makeev's constructive feedback.

I haven't written here for over 3 years, and I think it's time to fix this :)

So, let's start with the problem definition. Without further ado, consider the following image (that I've stolen somewhere from the internets):

So, ultimately, we want to get as close as possible to the anisotropic filtering, but it's usually too expensive. Point filtering is way too aliased, while the classic mipmapping is too blurry in the distance.

But why mipmapping is too blurry in this case and what could be done (except expensive anisotropic filtering about it)? Well, let's consult the OpenGL Spec in order to figure out, why:

float
mip_map_level(in vec2 texture_coordinate)
{
    // The OpenGL Graphics System: A Specification 4.2
    //  - chapter 3.9.11, equation 3.21
 
    vec2  dx_vtc        = dFdx(texture_coordinate);
    vec2  dy_vtc        = dFdy(texture_coordinate);
    float delta_max_sqr = max(dot(dx_vtc, dx_vtc), dot(dy_vtc, dy_vtc));
 
    //return max(0.0, 0.5 * log2(delta_max_sqr) - 1.0);
    return 0.5 * log2(delta_max_sqr);
}

Effectively, we're taking texture coordinate gradients in screen-space (and this is, btw, the reason why rasterizer spits out 2x2 pixel quads and why you should avoid having subpixel triangles) and then taking the max between the lengths of gradients in x and y directions. Then, we convert the metric distances to mip levels by taking a logarithm (and there is a nice optimization of post-multiplying logarithm by 0.5 instead of taking a square root of an argument).

Is it good enough? For most cases, yes. However, it starts failing in our case - when a gradient in one direction, y, is significantly larger than in another direction, x. That results in selecting a higher mip level (since we take the max), hence, the blurrier result.

What do you do if you want to tolerate a bit of noise (e.g., you are planning to get rid of it later with temporal anti-aliasing) for a bit more detail? What I've seen in my practice was either introducing a mip bias, clipping the mip pyramid (i.e., having only 2-4 mip levels instead of a full mipmap chain), and, the most radical, just forcing the most detailed mip (i.e., 0).

I propose a more precise, better looking, and elegant solution. What we need to do is simply replace the max operator with min operator when we choose between x and y gradients.

Here's a quick&dirty ShaderToy demo I've made to illustrate the concept. The demo basically cross-fades between max (default) and min (proposed) mip selection. You can also try forcing mip 0 or adding mip bias to compare or other cool hacks you have up your sleeve ;)

Let me know in comments what you think!

Monday, July 6, 2015

The Myth of the Free ALUs

One of the myths which arised with the new generation of consoles was that arithmetic operations (aka ALUs) were from now on literally free. If long ago we used cube map LUTs to normalize vectors, on GCN it costs nothing. Sure, they do have some cost, but with 6-7 ALU cycles per byte read from VRAM recommended at the top memory bandwidth (which you even almost never get), they should be all hidden beyond the memory fetches latency.

Well, it turns out, they are not. And here are some of the reasons why.

Some ALUs are more expensive than others.

If an extra mad would unlikely (I won quite significantly one day by optimizing out a redundant matrix multiplication from a tight loop) make a change, divisions or even trigonometry would. Even in an SSAO shader could benefit up to 13% performance increase when arithmetic is optimized.

There are no free lunches now.

Everything you could have relied hardware to do before (cube map texcoord calculation, attribute interpolation etc.) is now done with ALUs. You can use nointerpolation to avoid, if needed.

The scalar/vector ALUs are unbalanced.

In case your shader utilizes too many SALUs, not only you may become scalar register bound, but also your VALU units might stall while waiting for a scalar op result. An example here is excessive lane swizzling: when I tried to use it as a "cheap shared memory replacement" for blur filter, I got a ridiculous SALU/VALU ratio.

Friday, December 26, 2014

Triangles got to go... or not?

Few months ago I had an argue with a colleague of mine of whether we still will be using triangles in 10 (or 15) years. While I naturally opposed the idea, I feel there is something into it. So why are we using triangles, really?

Saving by calculating in VS and interpolating in PS

In fact, this is less and less is a viable argument. Nowadays characters already feature near-to-one-pixel triangles. We are actually LOOSING in this case: we get 4-fold PS cost and even rasterizer choking on some architectures. Besides, one modern architectures VS parameter cache is a bottleneck, so it is even less expensive to recompute a view position in a PS, than calculate it in VS and interpolate.

To match DCC tools

It is actually quite awkward to model your characters with triangles - a lot of artists use tools like ZBrush where they sculpt with spheres instead.

Clipping/culling

Once again - it is not really optimal when you triangles become very tiny. Besides, one might use bounding volumes for other representations to clip invisible surface.

I probably see how one can gain benefit for fx/billboards - but that's quite a specific case. Again, overdraw due to unused space on those is a big issue.

All in all, to support triangles we now have a lot of overhead - VS, sometimes tessellation, rasterization - while sometimes they are nearly equivalent to final pixels (and this trend seem to continue).

I am curious to hear your opinions and rationales guys. Will triangles live forever or will be replaced with something else (voxels? splats?) in a decade or two?

Monday, November 10, 2014

Interview Feedback

What surprised me in the game industry when I moved to the West is the absolute absence of subj. I had done several interviews prior I joined Eidos, and in case it was not successful, nobody was giving any clue what was wrong. Even an HR would stop answering any emails.

It's hard for me to justify this. If you are afraid of legal issues - give a call instead of writing an email, same way as salary is usually discussed. It takes 5 minutes of your time - and it is nothing compared to the time you spent organizing and conduction the onsite.

It is especially crucial if you give a take-home test. Firstly, a candidate spends a weekend in implementing and polishing the assignment - so he/she deserves a small bit of your time being spent. Secondly, unlike onsite, he/she has absolutely no clue what he might have done wrong.

Game development companies are constantly complaining on how it is hard to get a qualified candidate, and that they need to pay significant ($15-30k) amount of money for headhunters and/or to relocate/make a visa for a candidate abroad. If you see a person is interested in gamedev/your company - why not explain him/her what are his/her weaknesses, so in 6 months or a year he/she studies and come better prepared next time? It costs virtually nothing.

P. S.
For the sake of justice, there are few companies that do give interview feedback, but the majority don't.

Thursday, November 6, 2014

Tessellation Topology Sucks... But It Doesn't Have To

Initially this all started with a deficiency (Easter egg? Gag?) found in DirectX documentation:

If you ever worked with a hardware tessellation in any GAPI, you would know that you cannot achieve the tessellation on the triangle above (or below right) with ANY combination. Instead, if you tessellate a triangle with a tessfactor = 5 (edge and inside), you would get an abomination like this:

Why is it bad? Obviously, compared to the former topology, the output triangles are no more equilateral, though the input one is. Moreover, even for a pretty uniform geometry you will get a 'spider web' effect at the vertices of input triangles:

Is there any rationale behind this? I think, the main reason to keep the latter tessellation is an ability to generate a meaningful result for an arbitrary inside tessellation factor: any route from inside a triangle from an input vertex to the opposite edge should hop through exactly inside_tess_factor - 1 output vertices.

Is that really needed? I think, not really. All tessellation algorithms I've seen so far do not actually care that much about the inside tess factor. Usually, a max or average edge tess factor is used.

The only problem with the former tessellation could be how to handle different edge tess factors, if adaptive tessellation is used. Well, one could simply eliminate the surplus output edge vertices (and the incident output edges), and split the resultant quads in two triangles each (sorry, no image for that). Yes, the tessellation would be imperfect - but this would be just locally for the transition area.

What are your ideas guys? Does it sound worth trying?