tag:blogger.com,1999:blog-25444982230452556712024-03-05T22:58:54.584-05:00Peter Sikachev Dev BlogPeter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-2544498223045255671.post-39935610240868607512018-12-20T17:39:00.001-05:002018-12-25T11:58:56.539-05:00LCECBF: Linear Cost Exact Circular Bokeh Filter<div dir="ltr" style="text-align: left;" trbidi="on">
Depth-of-field has always been a costly post process for video games. Particularly, a circle-of-confusion filter proved to be a bottleneck. While being deceptively simple (all weights are either zero or constant values), it is non-separable: it means that, unlike box or Gaussian filter, one cannot run a two-pass O(r) filter (where r is filter radius), and rather needs to run O(r * r) pass, in order to get accurate results.<br />
<br />
There have been a lot of efforts to solve this issue. I apologize for being lazy to make a proper bibliographical reference, so I will simply list the approaches:<br />
<ul style="text-align: left;">
<li>NFS approach: works only for polygonal (e.g., hexagonal) bokeh - make a horizontal and then 2 diagonal passes, combine with a max filter (has some artifacts)</li>
<li>Crytek 2-pass approach with rotating kernel and flood fill</li>
<li>recent Fourier-series approach from EA (published at GDC18)</li>
</ul>
While being practical, these approaches still lack the following qualities:<br />
<ul style="text-align: left;">
<li>being accurate</li>
<li>requiring only one pass/no additional render targets</li>
</ul>
The approach I want to present here can apply a circular (or, basically, of any convex shape) bokeh filter to an image:<br />
<ul style="text-align: left;">
<li>with number of samples which is O(r), where r is filter radius</li>
<li>matching ground truth up to floating point error</li>
<li>single pass</li>
<li>not requiring any additional memory allocation</li>
</ul>
The proposed method utilizes an idea that was hinted to me about 10 years by my supervisor, Alexey Ignatenko, so here is a shout out to him. The key idea is that once you've computed a convolution with a constant-value filter kernel for one point, you can reuse most of it for the neighbor point:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3SNwg9oPozeUUHTgQgTydAPFLZ8OxgAUbv5GVV7dbf0SSJJ0B0AAxjRcx8moRfRQmvMSek8CUclMntvMifUUBaDb0Cun6DewoNvpM9bFIVM0VgIz8ZgzJZfSdYye9xeu6vi8yxV-PT4k/s1600/circles.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="208" data-original-width="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3SNwg9oPozeUUHTgQgTydAPFLZ8OxgAUbv5GVV7dbf0SSJJ0B0AAxjRcx8moRfRQmvMSek8CUclMntvMifUUBaDb0Cun6DewoNvpM9bFIVM0VgIz8ZgzJZfSdYye9xeu6vi8yxV-PT4k/s1600/circles.png" /></a></div>
If we compute convolution for the point A, effectively, we can reuse most of it (blue part) for the point B. This is due to fact that the weights are the same - the intrinsic property of the bokeh filter kernel. Now, to compute the convolution at B, all we have to do is take the A's convolution, add all pixels in green and subtract all pixels in pink.<br />
<br />
How many pixels would this be? Perimeter-many :) For a circle (and regular polygons) this is a linear function of their radius.<br />
<br />
How does this translate to shader code? Well, effectively, we can compute a full convolution just for one pixel, and then propagate it to neighbors at linear cost. That would require compute shader threads to output more than pixel, naturally.<br />
<br />
Here is the shader code (I apologize for hardcoded constants and other less-than-ideal things):<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">Texture2D Input : register( t0 ); <br />RWTexture2D<float4> Result : register( u0 );</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">[numthreads( 1, 32, 1 )]<br />void CSMain( uint3 Gid : SV_GroupID, uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID )<br />{<br /> const int nTotalSamples = 317;<br /> const int2 vCircleSamples[317] =<br /> {<br /> int2(-10, 0), int2(-9, -4), int2(-9, -3), int2(-9, -2), int2(-9, -1), int2(-9, 0), int2(-9, 1), int2(-9, 2), int2(-9, 3), int2(-9, 4),<br /> int2(-8, -6), int2(-8, -5), int2(-8, -4), int2(-8, -3), int2(-8, -2), int2(-8, -1), int2(-8, 0), int2(-8, 1), int2(-8, 2), int2(-8, 3),<br /> int2(-8, 4), int2(-8, 5), int2(-8, 6), int2(-7, -7), int2(-7, -6), int2(-7, -5), int2(-7, -4), int2(-7, -3), int2(-7, -2), int2(-7, -1),<br /> int2(-7, 0), int2(-7, 1), int2(-7, 2), int2(-7, 3), int2(-7, 4), int2(-7, 5), int2(-7, 6), int2(-7, 7), int2(-6, -8), int2(-6, -7),<br /> int2(-6, -6), int2(-6, -5), int2(-6, -4), int2(-6, -3), int2(-6, -2), int2(-6, -1), int2(-6, 0), int2(-6, 1), int2(-6, 2), int2(-6, 3),<br /> int2(-6, 4), int2(-6, 5), int2(-6, 6), int2(-6, 7), int2(-6, 8), int2(-5, -8), int2(-5, -7), int2(-5, -6), int2(-5, -5), int2(-5, -4),<br /> int2(-5, -3), int2(-5, -2), int2(-5, -1), int2(-5, 0), int2(-5, 1), int2(-5, 2), int2(-5, 3), int2(-5, 4), int2(-5, 5), int2(-5, 6),<br /> int2(-5, 7), int2(-5, 8), int2(-4, -9), int2(-4, -8), int2(-4, -7), int2(-4, -6), int2(-4, -5), int2(-4, -4), int2(-4, -3), int2(-4, -2),<br /> int2(-4, -1), int2(-4, 0), int2(-4, 1), int2(-4, 2), int2(-4, 3), int2(-4, 4), int2(-4, 5), int2(-4, 6), int2(-4, 7), int2(-4, 8),<br /> int2(-4, 9), int2(-3, -9), int2(-3, -8), int2(-3, -7), int2(-3, -6), int2(-3, -5), int2(-3, -4), int2(-3, -3), int2(-3, -2), int2(-3, -1),<br /> int2(-3, 0), int2(-3, 1), int2(-3, 2), int2(-3, 3), int2(-3, 4), int2(-3, 5), int2(-3, 6), int2(-3, 7), int2(-3, 8), int2(-3, 9),<br /> int2(-2, -9), int2(-2, -8), int2(-2, -7), int2(-2, -6), int2(-2, -5), int2(-2, -4), int2(-2, -3), int2(-2, -2), int2(-2, -1), int2(-2, 0),<br /> int2(-2, 1), int2(-2, 2), int2(-2, 3), int2(-2, 4), int2(-2, 5), int2(-2, 6), int2(-2, 7), int2(-2, 8), int2(-2, 9), int2(-1, -9),<br /> int2(-1, -8), int2(-1, -7), int2(-1, -6), int2(-1, -5), int2(-1, -4), int2(-1, -3), int2(-1, -2), int2(-1, -1), int2(-1, 0), int2(-1, 1),<br /> int2(-1, 2), int2(-1, 3), int2(-1, 4), int2(-1, 5), int2(-1, 6), int2(-1, 7), int2(-1, 8), int2(-1, 9), int2(0, -10), int2(0, -9),<br /> int2(0, -8), int2(0, -7), int2(0, -6), int2(0, -5), int2(0, -4), int2(0, -3), int2(0, -2), int2(0, -1), int2(0, 0), int2(0, 1),<br /> int2(0, 2), int2(0, 3), int2(0, 4), int2(0, 5), int2(0, 6), int2(0, 7), int2(0, 8), int2(0, 9), int2(0, 10), int2(1, -9),<br /> int2(1, -8), int2(1, -7), int2(1, -6), int2(1, -5), int2(1, -4), int2(1, -3), int2(1, -2), int2(1, -1), int2(1, 0), int2(1, 1),<br /> int2(1, 2), int2(1, 3), int2(1, 4), int2(1, 5), int2(1, 6), int2(1, 7), int2(1, 8), int2(1, 9), int2(2, -9), int2(2, -8),<br /> int2(2, -7), int2(2, -6), int2(2, -5), int2(2, -4), int2(2, -3), int2(2, -2), int2(2, -1), int2(2, 0), int2(2, 1), int2(2, 2),<br /> int2(2, 3), int2(2, 4), int2(2, 5), int2(2, 6), int2(2, 7), int2(2, 8), int2(2, 9), int2(3, -9), int2(3, -8), int2(3, -7),<br /> int2(3, -6), int2(3, -5), int2(3, -4), int2(3, -3), int2(3, -2), int2(3, -1), int2(3, 0), int2(3, 1), int2(3, 2), int2(3, 3),<br /> int2(3, 4), int2(3, 5), int2(3, 6), int2(3, 7), int2(3, 8), int2(3, 9), int2(4, -9), int2(4, -8), int2(4, -7), int2(4, -6),<br /> int2(4, -5), int2(4, -4), int2(4, -3), int2(4, -2), int2(4, -1), int2(4, 0), int2(4, 1), int2(4, 2), int2(4, 3), int2(4, 4),<br /> int2(4, 5), int2(4, 6), int2(4, 7), int2(4, 8), int2(4, 9), int2(5, -8), int2(5, -7), int2(5, -6), int2(5, -5), int2(5, -4),<br /> int2(5, -3), int2(5, -2), int2(5, -1), int2(5, 0), int2(5, 1), int2(5, 2), int2(5, 3), int2(5, 4), int2(5, 5), int2(5, 6),<br /> int2(5, 7), int2(5, 8), int2(6, -8), int2(6, -7), int2(6, -6), int2(6, -5), int2(6, -4), int2(6, -3), int2(6, -2), int2(6, -1),<br /> int2(6, 0), int2(6, 1), int2(6, 2), int2(6, 3), int2(6, 4), int2(6, 5), int2(6, 6), int2(6, 7), int2(6, 8), int2(7, -7),<br /> int2(7, -6), int2(7, -5), int2(7, -4), int2(7, -3), int2(7, -2), int2(7, -1), int2(7, 0), int2(7, 1), int2(7, 2), int2(7, 3),<br /> int2(7, 4), int2(7, 5), int2(7, 6), int2(7, 7), int2(8, -6), int2(8, -5), int2(8, -4), int2(8, -3), int2(8, -2), int2(8, -1),<br /> int2(8, 0), int2(8, 1), int2(8, 2), int2(8, 3), int2(8, 4), int2(8, 5), int2(8, 6), int2(9, -4), int2(9, -3), int2(9, -2),<br /> int2(9, -1), int2(9, 0), int2(9, 1), int2(9, 2), int2(9, 3), int2(9, 4), int2(10, 0)<br /> };<br /><br /> const int2 vCircleSamplesNeg[] =<br /> {<br /> int2(-11, 0), int2(-10, -4), int2(-10, -3), int2(-10, -2), int2(-10, -1), int2(-10, 1), int2(-10, 2), int2(-10, 3), int2(-10, 4), int2(-9, -6),<br /> int2(-9, -5), int2(-9, 5), int2(-9, 6), int2(-8, -7), int2(-8, 7), int2(-7, -8), int2(-7, 8), int2(-5, -9), int2(-5, 9), int2(-1, -10),<br /> int2(-1, 10)<br /> };<br /> const int totalSamplesBorderNeg = 21;<br /><br /> const int2 vCircleSamplesPos[] =<br /> {<br /> int2(0, -10), int2(0, 10), int2(4, -9), int2(4, 9), int2(6, -8), int2(6, 8), int2(7, -7), int2(7, 7), int2(8, -6), int2(8, -5),<br /> int2(8, 5), int2(8, 6), int2(9, -4), int2(9, -3), int2(9, -2), int2(9, -1), int2(9, 1), int2(9, 2), int2(9, 3), int2(9, 4),<br /> int2(10, 0)<br /> };<br /> const int totalSamplesBorderPos = 21;<br /> </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> float4 res = 0;<br /> int2 coord = int2(DTid.x * 64, DTid.y);<br /> [loop]<br /> for (int s = 0; s < nTotalSamples; ++s)<br /> {<br /> res += Input[coord + vCircleSamples[s]];<br /> }<br /> res /= float(nTotalSamples);<br /> Result[coord] = res;<br /> float4 prevRes = res;<br /><br /> [loop]<br /> for (int i = 1; i < 64; ++i)<br /> {<br /> res = 0;<br /> coord = int2(DTid.x * 64 + i, DTid.y);<br /> [loop]<br /> for (int s = 0; s < totalSamplesBorderNeg; ++s)<br /> {<br /> res -= Input[coord + vCircleSamplesNeg[s]];<br /> }<br /> [loop]<br /> for (int s = 0; s < totalSamplesBorderPos; ++s)<br /> {<br /> res += Input[coord + vCircleSamplesPos[s]];<br /> }<br /><br /> res /= float(nTotalSamples);<br /> res += prevRes;<br /> Result[coord] = res;<br /> prevRes = res;<br /> } <br />}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
And here is the code for the ground truth (all samples) bokeh computation:<br />
<span style="font-family: "courier new" , "courier" , monospace;">float4 res = 0;<br />int2 coord = int2(DTid.x, DTid.y);<br />[loop]<br />for (int s = 0; s < nTotalSamples; ++s)<br />{<br /> res += Input[coord + vCircleSamples[s]];<br />}<br /><br />res /= float(nTotalSamples);<br />Result[coord] = res;</span><br />
<br />
Here are the dispatch calls, just in case:<br />
<span style="font-family: "courier new" , "courier" , monospace;">pd3dImmediateContext->Dispatch((width + 63) / 64, (height + 31) / 32, 1); // Proposed method</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">pd3dImmediateContext->Dispatch(width, (height + 31) / 32, 1); // Ground truth</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"></span><br />
Performance (I didn't do any thorough optimization, e.g., half-res, making fetches more cache-friendly, etc), numbers for 21x21 filter, Full HD, GeForce 1060, RGBA16_FLOAT src/dst:<br />
Ground Truth - 5ms<br />
Proposed Method - 1.7ms<br />
<br />
Visual results (apologize for not handling borders properly, again kinda lazy):<br />
<br />
UPD: closeup for smartphone users :)<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2JrJqIQBi80za9NZmHOPd_mHKTHDMj9E_3egr4m9F_6Sq8iZsM4Fvjs1lhS0GFnTgPrj-AoulduOS6FhKLUn0DlejUrhzC-S9k4H4Oebz6c-6efKNZNyCXjR_ptue7v8BOVHMY_oE0zE/s1600/closeup.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="316" data-original-width="531" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2JrJqIQBi80za9NZmHOPd_mHKTHDMj9E_3egr4m9F_6Sq8iZsM4Fvjs1lhS0GFnTgPrj-AoulduOS6FhKLUn0DlejUrhzC-S9k4H4Oebz6c-6efKNZNyCXjR_ptue7v8BOVHMY_oE0zE/s1600/closeup.png" /></a></div>
<br />
Original:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEib6ixzmNbIAQOvdQ1MhSz-q1Mx-Q20nUExvM9bAiTPBdR7zPpwlrAFGLap4uFHYH43DUBvw4tPxcKA1FhHSJvRJDT3MUM6YAlFWr4ZKAGhQaoqz3ao3x6B7mLdRkZzKw60MD7OOkQcnQM/s1600/nodof.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEib6ixzmNbIAQOvdQ1MhSz-q1Mx-Q20nUExvM9bAiTPBdR7zPpwlrAFGLap4uFHYH43DUBvw4tPxcKA1FhHSJvRJDT3MUM6YAlFWr4ZKAGhQaoqz3ao3x6B7mLdRkZzKw60MD7OOkQcnQM/s640/nodof.png" width="640" /></a></div>
<br />
Ground Truth Filter:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzWP3rpM-qu33YvQzBJUdIzatuV8FuBM4oauNRrTdovsdn30XLor_LgkXyYYXlQt6R86RrHbaLsyeV-_7T0DhJ30gm2o0cHpE1I61RIYHScEBkslh3CoC0ZXpqZvPpxGjQ7QIkmJUhxjU/s1600/dofgroundtruth.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzWP3rpM-qu33YvQzBJUdIzatuV8FuBM4oauNRrTdovsdn30XLor_LgkXyYYXlQt6R86RrHbaLsyeV-_7T0DhJ30gm2o0cHpE1I61RIYHScEBkslh3CoC0ZXpqZvPpxGjQ7QIkmJUhxjU/s640/dofgroundtruth.png" width="640" /></a></div>
<br />
Proposed Method Filter:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSFNjNpzJK3rbxJ6YjW-2h38FJUx-FmQfFRzed0IZvqzG5_EfX7qfN1E5dSKbvrN1Sy8_mA3O1W0rh0pvAp9RJHDfc8OO9z_lAYw7Ew4d_g9wH6owZFQ9ZZpHIzfAsCiwZos1FLrRxQJs/s1600/doffast.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSFNjNpzJK3rbxJ6YjW-2h38FJUx-FmQfFRzed0IZvqzG5_EfX7qfN1E5dSKbvrN1Sy8_mA3O1W0rh0pvAp9RJHDfc8OO9z_lAYw7Ew4d_g9wH6owZFQ9ZZpHIzfAsCiwZos1FLrRxQJs/s640/doffast.png" width="640" /></a></div>
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: inherit;"></span> </span><br />
Obviously, there is yet a lot of optimization and polishing to be done to make it production-ready, but I think the concept is original and worth digging (and I like how it exploits bokeh filter kernel constant weight property). <br />
<br />
Please<span style="font-family: "courier new" , "courier" , monospace;"></span>, let me know (in the comments, Twitter, private messages etc) if you want me to polish&publish a demo. If there are enough people interested, I will try to find time for that :)</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0tag:blogger.com,1999:blog-2544498223045255671.post-66439488485090760192018-10-26T13:47:00.000-04:002018-10-26T14:35:41.888-04:00MinLod: A Cheap&Simple Method to Increase Texture Details at Acute Angles<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<b>UPD</b>: a simpler and faster solution utilizing textureGrad instead of textureLod is uploaded (same ShaderToy <a href="https://www.shadertoy.com/view/Xl3fRs">link</a>) thanks to Sergey Makeev's constructive feedback.<br />
<br />
I haven't written here for over 3 years, and I think it's time to fix this :)<br />
<br />
So, let's start with the problem definition. Without further ado, consider the following image (that I've stolen somewhere from the internets):<br />
<br />
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi26DCRGRHARhBdE0-XuDd04FS_-2WD-PD9NPVteNu-9X_DYURicmbhRjBI82eBmJ9CUi5vWLgB_T9-KGWvi0RcFREynOUSywhH04Za9oeP1aaBTvULDvwQmr26j-fq7Np8DDpadA9OaOU/s1600/zmds7n.jpg.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="899" data-original-width="1599" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi26DCRGRHARhBdE0-XuDd04FS_-2WD-PD9NPVteNu-9X_DYURicmbhRjBI82eBmJ9CUi5vWLgB_T9-KGWvi0RcFREynOUSywhH04Za9oeP1aaBTvULDvwQmr26j-fq7Np8DDpadA9OaOU/s400/zmds7n.jpg.png" width="400" /></a><br />
<br />
So, ultimately, we want to get as close as possible to the anisotropic filtering, but it's usually too expensive. Point filtering is way too aliased, while the classic mipmapping is too blurry in the distance.<br />
<br />
But why mipmapping is too blurry in this case and what could be done (except expensive anisotropic filtering about it)? Well, let's consult the OpenGL Spec in order to figure out, why:<br />
<br />
<pre style="background: none; font: normal normal 1em/1.2em monospace; margin: 0; padding: 0; vertical-align: top;">float
mip_map_level(in vec2 texture_coordinate)
{
// The OpenGL Graphics System: A Specification 4.2
// - chapter 3.9.11, equation 3.21
vec2 dx_vtc = dFdx(texture_coordinate);
vec2 dy_vtc = dFdy(texture_coordinate);
float delta_max_sqr = max(dot(dx_vtc, dx_vtc), dot(dy_vtc, dy_vtc));
//return max(0.0, 0.5 * log2(delta_max_sqr) - 1.0);
return 0.5 * log2(delta_max_sqr);
}</pre>
<br />
Effectively, we're taking texture coordinate gradients in screen-space (and this is, btw, the reason why rasterizer spits out 2x2 pixel quads and why you should avoid having subpixel triangles) and then taking the <b>max </b>between the lengths of gradients in <b>x </b>and <b>y</b> directions. Then, we convert the metric distances to mip levels by taking a logarithm (and there is a nice optimization of post-multiplying logarithm by 0.5 instead of taking a square root of an argument).<br />
<br />
Is it good enough? For most cases, yes. However, it starts failing in our case - when a gradient in one direction, <b>y</b>, is significantly larger than in another direction, <b>x</b>. That results in selecting a higher mip level (since we take the <b>max</b>), hence, the blurrier result.<br />
<br />
What do you do if you want to tolerate a bit of noise (e.g., you are planning to get rid of it later with temporal anti-aliasing) for a bit more detail? What I've seen in my practice was either introducing a mip bias, clipping the mip pyramid (i.e., having only 2-4 mip levels instead of a full mipmap chain), and, the most radical, just forcing the most detailed mip (i.e., 0).<br />
<br />
I propose a more precise, better looking, and elegant solution. What we need to do is simply replace the <b>max </b>operator with <b>min </b>operator when we choose between <b>x </b>and <b>y </b>gradients.<br />
<br />
Here's a quick&dirty ShaderToy <a href="https://www.shadertoy.com/view/Xl3fRs">demo</a> I've made to illustrate the concept. The demo basically cross-fades between <b>max </b>(default) and <b>min </b>(proposed) mip selection. You can also try forcing mip 0 or adding mip bias to compare or other cool hacks you have up your sleeve ;)<br />
<br />
Let me know in comments what you think!</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0tag:blogger.com,1999:blog-2544498223045255671.post-31084305022020763132015-07-06T15:45:00.002-04:002015-07-06T15:48:08.182-04:00The Myth of the Free ALUs<div dir="ltr" style="text-align: left;" trbidi="on">
One of the myths which arised with the new generation of consoles was that arithmetic operations (aka ALUs) were from now on literally free. If long ago we used cube map LUTs to normalize vectors, on <a href="http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah">GCN</a> it costs nothing. Sure, they do have some cost, but with 6-7 ALU cycles per byte read from VRAM recommended at the top memory bandwidth (which you even almost never get), they should be all hidden beyond the memory fetches latency.<br />
<br />
Well, it turns out, they are not. And here are <i>some</i> of the reasons why.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidG5ellEq5y65-u4OFCQc08NDvVeMaRl-BBoDLSS9koVSz21aYkB0YNFtZEK8PN5sdrlB3NrVBrNzN14VatdWrNnBvGyvu_iTkcYKTVge1YMp2hedIjwxnZYZQB43RQxZ_eN7738eaHGY/s1600/on_my_computer.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="166" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidG5ellEq5y65-u4OFCQc08NDvVeMaRl-BBoDLSS9koVSz21aYkB0YNFtZEK8PN5sdrlB3NrVBrNzN14VatdWrNnBvGyvu_iTkcYKTVge1YMp2hedIjwxnZYZQB43RQxZ_eN7738eaHGY/s400/on_my_computer.png" width="400" /></a></div>
<h3 style="text-align: left;">
</h3>
<h4 style="text-align: left;">
Some ALUs are more expensive than others. </h4>
If an extra mad would <i>unlikely </i>(I won quite significantly one day by optimizing out a redundant matrix multiplication from a tight loop) make a change, divisions or even trigonometry would. Even in an SSAO shader could benefit <a href="https://michaldrobot.files.wordpress.com/2014/05/gcn_alu_opt_digitaldragons2014.pdf">up to 13%</a> performance increase when arithmetic is optimized.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0vFFg6ewqU7BemkZ8upMwgjVYwARKrnzFPt5l8SraXIT9a6Y4I5L4gD1YMYaiGtVWjv2WSwEkrDNJYpMQWI60P2mUrZ3vil1JuhTG11UAWRwnkbhzhbuXzwcRT-feENComBdQ4ls8CXk/s1600/kitten.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="347" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0vFFg6ewqU7BemkZ8upMwgjVYwARKrnzFPt5l8SraXIT9a6Y4I5L4gD1YMYaiGtVWjv2WSwEkrDNJYpMQWI60P2mUrZ3vil1JuhTG11UAWRwnkbhzhbuXzwcRT-feENComBdQ4ls8CXk/s400/kitten.png" width="400" /></a></div>
<div>
<h4 style="text-align: left;">
There are no free lunches now. </h4>
Everything you could have relied hardware to do before (cube map texcoord calculation, attribute interpolation etc.) is now <a href="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Low-level-Shader-Optimization-for-Next-Gen-and-DX11-Emil-Persson.pptx">done with ALUs</a>. You can use <b>nointerpolation </b>to avoid, if needed.<br />
<h4 style="text-align: left;">
The scalar/vector ALUs are unbalanced.</h4>
<div style="text-align: left;">
In case your shader utilizes too many SALUs, not only you may become scalar register bound, but also your VALU units might stall while waiting for a scalar op result. An example here is excessive lane swizzling: when I tried to use it as a "cheap shared memory replacement" for blur filter, I got a ridiculous SALU/VALU ratio.</div>
<div>
<br /></div>
</div>
</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0tag:blogger.com,1999:blog-2544498223045255671.post-71804516270063437842014-12-26T13:40:00.001-05:002014-12-26T13:44:42.107-05:00Triangles got to go... or not?<div dir="ltr" style="text-align: left;" trbidi="on">
Few months ago I had an argue with a colleague of mine of whether we still will be using triangles in 10 (or 15) years. While I naturally opposed the idea, I feel there is something into it. So why are we using triangles, really?<br />
<br />
<h4 style="text-align: left;">
Saving by calculating in VS and interpolating in PS</h4>
<div style="text-align: left;">
In fact, this is less and less is a viable argument. Nowadays characters already feature near-to-one-pixel triangles. We are actually LOOSING in this case: we get 4-fold PS cost and even rasterizer choking on some architectures. Besides, one modern architectures VS parameter cache is a bottleneck, so it is even less expensive to recompute a view position in a PS, than calculate it in VS and interpolate.<br />
</div>
<h4 style="text-align: left;">
To match DCC tools</h4>
<div style="text-align: left;">
It is actually quite awkward to model your characters with triangles - a lot of artists use tools like ZBrush where they sculpt with spheres instead.<br />
</div>
<h4 style="text-align: left;">
Clipping/culling</h4>
<div style="text-align: left;">
Once again - it is not really optimal when you triangles become very tiny. Besides, one might use bounding volumes for other representations to clip invisible surface.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
I probably see how one can gain benefit for fx/billboards - but that's quite a specific case. Again, overdraw due to unused space on those is a big issue.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
All in all, to support triangles we now have a lot of overhead - VS, sometimes tessellation, rasterization - while sometimes they are nearly equivalent to final pixels (and this trend seem to continue).</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
I am curious to hear your opinions and rationales guys. Will triangles live forever or will be replaced with something else (voxels? splats?) in a decade or two?</div>
</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com1tag:blogger.com,1999:blog-2544498223045255671.post-71275494608060007592014-11-10T18:58:00.000-05:002014-11-10T18:59:46.942-05:00Interview Feedback<div dir="ltr" style="text-align: left;" trbidi="on">
What surprised me in the game industry when I moved to the West is the absolute absence of subj. I had done several interviews prior I joined Eidos, and in case it was not successful, nobody was giving any clue what was wrong. Even an HR would stop answering any emails.<br />
<br />
It's hard for me to justify this. If you are afraid of legal issues - give a call instead of writing an email, same way as salary is usually discussed. It takes 5 minutes of your time - and it is nothing compared to the time you spent organizing and conduction the onsite.<br />
<br />
It is especially crucial if you give a take-home test. Firstly, a candidate spends a weekend in implementing and polishing the assignment - so he/she deserves a small bit of your time being spent. Secondly, unlike onsite, he/she has absolutely no clue what he might have done wrong.<br />
<br />
Game development companies are constantly complaining on how it is hard to get a qualified candidate, and that they need to pay significant ($15-30k) amount of money for headhunters and/or to relocate/make a visa for a candidate abroad. If you see a person is interested in gamedev/your company - why not explain him/her what are his/her weaknesses, so in 6 months or a year he/she studies and come better prepared next time? It costs virtually nothing.<br />
<br />
P. S.<br />
For the sake of justice, there are few companies that do give interview feedback, but the majority don't.</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0tag:blogger.com,1999:blog-2544498223045255671.post-64178914498808339452014-11-06T00:22:00.001-05:002014-11-06T00:22:20.277-05:00Tessellation Topology Sucks... But It Doesn't Have To<div dir="ltr" style="text-align: left;" trbidi="on">
Initially this all started with a deficiency (Easter egg? Gag?) found in <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ff476340(v=vs.85).aspx">DirectX documentation</a>:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://i.msdn.microsoft.com/dynimg/IC534080.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://i.msdn.microsoft.com/dynimg/IC534080.png" height="293" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
If you ever worked with a hardware tessellation in any GAPI, you would know that you cannot achieve the tessellation on the triangle above (or below right) with ANY combination. Instead, if you tessellate a triangle with a <b>tessfactor</b> = 5 (edge and inside), you would get an abomination like this:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTFggXldAdIEdYXzSry3iwQGJOiSZm46E1pTiBgK8b4LRxexpvj-wuzBA3Lo2zYW4tf5QulRMrks5BMx96O-c8FY1xIgh4K_HjTvDs0qzNnxMHXsyEK7EyxRjBkdqSOMA_03uCYkgHKm8/s1600/tesstri.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTFggXldAdIEdYXzSry3iwQGJOiSZm46E1pTiBgK8b4LRxexpvj-wuzBA3Lo2zYW4tf5QulRMrks5BMx96O-c8FY1xIgh4K_HjTvDs0qzNnxMHXsyEK7EyxRjBkdqSOMA_03uCYkgHKm8/s1600/tesstri.png" height="277" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Why is it bad? Obviously, compared to the former topology, the output triangles are no more equilateral, though the input one is. Moreover, even for a pretty uniform geometry you will get a 'spider web' effect at the vertices of input triangles:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.geeks3d.com/public/jegx/200902/gpu-tesselation-terrain-level-1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.geeks3d.com/public/jegx/200902/gpu-tesselation-terrain-level-1.jpg" height="288" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Is there any rationale behind this? I think, the main reason to keep the latter tessellation is an ability to generate a meaningful result for an <u>arbitrary</u> inside tessellation factor: any route from inside a triangle from an input vertex to the opposite edge should hop through <u>exactly</u> <b>inside_tess_factor - 1</b> output vertices.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Is that really needed? I think, not really. All tessellation algorithms I've seen so far do not actually care that much about the inside tess factor. Usually, a max or average edge tess factor is used.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The only problem with the former tessellation could be how to handle different edge tess factors, if adaptive tessellation is used. Well, one could simply eliminate the surplus output edge vertices (and the incident output edges), and split the resultant quads in two triangles each (sorry, no image for that). Yes, the tessellation would be imperfect - but this would be just locally for the transition area.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
What are your ideas guys? Does it sound worth trying?</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0tag:blogger.com,1999:blog-2544498223045255671.post-5319224739192241612014-08-25T00:59:00.002-04:002014-08-25T22:01:04.504-04:00Computer Architecture - How To?<div dir="ltr" style="text-align: left;" trbidi="on">
Some time ago I remember asking a more senior team member: how does one learn about how hardware operates on the low level? I was answered that this could be learned only from practice.<br />
While I don't think that this is a completely invalid approach - since practical experience is always more valuable than a raw theory - I was searching for some theoretical basis I could get before diving into practice. I think I have something now what I can recommend.<br />
Firstly, there is this course on hardware architecture, which is really good: <a href="https://www.coursera.org/course/comparch">https://www.coursera.org/course/comparch</a><br />
Beware though, it is a grad-level course, so the level is really very demanding. You should know what are caches, associativity, memory hierarchy etc. They recommend the following book:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.amazon.com/Computer-Architecture-Fifth-Quantitative-Approach/dp/012383872X/"><img alt="http://www.amazon.com/Computer-Architecture-Fifth-Quantitative-Approach/dp/012383872X/" border="0" src="http://ecx.images-amazon.com/images/I/51Z9WQLtkRL._SX258_BO1,204,203,200_.jpg" height="320" width="259" /></a></div>
I really recommend you start reading it from appendix, which serves a sort of a primer. The book is also very up-to-date, even more than the course itself.<br />
The second book recommended (<a href="http://www.amazon.com/Modern-Processor-Design-Fundamentals-Superscalar/dp/1478607831/">http://www.amazon.com/Modern-Processor-Design-Fundamentals-Superscalar/dp/1478607831/</a>) provides a lot of low-level details on reservation stations/register renaming, very in-depth on a pair of particular architectures, and a comprehensive description of branch prediction algorithms. Despite it has been just republished, it is a bit outdated (e.g., Intel P6 microarchitecture is used as example), so it is really up to you to buy or to pass.<br />
I hope this has been helpful, and as previously, feel free to comment and discuss.</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com3tag:blogger.com,1999:blog-2544498223045255671.post-73461689250586280952014-08-19T14:45:00.000-04:002014-08-19T14:45:23.588-04:00SIGGRAPH Talk Accompanying Video<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div style="text-align: left;">
Thanks to our marketing department, I got a permission to post a video we showed during our talk which presents Thief artistic pipeline for reflection creation and setup. It might be somewhat unclear without comments, sorry for that:</div>
<div style="text-align: center;">
<br /><iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.youtube.com/embed/vxKheqKvKR0?feature=player_embedded' frameborder='0'></iframe></div>
</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0tag:blogger.com,1999:blog-2544498223045255671.post-26901688522539522762014-08-18T12:55:00.002-04:002014-08-18T14:25:29.627-04:00SIGGRAPH 2014 Talk 'Reflections in Thief'<div dir="ltr" style="text-align: left;" trbidi="on">
Hi guys!<br />
I've decided to start a new dev blog where I'll write about stuff I'm working on as well as book reviews and my random thoughts on graphics in particular and games tech in general.<br />
<br />
I don't promise regular updates, but, hopefully, I'll find some time to do them. For the start, I'll upload my <a href="https://www.dropbox.com/s/eizabia22186umf/SIGGRAPH2014_Draft.pptx">SIGGRAPH talk (45.6 MB, .pptx)</a> this year.<br />
<br />
Enjoy, and I'll be more than happy to discuss it in the comments.<br />
Peter</div>
Peter Sikachevhttp://www.blogger.com/profile/11450316581114016271noreply@blogger.com0