RDNA 2 – 128mb of L3 “Infinity” Cache – If True What It Could Mean For Radeon

As of late, there have been ever increasing murmurings and rumblings of RDNA2 GPUs, at the highest tiers anyway, possibly having a 128 MB shared L3 cache and the possible implications of that on the performance of this next generation of AMD GPUs.

Unfortunately, it would seem most people don’t, or at least pretend not to have a clue what a huge game changer this could be, how enormous the implications of such an advancement would – if it comes to pass. Never mind the fact that GPUs with shared L3 cache anywhere near that amount would be a historic first. Or the fact that the 3650X, a modern 16 core CPU – also from AMD, only has 64 MB of L3 cache. Which is still heaps more than Intel CPU, even with the same core count.

People don’t seem to realize or understand, or pretend not to understand that rasterization on a GPU is entirely different than running code on a CPU. And the things that such a large amount of cache would be used for and the way it would be used would be completely different from how and what the L3 cache on CPUs is used for: caching data and instructions.

Consequently, most people seem to think that this could just make up for the purported bandwidth disadvantage of RDNA2, as RDNA2 GPUs are also rumoured to have narrow memory bus widths, at least compared to Ampere and to AMD GPUs based on the GCN architecture.

This is very myopic, obtuse and limited thinking and betrays the thick ignorance of most tech pundits and fanboys regarding how rasterization rendering works on the GPU, what it actually does and how it goes about it.

3D graphics relies heavily on the use of raster images, bitmaps, for information on what to draw on the triangular facets of geometric models and how also how to present those surface details, or how to dynamically alter their appearance. These bitmaps are called textures. Modern game engines which employ or are designed around physically based rendering use multiple, usually up to 7, different bitmap per material. A material is a set of all of the data and instructions necessary to accurately represent the intrinsic surface details on the triangles of a (game’s) geometric model as the artist or developer intended.

The first and indispensable type of bitmap used in 3D rendering a material can be comprised of is colour map. These are what are generally referred to as textures and, historically, the first type of bitmaps used to depict surface details on the faces of the triangles which comprise the surface of 3D models in games and other 3D applications. 3D models which are almost always effectively hollow and contain no information on what to pretend to contain or how to depict what would appear to be their contents or interior. This map can also be referred to as an ‘albedo’, ‘diffuse’ or ‘base colour’ map.

The colour bitmap gives the base colour for each of a multitude of apparent small patches on the faces of triangles which make up a given object’s 3D model in game. Each of those small patches on the faces of triangles that comprise the surface of a particular in-game 3D model is the area which would fall entirely within the confines of the very same texture bitmap pixel (generally referred to as texel) on the texture, if you were to stretch and apply the texture to the surface of the triangle in the particular position as well as the particular orientation and relative stretching of the texture on that triangle which the artist who made that 3D model, during the game’s development, intended. And that also applies for each and every single triangle which comprises the surface of the 3D object which might conceivable ever be seen in game. No visible triangles of 3D models are left untextured. And non-visible triangles are generally not even included in the model for performance optimisation considerations.

The second type of bitmap used in the materials used to represent the surface texture of 3D models in modern game engines is the roughness map which, for each and every one of the same small surface patches mentioned before, specifies how shiny or reflective it should be.

The third type raster image used, the displacement map, isn’t necessarily always used and its use is contingent upon the game utilizing tessellation to dynamically add geometry to the surface of the model and enable more surface details such as protrusions or depressions, without these needing to be actually included in the base 3D triangle mesh of the model or object, and enabling LOD (level of detail) type optimisations of the triangles actually used in very high triangle count meshes by enabling the game engine to only apply tessellation to such very high geometry detail objects near the player and not further away. Comparatively, base mesh geometry level triangle count optimisations requires more manual effort from game developers, who actually need to manually create several different instances of the same object or 3D model with varying degrees of geometric complexity (more or fewer triangles).

Nevertheless, historically, tessellation has been known to be used quite excessively and needlessly liberally and, in my opinion, very likely deliberately, in an intentional effort to needlessly and excessively degrade performance on AMD GPUs, especially ones based earlier iterations of GCN architecture such as Hawaii, more than it degraded performance on nVidia gpus of the time (which simply happened to have other weaknesses or bottlenecks then choking on biblical floods worth of geometry) for practically unnoticeable improvements in appearance and silhouette smoothness of in-game 3D models. Which was probably objective of Hairworks as well.

The next type of raster bitmap used in 3D graphics are normal maps. These are used to store information regarding what should be the apparent surface orientation of each of the multitude of small individual patches on the surface of a 3D model’s triangles which fall within the confines of various texels on this map. This map is used to fake apparent geometric detail in the surface of triangles of the 3D model without actually introducing more geometry (more vertices or triangles, as tessellation does) or the associated performance hit and also without actually distorting or displacement portions of the surface of the 3D model’s base triangle mesh, as tessellation does on the basis of the displacement map.

The next type of raster bitmap used in 3D graphics are ambient occlusion maps. The information these maps convey is simply how exposed to or hidden from uniform ambient lighting, which lights the model from all directions evenly, each individual patch on the surface of each of the triangles which comprise the 3D model should be. Of course, this is only a very crude and inflexible approximation of global illumination and, in and of itself, doesn’t account for the surroundings of the model and how these might change how it is lit (from each direction in what amount). It’s effectively a pre-baked lighting/shadowmap for each individual object in the game rather than each individual map or level, such as in Source and GoldSource games.

Next up is the Metalness map. This map tells the game engine how much a given patch on the surface of a triangle of a 3D model in the game should resemble metal in the way it looks and reflects light.

Next is the opacity mask. This lets the game engine know which patches on the surfaces of triangles should be transparent and which should be opaque.

Finally, and seldom used, the emission map tells the 3D renderer what colour of light

For more information on the above, please see this article .

Now for the respective sizes or combined of these maps. This depends on the resolution of each map as well as how many colour channels are used for each of these maps and how many bits per colour channel is each for each of the colour channels used by each of these maps.

These parameters depend on the specific game engine, the specific game it’s used in or how it’s used or what type of game that engine is used to make and the specific game asset.

So what we’ll discuss further will be a typical example, in which the RGB colour channels of the ‘albedo’, ‘diffuse’ or ‘base colour’ bitmap are combined with one ‘alpha’ channel for the Opacity map, resulting in a 32 bit RGBA bitmap, which has 4 colour channels and 8 bits of colour depth per colour channel (4 x 8 bpcc = 32 bit bitmap). In addition to this, a 24 bit bitmap would include the Roughness, Metalness and Ambient Occlusion maps, allowing 8 bits per texel for each of these. Additionally, the normal map would be given 24 bits per texel. Or 32 bits in the case of a parallax texture. Finally, the displacement map, if tessellation is used, would be either 16 bits or 8 bits per texel.

So, all in all, the typical very worst case scenario material would have 104 bits worth of various different information per texel. Or 13 bytes per texel. At a texture resolution of 2048×2048, this would mean 52 MB. meaning that a 128 MB L3 cache could hold two entire 2K materials with 24 MB left over for things like geometry information (vertices, vertex indexes, vertex normals, vertex UV coordinates). However, not all game assets have 2K resolution maps. Many assets will have 1024×1024 or even 512×512 materials.

Which would allow the GPU to perform forward rendering of at least two and probably more typical AAA game assets at once only going to VRAM once per asset per rendered frame to fetch the entire various texture maps which comprise the materials of the two models to be rendered at once, in addition to their geometry itself, which is also needed for rendering. Which should save a lot of latency cost that sampling texels of the various different maps which comprise modern PBR rendering materials from VRAM or even HBM would normally incur. Especially as most texel samples aren’t orderly, deterministically predictable, sequential or consecutive positions in the texture maps and, consequently, VRAM so latency is usually incurred for all texel samples, even though this is mitigated massively by massive parallelization, so that the latencies don’t enqueue and therefore don’t stack or add up.

However, where this 128 MB cache will really shine and come into its own is in deferred rendering, which most modern games use and which has always been the bane and the weakness of AMD’s GCN iteration GPU architectures. Forward rendering performs every stage of the render pipeline for each triangle in the scene individually and therefore incurs a lot of redundant texture mapping, shading and lighting computational cost and processing time while forward rendering performs each stage of the render pipeline for all triangles in the scene at once, before moving on to the next stage in the render pipeline, and is therefore able to cut down on the amount of texturing, shading and lighting it has to do by avoiding doing it on occluded surfaces as well. Enabling computationally weaker hardware (historically, typically nvidia for the same price) with less bandwidth to compete with computationally superior hardware (historically, typically AMD for the same price) at the same price point and while using less power, because of the less computational work that needs doing.

The tradeoff is that deferred rendering doesn’t natively cope with semiopaque materials as well as forward rendering does, and requires computationally and processing time expensive workarounds and additional steps to properly render them. This is why godrays and volumetric lighting tanked performance on Maxwell and even Pascal to some degree. In addition to nVidia hardware always being weaker specced than the AMD hardware it historically competed at, at the same price point or even for more money than AMD.

However, the reason the 128MB cache matters a lot for deferred rendering is that, with deferred rendering, you don’t need all of the various different texture maps, mentioned above, which comprise the material a triangle uses at once, in order to render it. You only need the specific map or maps which are needed for each specific stage of the render pipeline you might currently be working your way through.

Which means that you probably don’t need to hold all of the 104 bits worth of per texel data in the L3 cache at once. For a particular render pipeline stage you might only need the normal map in there, for example.

Which, in turn, means that you can fit even larger texture maps in there. And as long as you don’t need to keep more than 64 bits worth of per texel data in the L3 at once (4096 x 4096 x 8 bytes = 128 MB, meaning you can store either two 4K, 32 bit texture maps at once or one 4K, 64 bit texture map), you can store several entire 4K texture maps in there, uncompressed. This means that, as long as you ensure that you’re ordering the triangles you’re processing such that all triangles which use a particular diffuse map or a particular normal map or a particular ambient occlusion map etc. are enqueued consecutively, in one monolithic contiguous sequence, you would not need to go to VRAM more than once per material, per render stage per rendered frame. And maybe even less than once if a particular render stage can make use of two or more of the types of texture maps listed higher up in this article at once and they both or all fit in the L3 cache together.

Which, evidently, collapses the need for VRAM bandwidth when in deferred rendering mode, with RDNA2. Which, in turn, means that a 256 bit bus may be quite overkill for even the top NAVI card, despite lamentations and protestations of dishonest people, people with ulterior motives or people who simply don’t know any better or can’t understand how a revolutionary change in GPU architecture, such as going from no L3 cache to 128 MB of it in one generation, can yield paradigm shifting quantum leaps in performance and changes in how rendering is done.

And about the L3 cache, the 3950X’s 64 MB of L3 cache has a bandwidth of 1 TB/s. I think it’s a safe bet to assume that RDNA2’s L3 cache will have at least that bandwidth. And if the cache is architecture in a way in which its bandwidth corresponds to its size, 2 TB/s bandwidth for RDNA2’s L3 cache might not be out of the question.

And texture map caching isn’t the only mechanism by which RDNA2’s rumoured 128 MB L3 could provide significant performance uplifts.

If AMD employ tiled rendering in conjunction with texture map caching and also cache corresponding tiles or sections of the z-buffer and tiles or sections frame buffer sections in the L3 when performing forward rendering, further massive gains might be possible, in addition to those yielded by caching texture maps in the L3 cache.

Once one thinks through and becomes aware of all of the possible uses of 128 MB of L3 cache, as well as the performance uplift implications of all of those ways of making use of that amount of cache, the duplicitous, dishonest, twofaced, craven concern trolling about RDNA2’s likewise rumoured narrow VRAM bus widths and the corresponding narrow bandwidth they would entail, compared with Ampere and iterations of AMD’s GCN architecture and even the initial iteration of RDNA, immediately reveal themselves as based in ignorance or stupidity.

However, it’s not all sunshine and rainbows. The limited size of the L3 cache also means that RDNA2 will probably get a significant performance penalty when rendering using monolithic 8K texture maps. Which cannot be made to fit into the L3 cache entirely and will therefore ensure a lot of cache misses and a lot of sampling of texels from VRAM no matter what you do, in either forward or deferred rendering.

In any case, if the rumours prove to be true and RDNA2 will have the chungus L3 cache that rumours say it will, we’re up for quite an upturning and upheaval of the GPU market status quo. If the chungus cache lives and it’s going to be used intelligently, RDNA2’s rasterization rendering performance WILL be in a class of its own and this WILL be AMD’s Maxwell moment, just as it increasingly appears to be nVidia’s Vega’s FuryX moment. In but a few short weeks, all the green sheep fanboys who’ve bought Ampere like the well trained simps for Huang they are, to use for anything but CUDA pro use or offline rendering, might be kicking themselves over their dumb purchase decision and poor impulse control. Or just descend into their fantasy world where they can pretend they didn’t buy the much more expensive, much slower product with less VRAM so that they don’t have to deal with raging buyer’s remorse.

And as closing observation, I would like to ask: how else could the 60% on top of 50% improvement in performance per watt with RDNA2 and RDNA1, respectively, be accounted for or explained other than through a switch to hardware optimized for deferred rendering with RDNA1 and deferred, tile based rendering that’s also cache centric and cache optimized, being built around a massive shared L3 cache, with RDNA2?

Just AMD casually pushing technology forward while nVidia continues milking, for all they’re worth, the clueless, dull, dimwitted and technically illiterate sheep it’s trained and inducted into its Apple like cult so well over the years.