AMD

3D RAGE

This page is about GPU's from AMD/ATI.

RDNA - ps5/xbx

2019

Architecture Overview

- RDNA Architecture | INTRODUCINGDNA ARCHITECTURE - 2019

RDNA - new AMD GPU architecture
- RDNA Architecture - 2019
- RDNA 1.0 Instruction SetArchitectureReference Guide - 2020

Performance Guidelines

RDNA Performance Guide

GCN GPU - ps4/xbo

2012 - 2018 : Unified shader (scalar/vector)

Architecture Overview

A GCN GPU contain one Graphic Command Processor (CP), 8 Asynchronous Compute Engines (ACE) and one ore more Shader Engines (SE). The CP manage the graphic commands to the CPU and a ACE handle compute shaders. Each SE contain multiple Compute Unit's (CU), 1 Rasterizer and 1 Geometry processor (GP). The GP runs geometry shaders and tessellation. The rasterizer read a triangle and write out up to 16 pixels oer clock.

A Compute Unit's have 4 SIMD-16 units and one scalar unit.

- General purpose register are a limited resource used by shaders. There are Scalar General Purpose Registers (sGPR) and Vector General Purpose Registers (vGPR). Both store 32-bit data and they can use consecutive entries to store larger data types. Each CU has a 8k scalar register file split so each SIMD get's 512 32-bit entries. They are shared among the 10 wavefronts on the SIMD and a wavefront can max allocate 112 registers and some of them are also reserved. Each SIMD contains 256 vGPR. Each vGPR contain 64 32-bit values for a total size of 64kb. vGPR's are also assigned to wavefronts and each thread only see it's own 32-bit value in the vGPR.
  - Texture Sampler : Four SGPR's (128-bit)
  - Texture Resource : Four or Eight SGPR's (128/256-bit)
  - Buffer: Four SGPR's (128-bit)
- The Local Data Share (LDS) is memory that can be used by threads in a wavefront/work-group.
- The Global Data Share (GDS) is memory that can be used by wavefronts on all compute units.

Links

- Advanced Shader Programming On GCN - 2017
- AMD GCN3 ISA Architecture Manual - 2016
- GCN – two ways of latency hiding and wave occupancy - 2014
- The AMD GCN Architecture - A Crash Course - 2014
- AMD Graphics Cores Next (gcn) Architecture - 2012

Performance Guidelines

Depth Buffers

- Issues with Z-Fighting? Use D32_FLOAT_S8X24_UINT format with no performance or memory impact compared to D24S8.Notes: Depth and stencil are stored separately on GCN architectures. The D32_FLOAT_S8X24_UINT is therefore not a 64-bit format like it could appear to be. There is no performance or memory footprint penalty from using a 32-bit depth buffer compared to using a 24-bit one.
- Render your skybox last and your first-person geometry first. Rendering close-up geometry first is a good way to ensure the depth buffer is primed with small values to maximize the potential for fragment rejection via Hierarchical Z and Early-Z testing.
- Using D16 shadow maps will provide a modest performance boost and a large memory saving. D16 will run slightly faster than other depth-only formats because of reduced memory bandwidth. In most cases pushing the front clip plane as much as possible results in a much better depth distribution and therefore avoid precision issues.

Textures

- MIPMapping is underrated -don't forget to use it on displacement maps and volume textures too.Notes: The use of MIPMapping is essential to avoid aliasing issues and improve texture cache performance. It should especially be used on volume textures as they are more likely to have poor cache hit rates. Non-color data such as normal maps and displacement maps should also use MIPMapping.
- Trilinear is up to 2x the cost of bilinear. Bilinear on 3D textures is 2x the cost of 2D. Aniso cost depends on taps

Shaders

- Some shader instructions are costly; pre-compute constants and store them in CB. Ex SIN, COS, RCP, RSQ, integer MUL and DIV etc.
- Use bool instead of int or float.
- Use abs() on inputs
- Use saturate() on outputs
- clip, discard, alpha-to-mask and writing to oMask or oDepth disable Early-Z when depth writes are on.
- GetDimensions() is a TEX instruction; prefer storing texture dimensions in a Constant Buffer if TEX-bound.

Render Targets

- Always clear MSAA render targets before rendering

Shader Input/Output

- Minimize shader inputs and outputs to minimize IO bandwidth
- Limit Vertex and Domain Shader output size to 4 float4/int4 attributes. Outputs larger than 4 float4/int4 have increased parameter cache storage requirements which reduce wave occupancy. As an added bonus using fewer outputs will reduce PS interpolation cost.
- Use the smallest Input Layout necessary for a given VS; this is especially important for depth-only rendering. Vertex structures often contain a variety of inputs but only a small selection of those are required for depth-only rendering (position and texture coordinates for alpha-tested geometry). Only binding the required inputs in a separate vertex buffer will result in better cache utilization and therefore better performance.
- Pack Vertex Shader outputs to a float4 vector to optimize attributes storage. Especially if it allows four or less float4 vector outputs to be used
- Passing interpolated screenpos can be better than declaring SV_POSITION in pixel shader especially if PS is short. Declaring SV_POSITION in the pixel shader will not be as efficient as passing screen coordinates from the previous shader stage because its use is hard-coded to fixed-function hardware that is also used for other purposes.If a pixel shader needs access to fragment position it is recommended to pass it via texture coordinates instead.

Rasterizer

- Tiny triangles dramatically reduce efficiency of rasterizer as it does one triangle at a time. Over-tessellation can create tiny triangles and also impact ROPs and AA that consume more bandwidth.
- Avoid over-tessellating geometry that produces small triangles in screen space; in general avoid tiny triangles.Notes: The smallest work unit in modern GPUs is the pixel quad (2x2 pixels). Small triangles have efficiency problems because fitting 2x2 pixel quads to cover their area is very likely to produce poor quad occupancy. Poor quad occupancy results in a waste of GPU resources and should therefore be avoided by adopting suitable LOD systems for geometry, especially when tessellation is used

DirectX

- A dedicated thread solely responsible for making D3D calls is usually the best way to drive the API.
- The D3DXSHADER_IEEE_STRICTNESS shader compiler flag is likely to produce longer shader code. This flag enforces additional precision on certain ALU operations, leading to more/more costly instructions being used.
- Use D3D11_USAGE_IMMUTABLE on read-only resources. A surprising number of games don’t. The more information is provided to the drivers and the runtime the better. Games and application often include resources that are never updated and those should be created with the IMMUTABLE flag to optimize memory management. For example skybox and HUD textures are likely to qualify for this flag.
- Avoid calling Map() on DYNAMIC textures as this may require a conversion from tiled to linear memory
- Avoid unnecessary DISCARD when Map()ping resources; some apps still do this at least once a frame. There is no need to DISCARD a buffer once a frame; instead DYNAMIC buffers used with NO_OVERWRITE updates should only be DISCARD-ed when full.

* Create shaders before textures to give the driver enough time to convert the D3D ASM to GCN ASM.Notes: GCN drivers defer compilation of shaders onto separate threads. Creating shaders early on during theloading process ensures they have enough time to finish compiling before the game starts. To ensure all shaders have finished compiling always warm the shader cache by binding all needed shaders into an offscreen rendering operation prior to rendering thegame level.

* Do think about GPR utilization & LDS usage (impacts max # wavefronts)

* Don't forget to optimize geometry for index locality and sequential read access -including procedural geometry.Notes: Index re-use is important to minimize Vertex Shader execution cost, especially in depth-only rendering situations where the GPU front-end is more likely to be a bottleneck.

* Avoid indexing into arrays of shader variables -this has a high performance impact. If indexing cannot be resolved at compile time then indexing into arrays of shader variables will cause these to be stored in Vector Generic Purpose Registers or scratch memory.

* Avoid sparse shader resource slot assignments, e.g. binding resource slot #0 and #127 is a bad idea

* Ensure proxy and predicated geometry are spaced by a few draws when using predicated rendering

* Fetch indirections increase execution latency; keep it under control especially for VS and DS stages.Notes: A “fetch indirection” refers to the process of fetching memory data whose address is itself depending on a previous memory fetch operation. Such memory fetches cannot be grouped together since one depends on the other. Because of the latency involved in fetching memory such dependencies will therefore increase total execution latency.

* Dynamic indexing into a Constant Buffer counts as fetch indirection and should be avoided. Notes: If a calculated index is different across all threads of a wavefront then the fetch of Constant Buffer data using such index is akin to a memory fetch operation.

* With cascaded shadow maps use area culling to exclude geometry already rendered in finer shadow cascades.

* Avoid heavy switching between compute and rendering jobs. Jobs of the same type should be done consecutively.Notes: GCN drivers have to perform surface synchronization tasks when switching between compute and rendering tasks. Heavy back-and-forth switching may therefore increase synchronization overhead and reduce performance

Optimizing GPU occupancy and resource usage with large thread groups

GCN Performance TweetsAMD Developer Relations - 2013

Older GPU Architecture

TeraScale (Xenos/R600 - ) - 2008 - 2011 : Unified shader (VLIW)

VLIEW-5 Element Very-Long-Instruction-Word (XYZWT).

Optimiuzed for Graphics workloads.

Ideal for 4-element vector and 4x4 matrix operations. Vector/vector math in single instruction.

T is transcendental-unit function.

16 SIMDs x (1 VLIW inst x 4 ALU ops)

VLIEW-4 Element Very-Long-Instruction-Word (XYZW) removed T-Unit.