KhronosGroup · gpx1000 · Aug 4, 2025 · Oct 7, 2025 · bacTlink · Oct 24, 2025
diff --git a/README.adoc b/README.adoc
@@ -66,6 +66,8 @@ The Vulkan Guide can be built as a single page using `asciidoctor guide.adoc`
 
 == xref:{chapters}ide.adoc[Development Environments & IDEs]
 
+== xref:{chapters}tile_based_rendering_best_practices.adoc[Tile Based Rendering (TBR) Best Practices]
+
 == xref:{chapters}vulkan_profiles.adoc[Vulkan Profiles]
 
 == xref:{chapters}loader.adoc[Loader]

diff --git a/antora/modules/ROOT/nav.adoc b/antora/modules/ROOT/nav.adoc
@@ -21,6 +21,7 @@
 ** xref:{chapters}validation_overview.adoc[]
 ** xref:{chapters}decoder_ring.adoc[]
 * Using Vulkan
+** xref:{chapters}tile_based_rendering_best_practices.adoc[]
 ** xref:{chapters}loader.adoc[]
 ** xref:{chapters}layers.adoc[]
 ** xref:{chapters}querying_extensions_features.adoc[]

diff --git a/chapters/tile_based_rendering_best_practices.adoc b/chapters/tile_based_rendering_best_practices.adoc
@@ -0,0 +1,306 @@
+// Copyright 2025 Holochip, Inc.
+// SPDX-License-Identifier: CC-BY-4.0
+
+// Required for both single-page and combined guide xrefs to work
+ifndef::chapters[:chapters:]
+ifndef::images[:images: images/]
+
+[[TileBasedRenderingBestPractices]]
+= Tile Based Rendering (TBR) Best Practices
+
+Tile Based Rendering (TBR) is a rendering architecture commonly found in mobile GPUs and some desktop GPUs that divides the screen into small rectangular tiles and renders each tile separately. This comprehensive chapter provides extensive technical guidance on TBR considerations, advanced optimization techniques, and detailed comparisons with Immediate Mode Rendering (IMR) architectures.
+
+Understanding the intricate differences between TBR and IMR architectures is crucial for optimizing Vulkan applications across different GPU architectures, achieving optimal performance on mobile devices, and making informed architectural decisions for cross-platform applications.
+
+[[mobile-gpu-architectures]]
+== Mobile GPU Architectures
+
+Understanding the underlying hardware architecture is fundamental to optimizing for TBR systems. Mobile GPUs have evolved significantly, with each vendor implementing unique approaches to tile-based rendering that affect optimization strategies.
+
+[[tbr-hardware-implementations]]
+=== TBR Hardware Implementations
+
+Modern mobile GPUs implement TBR with varying degrees of sophistication and different architectural choices that directly impact application performance.
+
+**TBR Architecture Characteristics:**
+
+- **Tile size**: Chosen by the implementation and not queryable in core Vulkan. Do not assume a specific size. Some vendor extensions (for example `VK_QCOM_tile_shading`) expose tile parameters, but are specific to Qualcomm devices.
+- **Tile memory**: Implementations use on‑chip tile/local memory internally; its size/layout are not exposed to applications.
+- **MSAA**: Resolves can be efficient on tilers because they can resolve from tile memory. Choose sample counts based on image quality and device testing.
+- **Attachment usage**: Prefer `VK_ATTACHMENT_STORE_OP_DONT_CARE` for transient/intermediate attachments, and `VK_ATTACHMENT_LOAD_OP_CLEAR` when you overwrite contents.
+- **Render targets**: The number and formats of attachments affect performance; measure on target devices rather than relying on fixed limits.
+
+**Note on internal rendering modes:**
+Some implementations may internally select different rendering paths. Applications do not control this in Vulkan and should write rendering code that is agnostic to the underlying TBR/IMR implementation.
+
+**Advanced TBR considerations:**
+
+- Use subpasses and `VK_DEPENDENCY_BY_REGION_BIT` to enable local data reuse where beneficial; always measure on target devices.
+- Prefer smaller attachment formats where acceptable and avoid unnecessary attachments to reduce bandwidth.
+- Use MSAA resolves to move data out of multisampled attachments efficiently when using MSAA.
+- Focus on render pass load/store/discard patterns to minimize external memory traffic.
+
+[[tbr-optimization-considerations]]
+=== TBR Optimization Considerations
+
+Modern TBR architectures share common optimization principles that can be applied generically across different hardware implementations:
+
+**Core TBR Optimization Principles:**
+
+- Tile-local memory is managed by the implementation; applications influence external memory traffic primarily via render pass loadOp/storeOp/discard, resolves, and using transient attachments.
+- Early fragment rejection and by-region dependencies can reduce work; ensure correct subpass dependencies and pipeline barriers to enable pipelining.
+- Bandwidth optimization is critical on mobile; minimize attachment traffic and unnecessary clears/stores.
+
+**Bandwidth Optimization Strategies:**
+
+- **Attachment configuration**: Final attachments use `VK_ATTACHMENT_STORE_OP_STORE`; intermediate attachments use `VK_ATTACHMENT_STORE_OP_DONT_CARE` when you do not need the results.
+- **Load operations**: Use `VK_ATTACHMENT_LOAD_OP_CLEAR` for new content; `VK_ATTACHMENT_LOAD_OP_DONT_CARE` for intermediate results you overwrite.
+- **MSAA**: Resolves can be efficient on many tilers; choose sample counts based on quality/performance testing on target hardware.
+- **Pipelining**: Use appropriate pipeline barriers and subpass dependencies to enable on-chip pipelining and reduce external memory traffic.
+
+**TBR-friendly patterns:**
+
+- Use subpasses with `VK_DEPENDENCY_BY_REGION_BIT` to keep intermediate results on-chip where supported.
+- Prefer `VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT` for temporary attachments and lazily allocated memory when supported.
+
+**Tile memory–friendly strategies:**
+
+- Use `VK_ATTACHMENT_STORE_OP_DONT_CARE` for attachments you do not need after the pass (e.g., depth-stencil in many cases).
+- Prefer transient attachments (`VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT`) and lazily allocated memory when supported by the memory allocator driver.
+- Resolve MSAA attachments in the same pass where applicable to avoid exposing multisampled images to external memory.
+- Prefer smaller bit-depth formats where acceptable to reduce bandwidth.
+
+[[tile-sizes-and-memory-constraints]]
+=== Tile Sizes and Memory Constraints
+
+Tile size and any on-chip tile memory characteristics are implementation-defined and not exposed by core Vulkan. Applications should not try to infer or configure tile size.
+
+Practical guidance:
+
+- Design render passes/subpasses and attachment usage assuming the implementation will manage tiles internally.
+- Minimize external memory traffic via loadOp/storeOp choices, resolves, and transient attachments.
+- Profile on target devices; do not rely on fixed tile-size assumptions or device RAM heuristics.
+
+[[tbr-vs-imr-detailed-analysis]]
+== TBR vs IMR Detailed Analysis
+
+This section provides a high-level comparison between Tile-Based Rendering and Immediate Mode Rendering architectures, focused on implications for Vulkan applications.
+
+[[tbr-architecture-deep-dive]]
+=== TBR Architecture Overview
+
+On tile-based renderers, the driver/GPU partitions work by screen regions (tiles) and executes per‑tile rendering using on‑chip tile/local memory. Intermediate results can remain on‑chip until the tile is resolved to external memory. The exact tiling scheme and memory management are implementation‑defined and opaque to applications.
+
+Implications for applications:
+
+- You cannot control tile size or tiling policy in core Vulkan; write code that does not assume a particular tile size or layout.
+- Minimize external memory traffic by using appropriate attachment load/store ops, resolves, and transient attachments.
+- Where beneficial and supported, organize work into subpasses with VK_DEPENDENCY_BY_REGION_BIT to allow local reuse of data without round‑tripping to external memory.
+
+[[imr-architecture-analysis]]
+=== IMR Architecture Analysis
+
+Immediate Mode Rendering does not perform screen-space binning prior to rasterization. Fragment results are typically written to external memory as they are produced.
+
+Key characteristics (high-level):
+
+- No explicit on-chip tile memory model exposed to applications.
+- Overdraw tends to generate more external memory traffic than on tilers; minimizing overdraw is important.
+- Applications should rely on standard Vulkan techniques (early depth/stencil, appropriate load/store ops, and subpasses where helpful) and profile on target devices.
+
+[[vulkan-extensions-comprehensive-guide]]
+== Vulkan Extensions Comprehensive Guide
+
+Several Vulkan extensions provide specific optimizations and capabilities for TBR architectures. This section provides concrete recommendations about what applications may benefit from these extensions:
+
+[[vk-khr-dynamic-rendering-local-read]]
+=== VK_KHR_dynamic_rendering_local_read
+
+Provides input-attachment style local reads from color, depth, and stencil attachments when using dynamic rendering, without needing subpasses or render pass objects.
+
+Key points:
+
+- Availability: Promoted to Vulkan 1.4; available as `VK_KHR_dynamic_rendering_local_read` on older versions. Requires `VkPhysicalDeviceDynamicRenderingLocalReadFeaturesKHR::dynamicRenderingLocalRead = VK_TRUE` at device creation.
+- What it enables: Fragment shaders can read the value produced for the current pixel/sample from attachments within the same dynamic rendering instance. This mirrors subpass input attachments, but for dynamic rendering.
+- Typical uses: Porting subpass-input workflows to dynamic rendering; reading the current pixel from a previous attachment write in the same pass (e.g., order-dependent blending logic per-fragment). Benefits are workload- and implementation-dependent; always profile on target devices.
+- Not a general feedback loop: This is not neighborhood sampling or arbitrary sampling of attachments, and not a cross-draw feedback loop for textures. For neighborhood filters or post-processing, use other techniques (e.g., separate passes).
+
+API usage outline:
+
+* Enable the feature at device creation
+
+[source,cpp]
+----
+VkPhysicalDeviceDynamicRenderingLocalReadFeaturesKHR localRead{};
+localRead.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_DYNAMIC_RENDERING_LOCAL_READ_FEATURES_KHR;
+localRead.dynamicRenderingLocalRead = VK_TRUE;
+// Chain into pNext of VkDeviceCreateInfo (or VkPhysicalDeviceFeatures2 path)
+----
+
+* Specify attachment formats in the graphics pipeline (dynamic rendering)
+
+[source,cpp]
+----
+VkPipelineRenderingCreateInfo pipelineRendering{};
+pipelineRendering.sType = VK_STRUCTURE_TYPE_PIPELINE_RENDERING_CREATE_INFO;
+pipelineRendering.colorAttachmentCount = colorFormatCount;
+pipelineRendering.pColorAttachmentFormats = colorFormats;
+pipelineRendering.depthAttachmentFormat = depthFormat;   // optional
+pipelineRendering.stencilAttachmentFormat = stencilFormat; // optional
+// Chain into VkGraphicsPipelineCreateInfo::pNext
+----
+
+* Map attachments to locations and input indices (dynamic state)
+
+[source,cpp]
+----
+// Set the location numbers that fragment shaders will use for subpassInput declarations
+VkRenderingAttachmentLocationInfoKHR locInfo{};
+locInfo.sType = VK_STRUCTURE_TYPE_RENDERING_ATTACHMENT_LOCATION_INFO_KHR;
+locInfo.colorAttachmentCount = colorCount;
+locInfo.pColorAttachmentLocations = colorLocations; // e.g., {0, 1, ...}
+
+// Map input_attachment_index -> attachment
+VkRenderingInputAttachmentIndexInfoKHR indexInfo{};
+indexInfo.sType = VK_STRUCTURE_TYPE_RENDERING_INPUT_ATTACHMENT_INDEX_INFO_KHR;
+indexInfo.colorAttachmentCount = colorCount;
+indexInfo.pColorAttachmentInputIndices = inputIndices; // e.g., {0, 1, ...}
+
+vkCmdSetRenderingAttachmentLocationsKHR(cmd, &locInfo);
+vkCmdSetRenderingInputAttachmentIndicesKHR(cmd, &indexInfo);
+----
+
+* Use input attachments in the fragment shader (current pixel only)
+
+[source,glsl]
+----
+#version 450
+layout(input_attachment_index = 0, set = 0, binding = 0) uniform subpassInput inColor;
+
+void main() {
+    vec4 c = subpassLoad(inColor); // reads current pixel from the mapped color attachment
+    // ... use c
+}
+----
+
+Synchronization and hazards:
+
+- Local reads in the same rendering instance follow rasterization-order rules similar to subpass input; no extra barriers are needed within a single draw for the same fragment.
+- For producer/consumer across draws within the same rendering instance, synchronize writes to attachments before reads using a by-region dependency, e.g. with `vkCmdPipelineBarrier2`:
+
+[source,cpp]
+----
+VkMemoryBarrier2 barrier{ VK_STRUCTURE_TYPE_MEMORY_BARRIER_2 };
+barrier.srcStageMask = VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT;
+barrier.srcAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT;
+barrier.dstStageMask = VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT;
+barrier.dstAccessMask = VK_ACCESS_2_INPUT_ATTACHMENT_READ_BIT;
+
+VkDependencyInfo dep{ VK_STRUCTURE_TYPE_DEPENDENCY_INFO };
+dep.dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT; // prefer region-local sync on tilers
+dep.memoryBarrierCount = 1;
+dep.pMemoryBarriers = &barrier;
+
+vkCmdPipelineBarrier2(cmd, &dep);
+----
+
+TBR relevance:
+
+- On many tilers, local reads can be serviced from on-chip tile memory, avoiding external memory round-trips. This can reduce bandwidth versus sampling from images written in earlier passes. Actual gains are implementation- and workload-dependent; profile on target devices.
+
+Specification:
+
+- VK_KHR_dynamic_rendering_local_read: https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_dynamic_rendering_local_read.html[Extension/man page]
+
+[[vk-ext-shader-tile-image]]
+=== VK_EXT_shader_tile_image
+
+This extension allows a fragment shader to read the value of the current pixel from an attachment in the tile.
+
+Note: Access is limited to the current pixel; this extension is not suitable for neighborhood filters (e.g., bloom, FXAA, SSR) that require reading adjacent pixels.
+
+[source,glsl]
+----
+#version 450
+#extension GL_EXT_shader_tile_image : require
+
+layout(location = 0) out vec4 fragColor;
+
+// Tile image access in fragment shader
+layout(binding = 0) uniform tileImageEXT colorTile;
+
+void main() {
+    // Direct access to tile memory - very fast on TBR
+    vec4 tileColor = tileImageLoad(colorTile);
+
+    // Process tile data efficiently
+    fragColor = processColor(tileColor);
+}
+----
+
+**Notes:**
+
+- Avoids round-tripping through external memory for the current pixel value.
+- Scope is limited to the current pixel; broader post-processing still requires other techniques.
+
+[[performance-considerations]]
+== Performance Considerations
+
+[[memory-bandwidth]]
+=== Memory Bandwidth
+
+TBR architectures excel when external memory bandwidth is minimized:
+
+**Optimization Strategies:**
+
+- Use appropriate load/store operations for attachments
+- Minimize attachment resolution and bit depth when possible
+- Leverage tile memory for intermediate computations
+
+[[overdraw-impact]]
+=== Overdraw Impact
+
+Tilers can mitigate some external memory cost of overdraw because many fragments can be resolved in on‑chip memory before writing out. Overdraw still incurs shader work and can impact bandwidth. Prefer techniques that reduce overdraw (e.g., front‑to‑back rendering, effective early depth/stencil) and profile on target devices.
+
+[[multisampling-considerations]]
+=== Multisampling Considerations
+
+On many tilers, resolving MSAA attachments can be efficient because the implementation may resolve from on‑chip memory. Choose sample counts and resolve strategies based on image‑quality goals and profiling results on target hardware.
+
+[[best-practices-summary]]
+== Best Practices Summary
+
+**For TBR Optimization:**
+
+1. **Minimize External Memory Traffic**
+   - Use `VK_ATTACHMENT_STORE_OP_DONT_CARE` for temporary data
+   - Prefer `VK_ATTACHMENT_LOAD_OP_CLEAR` over loading existing data
+   - Keep intermediate results in tile memory using subpasses
+
+2. **Leverage TBR-relevant Extensions**
+   - Use `VK_EXT_shader_tile_image` for direct access to the current pixel value in the tile
+   - Consider `VK_KHR_dynamic_rendering_local_read` where supported; evaluate benefits by profiling
+
+3. **Optimize Render Pass Design**
+   - Use subpasses instead of multiple render passes
+   - Apply `VK_DEPENDENCY_BY_REGION_BIT` for tile-local dependencies
+   - Design for tile memory constraints
+
+4. **Validate on Target Devices**
+   - Always profile; benefits are workload- and implementation-dependent.
+
+**For Cross-Platform Compatibility:**
+
+- Profile on both TBR and IMR architectures
+- Use conditional compilation for architecture-specific optimizations
+- Implement fallback paths for unsupported extensions
+
+[[additional-resources]]
+== Additional Resources
+
+**GPU Vendor Documentation and Performance Guides:**
+
+* **ARM Mali GPU Best Practices Guide**: https://developer.arm.com/documentation/101897/latest/[Comprehensive optimization strategies for Mali TBR architecture]
+* **ARM Mali GPU Application Developer Best Practices**: https://developer.arm.com/documentation/102662/latest/[Detailed bandwidth optimization and power consumption analysis]
+* **Imagination PowerVR Architecture Guide**: https://docs.imgtec.com/starter-guides/powervr-architecture/html/index.html[Tile-based deferred rendering and memory hierarchy optimization]
+* **HUAWEI Maleoon GPU Best Practices**: https://developer.huawei.com/consumer/en/doc/best-practices/bpta-maleoon-gpu-best-practices[TBR-relevant best practices for Huawei Maleoon]
diff --git a/guide.adoc b/guide.adoc
@@ -62,6 +62,8 @@ include::{chapters}decoder_ring.adoc[]
 
 include::{chapters}ide.adoc[]
 
+include::{chapters}tile_based_rendering_best_practices.adoc[]
+
 include::{chapters}descriptor_arrays.adoc[]
 
 include::{chapters}loader.adoc[]