CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
-
Updated
May 11, 2026 - Cuda
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
This is my 🔥 100 Days of GPU — a wild, hands-on journey through CUDA/CUTLASS kernels, Triton spells, and PTX sorcery.
Profiling with NVIDIA Nsight Tools Bootcamp
References content from the OLCF CUDA Training Series. (https://github.com/olcf/cuda-training-series)
A comprehensive, hardware-agnostic GPU benchmarking suite that compares CUDA, OpenCL, and DirectCompute performance using identical workloads. Built from scratch with professional architecture, extensive documentation, and production-ready GUI.
CUDA Samples and Nsight Guided Profiling Samples
Learning CUDA GEMM optimization and profiling for AI infrastructure.
CUDA FP32 GEMM optimization with loop unrolling, shared memory tiling, register tiling, benchmarking, and Nsight profiling.
This project demonstrates the integration of a CUDA kernel within an NVIDIA Holoscan application. It consists of two custom operators: one for memory allocation and data initialization, and another for executing the CUDA kernel. The application was profiled using Nsight systems and the kernel with Nsight compute
libHPC is a high-performance computing library focused on Linux and Windows environments. It provides SIMD-optimized kernels, concurrent data structures, GPU utilities, and HPC-oriented memory management components.
CUDA reduction primitive using warp shuffles, grid-stride loading, and memory-bandwidth profiling with Nsight Compute.
Single-head CUDA attention kernel: naive SDPA --> fused softmax --> occupancy-tuned variants, benchmarked against cuDNN SDPA with Nsight Compute profiling.
Nsight-driven CUDA kernel profiling studies for GEMM, Tensor Core GEMM, reductions, softmax, and attention against vendor baselines.
CUDA learning lab: kernel exercises, profiling experiments, and writeups.
Kernel-only profiling workflow for CUDA and Triton kernels with Nsight Compute, standardized reports, visual analysis, and vendor-portable adapters.
Sparse binary 2D FFT on CUDA/cuFFT with memory-footprint optimization, streaming tiles, Hermitian symmetry, and Nsight analysis.
Row-wise CUDA softmax kernels: shared-memory reduction, warp-shuffle reduction, and online softmax benchmarked against cuDNN.
GPU-accelerated Number-Theoretic Transform for ZK-Proof generation. Targets the NTT bottleneck (91% of Groth16 prover time) via two CUDA optimizations: async double-buffered pipeline eliminating CPU-GPU transfer overhead, and IADD3-path Montgomery multiplication reducing finite-field instruction latency. BLS12-381, Ampere sm_86, Nsight-profiled.
Profile-driven FP32 CUDA GEMM optimization: naive --> tiled --> coalesced --> register-blocked --> bank-padded, benchmarked against cuBLAS.
University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.
Add a description, image, and links to the nsight-compute topic page so that developers can more easily learn about it.
To associate your repository with the nsight-compute topic, visit your repo's landing page and select "manage topics."