nsight-compute

Here are 24 public repositories matching this topic...

psmarter / CUDA-Practice

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

parallel-computing cuda high-performance-computing cuda-kernels quantization cutlass gemm performance-optimization nccl gpu-programming roofline-model tensor-core llm-inference flash-attention nsight-compute

Updated May 11, 2026
Cuda

bikrammajhi / 100-days-of-GPU

Star

This is my 🔥 100 Days of GPU — a wild, hands-on journey through CUDA/CUTLASS kernels, Triton spells, and PTX sorcery.

mojo cuda triton cutlass ptx nsight-compute thunderkittens

Updated Apr 6, 2026
HTML

openhackathons-org / HPC_Profiler

Star

Profiling with NVIDIA Nsight Tools Bootcamp

hpc cuda openacc nsight-systems nsight-compute

Updated Feb 4, 2026
C++

SeungjaeLim / CUDA.tutorial

Star

References content from the OLCF CUDA Training Series. (https://github.com/olcf/cuda-training-series)

cuda gpu-programming nsight-systems nsight-compute

Updated Nov 21, 2024
Cuda

A comprehensive, hardware-agnostic GPU benchmarking suite that compares CUDA, OpenCL, and DirectCompute performance using identical workloads. Built from scratch with professional architecture, extensive documentation, and production-ready GUI.

benchmarking cmake opengl kernel cpp opencl imgui cuda hlsl gpu-computing low-level-programming nvcc directcompute cool-stuff cuda-programming fxc nsight-compute dgxi

Updated Jan 16, 2026
C++

j3soon / hpc-samples

Star

CUDA Samples and Nsight Guided Profiling Samples

cuda profiling nsight nsight-compute

Updated May 20, 2026
Python

nrjwfersonguimaraesb601-hue / GEMM-Personal-Learning

Star

Learning CUDA GEMM optimization and profiling for AI infrastructure.

gpu cuda gemm ai-infra nsight-compute kernel-optimization

Updated May 23, 2026
Cuda

LioEinaudi / GEMM-optimization

Star

CUDA FP32 GEMM optimization with loop unrolling, shared memory tiling, register tiling, benchmarking, and Nsight profiling.

hpc gpu parallel-computing cuda cublas matrix-multiplication cuda-kernels gemm nsight-compute

Updated May 16, 2026
Cuda

abhiMishra98 / Holoscan-Add-Matrices

Star

This project demonstrates the integration of a CUDA kernel within an NVIDIA Holoscan application. It consists of two custom operators: one for memory allocation and data initialization, and another for executing the CUDA kernel. The application was profiled using Nsight systems and the kernel with Nsight compute

cuda gpu-programming holoscan nsight-systems nsight-compute gpu-profiling

Updated Nov 11, 2025
C++

Liupeter01 / libHPC

Star

libHPC is a high-performance computing library focused on Linux and Windows environments. It provides SIMD-optimized kernels, concurrent data structures, GPU utilities, and HPC-oriented memory management components.

c cpp hpc cuda intrinsics lock-free sparse-matrix memorypool ghost-cell-exchange nsight-systems nsight-compute ghost-cell

Updated Apr 22, 2026
C++

Pupking / 03_parallel_reduction

Star

CUDA reduction primitive using warp shuffles, grid-stride loading, and memory-bandwidth profiling with Nsight Compute.

cuda memory-bandwidth gpu-performance parallel-reduction nsight-compute warp-shuffle

Updated Apr 20, 2026
Cuda

Pupking / 05_flash_attention_lite

Star

Single-head CUDA attention kernel: naive SDPA --> fused softmax --> occupancy-tuned variants, benchmarked against cuDNN SDPA with Nsight Compute profiling.

cuda attention cudnn gpu-performance flash-attention nsight-compute

Updated Apr 20, 2026
Cuda

Pupking / cuda-kernel-profiling-studies

Star

Nsight-driven CUDA kernel profiling studies for GEMM, Tensor Core GEMM, reductions, softmax, and attention against vendor baselines.

cuda cublas profiling cudnn gemm gpu-performance tensor-cores flash-attention nsight-compute

Updated May 1, 2026

5ara5t1 / cuda-learning

Star

CUDA learning lab: kernel exercises, profiling experiments, and writeups.

gpu cuda cuda-programming nsight-compute

Updated May 21, 2026
Cuda

ZJtoast / kernel-profiler-skill

Star

Kernel-only profiling workflow for CUDA and Triton kernels with Nsight Compute, standardized reports, visual analysis, and vendor-portable adapters.

agent cuda nvidia developer-tools triton performance-analysis profiling llm-inference nsight-compute gpu-profiling agent-skills codex-skill

Updated May 15, 2026
Python

Pupking / fft_final

Star

Sparse binary 2D FFT on CUDA/cuFFT with memory-footprint optimization, streaming tiles, Hermitian symmetry, and Nsight analysis.

cuda fft sparse-matrix suitesparse cufft gpu-performance nsight-compute

Updated May 2, 2026
Cuda

Pupking / 04_fused_softmax

Star

Row-wise CUDA softmax kernels: shared-memory reduction, warp-shuffle reduction, and online softmax benchmarked against cuDNN.

cuda cudnn softmax gpu-performance nsight-compute online-softmax

Updated Apr 20, 2026
Cuda

Artemarius / cuda-zkp-ntt

Star

GPU-accelerated Number-Theoretic Transform for ZK-Proof generation. Targets the NTT bottleneck (91% of Groth16 prover time) via two CUDA optimizations: async double-buffered pipeline eliminating CPU-GPU transfer overhead, and IADD3-path Montgomery multiplication reducing finite-field instruction latency. BLS12-381, Ampere sm_86, Nsight-profiled.

gpu cuda cuda-kernels number-theoretic-transform ntt zero-knowledge-proofs finite-field-arithmetic bls12-381 montgomery-multiplication nsight-compute

Updated Mar 16, 2026
Cuda

Pupking / 01_tiled_gemm

Star

Profile-driven FP32 CUDA GEMM optimization: naive --> tiled --> coalesced --> register-blocked --> bank-padded, benchmarked against cuBLAS.

cuda cublas shared-memory gemm gpu-performance nsight-compute memory-coalescing

Updated May 5, 2026
Cuda

itm-unipi / Parallelized-Nearest-Neighbor-Upscaler

Star

University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.

gpu nvidia nvidia-cuda nvidia-gpu nsight image-upscaling parallelized nearest-neighbor-algorithm nsight-compute

Updated Dec 29, 2023
C

Improve this page

Add a description, image, and links to the nsight-compute topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the nsight-compute topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsight-compute

Here are 24 public repositories matching this topic...

psmarter / CUDA-Practice

bikrammajhi / 100-days-of-GPU

openhackathons-org / HPC_Profiler

SeungjaeLim / CUDA.tutorial

davesohamm / GPU-Benchmark

j3soon / hpc-samples

nrjwfersonguimaraesb601-hue / GEMM-Personal-Learning

LioEinaudi / GEMM-optimization

abhiMishra98 / Holoscan-Add-Matrices

Liupeter01 / libHPC

Pupking / 03_parallel_reduction

Pupking / 05_flash_attention_lite

Pupking / cuda-kernel-profiling-studies

5ara5t1 / cuda-learning

ZJtoast / kernel-profiler-skill

Pupking / fft_final

Pupking / 04_fused_softmax

Artemarius / cuda-zkp-ntt

Pupking / 01_tiled_gemm

itm-unipi / Parallelized-Nearest-Neighbor-Upscaler

Improve this page

Add this topic to your repo