- install
turingascompilergit clone --recursive git@github.com:sjfeng1999/gpu-arch-microbenchmark.git
cd turingas
python setup.py install
mkdir build && cd buildcmake .. && makepython ../compile_sass.py -arch=(70|75|80)./(memory_latency|reg_bankconflict|...)
| Device | Latency | Turing RTX-2070 (TU104) |
|---|---|---|
| Global Latency | cycle | 1000 ~ 1200 |
| TLB Latency | cycle | 472 |
| L2 Latency | cycle | 236 |
| L1 Latency | cycle | 32 |
| Shared Latency | cycle | 23 |
| Constant Latency | cycle | 448 |
| Constant L2 Latency | cycle | 62 |
| Constant L1 Latency | cycle | 4 |
- const L1-cache is as fast as register.
- memory bandwidth within one thread
| Device | Bandwidth | Turing RTX-2070 |
|---|---|---|
| Global LDG.128 | GB/s | 194.12 |
| Global LDG.64 | GB/s | 140.77 |
| Global LDG.32 | GB/s | 54.18 |
| Shared LDS.128 | GB/s | 152.96 |
| Shared LDS.64 | GB/s | 30.58 |
| Shared LDS.32 | GB/s | 13.32 |
- global memory bandwidth within (64 block * 256 thread)
| Device | Bandwidth | Turing RTX-2070 |
|---|---|---|
| LDG.32 | GB/s | 246.65 |
| LDG.32 Group1 Stride1 | GB/s | 118.73(2X) |
| LDG.32 Group2 Stride2 | GB/s | 119.08(2X) |
| LDG.32 Group4 Stride4 | GB/s | 117.11(2X) |
| LDG.32 Group8 Stride8 | GB/s | 336.27 |
| LDG.64 | GB/s | 379.24 |
| LDG.64 Group1 Stride1 | GB/s | 126.40(2X) |
| LDG.64 Group2 Stride2 | GB/s | 124.51(2X) |
| LDG.64 Group4 Stride4 | GB/s | 398.84 |
| LDG.64 Group8 Stride8 | GB/s | 371.28 |
| LDG.128 | GB/s | 391.83 |
| LDG.128 Group1 Stride1 | GB/s | 125.25(2X) |
| LDG.128 Group2 Stride2 | GB/s | 402.55 |
| LDG.128 Group4 Stride4 | GB/s | 394.22 |
| LDG.128 Group8 Stride8 | GB/s | 396.10 |
| Device | Linesize | Turing RTX-2070(TU104) |
|---|---|---|
| L2 Linesise | bytes | 64 |
| L1 Linesize | bytes | 32 |
| Constant L2 Linesise | bytes | 256 |
| Constant L1 Linesize | bytes | 32 |
| Instruction | CPI | conflict | without conflict | reg reuse | double reuse |
|---|---|---|---|---|---|
| FFMA | cycle | 3.516 | 2.969 | 2.938 | 2.938 |
| IADD3 | cycle | 3.031 | 2.062 | 2.031 | 2.031 |
| Memory Load | Latency | Turing RTX-2070 (TU104) |
|---|---|---|
| Single | cycle | 23 |
| Vector2 X 2 | cycle | 27 |
| Conflict Strided | cycle | 41 |
| Conlict-Free Strided | cycle | 32 |
- warp schedule
- L1/L2 cache n-way k-set
- Jia, Zhe, et al. "Dissecting the NVIDIA volta GPU architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
- Jia, Zhe, et al. "Dissecting the NVidia Turing T4 GPU via microbenchmarking." arXiv preprint arXiv:1903.07486 (2019).
- Yan, Da, Wei Wang, and Xiaowen Chu. "Optimizing batched winograd convolution on GPUs." Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming. 2020. (turingas)