AI Inference
Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.
When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.
Click here to view other performance data.
MLPerf Inference v4.0 Performance Benchmarks
Offline Scenario, Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | Dataset |
---|---|---|---|---|---|---|
Llama2 70B | 31,712 tokens/sec | 8x H200 | NVIDIA DGX H200 | NVIDIA H200-SXM-141GB-CTS | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca |
22,290 tokens/sec | 8x H100 | GIGABYTE G593-SD1 | NVIDIA H100-SXM-80GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca | |
3,871 tokens/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200-GraceHopper-Superchip 144GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca | |
15,086 tokens/sec | 8x H100 NVL | SYS-521GE-TNRT | NVIDIA H100 NVL | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | OpenOrca | |
Stable Diffusion XL | 13.2 samples/sec | 8x H100 | GIGABYTE G593-SD1 | H100-SXM-80GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val |
1.8 samples/sec | 1x GH200 | GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT | NVIDIA GH200-GraceHopper-Superchip | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val | |
5.04 samples/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val | |
ResNet-50 | 705,887 samples/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 76.46% Top1 | ImageNet (224x224) |
369,341 samples/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 76.46% Top1 | ImageNet (224x224) | |
RetinaNet | 14,291 samples/sec | 8x H100 | HPE Cray XD670 | H100-SXM-80GB | 0.3755 mAP | OpenImages (800x800) |
6,401 samples/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 0.3755 mAP | OpenImages (800x800) | |
BERT | 70,759 samples/sec | 8x H100 | HPE Cray XD670 | H100-SXM-80GB | 90.874% f1 | SQuAD v1.1 |
26,430 samples/sec | 8x L40S | SYS-521GE-TNRT | NVIDIA L40S | 90.87% f1 | SQuAD v1.1 | |
GPT-J | 243 samples/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail |
32 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail | |
98 samples/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail | |
DLRMv2 | 354,151 samples/sec | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | 80.31% AUC | Synthetic Multihot Criteo Dataset |
49,651 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 80.31% AUC | Synthetic Multihot Criteo Dataset | |
101,691 samples/sec | 1x L40S | ESC8000-E11 | NVIDIA L40S | 80.31% AUC | Synthetic Multihot Criteo Dataset | |
3D-UNET | 52 samples/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 0.863 DICE mean | KiTS 2019 |
32 samples/sec | 1x L40S | SYS-521GE-TNRT | NVIDIA L40S | 0.863 DICE mean | KiTS 2019 | |
RNN-T | 191,355 samples/sec | 8x H100 | GIGABYTE G593-SD1 | H100-SXM-80GB | 7.45% WER | Librispeech dev-clean |
91,782 samples/sec | 1x L40S | ESC8000-E11 | NVIDIA L40S | 7.45% WER | Librispeech dev-clean |
Server Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) |
Dataset |
---|---|---|---|---|---|---|---|
Llama2 70B | 29,526 tokens/sec | 8x H200 | NVIDIA DGX H200 | NVIDIA H200-SXM-141GB-CTS | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca |
21,504 tokens/sec | 8x H100 | SYS-821GE-TNHR | NVIDIA H100-SXM-80GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca | |
3,617 tokens/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200-GraceHopper-Superchip 144GB | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca | |
14,275 tokens/sec | 8x H100 NVL | SYS-521GE-TNRT | NVIDIA H100 NVL | rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 | TTFT/TPOT: 2000 ms/200 ms | OpenOrca | |
Stable Diffusion XL | 13.6 queries/sec | 8x H100 | SYS-821GE-TNHR | NVIDIA H100-SXM-80GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val |
1.68 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200-GraceHopper-Superchip | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val | |
4.96 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val | |
ResNet-50 | 630,172 queries/sec | 8x H100 | GIGABYTE G593-SD1 | H100-SXM-80GB | 76.46% Top1 | 15 ms | ImageNet (224x224) |
355,029 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 76.46% Top1 | 15 ms | ImageNet (224x224) | |
RetinaNet | 13,676 queries/sec | 8x H100 | HPE Cray XD670 | H100-SXM-80GB | 0.3755 mAP | 100 ms | OpenImages (800x800) |
5,798 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 0.3755 mAP | 100 ms | OpenImages (800x800) | |
BERT | 57,293 queries/sec | 8x H100 | GIGABYTE G593-SD1 | H100-SXM-80GB | 90.87% f1 | 130 ms | SQuAD v1.1 |
25,121 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 90.87% f1 | 130 ms | SQuAD v1.1 | |
GPT-J | 240 queries/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail |
31 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200-GraceHopper-Superchip | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail | |
98 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail | |
DLRMv2 | 333,218 queries/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset |
48,788 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | NVIDIA GH200-GraceHopper-Superchip | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset | |
94,969 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset | |
RNN-T | 179,985 queries/sec | 8x H100 | GIGABYTE G593-SD1 | H100-SXM-80GB | 7.45% WER | 1000 ms | Librispeech dev-clean |
87,974 queries/sec | 8x L40S | ESC8000-E11 | NVIDIA L40S | 7.45% WER | 1000 ms | Librispeech dev-clean |
Power Efficiency Offline Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
Llama2 70B | 17,099 tokens/sec | 2.99 tokens/sec/watt | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | OpenOrca |
Stable Diffusion XL | 9.65 samples/sec | 0.00203 samples/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | Subset of coco-2014 val |
4.24 samples/sec | 0.00119 samples/sec/watt | 8x L40S | PRIMERGY CDI | NVIDIA L40S | Subset of coco-2014 val | |
ResNet-50 | 456,575 samples/sec | 113 samples/sec/watt | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | ImageNet (224x224) |
RetinaNet | 10,106 samples/sec | 2 samples/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | OpenImages (800x800) |
BERT | 53,727 samples/sec | 11 samples/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | SQuAD v1.1 |
GPT-J | 174 samples/sec | 0.0377 samples/sec/watt | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | CNN Dailymail |
DLRMv2 | 283,714 samples/sec | 50 samples/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | Synthetic Multihot Criteo Dataset |
3D-UNET | 37 samples/sec | 0.009 samples/sec/watt | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | KiTS 2019 |
RNN-T | 139,938 samples/sec | 32 samples/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | Librispeech dev-clean |
Power Efficiency Server Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
Llama2 70B | 15,487 tokens/sec | 2.62 tokens/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | OpenOrca |
Stable Diffusion XL | 8.78 queries/sec | 0.00196 queries/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | Subset of coco-2014 val |
4.12 queries/sec | 0.00117 queries/sec/watt | 8x L40S | PRIMERGY CDI | NVIDIA L40S | Subset of coco-2014 val | |
ResNet-50 | 400,031 queries/sec | 103 queries/sec/watt | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | ImageNet (224x224) |
RetinaNet | 8,794 queries/sec | 2 queries/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | OpenImages (800x800) |
BERT | 42,386 queries/sec | 8 queries/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | SQuAD v1.1 |
GPT-J | 150 queries/sec | 0.0326 queries/sec/watt | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | CNN Dailymail |
DLRMv2 | 255,995 queries/sec | 44 queries/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | Synthetic Multihot Criteo Dataset |
RNN-T | 123,981 queries/sec | 27 queries/sec/watt | 8x H100 | NVIDIA DGX H100 | H100-SXM-80GB | Librispeech dev-clean |
MLPerf™ v4.0 Inference Closed: Llama2 70B, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 4.0-0002, 4.0-0033, 4.0-0042, 4.0-0044, 4.0-0047, 4.0-0062, 4.0-0063, 4.0-0064, 4.0-0065, 4.0-0066, 4.0-0068, 4.0-0070, 4.0-0071, 4.0-0082, 4.0-0085, 4.0-0086. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA H200 and NVIDIA GH200 GraceHopper-Superchip 144GB is a preview submission
Llama2 Max Sequence Length = 1,024
BERT-Large Max Sequence Length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
LLM Inference Performance of NVIDIA Data Center Products
H200 Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPTJ 6B | 1024 | 1 | 128 | 128 | 29,169 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA H200 |
GPTJ 6B | 120 | 1 | 128 | 2048 | 9,472 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA H200 |
GPTJ 6B | 64 | 1 | 2048 | 128 | 2,962 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA H200 |
GPTJ 6B | 64 | 1 | 2048 | 2048 | 4,149 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA H200 |
Llama v2 7B | 896 | 1 | 128 | 128 | 20,618 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 7B | 120 | 1 | 128 | 2048 | 8,348 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 7B | 84 | 1 | 2048 | 128 | 2,430 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA H200 |
Llama v2 7B | 56 | 1 | 2048 | 2048 | 3,522 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 1024 | 1 | 128 | 128 | 3,989 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 512 | 2 | 128 | 2048 | 3,963 total tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 64 | 1 | 2048 | 128 | 418 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 64 | 1 | 2048 | 2048 | 1,458 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 1024 | 4 | 128 | 128 | 1,118 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 1024 | 4 | 128 | 2048 | 990 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 64 | 4 | 2048 | 128 | 118 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 64 | 4 | 2048 | 2048 | 265 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 896 | 1 | 128 | 128 | 20,460 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 120 | 1 | 128 | 2048 | 8,950 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 84 | 1 | 2048 | 128 | 2,450 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA H200 |
Mistral 7B | 56 | 1 | 2048 | 2048 | 3,867 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
TP: Tensor Parallelism
Batch size per GPU
GH200 Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 1024 | 1 | 128 | 128 | 28,946 total tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
GPT-J 6B | 120 | 1 | 128 | 2048 | 8,882 total tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
GPT-J 6B | 64 | 1 | 2048 | 128 | 2,783 total tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
GPT-J 6B | 64 | 1 | 2048 | 2048 | 3,832 total tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Llama v2 70B | 256 | 1 | 128 | 128 | 3,401 total tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Llama v2 70B | 256 | 2 | 128 | 2048 | 2,904 total tokens/sec | 2x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Llama v2 70B | 96 | 2 | 2048 | 128 | 305 total tokens/sec | 2x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Llama v2 70B | 64 | 2 | 2048 | 2048 | 1,028 total tokens/sec | 2x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Falcon 180B | 1024 | 4 | 128 | 128 | 1,132 total tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Falcon 180B | 512 | 4 | 128 | 2048 | 946 total tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Falcon 180B | 64 | 4 | 2048 | 128 | 121 total tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
Falcon 180B | 64 | 4 | 2048 | 2048 | 277 total tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.8.0 | NVIDIA GH200 96B |
TP: Tensor Parallelism
Batch size per GPU
H100 Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 1024 | 1 | 128 | 128 | 27,358 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.8.0 | H100-SXM5-80GB |
GPT-J 6B | 120 | 1 | 128 | 2048 | 7,832 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.8.0 | H100-SXM5-80GB |
GPT-J 6B | 64 | 1 | 2048 | 128 | 2,661 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.8.0 | H100-SXM5-80GB |
GPT-J 6B | 64 | 1 | 2048 | 2048 | 3,409 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.8.0 | H100-SXM5-80GB |
Llama v2 70B | 1024 | 2 | 128 | 128 | 3,269 total tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.8.0 | H100-SXM5-80GB |
Llama v2 70B | 512 | 4 | 128 | 2048 | 2,725 total tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 70B | 96 | 2 | 2048 | 128 | 346 total tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 70B | 64 | 2 | 2048 | 2048 | 1,011 total tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
TP: Tensor Parallelism
Batch size per GPU
H100 NVL Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 1024 | 1 | 128 | 128 | 20,484 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
GPT-J 6B | 120 | 1 | 128 | 2048 | 7,134 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
GPT-J 6B | 64 | 1 | 2048 | 128 | 2,124 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
GPT-J 6B | 64 | 1 | 2048 | 2048 | 3,062 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 7B | 896 | 1 | 128 | 128 | 15,044 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 7B | 120 | 1 | 128 | 2048 | 6,153 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 7B | 84 | 1 | 2048 | 128 | 1,736 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 7B | 56 | 1 | 2048 | 2048 | 2,591 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 70B | 256 | 1 | 128 | 128 | 2,335 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 70B | 96 | 1 | 2048 | 128 | 264 total tokens/sec | 1x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
Llama v2 70B | 64 | 2 | 2048 | 2048 | 846 total tokens/sec | 2x H100 | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | H100 NVL |
TP: Tensor Parallelism
Batch size per GPU
L40S Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 512 | 1 | 128 | 128 | 7,859 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
GPT-J 6B | 64 | 1 | 128 | 2048 | 1,904 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
GPT-J 6B | 32 | 1 | 2048 | 128 | 684 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
GPT-J 6B | 32 | 1 | 2048 | 2048 | 768 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 7B | 256 | 1 | 128 | 128 | 5,885 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 7B | 64 | 1 | 128 | 2048 | 1,654 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 7B | 32 | 1 | 2048 | 128 | 574 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 7B | 16 | 1 | 2048 | 2048 | 537 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 70B | 256 | 2 | 128 | 128 | 562 total tokens/sec | 2x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 70B | 256 | 4 | 128 | 2048 | 478 total tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 70B | 16 | 2 | 2048 | 128 | 49 total tokens/sec | 2x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Llama v2 70B | 64 | 4 | 2048 | 2048 | 185 total tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.9.0 | NVIDIA L40S |
Mistral 7B | 896 | 1 | 128 | 128 | 9,562 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | NVIDIA L40S |
Mistral 7B | 120 | 1 | 128 | 2048 | 4,387 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | NVIDIA L40S |
Mistral 7B | 84 | 1 | 2048 | 128 | 971 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | NVIDIA L40S |
Mistral 7B | 56 | 1 | 2048 | 2048 | 1,721 total tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.8.0 | NVIDIA L40S |
TP: Tensor Parallelism
Batch size per GPU
H200 Inference Performance - High Throughput at Low Latency Under 1 Second
Model | Batch Size | TP | Input Length | Output Length | Time to 1st Token | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 512 | 1 | 128 | 128 | 0.64 seconds | 25,126 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
GPT-J 6B | 64 | 1 | 128 | 2048 | 0.08 seconds | 7,719 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
GPT-J 6B | 32 | 1 | 2048 | 128 | 0.68 seconds | 2,469 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
GPT-J 6B | 32 | 1 | 2048 | 2048 | 0.68 seconds | 3,167 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 7B | 512 | 1 | 128 | 128 | 0.84 seconds | 19,975 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 7B | 64 | 1 | 128 | 2048 | 0.11 seconds | 7,149 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 7B | 32 | 1 | 2048 | 128 | 0.9 seconds | 2,101 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 7B | 32 | 1 | 2048 | 2048 | 0.9 seconds | 3,008 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 64 | 1 | 128 | 128 | 0.92 seconds | 2,044 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 64 | 1 | 128 | 2048 | 0.93 seconds | 2,238 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 4 | 1 | 2048 | 128 | 0.95 seconds | 128 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Llama v2 70B | 16 | 8 | 2048 | 2048 | 0.97 seconds | 173 total tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 32 | 4 | 128 | 128 | 0.36 seconds | 365 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 64 | 8 | 128 | 2048 | 0.43 seconds | 408 total tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 4 | 4 | 2048 | 128 | 0.71 seconds | 43 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Falcon 180B | 4 | 4 | 2048 | 2048 | 0.71 seconds | 53 total tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 512 | 1 | 128 | 128 | 0.88 seconds | 19,975 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 120 | 1 | 128 | 2048 | 0.21 seconds | 8,951 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 32 | 1 | 2048 | 128 | 0.94 seconds | 2,111 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
Mistral 7B | 32 | 1 | 2048 | 2048 | 0.94 seconds | 3,194 total tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.9.0 | NVIDIA H200 |
TP: Tensor Parallelism
Batch size per GPU
Low Latency Target: Highest measured throughput with less than 1 second 1st token latency
H100 Inference Performance - High Throughput at Low Latency Under 1 Second
Model | Batch Size | TP | Input Length | Output Length | Time to 1st Token | Throughput/GPU | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 512 | 1 | 128 | 128 | 0.63 seconds | 24,167 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
GPT-J 6B | 120 | 1 | 128 | 2048 | 0.16 seconds | 7,351 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
GPT-J 6B | 32 | 1 | 2048 | 128 | 0.67 seconds | 2,257 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
GPT-J 6B | 32 | 1 | 2048 | 2048 | 0.68 seconds | 2,710 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 7B | 512 | 1 | 128 | 128 | 0.83 seconds | 19,258 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 7B | 120 | 1 | 128 | 2048 | 0.2 seconds | 6,944 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 7B | 32 | 1 | 2048 | 128 | 0.89 seconds | 1,904 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 7B | 32 | 1 | 2048 | 2048 | 0.89 seconds | 2,484 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 70B | 64 | 1 | 128 | 128 | 0.92 seconds | 1,702 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 70B | 128 | 4 | 128 | 2048 | 0.73 seconds | 1,494 total tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 70B | 4 | 8 | 2048 | 128 | 0.74 seconds | 105 total tokens/sec | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Llama v2 70B | 8 | 4 | 2048 | 2048 | 0.74 seconds | 141 total tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Falcon 180B | 64 | 4 | 128 | 128 | 0.71 seconds | 372 total tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Falcon 180B | 64 | 4 | 128 | 2048 | 0.7 seconds | 351 total tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Falcon 180B | 8 | 8 | 2048 | 128 | 0.87 seconds | 45 total tokens/sec | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Falcon 180B | 8 | 8 | 2048 | 2048 | 0.87 seconds | 61 total tokens/sec | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Mistral 7B | 512 | 1 | 128 | 128 | 0.88 seconds | 19,276 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Mistral 7B | 120 | 1 | 128 | 2048 | 0.21 seconds | 8,623 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Mistral 7B | 32 | 1 | 2048 | 128 | 0.94 seconds | 2,033 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
Mistral 7B | 32 | 1 | 2048 | 2048 | 0.94 seconds | 2,981 total tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.9.0 | H100-SXM5-80GB |
TP: Tensor Parallelism
Batch size per GPU
Low Latency Target: Highest measured throughput with less than 1 second 1st token latency
Inference Performance of NVIDIA Data Center Products
GH200 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 2.12 images/sec | - | 471.9 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | GH200 96GB |
4 | 3.3 images/sec | - | 1210.63 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
Stable Diffusion XL | 1 | 0.35 images/sec | - | 2899.48 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | GH200 96GB |
ResNet-50 | 8 | 21,350 images/sec | 78 images/sec/watt | 0.37 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
128 | 63,745 images/sec | 118 images/sec/watt | 2.01 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
542 | 77,857 images/sec | - images/sec/watt | 6.96 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
ResNet-50v1.5 | 128 | 61,867 images/sec | 112 images/sec/watt | 2.07 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
472 | 74,489 images/sec | - images/sec/watt | 6.98 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
BERT-BASE | 8 | 9,328 sequences/sec | 22 sequences/sec/watt | 0.86 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
BERT-LARGE | 8 | 4,073 sequences/sec | 9 sequences/sec/watt | 1.96 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
EfficientNet-B0 | 8 | 16,357 images/sec | 80 images/sec/watt | 0.49 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
128 | 56,136 images/sec | 126 images/sec/watt | 2.28 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
479 | 68,736 images/sec | - images/sec/watt | 6.97 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
EfficientNet-B4 | 8 | 4,521 images/sec | 15 images/sec/watt | 1.77 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
55 | 7,911 images/sec | - images/sec/watt | 6.95 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
128 | 8,673 images/sec | 15 images/sec/watt | 14.76 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
HF Swin Base | 8 | 4,109 samples/sec | 10 samples/sec/watt | 1.95 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | GH200 96GB |
32 | 6,432 samples/sec | 11 samples/sec/watt | 4.97 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
HF Swin Large | 8 | 2,727 samples/sec | 5 samples/sec/watt | 2.93 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
32 | 3,926 samples/sec | 6 samples/sec/watt | 8.15 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
HF ViT Base | 8 | 6,698 samples/sec | 12 samples/sec/watt | 1.19 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | GH200 96GB |
HF ViT Large | 8 | 2,715 samples/sec | 4 samples/sec/watt | 2.95 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
64 | 3,819 samples/sec | 5 samples/sec/watt | 16.76 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB | |
Megatron BERT Large QAT | 8 | 4,987 sequences/sec | 14 sequences/sec/watt | 1.6 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
QuartzNet | 8 | 6,415 samples/sec | 26 samples/sec/watt | 1.25 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | GH200 96GB |
128 | 33,527 samples/sec | 94 samples/sec/watt | 3.82 | 1x GH200 | NVIDIA P3880 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | GH200 96GB |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
H100 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 2.13 images/sec | - | 468.79 | 1x H100 | DGX H100 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
4 | 3.12 images/sec | - | 1284.05 | 1x H100 | DGX H100 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
Stable Diffusion XL | 1 | 0.33 images/sec | - | 3023.54 | 1x H100 | DGX H100 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
ResNet-50 | 8 | 20,766 images/sec | 73 images/sec/watt | 0.39 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
128 | 59,967 images/sec | 101 images/sec/watt | 2.13 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
495 | 70,882 images/sec | - images/sec/watt | 6.98 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
ResNet-50v1.5 | 128 | 58,467 images/sec | 106 images/sec/watt | 2.19 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
472 | 67,927 images/sec | - images/sec/watt | 6.95 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
BERT-BASE | 8 | 9,319 sequences/sec | 22 sequences/sec/watt | 0.86 | 1x H100 | DGX H100 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
BERT-LARGE | 8 | 3,985 sequences/sec | 8 sequences/sec/watt | 2.01 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
EfficientNet-B0 | 8 | 15,995 images/sec | 63 images/sec/watt | 0.5 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
128 | 54,695 images/sec | 108 images/sec/watt | 2.34 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
467 | 66,922 images/sec | - images/sec/watt | 6.98 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
EfficientNet-B4 | 8 | 4,479 images/sec | 12 images/sec/watt | 1.79 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
53 | 7,681 images/sec | - images/sec/watt | 6.9 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
128 | 8,484 images/sec | 14 images/sec/watt | 15.09 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
HF Swin Base | 8 | 3,965 samples/sec | 9 samples/sec/watt | 2.02 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
32 | 6,256 samples/sec | 10 samples/sec/watt | 5.12 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
HF Swin Large | 8 | 2,694 samples/sec | 5 samples/sec/watt | 2.97 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
32 | 3,732 samples/sec | 5 samples/sec/watt | 8.57 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
HF ViT Base | 8 | 6,688 samples/sec | 12 samples/sec/watt | 1.2 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
HF ViT Large | 8 | 2,683 samples/sec | 4 samples/sec/watt | 2.98 | 1x H100 | DGX H100 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
64 | 3,270 samples/sec | 5 samples/sec/watt | 19.57 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB | |
Megatron BERT Large QAT | 8 | 4,794 sequences/sec | 13 sequences/sec/watt | 1.67 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
QuartzNet | 8 | 6,448 samples/sec | 22 samples/sec/watt | 1.24 | 1x H100 | DGX H100 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
128 | 32,691 samples/sec | 80 samples/sec/watt | 3.92 | 1x H100 | DGX H100 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | H100 SXM 80GB |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
L40S Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion XL | 1 | 0.16 images/sec | - | 6454.46 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.12-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40S |
ResNet-50 | 8 | 23,704 images/sec | 79 images/sec/watt | 0.34 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
32 | 39,363 images/sec | 114 images/sec/watt | 0.81 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
ResNet-50v1.5 | 8 | 23,034 images/sec | 75 images/sec/watt | 0.35 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
32 | 37,522 images/sec | 109 images/sec/watt | 0.85 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
BERT-BASE | 8 | 8,271 sequences/sec | 29 sequences/sec/watt | 0.97 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
128 | 12,910 sequences/sec | 37 sequences/sec/watt | 9.91 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
BERT-LARGE | 8 | 3,167 sequences/sec | 10 sequences/sec/watt | 2.53 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
24 | 4,321 sequences/sec | 13 sequences/sec/watt | 5.55 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
EfficientDet-D0 | 2 | images/sec | 13 images/sec/watt | 0.92 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40S |
8 | 4,530 images/sec | 16 images/sec/watt | 1.77 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
EfficientNet-B0 | 8 | 20,456 images/sec | 105 images/sec/watt | 0.39 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
32 | 40,357 images/sec | 137 images/sec/watt | 0.79 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
EfficientNet-B4 | 8 | 5,082 images/sec | 18 images/sec/watt | 1.57 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
16 | 5,307 images/sec | 18 images/sec/watt | 2.7 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
HF Swin Base | 8 | 3,138 samples/sec | 10 samples/sec/watt | 2.55 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
16 | 3,624 samples/sec | 11 samples/sec/watt | 4.42 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
HF Swin Large | 8 | 1,598 samples/sec | 5 samples/sec/watt | 5.01 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
16 | 1,778 samples/sec | 6 samples/sec/watt | 9 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
HF ViT Base | 12 | 4,019 samples/sec | 13 samples/sec/watt | 2.99 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
HF ViT Large | 8 | 1,365 samples/sec | 4 samples/sec/watt | 5.86 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
Megatron BERT Large QAT | 8 | 4,228 sequences/sec | 13 sequences/sec/watt | 1.89 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
24 | 5,102 sequences/sec | 15 sequences/sec/watt | 4.7 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S | |
QuartzNet | 8 | 7,625 samples/sec | 34 samples/sec/watt | 1.05 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
128 | 22,232 samples/sec | 64 samples/sec/watt | 5.76 | 1x L40S | Supermicro SYS-521GE-TNRT | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L40S |
1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
L4 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 0.45 images/sec | - | 2230.89 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
4 | 0.46 images/sec | - | 8612.55 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 | |
Stable Diffusion XL | 1 | 0.05 images/sec | - | 20540.47 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
ResNet-50 | 8 | 10,164 images/sec | 141 images/sec/watt | 0.79 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
32 | 10,426 images/sec | 145 images/sec/watt | 3.07 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 | |
ResNet-50v1.5 | 8 | 9,761 images/sec | 135 images/sec/watt | 0.82 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
32 | 10,076 images/sec | 140 images/sec/watt | 3.18 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 | |
BERT-BASE | 8 | 3,511 sequences/sec | 50 sequences/sec/watt | 2.28 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
24 | 4,034 sequences/sec | 56 sequences/sec/watt | 5.95 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 | |
BERT-LARGE | 8 | 1,109 sequences/sec | 15 sequences/sec/watt | 7.22 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
12 | 1,293 sequences/sec | 18 sequences/sec/watt | 9.28 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 | |
EfficientNet-B4 | 8 | 1,816 images/sec | 25 images/sec/watt | 4.4 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
HF Swin Base | 8 | 1,100 samples/sec | 15 samples/sec/watt | 7.27 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
HF Swin Large | 8 | 541 samples/sec | 8 samples/sec/watt | 14.78 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
HF ViT Base | 8 | 1,304 samples/sec | 18 samples/sec/watt | 6.13 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
HF ViT Large | 8 | 393 samples/sec | 5 samples/sec/watt | 20.35 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
Megatron BERT Large QAT | 8 | 1,517 sequences/sec | 21 sequences/sec/watt | 5.28 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
QuartzNet | 8 | 4,600 samples/sec | 64 samples/sec/watt | 1.74 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
128 | 5,998 samples/sec | 83 samples/sec/watt | 21.34 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 | |
RetinaNet-RN34 | 8 | 373 images/sec | 5 images/sec/watt | 21.43 | 1x L4 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | NVIDIA L4 |
512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A40 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 11,549 images/sec | 44 images/sec/watt | 0.69 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
113 | 16,444 images/sec | - images/sec/watt | 6.87 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
128 | 16,247 images/sec | 54 images/sec/watt | 7.88 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
ResNet-50v1.5 | 8 | 11,116 images/sec | 41 images/sec/watt | 0.72 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
109 | 15,626 images/sec | - images/sec/watt | 6.91 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
128 | 15,626 images/sec | 52 images/sec/watt | 8.19 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
BERT-BASE | 8 | 4,392 sequences/sec | 15 sequences/sec/watt | 1.82 | 24.02-py3 | INT8 | Synthetic | TensorRT | A40 | TensorRT 8.6.3 | A40 |
128 | 5,704 sequences/sec | 20 sequences/sec/watt | 22.44 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
BERT-LARGE | 8 | 1,596 sequences/sec | 5 sequences/sec/watt | 5.01 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
128 | 1,964 sequences/sec | 7 sequences/sec/watt | 65.17 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
EfficientNet-B0 | 8 | 10,900 images/sec | 59 images/sec/watt | 0.73 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
128 | 20,003 images/sec | 67 images/sec/watt | 6.4 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
138 | 19,890 images/sec | - images/sec/watt | 6.94 | 1x A40 | GIGABYTE G482-Z52-00 | 23.12-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
EfficientNet-B4 | 8 | 2,106 images/sec | 8 images/sec/watt | 3.8 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
15 | 2,274 images/sec | - images/sec/watt | 6.6 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
128 | 2,690 images/sec | 9 images/sec/watt | 47.58 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
HF Swin Base | 8 | 1,444 samples/sec | 5 samples/sec/watt | 5.54 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
32 | 1,465 samples/sec | 5 samples/sec/watt | 21.84 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A40 | |
HF Swin Large | 8 | 829 samples/sec | 3 samples/sec/watt | 9.65 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A40 |
32 | 840 samples/sec | 3 samples/sec/watt | 38.1 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A40 | |
HF ViT Base | 8 | 2,176 samples/sec | 7 samples/sec/watt | 3.68 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
64 | 2,182 samples/sec | 7 samples/sec/watt | 29.33 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
HF ViT Large | 8 | 694 samples/sec | 2 samples/sec/watt | 11.53 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
64 | 713 samples/sec | 2 samples/sec/watt | 89.73 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
Megatron BERT Large QAT | 8 | 2,101 sequences/sec | 8 sequences/sec/watt | 3.81 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
128 | 2,688 sequences/sec | 9 sequences/sec/watt | 47.62 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
QuartzNet | 8 | 4,501 samples/sec | 21 samples/sec/watt | 1.78 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
128 | 8,492 samples/sec | 28 samples/sec/watt | 15.07 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 | |
RetinaNet-RN34 | 8 | 706 images/sec | 2 images/sec/watt | 11.33 | 1x A40 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A40 |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A30 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 10,446 images/sec | 74 images/sec/watt | 0.77 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
117 | 17,104 images/sec | - images/sec/watt | 6.84 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
128 | 17,328 images/sec | 106 images/sec/watt | 7.39 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
ResNet-50v1.5 | 8 | 10,167 images/sec | 71 images/sec/watt | 0.79 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
118 | 16,540 images/sec | - images/sec/watt | 6.95 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
128 | 16,759 images/sec | 102 images/sec/watt | 7.64 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 4,337 sequences/sec | 26 sequences/sec/watt | 1.84 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
128 | 5,784 sequences/sec | 35 sequences/sec/watt | 22.13 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,488 sequences/sec | 9 sequences/sec/watt | 5.38 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
128 | 2,043 sequences/sec | 12 sequences/sec/watt | 62.65 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
EfficientNet-B0 | 8 | 8,928 images/sec | 82 images/sec/watt | 0.9 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
117 | 17,178 images/sec | - images/sec/watt | 6.81 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
128 | 17,251 images/sec | 105 images/sec/watt | 7.42 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
EfficientNet-B4 | 8 | 1,866 images/sec | 13 images/sec/watt | 4.29 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
14 | 2,091 images/sec | - images/sec/watt | 6.69 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
128 | 2,395 images/sec | 15 images/sec/watt | 53.44 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
HF Swin Base | 8 | 1,456 samples/sec | 9 samples/sec/watt | 5.49 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A30 |
32 | 1,604 samples/sec | 10 samples/sec/watt | 19.96 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
HF Swin Large | 8 | 811 samples/sec | 5 samples/sec/watt | 9.87 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A30 |
32 | 841 samples/sec | 5 samples/sec/watt | 38.04 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A30 | |
HF ViT Base | 8 | 2,028 samples/sec | 12 samples/sec/watt | 3.94 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A30 |
64 | 2,140 samples/sec | 13 samples/sec/watt | 29.9 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
HF ViT Large | 8 | 648 samples/sec | 4 samples/sec/watt | 12.34 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
64 | 698 samples/sec | 4 samples/sec/watt | 91.71 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
Megatron BERT Large QAT | 8 | 1,816 sequences/sec | 13 sequences/sec/watt | 4.41 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
128 | 2,766 sequences/sec | 17 sequences/sec/watt | 46.28 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
QuartzNet | 8 | 3,429 samples/sec | 30 samples/sec/watt | 2.33 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
128 | 9,891 samples/sec | 71 samples/sec/watt | 12.94 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 | |
RetinaNet-RN34 | 8 | 695 images/sec | 4 images/sec/watt | 11.52 | 1x A30 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A30 |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A30 1/4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 4,047 images/sec | 47 images/sec/watt | 1.98 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
32 | 4,650 images/sec | - images/sec/watt | 6.88 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 4,788 images/sec | 54 images/sec/watt | 26.73 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
ResNet-50v1.5 | 8 | 3,894 images/sec | 47 images/sec/watt | 2.05 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
31 | 4,463 images/sec | - images/sec/watt | 6.95 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 4,636 images/sec | 51 images/sec/watt | 27.61 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-BASE | 8 | 1,571 sequences/sec | 17 sequences/sec/watt | 5.09 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 1,705 sequences/sec | 18 sequences/sec/watt | 75.05 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-LARGE | 8 | 519 sequences/sec | 6 sequences/sec/watt | 15.42 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 592 sequences/sec | 6 sequences/sec/watt | 216.21 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE
A30 4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 15,541 images/sec | 94 images/sec/watt | 2.06 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
29 | 17,253 images/sec | - images/sec/watt | 6.75 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 18,158 images/sec | 111 images/sec/watt | 28.26 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
ResNet-50v1.5 | 8 | 14,924 images/sec | 91 images/sec/watt | 2.15 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
28 | 16,678 images/sec | - images/sec/watt | 6.74 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 17,511 images/sec | 106 images/sec/watt | 29.31 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-BASE | 8 | 5,715 sequences/sec | 35 sequences/sec/watt | 5.71 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 6,004 sequences/sec | 37 sequences/sec/watt | 86.99 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-LARGE | 8 | 1,885 sequences/sec | 12 sequences/sec/watt | 17.06 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 2,090 sequences/sec | 13 sequences/sec/watt | 246.09 | 1x A30 | GIGABYTE G482-Z52-00 | 23.11-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE
A10 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 8,877 images/sec | 59 images/sec/watt | 0.9 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
75 | 11,019 images/sec | - images/sec/watt | 3.02 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
128 | 11,526 images/sec | 77 images/sec/watt | 11.11 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
ResNet-50v1.5 | 8 | 8,469 images/sec | 57 images/sec/watt | 0.94 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
70 | 10,801 images/sec | - images/sec/watt | 6.48 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
128 | 10,868 images/sec | 73 images/sec/watt | 11.78 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 3,227 sequences/sec | 22 sequences/sec/watt | 2.48 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
128 | 3,768 sequences/sec | 25 sequences/sec/watt | 33.97 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,120 sequences/sec | 7 sequences/sec/watt | 7.14 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
128 | 1,267 sequences/sec | 9 sequences/sec/watt | 101.05 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
EfficientNet-B0 | 8 | 9,496 images/sec | 64 images/sec/watt | 0.84 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
128 | 14,315 images/sec | 96 images/sec/watt | 8.94 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
EfficientNet-B4 | 8 | 1,592 images/sec | 11 images/sec/watt | 5.02 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
128 | 1,853 images/sec | 12 images/sec/watt | 69.09 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
HF Swin Base | 8 | 1,061 samples/sec | 7 samples/sec/watt | 7.54 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A10 |
32 | 1,046 samples/sec | 7 samples/sec/watt | 30.61 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
HF Swin Large | 8 | 554 samples/sec | 4 samples/sec/watt | 14.45 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
32 | 575 samples/sec | 4 samples/sec/watt | 55.68 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A10 | |
HF ViT Base | 8 | 1,384 samples/sec | 9 samples/sec/watt | 5.78 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A10 |
64 | 1,438 samples/sec | 10 samples/sec/watt | 44.51 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A10 | |
HF ViT Large | 8 | 462 samples/sec | 3 samples/sec/watt | 17.32 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
64 | 446 samples/sec | 3 samples/sec/watt | 143.47 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | Mixed | Synthetic | TensorRT 8.6.3 | A10 | |
Megatron BERT Large QAT | 8 | 1,596 sequences/sec | 11 sequences/sec/watt | 5.01 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
128 | 1,846 sequences/sec | 13 sequences/sec/watt | 69.36 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
QuartzNet | 8 | 3,999 samples/sec | 27 samples/sec/watt | 2 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
128 | 5,875 samples/sec | 39 samples/sec/watt | 21.79 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 | |
RetinaNet-RN34 | 8 | 503 images/sec | 3 images/sec/watt | 15.89 | 1x A10 | GIGABYTE G482-Z52-00 | 24.02-py3 | INT8 | Synthetic | TensorRT 8.6.3 | A10 |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
NVIDIA Performance with Triton Inference Server
H100 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Number of Concurrent Client Requests | Latency (ms) | Throughput | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|
BERT Base Inference | H100 SXM5-80GB | tensorrt | TensorRT | Mixed | 4 | 1 | 4 | 1.207 | 3,311 inf/sec | 24.02-py3 |
BERT Large Inference | H100 SXM5-80GB | tensorrt | PyTorch | Mixed | 4 | 1 | 16 | 14.784 | 1,082 inf/sec | 24.02-py3 |
BERT Large Inference | H100 SXM5-80GB | tensorrt | PyTorch | Mixed | 4 | 2 | 8 | 12.715 | 1,258 inf/sec | 24.02-py3 |
DLRM | H100 SXM5-80GB | ts-trace | PyTorch | Mixed | 1 | 1 | 32 | 0.94 | 34,027 inf/sec | 24.02-py3 |
DLRM | H100 SXM5-80GB | ts-trace | PyTorch | Mixed | 4 | 2 | 32 | 0.913 | 70,071 inf/sec | 24.02-py3 |
FastPitch Inference | H100 SXM5-80GB | ts-trace | PyTorch | Mixed | 2 | 1 | 512 | 119.531 | 4,281 inf/sec | 24.02-py3 |
FastPitch Inference | H100 SXM5-80GB | ts-trace | PyTorch | Mixed | 2 | 2 | 256 | 119.36 | 4,287 inf/sec | 24.02-py3 |
ResNet-50 v1.5 | H100 SXM5-80GB | tensorrt | PyTorch | Mixed | 4 | 1 | 16 | 1.977 | 8,090 inf/sec | 24.02-py3 |
ResNet-50 v1.5 | H100 SXM5-80GB | tensorrt | PyTorch | Mixed | 4 | 2 | 16 | 4.101 | 7,801 inf/sec | 24.02-py3 |
TFT Inference | H100 SXM5-80GB | ts-script | PyTorch | Mixed | 2 | 1 | 1024 | 33.027 | 30,996 inf/sec | 24.02-py3 |
TFT Inference | H100 SXM5-80GB | ts-script | PyTorch | Mixed | 2 | 2 | 512 | 25.522 | 40,114 inf/sec | 24.02-py3 |
H100 NVL Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Number of Concurrent Client Requests | Latency (ms) | Throughput | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|
BERT Base Inference | NVIDIA H100 NVL | tensorrt | TensorRT | Mixed | 4 | 1 | 4 | 1.333 | 3,000 inf/sec | 24.02-py3 |
BERT Large Inference | NVIDIA H100 NVL | tensorrt | PyTorch | Mixed | 4 | 1 | 16 | 19.055 | 840 inf/sec | 24.02-py3 |
BERT Large Inference | NVIDIA H100 NVL | tensorrt | PyTorch | Mixed | 4 | 2 | 8 | 17.025 | 940 inf/sec | 24.02-py3 |
DLRM | NVIDIA H100 NVL | ts-trace | PyTorch | Mixed | 2 | 1 | 32 | 0.804 | 39,745 inf/sec | 24.02-py3 |
DLRM | NVIDIA H100 NVL | ts-trace | PyTorch | Mixed | 2 | 2 | 32 | 1.071 | 59,691 inf/sec | 24.02-py3 |
FastPitch Inference | NVIDIA H100 NVL | ts-trace | PyTorch | Mixed | 2 | 1 | 512 | 151.079 | 3,386 inf/sec | 24.02-py3 |
FastPitch Inference | NVIDIA H100 NVL | ts-trace | PyTorch | Mixed | 1 | 2 | 128 | 82.113 | 3,117 inf/sec | 24.02-py3 |
ResNet-50 v1.5 | NVIDIA H100 NVL | tensorrt | PyTorch | Mixed | 4 | 1 | 16 | 1.978 | 8,086 inf/sec | 24.02-py3 |
ResNet-50 v1.5 | NVIDIA H100 NVL | tensorrt | PyTorch | Mixed | 4 | 2 | 16 | 3.812 | 8,392 inf/sec | 24.02-py3 |
TFT Inference | NVIDIA H100 NVL | ts-trace | PyTorch | Mixed | 2 | 1 | 512 | 16.846 | 30,387 inf/sec | 24.02-py3 |
TFT Inference | NVIDIA H100 NVL | ts-trace | PyTorch | Mixed | 2 | 2 | 256 | 13.848 | 36,966 inf/sec | 24.02-py3 |
L40S Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Number of Concurrent Client Requests | Latency (ms) | Throughput | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|
BERT Base Inference | NVIDIA L40S | tensorrt | TensorRT | Mixed | 4 | 1 | 4 | 1.396 | 2,863 inf/sec | 24.02-py3 |
BERT Large Inference | NVIDIA L40S | tensorrt | PyTorch | Mixed | 2 | 1 | 8 | 15.677 | 510 inf/sec | 24.02-py3 |
BERT Large Inference | NVIDIA L40S | tensorrt | PyTorch | Mixed | 2 | 2 | 4 | 15.077 | 531 inf/sec | 24.02-py3 |
DLRM | NVIDIA L40S | ts-trace | PyTorch | Mixed | 1 | 1 | 64 | 1.545 | 41,403 inf/sec | 24.02-py3 |
DLRM | NVIDIA L40S | ts-trace | PyTorch | Mixed | 1 | 2 | 32 | 0.929 | 68,867 inf/sec | 24.02-py3 |
FastPitch Inference | NVIDIA L40S | ts-trace | PyTorch | Mixed | 2 | 1 | 64 | 64.515 | 2,413 inf/sec | 24.02-py3 |
FastPitch Inference | NVIDIA L40S | ts-trace | PyTorch | Mixed | 2 | 2 | 32 | 26.384 | 2,425 inf/sec | 24.02-py3 |
ResNet-50 v1.5 | NVIDIA L40S | tensorrt | PyTorch | Mixed | 4 | 1 | 16 | 2.017 | 7,928 inf/sec | 24.02-py3 |
ResNet-50 v1.5 | NVIDIA L40S | tensorrt | PyTorch | Mixed | 4 | 2 | 16 | 3.964 | 8,070 inf/sec | 24.02-py3 |
TFT Inference | NVIDIA L40S | ts-trace | PyTorch | Mixed | 2 | 1 | 256 | 10.122 | 25,288 inf/sec | 24.02-py3 |
TFT Inference | NVIDIA L40S | ts-trace | PyTorch | Mixed | 2 | 2 | 64 | 4.988 | 25,658 inf/sec | 24.02-py3 |
Inference Performance of NVIDIA GPUs in the Cloud
A100 Inference Performance in the Cloud
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 13,768 images/sec | - images/sec/watt | 0.58 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
128 | 30,338 images/sec | - images/sec/watt | 4.22 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
BERT-LARGE | 8 | 2,308 images/sec | - images/sec/watt | 3.47 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
128 | 4,045 images/sec | - images/sec/watt | 31.64 | 1x A100 | GCP A2-HIGHGPU-1G | 23.10-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
BERT-Large: Sequence Length = 128
View More Performance Data
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More