2025-03-11
With the release of DeepSeek V3 and R1 series models, their powerful performance has quickly garnered widespread attention in the AI community both domestically and internationally. The MoE (Mixture of Experts) architecture adopted by DeepSeek-V3 has become a subject of study and reference in the industry, positioning DeepSeek at the forefront of large language model technology.
Recently, demand for on-premises deployment of DeepSeek models in enterprise environments has been steadily increasing, particularly for the full-scale versions of DeepSeek-V3 and R1 671B, as well as the distilled models optimized for efficient inference. However, deploying the full-scale DeepSeek 671B models imposes extremely high requirements on computing resources, involving multiple critical factors such as GPU compute power, memory capacity, and communication bandwidth.
This article will explore the computational requirements for both full-scale and distilled versions of DeepSeek models in enterprise server environments, analyze deployment solutions suitable for different GPU hardware, and help enterprises utilize AI computing resources more efficiently.
Parameter/Metric | A100 (80GB) | RTX 4090 (Standard) | RTX 4090 (48GB) | H20 (Standard) | H20 (141GB) | H200 |
---|---|---|---|---|---|---|
Memory Capacity | 80GB | 24GB | 48GB | 96GB | 141GB | 141GB |
Memory Type | HBM2e | GDDR6X | GDDR6X | HBM3 | HBM3e | HBM3e |
Memory Bandwidth | 2TB/s | 1TB/s | 1TB/s | 4TB/s | 4.8TB/s | 4.8TB/s |
FP32 Performance | 19.5 TFLOPS | 82.6 TFLOPS | 82.6 TFLOPS | 44 TFLOPS | 67 TFLOPS | 67 TFLOPS |
TF32 Performance (Sparse) | 156 TFLOPS | Not Supported | Not Supported | 88 TFLOPS | 134 TFLOPS | 134 TFLOPS |
FP16 Performance | 312 TFLOPS | 165.3 TFLOPS | 165.3 TFLOPS | 176 TFLOPS | 1,979 TFLOPS | 1,979 TFLOPS |
FP8 Performance | Not Supported | Not Supported | Not Supported | 352 TFLOPS | 3,958 TFLOPS | 3,958 TFLOPS |
INT8 Performance | 624 TOPS | 661 TOPS | 661 TOPS | 704 TOPS | 7,916 TOPS | 7,916 TOPS |
GPU Interconnect | NVLink 600GB/s | Not Supported | Not Supported | NVLink 900GB/s | NVLink 900GB/s | NVLink 900GB/s |
TDP | 400W | 450W | 450W | 300W | 300W | 300W |
In deep learning and high-performance computing, memory capacity (VRAM) is one of the critical factors determining whether a model can run smoothly. Memory capacity directly affects the model's processable scale, inference efficiency, and training/inference stability. Different versions of DeepSeek models range from billions (1.5B, 7B, 8B, 14B, 32B, and 70B) to hundreds of billions of parameters (671B full version). Insufficient memory can prevent models from loading or cause low operational efficiency.
DeepSeek uses native FP8 for training and inference, so on GPUs with FP8 support, memory usage approximates the parameter count (671B ≈ 700GB).
However, not all GPUs have FP8 compute units (hardware support).
For GPUs that don't support FP8 but support BF16 (such as Nvidia A100, Ascend 910B, etc.), the model must be converted to a supported format before running. Due to the increased precision (BF16 > FP8), this results in doubled memory usage (approximately 1.4TB). > Learn more about DeepSeek full version definitions and different forms
DeepSeek-R1 distilled model series natively support BF16 computation format for training and inference, requiring no additional precision conversion. Compared to traditional FP16 (Half-Precision Floating Point) format, BF16 offers a wider exponent range, making it more stable in deep learning training and inference processes, especially in large-scale model computing environments.
Memory bandwidth directly affects data transfer speed and inference efficiency, especially in deep learning inference and training processes where large amounts of model parameters, activation values, and KV Cache need to be transferred within or between GPUs. If memory bandwidth is insufficient, data transfer becomes a performance bottleneck, affecting computational efficiency.
Model Version | Parameter Scale | Computation Precision | GPU | Minimum Deployment Requirements |
---|---|---|---|---|
DeepSeek-V3, R1 Full Version | 671B | FP8 (Native Support) | H200, H20, H100, H800 | 8x H200 (141GB) / 8x H20 (141GB) 16x H100 (80GB) / 16x H800 (80GB) |
DeepSeek-V3, R1 Full Version | 671B | BF16 (Converted) | A100, A800 | 16x A100 (80GB) / 16x A800 (80GB) |
DeepSeek-R1-Distill-Llama-70B | 70B | BF16 (Native Support) | H20, H100, L20, RTX 4090 | 4x H100 (80GB) / 4x A100 (80GB) 8x L20 (48GB) / 8x RTX 4090 (48GB) |
DeepSeek-R1-Distill-Qwen-32B | 32B | BF16 (Native Support) | RTX 4090 | 1x A100 (80GB) 4x RTX 4090 (24GB) |
DeepSeek-R1-Distill-Qwen-14B | 14B | BF16 (Native Support) | RTX 4090 | 1x RTX 4090 (24GB) |
DeepSeek-R1-Distill-Qwen-7B | 7B | BF16 (Native Support) | RTX 4090 | 1x RTX 4090 (24GB) |
Model Version | Parameter Scale | Computation Precision | Accelerator | Minimum Deployment Requirements |
---|---|---|---|---|
DeepSeek-V3, R1 Full Version | 671B | W8A8 (Quantized) | Ascend 910B | 2 Atlas 800I A2 (8x64GB) servers |
DeepSeek-V3, R1 Full Version | 671B | BF16 (Converted) | Ascend 910B | 4 Atlas 800I A2 (8x64GB) servers |
DeepSeek-R1-Distill-Llama-70B | 70B | BF16 (Native Support) | Ascend 910B | 1 Atlas 800I A2 (8x64GB) server |
DeepSeek-R1-Distill-Qwen-32B | 32B | BF16 (Native Support) | Ascend 910B, Ascend 310P | 1 Atlas 800I A2 (8x32GB) server 1 Atlas 300I DUO (1x96GB) server |
DeepSeek-R1-Distill-Qwen-14B | 14B | BF16 (Native Support) | Ascend 310P | 1 Atlas 300I DUO (4x48G) server |
DeepSeek-R1-Distill-Qwen-7B | 7B | BF16 (Native Support) | Ascend 310P | 1 Atlas 300I DUO (1x48G) server |
DeepSeek-R1-Distill-Llama-8B | 8B | BF16 (Native Support) | Ascend 310P | 1 Atlas 300I DUO (1x48G) server |
DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | BF16 (Native Support) | Ascend 310P | 1 Atlas 300I DUO (1x48G) server |
With the growing deployment demands for DeepSeek large models, traditional single-machine deployment solutions can no longer meet enterprise requirements. The Rise MAX-DS Pooled Appliance provides a more flexible and efficient deployment solution:
For more detailed information about the Rise MAX-DS Pooled Appliance, please refer to DeepSeek AI Computing Appliance Details.
As quantization techniques and distributed frameworks mature, DeepSeek's cost-effectiveness and excellent performance enable more enterprises to afford on-premises deployment of high-performance large models, directly promoting the democratization of computing power and accelerating the popularization and application of AI technology.
For example, H20 has 900GB/s NVLink bandwidth, higher than A100's 600GB/s, and 7 times higher than PCIe 5.0 (128GB/s). This allows H20 to effectively reduce communication latency in multi-GPU computing, improving overall computational efficiency.
Single GPU Operation:
Single-Node Multi-GPU Operation (NVLink Interconnect):
Multi-Node Multi-GPU Operation (NVLink + InfiniBand):
In AI computing, W8A8 and W8A16 represent low-bit quantization techniques used to reduce model computational overhead and memory usage while maintaining inference accuracy as much as possible. These quantization schemes are primarily used to adapt to Huawei Ascend series AI accelerators, enabling DeepSeek models to run efficiently.
Characteristics:
Characteristics:
Huawei Ascend (such as Ascend 910B, Ascend 310P) AI compute units (Cube Units) are specially optimized for low-bit quantized computation, particularly efficient in INT8 and FP16 calculations. DeepSeek's adaptation to W8A8 and W8A16 quantization means its models can run efficiently on the Ascend platform, reducing power consumption and computational resource usage while optimizing inference throughput.
Scheme | Computation Speed | Memory Usage | Precision Loss | Suitable Scenarios |
---|---|---|---|---|
W8A8 | Highest | Lowest | Relatively Larger | Ultra-high throughput inference: search, recommendations, real-time AI applications |
W8A16 | Medium | Medium | Higher Precision | AI tasks requiring higher precision, such as NLP large model inference |
If your goal is efficient deployment of DeepSeek on Ascend hardware, then W8A8 is suitable for extreme performance optimization, while W8A16 is suitable for scenarios balancing precision and performance.
Architecture | Pascal | Volta | Turing | Ampere | Ada | Hopper | Blackwell |
---|---|---|---|---|---|---|---|
Release Year | 2016 | 2017 | 2018 | 2020 | 2022 | 2022 | 2024 |
Typical GPUs | Tesla P40, GTX 1080 | Tesla V100 | T4, Quadro RTX 6000, RTX 2080 | A100, A40, RTX 3090 | RTX 6000 Ada, L40, RTX 4090 | H100, H200 | B200, RTX 5090 |
China-specific | V100S | T4G | A800, A800 80GB, A800 PCIe, A800 SXM4 | L20, L40S, H20 | H800 | B100 |
NPU Model | FP16 Performance | Memory | Corresponding Huawei Host |
---|---|---|---|
Ascend 910B4 | 280T | 32GB HBM2 | Atlas 800I A2 |
Ascend 910B3 | 313T | 64GB HBM2 | Atlas 800T A2 |
Ascend 910B2 | 376T | 64GB HBM2 | n/a |
Ascend 910B1 | 414T | 64GB HBM2 | n/a |
Comparison Item | Atlas 300I Duo | Atlas 300I Pro |
---|---|---|
Processor | 2 Ascend 310P AI processors | 1 Ascend 310P AI processor |
AI Cores | 8 AI cores | 8 AI cores |
CPU Cores | 8 self-developed CPU cores | 8 self-developed CPU cores |
Compute Power (INT8) | 280 TOPS | 140 TOPS |
Compute Power (FP16) | 140 TFLOPS | 70 TFLOPS |
Memory | 48GB or 96GB LPDDR4X | 24GB LPDDR4X |
Memory Bandwidth | 408GB/s | 204.8GB/s |
Power Consumption | 150W | 72W |
Form Factor | Single-slot full-height, full-length PCIe card | Single-slot half-height, half-length PCIe card |
Application Scenarios | High compute demand scenarios: search recommendations, content moderation, OCR recognition, video analysis, etc. | Medium compute demand scenarios: OCR recognition, speech analysis, search recommendations, etc. |
Positioning Difference:
To learn more about RiseUnion's GPU virtualization and computing power management solutions, contact@riseunion.io