Skip to main content
Tech Guide

DeepSeek-V3/R1 671B Deployment Guide: GPU Requirements

睿思智联
1/1/2025
DeepSeek-V3/R1 671B Deployment Guide: GPU Requirements

Background

With the release of DeepSeek V3 and R1 series models, their powerful performance has quickly garnered widespread attention in the AI community both domestically and internationally. The MoE (Mixture of Experts) architecture adopted by DeepSeek-V3 has become a subject of study and reference in the industry, positioning DeepSeek at the forefront of large language model technology.

Recently, demand for on-premises deployment of DeepSeek models in enterprise environments has been steadily increasing, particularly for the full-scale versions of DeepSeek-V3 and R1 671B, as well as the distilled models optimized for efficient inference. However, deploying the full-scale DeepSeek 671B models imposes extremely high requirements on computing resources, involving multiple critical factors such as GPU compute power, memory capacity, and communication bandwidth.

This article will explore the computational requirements for both full-scale and distilled versions of DeepSeek models in enterprise server environments, analyze deployment solutions suitable for different GPU hardware, and help enterprises utilize AI computing resources more efficiently.

Comparison of Mainstream Nvidia GPUs

GPU Specifications Comparison

Parameter/MetricA100 (80GB)RTX 4090 (Standard)RTX 4090 (48GB)H20 (Standard)H20 (141GB)H200
Memory Capacity80GB24GB48GB96GB141GB141GB
Memory TypeHBM2eGDDR6XGDDR6XHBM3HBM3eHBM3e
Memory Bandwidth2TB/s1TB/s1TB/s4TB/s4TB/s4.8TB/s
FP3219.5 TFLOPS82.6 TFLOPS82.6 TFLOPS44 TFLOPS44 TFLOPS66.9 TFLOPS
FP16312 TFLOPS165.3 TFLOPS165.3 TFLOPS148 TFLOPS148 TFLOPS1,979 TFLOPS
FP8Not SupportedNot SupportedNot Supported3,958 TFLOPS3,958 TFLOPS3,958 TFLOPS
INT8624 TOPS661 TOPS661 TOPS296 TOPS296 TOPS3,958 TOPS
NVlinkNVLink 600GB/sNot SupportedNot SupportedNVLink 900GB/sNVLink 900GB/sNVLink 900GB/s
TDP400W450W450W400W400W700W

Three Key GPU Metrics

1. Memory Capacity (VRAM)

In deep learning and high-performance computing, memory capacity (VRAM) is one of the critical factors determining whether a model can run smoothly. Memory capacity directly affects the model’s processable scale, inference efficiency, and training/inference stability. Different versions of DeepSeek models range from billions (1.5B, 7B, 8B, 14B, 32B, and 70B) to hundreds of billions of parameters (671B full version). Insufficient memory can prevent models from loading or cause low operational efficiency.

2. Computation Precision (FP8 vs. INT8)

DeepSeek uses native FP8 for training and inference, so on GPUs with FP8 support, memory usage approximates the parameter count (671B ≈ 700GB).

However, not all GPUs have FP8 compute units (hardware support).

For GPUs that don’t support FP8 but support BF16 (such as Nvidia A100, Ascend 910B, etc.), the model must be converted to a supported format before running. Due to the increased precision (BF16 > FP8), this results in doubled memory usage (approximately 1.4TB). > Learn more about DeepSeek full version definitions and different forms

DeepSeek-R1 distilled model series natively support BF16 computation format for training and inference, requiring no additional precision conversion. Compared to traditional FP16 (Half-Precision Floating Point) format, BF16 offers a wider exponent range, making it more stable in deep learning training and inference processes, especially in large-scale model computing environments.

3. Memory Bandwidth

Memory bandwidth directly affects data transfer speed and inference efficiency, especially in deep learning inference and training processes where large amounts of model parameters, activation values, and KV Cache need to be transferred within or between GPUs. If memory bandwidth is insufficient, data transfer becomes a performance bottleneck, affecting computational efficiency.


Deployment Requirements for DeepSeek Full-Scale and Distilled Models

Support for Mainstream Nvidia Cards

Model VersionParameter ScaleComputation PrecisionGPUMinimum Deployment Requirements
DeepSeek-V3, R1 Full Version671BFP8 (Native Support)H200, H20, H100, H8008x H200 (141GB) / 8x H20 (141GB)
16x H100 (80GB) / 16x H800 (80GB)
DeepSeek-V3, R1 Full Version671BBF16 (Converted)A100, A80016x A100 (80GB) / 16x A800 (80GB)
DeepSeek-R1-Distill-Llama-70B70BBF16 (Native Support)H20, H100, L20, RTX 40904x H100 (80GB) / 4x A100 (80GB)
8x L20 (48GB) / 8x RTX 4090 (48GB)
DeepSeek-R1-Distill-Qwen-32B32BBF16 (Native Support)RTX 40901x A100 (80GB)
4x RTX 4090 (24GB)
DeepSeek-R1-Distill-Qwen-14B14BBF16 (Native Support)RTX 40901x RTX 4090 (24GB)
DeepSeek-R1-Distill-Qwen-7B7BBF16 (Native Support)RTX 40901x RTX 4090 (24GB)

Support for Ascend Cards

  • DeepSeek-V3 and R1 671B FP8 versions cannot run directly on Ascend 910B as the hardware doesn’t support FP8. They need to be converted to BF16, doubling the model’s memory footprint (approximately 1.4TB).
  • Software adaptation is based on MindIE, with official support rapidly provided.
  • Therefore, complete inference requires 32 Ascend 910B cards (each with 64GB memory). For hardware and deployment requirements, refer to: Deploying DeepSeek-V3 in Ascend Environment and Deploying DeepSeek-R1 in Ascend Environment.
Model VersionParameter ScaleComputation PrecisionAcceleratorMinimum Deployment Requirements
DeepSeek-V3, R1 Full Version671BW8A8 (Quantized)Ascend 910B2 Atlas 800I A2 (8x64GB) servers
DeepSeek-V3, R1 Full Version671BBF16 (Converted)Ascend 910B4 Atlas 800I A2 (8x64GB) servers
DeepSeek-R1-Distill-Llama-70B70BBF16 (Native Support)Ascend 910B1 Atlas 800I A2 (8x64GB) server
DeepSeek-R1-Distill-Qwen-32B32BBF16 (Native Support)Ascend 910B, Ascend 310P1 Atlas 800I A2 (8x32GB) server
1 Atlas 300I DUO (1x96GB) server
DeepSeek-R1-Distill-Qwen-14B14BBF16 (Native Support)Ascend 310P1 Atlas 300I DUO (4x48G) server
DeepSeek-R1-Distill-Qwen-7B7BBF16 (Native Support)Ascend 310P1 Atlas 300I DUO (1x48G) server
DeepSeek-R1-Distill-Llama-8B8BBF16 (Native Support)Ascend 310P1 Atlas 300I DUO (1x48G) server
DeepSeek-R1-Distill-Qwen-1.5B1.5BBF16 (Native Support)Ascend 310P1 Atlas 300I DUO (1x48G) server

Rise MAX-DS Pooled Appliance: Optimal Solution for Enterprise Deployment

With the growing deployment demands for DeepSeek large models, traditional single-machine deployment solutions can no longer meet enterprise requirements. The Rise MAX-DS Pooled Appliance provides a more flexible and efficient deployment solution:

Intelligent Compute Pooling and Scheduling

  • Supports multiple DeepSeek models (from 1.5B to 671B) running collaboratively within the same compute pool
  • Dynamically adjusts compute allocation, improving GPU utilization by 30%+
  • Automatic load balancing, avoiding resource waste and inefficient occupation

Elastic Scaling Capability

  • Supports on-demand expansion without architecture reconstruction
  • Breaks through single-machine physical limitations, achieving resource pooling and sharing
  • Supports cloud-edge collaborative deployment, optimizing resource configuration

Reduced Deployment Costs and Improved Efficiency

  • Reduces initial hardware investment with on-demand expansion
  • Improves resource utilization efficiency, lowering operational costs
  • Simplifies operations management, reducing labor costs

For more detailed information about the Rise MAX-DS Pooled Appliance, please refer to DeepSeek AI Computing Appliance Details.

Conclusion

  1. To leverage DeepSeek’s native performance without quantization, FP8 compute units are the key factor determining DeepSeek deployment effectiveness.
  2. H100, H200, H800, and H20 have clear advantages in DeepSeek inference due to native FP8 support. To avoid network overhead, single-machine deployments capable of running full models (such as H200, H20) are preferred.
  3. A100/A800 require double the memory for deploying 671B versions due to lack of hardware FP8 support.
  4. Ascend 910B requires BF16 conversion due to lack of hardware FP8 support, necessitating more memory and devices to complete full-version deployment. Alternatively, W8A8 quantization can be used to support 671B deployment with fewer resources.

As quantization techniques and distributed frameworks mature, DeepSeek’s cost-effectiveness and excellent performance enable more enterprises to afford on-premises deployment of high-performance large models, directly promoting the democratization of computing power and accelerating the popularization and application of AI technology.


Appendix

GPU Memory Bandwidth

For example, H20 has 900GB/s NVLink bandwidth, higher than A100’s 600GB/s, and 7 times higher than PCIe 5.0 (128GB/s). This allows H20 to effectively reduce communication latency in multi-GPU computing, improving overall computational efficiency.

Impact of Memory Bandwidth on Model Performance

Single GPU Operation:

  • Memory bandwidth primarily affects the loading speed of model data, including parameters, activation values, and KV Cache transfer.
  • Low-precision computation modes like FP8/BF16 have lower bandwidth requirements, but when models exceed the memory capacity of a single GPU, frequent data swapping becomes limited by bandwidth.

Single-Node Multi-GPU Operation (NVLink Interconnect):

  • High bandwidth (such as H20’s 900GB/s) can accelerate parameter synchronization and KV Cache sharing between multiple GPUs, reducing communication overhead and improving inference and training efficiency.
  • For example, when inferencing 671B-level models across multiple GPUs, insufficient bandwidth makes KV Cache synchronization a bottleneck, affecting inference speed.
  • A100 uses NVLink at 600GB/s, performing adequately in single-node multi-GPU mode, but with lower communication efficiency compared to H20’s 900GB/s.

Multi-Node Multi-GPU Operation (NVLink + InfiniBand):

  • When single-node GPU count is insufficient, InfiniBand (such as 200Gb/s, 400Gb/s) is used to connect multiple servers.
  • The combination of NVLink (high-speed intra-node communication) + InfiniBand (inter-node communication) can improve cross-machine communication efficiency, but InfiniBand bandwidth is typically much lower than NVLink, resulting in higher communication overhead for multi-node deployment.
  • If the model requires cross-node Tensor Parallel or Pipeline Parallel operations, low-bandwidth InfiniBand becomes a performance bottleneck.

Summary

  • Higher memory bandwidth means faster data transfer, reducing communication overhead within and between GPUs, improving inference and training efficiency.
  • In single-node multi-GPU mode, high-bandwidth NVLink (such as H20’s 900GB/s) effectively reduces cross-GPU data transfer latency, improving multi-GPU computational efficiency.
  • In multi-node multi-GPU mode, although InfiniBand provides inter-node connectivity, its bandwidth is typically lower than NVLink, potentially becoming a communication bottleneck for model training and inference, especially when deploying large models (such as 671B-level).

What do W8A8 and W8A16 Quantization Mean for Ascend DeepSeek?

In AI computing, W8A8 and W8A16 represent low-bit quantization techniques used to reduce model computational overhead and memory usage while maintaining inference accuracy as much as possible. These quantization schemes are primarily used to adapt to Huawei Ascend series AI accelerators, enabling DeepSeek models to run efficiently.

1. W8A8 Quantization

  • W8 (Weight Quantization 8-bit): Weights are quantized to 8-bit (int8 or uint8).
  • A8 (Activation Quantization 8-bit): Activation values (intermediate computation results) are also quantized to 8-bit (int8 or uint8).

Characteristics:

  • Significantly reduces computational requirements and storage usage, saving 50% memory compared to FP16.
  • Suitable for extreme performance optimization scenarios, such as high-throughput AI server inference deployment.
  • Since activation values also use 8-bit quantization, compared to A16 quantization, it may introduce greater precision loss.

2. W8A16 Quantization

  • W8 (Weight Quantization 8-bit): Weights are still quantized to 8-bit (int8 or uint8).
  • A16 (Activation Quantization 16-bit): Activation values use 16-bit (typically FP16 or BF16).

Characteristics:

  • Computational complexity between W8A8 and FP16, with improved computational performance while retaining some of FP16’s precision advantages.
  • Memory usage reduced by 25%-30% compared to FP16, but slightly higher than W8A8.
  • Suitable for inference tasks requiring higher precision, such as NLP or CV tasks with higher numerical stability requirements.

3. Significance for Ascend Adaptation

Huawei Ascend (such as Ascend 910B, Ascend 310P) AI compute units (Cube Units) are specially optimized for low-bit quantized computation, particularly efficient in INT8 and FP16 calculations. DeepSeek’s adaptation to W8A8 and W8A16 quantization means its models can run efficiently on the Ascend platform, reducing power consumption and computational resource usage while optimizing inference throughput.

4. Choosing Between W8A8 vs. W8A16

SchemeComputation SpeedMemory UsagePrecision LossSuitable Scenarios
W8A8HighestLowestRelatively LargerUltra-high throughput inference: search, recommendations, real-time AI applications
W8A16MediumMediumHigher PrecisionAI tasks requiring higher precision, such as NLP large model inference

5. Summary

  • W8A8: Maximizes computation and memory optimization, suitable for high-throughput scenarios, may affect inference precision.
  • W8A16: Balances computation and precision, suitable for AI tasks requiring higher precision.

If your goal is efficient deployment of DeepSeek on Ascend hardware, then W8A8 is suitable for extreme performance optimization, while W8A16 is suitable for scenarios balancing precision and performance.


Nvidia GPU Architecture Series

ArchitecturePascalVoltaTuringAmpereAdaHopperBlackwell
Release Year2016201720182020202220222024
Typical GPUsTesla P40, GTX 1080Tesla V100T4, Quadro RTX 6000, RTX 2080A100, A40, RTX 3090RTX 6000 Ada, L40, RTX 4090H100, H200B200, RTX 5090
China-specificV100ST4GA800, A800 80GB, A800 PCIe, A800 SXM4L20, L40S, H20H800B100

Ascend 910B Models

| NPU Model | FP16 Performance | Memory | Corresponding Huawei Host | |------|--------|-------|--------|--------| | Ascend 910B4 | 280T | 32GB HBM2 | Atlas 800I A2 | | Ascend 910B3 | 313T | 64GB HBM2 | Atlas 800T A2 | | Ascend 910B2 | 376T | 64GB HBM2 | n/a | | Ascend 910B1 | 414T | 64GB HBM2 | n/a |


Comparison Between Atlas 300I DUO and Atlas 300I Pro

Comparison ItemAtlas 300I DuoAtlas 300I Pro
Processor2 Ascend 310P AI processors1 Ascend 310P AI processor
AI Cores8 AI cores8 AI cores
CPU Cores8 self-developed CPU cores8 self-developed CPU cores
Compute Power (INT8)280 TOPS140 TOPS
Compute Power (FP16)140 TFLOPS70 TFLOPS
Memory48GB or 96GB LPDDR4X24GB LPDDR4X
Memory Bandwidth408GB/s204.8GB/s
Power Consumption150W72W
Form FactorSingle-slot full-height, full-length PCIe cardSingle-slot half-height, half-length PCIe card
Application ScenariosHigh compute demand scenarios: search recommendations, content moderation, OCR recognition, video analysis, etc.Medium compute demand scenarios: OCR recognition, speech analysis, search recommendations, etc.

Positioning Difference:

  • Atlas 300I Duo: Suitable for high compute demand tasks, with higher compute power and memory capacity, ideal for large-scale AI inference tasks.
  • Atlas 300I Pro: Focuses on energy efficiency and compact design, suitable for medium-load AI inference tasks, with lower power consumption and more compact design.

WANT TO KNOW MORE?

Connect with our expert team directly via the buttons below