We have an exciting announcement about badges coming in May 2025. Until then, we will temporarily stop issuing new badges for course completions and certifications. However, all completions will be recorded and fulfilled after May 2025.
General

General

Fluent GPU Solver Hardware Buying Guide

Tagged:

    • FAQFAQ
      Participant

      INTRODUCTION

      GPU Hardware is very different than CPU hardware. Understanding these differences will help make correct buying decisions.

       

      Table of Contents

      Preface

      FAQs

      1. What are the requirements to run Fluent on GPUs?
      2. How do I choose GPU cards that work for me?
      3. Which GPU cards are recommended for use with the Fluent GPU solver?
      4. Won’t the (non-recommended) card I already have work just as well as a recommended one?
      5. Assuming I use a recommended GPU card, how much faster can I expect my simulations to run?
      6. I have only a mid-range budget. Can you recommend a card for me?
      7. If you had to recommend one, all-around best card for most situations, which would it be?
      8. What if I want to use Ansys cloud solutions instead of buying my own GPU hardware?
      9. Can you recommend a card for specific models?
      10. Benchmark before buying
      11. Are there any other resources I can learn from?

       


      Preface

      What Are CPUs?

      Central Processing Units (CPU) have relatively few flexible cores that can handle complex instruction sets.  Each CPU can handle serial computations, file input/output, networking, and communicating with peripherals like USB ports, keyboards and mice.

      and can often calculate single precision or double precision variables for memory-bound applications like CFD with similar speed.  (E.g., for compute-bound applications such as Molecular Dynamics applications, the CPU will be half as fast using double precision!)

      What Are GPUs?

      Graphics Processing Unit (GPU) hardware, on the other hand, is tuned to specific applications.   GPUs use different groupings of specialized cores and are dramatically parallel in nature.  The groupings of cores are called Streaming Multiprocessors (SM) by NVIDIA or Compute Units (CU) by AMD.  GPU cores are only able to process simple instructions and are clocked slower (so are more energy efficient) compared to CPU cores.  Dramatic GPU simulation speedup occurs because there cores in a GPU compared to a CPU.

      GPU cores fall into several specialized categories:

      1. FP32 – single precision cores
      2. FP64 – double precision cores
      3. INT32 – long integer cores
      4. Tensor – accelerate matrix operations
      5. RT – “Ray Tracing” calculate rays from an object to a camera

      GPUs do not necessarily contain all core types!  Different GPU models will have varying amounts of these cores, or none at all. This is why it is so important to understand your application and how it fits with the capabilities of a particular GPU.  

       

      Using GPUs with the Fluent GPU Solver

      The Fluent GPU solver run in single precision (3d) will take advantage of all the FP32 cores but does not use any FP64 cores if available. Tensor and integer cores are used as needed and are grouped nearby in the SM or CU.
      Fluent does not currently use the RT (ray tracing cores) for a CFD solution. In the future, it is possible for certain radiation models to benefit. Currently, Ansys Optics codes that do raytracing optical analysis do use the RT cores and are faster by orders of magnitude.

       

      The Good News and Bad News

      Most of the less expensive GPU cards have no FP64 cores at all.  This includes everything up to and including the Nvidia RTX6000ada which is the workstation packaged version of the Nvidia L40 server GPU.  The good news is that Fluent can be run in double precision (fluent 3ddp) on these cards even though there are no FP64 cores.   CUDA libraries emulate FP64 by using two FP32 CUDA cores.  The bad news is that on such hardware, double precision means half the solve speed.

      Perhaps it is time to think carefully about double precision?  Traditionally the double precision solver was reserved for only cases that truly benefit.  These were simulations driven by weak gradients such as natural convection or complex physics like multiphase flows.  In recent years, cheap, plentiful CPU RAM with a small speed penalty has made double precision solvers an easy default approach.

      Today’s progress requires a closer look at the choice of “3d” or “3ddp”.   The Fluent GPU solver tends to be more robust and converges better in single precision compared to the Fluent CPU solver.  Coming soon, the latest builds of Fluent have hybrid precision capability which gives double precision like convergence and accuracy using more FP32 calculations.  The tradeoff of speed vs cost is much larger than it has been for some time.

      Inexpensive GPU cards offer large cost efficiency if single precision solutions or raytracing are of primary interest.

      Figure 1.  Nvidia H100 GPU internals

      Figure 2.  Nvidia H100 SM layout
      All the small green squares are SMs.  Slightly different number of SMs depending on the variant. H100 PCIe (114 SMs), H100 SXM5, (132 SMs), GH100 (144 SMs).  The Grace Hopper architecture co-locates ARM CPUs with the GPUs on the card.

      Figure 3.  Nvidia H100 SM internals showing different types of cores.  Note the FP64 cores for native double precision calculations

       

      When Double Precision is Required

      Higher end GPUs like the Nvidia H100 offer dedicated FP64 cores so double precision solves are fast.  Lower end GPU cards provide double precision by using two FP32 (single precision) cores to create virtualize a FP64.  If native FP64 cores exist, no virtualization happens so the   H100 cards are faster even for single precision calculations due to faster FP32 cores and more memory bandwidth.

      [Reference comparison:  https://vast.ai/article/nvidia-h100-vs-l40s-power-meets-versatility]

      Most interesting, the H100 GPUs have no RT cores.  Ansys optics simulations, which benefit from RT cores are still fast on H100 GPUs but cost efficiency is much better on L40 hardware which has 142 RT cores.

      Now that we have a better understanding of the unique aspects of the GPU cards, let’s move on to discuss more fundamental aspects including particular models, RAM requirements and licensing.

       


      FAQs

      1. What are the requirements to run Fluent on GPUs?

      General requirements:

      • Fluent benefits from GPUs because of their dedicated architecture for matrix operations.
      • You can use more than one GPU on the same or on multiple computers if your model does not fit in the memory of a single GPU.
      • The sum of the memory of all GPU cards must be able to hold the model and the computation overhead.
      • Certain input/output operations still require the CPU and the main memory. Each system should have at least the same amount of system memory as the sum of the GPU memory in this system. For example, a system with two GPU cards with 40 GB each should have at least 80 GB of system memory. It is possible that more system memory than GPU memory is needed, especially for polyhedral meshes. This applies to all computers involved in the calculation.
      • You need an Ansys CFD Enterprise license with enough HPC Packs, HPC tasks or an Ansys CFD HPC Ultimate license. See also Fluent GPU Solver FAQ.

      Requirements specific to Nvidia cards:

      • The graphics card and its driver must be compatible with CUDA 11.8 or newer for Fluent 2025 R1 and CUDA 12.8 for Fluent 2025 R2. Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper, and Blackwell architectures should be compatible with CUDA 12.8. Kepler GPUs (introduced in 2013) are only supported up to Fluent 2025 R1.
      • CUDA 12.8 must be installed together with the driver for 2025 R2.

      Requirements specific to AMD cards:

      • The graphics card and its driver must be compatible with ROCm-6.0 or newer. This version was released in 2019. All RDNA and CDNA architectures are compatible.
      • AMD cards can only be used under Linux with Fluent 2025 R1 or later.

       

      2. How do I choose GPU cards that work for me?

      Consider budget and needed memory, first. The best approach is to benchmark yourself using cloud services. Ansys published benchmark results (LINK if possible), but we cannot consider every possible use case and every possible combination of models.

      The exact memory needs depend on the type of cells, the number of cells and boundary facets, the number of equations, the type of the solver, and the specifics of the used models. Still, it is possible to provide a rough rule of thumb as a lower limit:

      • The Fluent Cortex process needs about 55 MB of GPU memory and 725 MB of system memory. This is only needed once, independent of the number of GPUs used. Post-processing is done with this process. It can take a lot more system and GPU memory depending on what is shown.
      • Each Fluent compute process requires about 95 MB of GPU memory and 210 MB of system memory, regardless of single or double precision mode. Data passes through the system memory of these processes during I/O operations (file access). The system memory needed for this process during these operations can be larger than the GPU memory needed for computation.
      • Approximate GPU memory for the calculation of 1 million fluid cells with a two-equation turbulence model and active energy equation:

       

      Mesh type Single precision, segregated Single precision, coupled Double precision, segregated Double precision, coupled
      Tetrahedral 1.0 GB 1.8 GB 1.6 GB 3.0 GB
      Hexahedral 1.2 GB 2.2 GB 1.9 GB 3.6 GB
      Polyhedral 1.8 GB 3.4 GB 2.8 GB 5.6 GB

       

      Now that you have a rough understanding of the minimum memory requirements, you can select candidates. For your convenience you can find a few characteristics for common cards below. The theoretical computation speed in single and double precision refers to the specialized cores. Even though most cards do not have specialized double precision cores, they can still calculate in double precision at roughly a quarter to half the speed of single precision. The memory bandwidth is important to transport the data to the SMs/CUs.

       

      Card name Memory size (GB) Memory bandwidth (GB/s) SMs / CUs Single precision (TFLOPS) Double precision (TFLOPS) Released
      Workstation cards RTX 6000 Ada 48 960 142 91.1 2023
      RTX 5000 Ada 32 576 100 65.3 2023
      RTX 4500 Ada 24 432 60 39.6 2023
      RTX 4000 Ada 20 360 48 26.7 2023
      RTX 2000 Ada 16 224 22 12 2024
      RTX A6000 48 768 84 38.7 2022
      RTX A5500 24 768 80 34.1 2022
      RTX A5000 24 768 64 27.8 2022
      RTX A4500 20 640 56 23.7 2023
      RTX A4000 16 448 48 19.2 2022
      RTX A2000 6 or 12 288 26 8 2022
      RTX 5000 A Mobile 16 576 76 42.6 2023
      RTX 4000 A Mobile 12 432 58 42.6 2023
      RTX 2000 A Mobile 8 256 24 14.5 2023
      Radeon Pro W7900 48 864 96 61.3 2023
      Radeon Pro W7800 48GB 48 864 70 45.2 2023
      Radeon Pro W7800 32 576 70 45.2 2023
      Radeon Pro W7700 16 576 48 28.3 2023
      Radeon Pro W7600 8 288 32 21.4 2023
      Radeon Pro W6800 32 512 60 17.83 1.11 2021
      Radeon Pro W6600 8 224 28 10.4 2021
      Server cards H200 SXM 141 4800 67 37 2024
      H200 NVL (PCIe) 141 4800 60 30 2024
      H100 SXM 80 3350 132 67 34 2024
      H100 NVL (PCIe) 94 3900 114 60 30 2024
      A100 SXM 80 2039 108 19.5 9.7 2022
      A100 PCIe 80 1935 108 19.5 9.7 2022
      L40 48 864 142 90.5 2022
      L40S 48 864 142 91.6 2024
      A30 24 933 72 10.3 5.2 2022
      Instinct MI325X 256 6000 304 163.4 163.4 2024
      Instinct MI300X 192 5300 304 163.4 163.4 2023
      Instinct MI250X 128 3200 220 95.7 95.7 2021
      Instinct MI250 128 3200 208 45.3 45.3 2021
      Instinct MI210 64 1600 104 22.6 22.6 2022

       

      A list of tested hardware for the different Ansys products is available here: Native GPU Accelerator Capabilities

      Fluent is tested and verified with all the following Nvidia and AMD GPU cards:

      Workstation: RTX A4000, RTX A5000, RTX A6000, RTX A6000 Ada, Quadro RTX 6000
      PROS: Typically, these cards are affordable and also available to buy. They can be used for many other applications, including high-end visualization.
      CONS: Compared with high-end server cards they are slow in double precision. They do not offer lots of memory.

      Server: A100, H100, Instinct MI210
      PROS: Offer the maximum calculation speed and memory for a single card that is currently available.
      CONS: The cards are expensive and can be difficult to get.

      Server: A40, L40
      PROS: Performance and price is slightly above the high-end workstation cards but much lower than the high-end server cards. For single-precision calculations they are an excellent choice.
      CONS: Compared with the high-end server cards they are slower in double precision.

      *These cards have all been internally tested by the Ansys team. However, the Fluent GPU Solver supports many more GPU cards than those mentioned above. We recommend benchmarking your GPU cards to find the best one for your application.

       

      If your existing hardware is compatible with CUDA 12.8 or ROCm-6.0, Fluent should run even if the card is not recommended.

      If non-recommended means gaming card, you should be aware that Ansys does not test them. Fluent will most likely also run on it while it fulfills the minimum requirements. Like the Nvidia L40, L40S, and workstation cards, gaming which reduces the speed for such calculations compared to server cards even when the GPU generation and number of streaming multiprocessors (SMs) is identical.

       

       

      You can get an impression of the possible speed-up from the difference in theoretical single-precision or double-precision computation speed and the difference in memory bandwidth. It depends on the specific calculation which of the categories has more impact.

       

      6. I have only a mid-range budget. Can you recommend a card for me?

      Any graphics card that fulfills the minimum requirements is faster than the fastest workstation CPU of the same generation. Obviously, you still need a CPU, but it can be a cheaper one when combined with a powerful graphics card.

      Consider the size of your models and select a card or multiple cards that have enough memory to run your simulations. In most cases you benefit from a higher memory bandwidth.

      The Nvidia L40 and L40S come with a slightly higher price tag than the high-end workstation graphics cards. Speed and memory are also slightly higher. Both high-end workstation cards and the visualization server cards are good choices for a limited budget.

       

       

      7. If you had to recommend one, all-around best card for most situations, which would it be?

      If the computer is used for computation and visualization, a card like the L40 or L40S is an interesting choice because compared to A100, H100, and H200 it is affordable and has hardware for visualization. The A100, H100, and H200 are significantly faster for computations in double precision but lack visualization capabilities.

      If the computer is only used for computation, A100, H100, and H200 are good choices from Nvidia. These are different generations of the same class of cards. Support for Nvidia hardware is spread widely across many different software packages.

      The AMD Instinct cards have a very compelling offer in terms of computation speed and memory. If you also plan to use other GPU-based software products besides Fluent, check if they support AMD hardware, first.

       

       

      8. What if I want to use Ansys cloud solutions instead of buying my own GPU hardware?

      “Ansys Access on Microsoft Azure” and “Ansys Gateway powered by AWS” offer single instances of one or multiple GPUs to run Fluent jobs on different configurations. Contact us to discuss if one of these offers is suitable for your needs.

       

      9. Can you recommend a card for specific models?

      There are many possible model combinations that we cannot recommend a specific card without detailed context.
      The most important question to consider is the requirement of double precision. If this is needed, the high-end server products are more appealing despite their price tag. When judging the need for double precision, consider benchmarking your model with the GPU solver in single and double precision for accuracy and speed. Due to the different architecture of the solver, you might be able to run in single precision even when the CPU solver requires double precision.

      The second question is about the required memory. Every additional model adds to the memory consumption. Again, benchmarking can help you find the optimal amount of memory that is needed for your applications. Remember that it is not necessary that the model fits into a single card. You can distribute it across multiple GPU cards in one or multiple computers.

       

      10. Benchmark before buying

      Avoid surprises. GPU cards are bespoke hardware that require a much closer look at benefits and tradeoffs. Hopefully this document has given you a better understanding of what to look for in a GPU. However, nothing beats running your case on a GPU to gain experience. If you have absolutely no GPUs available at least run the Fluent GPU code on your CPU based system (use –gpu=-1). This will allow verification that the models you need are available on the GPU and give an accurate RAM estimate.

       

      11. Are there any other resources I can learn from?

      Yes, please reference the below resources to learn more: