Rocky GPU Buying Guide

July 11, 2024 at 1:56 pm

FAQ

Participant

With Ansys Rocky™ particle dynamics simulation software you can use one or more Graphic Processing Units (GPUs) to process your simulations. Before investing in new hardware, see the FAQs below to find guidelines and recommendations.

Mastering Multi-GPU in Ansys Rocky Software and Enhancing Its Performance

Rocky GPU FAQs

1. Which license is required to run Rocky on GPUs?

The Ansys Rocky base product license allows you to run a single job with up to 112 graphic cards SM’s (Streaming Multiprocessor). It is indifferent whether this is with a single or multiple GPU cards.

For example, if you have one A100 card (108 SMs), you can run your Rocky simulation without needing any additional HPC license. In the same way, if you have four RTX 3060 cards (28 SM’s each), you can run on multi-GPU as the total SM’s count in this case is 112.*

To run any job with more than 112 SM’s, you need to add Ansys Rocky HPC Licensing. 1 Ansys Rocky HPC task enables 14 SM’s, e.g, 1 Ansys Rocky HPC-8 license includes 8 tasks, that enables 112 additional SM’s.

Additional SM’s can be enabled with Ansys Rocky HPC Licensing.

1 Ansys Rocky HPC task enables 14 SM’s, e.g, 1 Ansys Rocky HPC-8 license includes 8 tasks, that enables 112 additional SM’s.

SM’s	Rocky HPC tasks
1 – 112	0
113 – 224	1 – 8
225 – 336	9 – 16
337 – 560	17 – 32
561 – 1008	33 – 64

Now consider another situation, in which you have one RTX 4090 card (128 SM’s) or five RTX 3060 cards (140 SM’s). In both cases you will need to invest in 1 Rocky HPC-8 License (see the table below).

HPC features required according to the card(s) SM count

RTX 3060			RTX 4090
Cards	SM Count	Rocky HPC-8	Cards	SM Count	Rocky HPC-8
1	28	0	1	128	1
2	56	0	2	256	2
3	84	0	3	384	3
4	112	0	4	512	4
5	140	1	5	640	5
6	168	1	6	768	6

*For more information about SMs, refer to the APPENDIX section.

Notes:

When using multiple GPU’s, licensing is based on the total number of SM’s across all GPU’s irrespective of the number of GPU’s.
All available SM’s are used on a GPU card. It is not possible to restrict usage to a subset of SM’s.
All GPU cards should reside on a single server (i.e., Ansys Rocky does not support distributed GPU computing).

2. Which GPU cards are recommended for use with Rocky?

Rocky has been tested and verified with all the following NVIDIA GPU cards:

Workstation: Titan V, Titan RTX, Quadro GP100, and Quadro GV100
- PROS: Faster when using spherical and/or shaped particles and/or SPH elements
- CONS: More expensive

Server: Tesla P100, Tesla V100, A30, A100 and H100.
- PROS: Faster when using only spherical particles and/or SPH elements; inexpensive; can be installed on individual workstations; has video output
- CONS: More expensive; must be installed in a server rack; no video output

Gaming: RTX 3060, RTX 3070, RTX 3080, RTX 3090 and RTX4090
- PROS: Faster when using only spherical particles and/or SPH elements; inexpensive; can be installed on individual workstations; has video output
- CONS: Slower when using shaped particles

For better results, use only the above recommended GPU cards during Rocky processing.

Gaming cards can have good performance when running small cases with spherical particles and/or SPH elements but may not be the best choice for simulations with shaped particles.

3. What are the minimum requirements for GPU cards that will be used for running Rocky?

There are some minimum requirements for GPU or multi-GPU processing, and you must choose one or more NVIDIA GPU cards (computing or gaming), according to the following criteria:

At least 4 GB memory.

Fast double-precision processing capabilities.

A CUDA compute capability of 6.0 or higher.

A graphics driver version that supports the CUDA version 12.8 toolkit or higher.

(Access Nvidia website to see a CUDA driver table with a list of which driver version supports which toolkit version.For example, the release notes of CUDA toolkit 12.9 can be useful. This documention was released on May,2025)

4. What cards are best for running only spherical particles? What about cases using shaped particles?

Regarding particle shapes, here are some guidelines:

When running cases with shaped particles, choosing GPUs with higher double-precision performance should be your primary focus.

When running cases using only spherical particles, choosing GPUs with higher memory bandwidth will get you faster results in your processing.

If you intend to run very large cases, with millions of particles, you should consider GPUs with larger memory size.

It is important to note that all the 3090 and 4090 cards have poor double precision but considerable memory bandwidth performance. This means they will perform very well when simulating only spherical particles, but very poorly with shaped particles. This is a critical point when you are deciding which card to purchase.

5. Which cards are best for running SPH?

For simulation with only SPH elements, choose a GPU with high single-precision performance and higher memory bandwidth so you will speed up your simulations. GPUs with larger memory allow you to run bigger cases with millions of SPH elements, so keep it in mind when selecting the hardware.

If you are going to run simulations with both SPH elements and DEM particles together, you must take the tips from the last section into account, since the performance bottleneck can be either the SPH or the DEM, depending on the element/particle amount and the particle shape.

6. Can you provide some examples for comparison?

The tables below show that the RTX 3070 Ti is less than 10% faster than the RTX 3070 attributed to its higher double-precision, with both having the same memory size. However, if you look at the RTX 3080, you can see a more substantial improvement with 20% more memory and a 45% faster card when compared to the RTX 3070. In this case, it would be beneficial to get the RTX 3080 card.

Comparing the cards RTX 3090 with the RTX 3090 Ti, both have the same memory size, and the Ti version is 12% faster. Despite the performance gain not being too substantial, the cost increase is not significant and in this case the Ti version would be a better choice. Meanwhile, if the performance is a bottle neck, the RTX 4090 could be considered as an option, as it is 2x faster than the RTX 3090 with the same memory size. In this case, an assessment of the pros and cons is required, as the RTX 4090 has a higher cost and requires a HPC license due to its SM count.

7. There are a lot of cards on that list! How do I choose the one that is right for me?

Choosing the card that will work best for you depends upon the type of simulations you will be running, how fast you need those simulations to complete, and the budget available to spend on your hardware.

The below tables provide a quick comparison of the most common workstation, server and gaming cards.

*Last update December 2023

	Card Name	Memory Size (GB)	Memory Bandwidth (GB/s)	SMs	Single Precision (Tflops)	Double Precision (Gflops)	*MSRP (USD)**
Workstation Cards	Titan V	12	653	80	14.9	7450	3000
	RTX A6000	48	768	84	38.71	605	4650
	RTX 6000 Ada	48	960	142	91	1423	6800
	Quadro GP100	16	732	56	10.3	5168	7000
	Quadro GV100	32	868	80	16.66	8330	9000
	RTX A2000	12	288	26	7.9	124.8	–
	RTX A4000	16	448	48	19.2	299.5	–
	RTX A5000	24	768	64	27.8	433.9	–

Server Cards	Tesla P100	16	733	56	9.53	4763	3000
	Tesla V100	32	900	80	14	7014	8000
	A30	24	930	56	10.3	5161	6300
	A100	40	1555	108	19.5	9746	–
	A100	80	1935	108	19.5	9746	–
	H100	80	2039	114	51.22	25610	–
	L40	48	864	142	90.52	1414	–

Gaming Cards	RTX 3060 Ti	8	448	38	16.2	253.1	400
	RTX 3070	8	448	46	20.31	317.4	500
	RTX 3070 Ti	8	608.3	48	21.75	339.8	600
	RTX 3080	10	760	68	29.77	465.1	–
	RTX 3080 Ti	12	912.4	80	34.1	532.8	–
	RTX 3090	24	936.2	82	35.58	556	1000
	RTX 3090 Ti	24	1008	84	40	625	1080
	RTX 4090	24	1008	128	82.58	1290	1600

On the mid-price range of cards, there are the two generations of RTX’s workstation cards: A6000 and A6000 Ada. Both are in the price range between USD 4000 and USD 7000, same memory size (48 GB) and poor double precision. In the other hand, the card A30 has half memory (24 GB) but blazing-fast double precision. Another mid-range budget option is the Titan V, which performs highly in double precision but with reduced memory size (12 GB).

Thus, you need to choose what you need: larger memory (better for running larger cases with only spherical particles or with SPH elements) or faster double precision (better for running cases with shaped particles).

To get both of them together (larger memory and faster double precision) you would need to go for a Quadro GV100: that will be even faster than the Titan V and with a memory size closer to the RTX’s, but at 3 times the cost (USD 9000).

All in all, the A100 is by far the Rocky team’s preferred choice. It has a good amount of memory, blazing-fast double precision, and it delivers the most in terms of processing capacity given its cost.

And if it turns out your simulation does not fit onto a single GPU, you can always use Rocky’s support for multi-GPU to stack-up the GPU’s combined memory.

10. Won’t the (non-recommended) card I already have work just as well as a recommended one?

Different GPU cards can have one order of magnitude difference in performance, which is why we have recommended only the cards that will have the best performance with Rocky. Just because Rocky appears to run fine on a non-recommended GPU card, does not mean that it is helping the processing performance. And if it is not helping the performance, then there is no point in running your simulations on GPUs.

To see for yourself the huge range of performance differences, visit the Nvidia and review the Processing Power / Single Precision / Double Precision of the GPUs cards.

11. Assuming I use a recommended GPU card, how much faster can I expect my simulations to run?

Compared to a CPU with 8 cores, adding even one GTX 980 has been shown to speed up the processing time 5 fold; add in three P100s and what was once a 3-day simulation can be completed in just over an hour. But it all depends upon what you are simulating, how large your case is, and how much budget you have.

Appendix: What are Streaming Multiprocessors (SMs)?

Streaming Multiprocessors (SMs) are key components of the NVIDIA GPU’s responsible for executing parallel computations, perform tasks related to rendering and other general-purpose computing. A SM consists of multiple CUDA cores and more powerful GPU cards typically contain more SM’s.

GH100 Full GPU architecture with 144 SMs

Rocky GPU Performance Benchmark

In the past, DEM simulations were restricted to relatively small problems that used, for example, only thousands of larger particles that were mostly spherical in shape.

Continual improvements in both DEM codes and computational power have enabled closer-to-reality particle simulations. Users today can expect to simulate problems using the real particle shape and the actual particle size distribution (PSD), creating DEM simulations with many millions of particles.

However, these enhancements in simulation accuracy have come at the cost of increased computational loads in both processing time and memory requirements. Within Rocky, these loads can be offset considerably by using GPU processing abilities, which provides users with the capacity to obtain results in a more practical time frame.

The benefits of GPU

The addition of GPU processing has helped to make DEM a practical tool for engineering design. For example, the speed-up experienced by processing a simulation with even an inexpensive gaming GPU is remarkable when compared to a standard 32-core CPU machine working alone.

Since the release 4 of Rocky, users have been able to make use of multi-GPU technology capabilities, which facilitates large-scale and/or complicated solutions that were previously impossible to tackle due to memory limitations. By combining the memory of multiple GPU cards at once, users have been able to overcome these limitations and achieve a substantial performance increase by aggregating their computing power.

From an investment perspective, there are many benefits to multi-GPU processing. The hardware cost of running cases with several millions of particles using multiple GPUs is much smaller than buying an equivalent CPU-based machine. The energy consumption is also less with GPUs, and GPU-based machines are also easier to upgrade by adding more cards or buying newer ones.

Moreover, in a world where we push multi-physics simulations ever farther, Rocky GPU and multi-GPU processing enables you to free-up all your CPUs for coupled simulations, avoiding hardware competition.

Performance Benchmark

To better illustrate the gains in processing speed that are possible for common applications, a performance benchmark of a rotating drum (Figure 1) was developed. Multiple runs using different criteria were evaluated as explained below.

Figure 1 – Rotating drum benchmark case.

Criteria 1: Particle shape

Two different particle shapes were evaluated at the same equivalent size (Figure 2):

Spheres
Polyhedrons (shaped from 16 triangles)

Drum geometry was lengthened as the number of particles increased to keep the material cross-section consistent across the various runs.

Figure 2 – Sphere (left) and 16-triangle polyhedron (right) particle shapes used in the benchmark case.

Criteria 2: Processing type

Four different processing combinations were evaluated:

CPU: Intel(R) Xeon(R) Gold 6542Y @ 2.90 GHz on 48 cores
1 GPU: NVIDIA H100, NVIDIA A100, NVIDIA L40
2 GPUs: NVIDIA H100, NVIDIA A100, NVIDIA L40
4 GPUs: NVIDIA L40

Criteria 3: Performance measurement

Two measurements were taken at steady state to evaluate performance:

Simulation Pace (speed up), which is the amount of hardware processing time (duration) required to advance the simulation one second. In general, a lower simulation pace indicates faster processing. The simulation speed up metric is used considering the CPU pace as reference.
GPU Memory Usage, which is the amount of memory being used on the GPU while processing the simulation. In general, a lower memory usage allows for more particles to be processed, and/or more calculations to be performed.

Benchmark results for Ansys Rocky 2025 R1

Relevant conclusions on simulation performance

The following plots (Figures 3 to 6) show the performance gains for spheres and polyhedrons for different numbers of particles, using different models and numbers of GPUs.

The results show a significant performance gain with multi-GPU versus CPU simulations: up to around 100 times faster for spheres and up to around 80 times faster for polyhedrons when comparing 2 H100 GPUs with a 48-core CPU.
Excellent scalability is achieved with all the GPU cards tested.

Figure 3 – GPU speed up based upon Simulation Pace (compared with CPU 48x cores) achieved using 16 million spheres.

Figure 4 – GPU speed up based upon Simulation Pace (compared with CPU 48x cores) achieved using 32 million spheres.

Figure 5 – GPU speed up based upon Simulation Pace (compared with CPU 48x cores) achieved using 16 million polyhedrons.

Figure 6 – GPU speed up based upon Simulation Pace (compared with CPU 48x cores) achieved using 32 million polyhedrons.

Relevant conclusions on GPU memory consumption

The following plots (Figures 7 to 10) show the total GPUs memory usage for spheres and polyhedrons for different numbers of particles using different models and numbers of GPUs.

Each GPU memory consumption per million particles is less than 2GB for spheres and less than 3 GB for polyhedrons. Note: This ratio is just a general guideline and can vary with case behavior, setup, and enabled models.
A total GPU memory usage of about 90 GB to run a case with 32 million polyhedrons means that Rocky solver can handle a similar case with more than 200 million real-shaped particles on a similar hardware configuration.

Figure 7 – Total GPU memory consumption using 16 million spheres.

Figure 8 – Total GPU memory consumption using 32 million spheres.

Figure 9 – Total GPU memory consumption using 16 million polyhedrons.

Figure 10 – Total GPU memory consumption using 32 million polyhedrons.

Ansys Rocky™ particle dynamics simulation software

Learn more about Rocky software in the Ansys Rocky Innovation Space.