Ansys Products

Ansys Products

Discuss installation & licensing of our Ansys Teaching and Research products.

Ansys Fluent running slow on HPC

    • mkhademi
      Subscriber

      Hello,

      I have a question about the ways I can use to optimize the solution speed on HPC of our campus. I have run case of Fluent on my own workstation with 48 CPU cores, and I have transferred them to the HPC. Running the same cases with 128 CPU cores on HPC is even slower than the same cases on my workstation. I contacted our administration, and he puts forward some scenarios. He proposed that 

      "Fluent is a vendor supplied binary; like most proprietary software packages it is not compiled locally and is given to us basically as a "black box". It uses MPI for parallelization, but relies on its own MPI libraries shipped with the rest of the package (in general, MPI enabled packages want to run with the same MPI used for compilation). We have found that a number of proprietary packages appear to have been compiled with Intel compilers, and therefore (as the Intel compilers only optimize well for Intel CPUs, and are suspected of not optimizing well at all for AMD processors) do not perform well on AMD based clusters like the one on campus. But as we do not control the compilation of the package, there is little we can do about it."

      I am also using Opnempi for parallelization on the HPC. Do you think this problem having AMD based clusters can be the main reason for this issue. Do you suggest some general guidelines? For example, when I am preparing the case and data files, do you think I should make the case and data file with the same version of ansys as the one on cluster? Do you think if it makes any difference if I have Fluent read the mesh on the cluster, and make the case there and run it instead of preparing it on Windows pc? 

      Thank you so much,

      Mahdi.

       

    • Rob
      Forum Moderator

      How many cells are in the model? Are the cores all "real" as opposed to virtual, and is the data transfer "stuff" up to the task of passing all of the data. We've seen some hardware where running on around half of the cores gave far better performance as the data handling was lacking. 

    • mkhademi
      Subscriber

      Hi Rob,

      Sorry for getting back to this question very late, and I appreciate your response. I gave up Fluent on HPC for the running remained very slow, and I don't know about virtual or real cores on HPC. Now, I am going to use Rocky DEM on the cluster coupled with Fluent, and I think I should communicate with the IT staff again about the slow running of the Fluent. I will ask them about your question.

      The Ansys version on the cluster is 21.2, The linux system is Red hat, and I have to use Openmpi for parallel run. Would you think that switching to 23.1 will probably solve the issue? Does it matter which version of openmpi to use? Is it better to use the Openmpi which is distributed within the Fluent package? And lastly, how can I get some diagnostic data while running the simulations that can help you identify any problem? I read somewhere that "show affinity" may help with that. 

       

      Thank you so much.

    • mkhademi
      Subscriber

      And, do you think it is important to load a specific version of Openmpi with Ansys 21.2? There are multiple versions on the cluster. 

    • Rob
      Forum Moderator

      If you're running coupled Fluent and Rocky you'll benefit from 2023R1 as the coupling was new in 2021R2: it's since been improved. You also want to check cpu load when running Fluent, you've not mentioned cell count, or what the operating system is. 

      No idea about the mpi settings - I rely on our system being set up. I'll get one of the install group to comment. 

    • MangeshANSYS
      Ansys Employee
      hello, 1. Were you able to try using Ansys 2023 R1 ? is the performance gap still seen when running on these 2 different hardware? If still seeing differences with 2023 R1 2. What is the processor make and model on your workstation and how many cores do you use? What is the processor make and model on the cluster and how many cores are used on when running on the cluster ? 3. Can you please run a few iterations on the workstation and also on the cluster and the post portions from 2 transcripts showing the slow down? Additionally you can use these commands in Fluent GUI to generate information on hardware (proc-stats) (sys-stats) (show-affinity) you can test these commands by runnign a simple test on your workstation. When running Fluent in batch on the cluster, you may need to use corresponding TUI commands / journal file 3. Can you please also find out more information on the cluster from cluster resources or cluster administrator: exact operating system version and patch level, recommended interconnect, file system to be used for HPC and whether you are using the recommended file system, scheduler used, etc
    • mkhademi
      Subscriber

      Thank you so much for your replies Rob and Mangesh,

      I got some information from our IT administration. Here is the information they provided:

      1) The cluster is running RHEL 8.6. The nodes have dual AMD EPYC 7763 64-Core Processor with 512 GiB of RAM
      2) The compute nodes have HDR100 Infiniband interconnects
      3) The scratch filesystem is BeeGFS 7.3.0
      4) Scheduler is Slurm 22.05.07

      I ran a simulation both with my Laptop and cluster with both using 8 cores. I got some information with your instructions after running a common unsteady case with 100 iterations:

      The cluster gave me this information:

      > (proc-stats)
       
      ------------------------------------------------------------------------------
             | Virtual Mem Usage (GB)   | Resident Mem Usage(GB)   |              
      ID     | Current      Peak        | Current      Peak        | Page Faults  
      ------------------------------------------------------------------------------
      host   | 0.591854     0.592873    | 0.174324     0.174324    | 591        
      n0     | 0.818371     0.826019    | 0.175053     0.182316    | 259        
      n1     | 0.814728     0.82201     | 0.167213     0.173817    | 214        
      n2     | 0.815907     0.822296    | 0.165989     0.171776    | 255        
      n3     | 0.81435      0.821392    | 0.168133     0.174389    | 175        
      n4     | 0.814678     0.82206     | 0.165474     0.172218    | 168        
      n5     | 0.814739     0.822018    | 0.164593     0.171059    | 252        
      n6     | 0.814575     0.821838    | 0.165981     0.172546    | 253        
      n7     | 0.815868     0.822704    | 0.164452     0.1703      | 258        
      ------------------------------------------------------------------------------
      Total  | 7.11507      7.17321     | 1.51121      1.56274     | 2425       
      ------------------------------------------------------------------------------
       
      ------------------------------------------------------------------------------------------------
                          | Virtual Mem Usage (GB)    | Resident Mem Usage(GB)    | System Mem (GB)          
      Hostname            | Current      Peak         | Current      Peak         |                          
      ------------------------------------------------------------------------------------------------
      compute-b6-21.zarata| 7.11507      7.17321      | 1.51121      1.56274      | 503.141      
      ------------------------------------------------------------------------------------------------
      Total               | 7.11507      7.17321      | 1.51121      1.56274      |           
      ------------------------------------------------------------------------------------------------
      ()
       
       
      >  (sys-stats)
       
      ---------------------------------------------------------------------------------------
                          | CPU                                  | System Mem (GB)                  
      Hostname            | Sock x Core x HT  Clock (MHz) Load   | Total        Available  
      ---------------------------------------------------------------------------------------
      compute-b6-21.zarata| 2 x 64 x 1        0           58.77  | 503.141      362.15     
      ---------------------------------------------------------------------------------------
      Total               | 128               -           -      | 503.141      362.15     
      ---------------------------------------------------------------------------------------
      ()
       
       
      >  (show-affinity)
      999999: 0 22 23 24 72 73 74 75 
      0: 0 
      1: 22 23 24 
      2: 72 73 74 75 
      3: 22 23 24 
      4: 22 23 24 
      5: 72 73 74 75 
      6: 72 73 74 75 
      7: 72 73 74 75 
       
      999999: 0 22 23 24 72 73 74 75 
      0: 0 22 23 24 
      1: 0 22 23 24 
      2: 0 22 23 24 
      3: 0 22 23 24 
      4: 72 73 74 75 
      5: 72 73 74 75 
      6: 72 73 74 75 
      7: 72 73 74 75 
       
      999999: 0 22 23 24 72 73 74 75 
      0: 0 22 23 24 
      1: 0 22 23 24 
      2: 0 22 23 24 
      3: 0 22 23 24 
      4: 72 73 74 75 
      5: 72 73 74 75 
      6: 72 73 74 75 
      7: 72 73 74 75 
      • The alptop gave me this information for the same case and 8 cores for the same number of iterations:

      > (proc-stats)

       

      ----------------------------------------------

      | Virtual Mem Usage (GB)|

      ID | Current Peak | Page Faults

      ----------------------------------------------

      host | 0.112896 0.144833 | 7.166e+04

      n0 | 0.0862007 0.153492 | 2.659e+06

      n1 | 0.0807686 0.134022 | 1.793e+06

      n2 | 0.0791664 0.135555 | 3.148e+06

      n3 | 0.0825119 0.134621 | 2.002e+06

      n4 | 0.0780334 0.134251 | 3.438e+06

      n5 | 0.0720711 0.130085 | 1.969e+06

      n6 | 0.0833321 0.135273 | 1.328e+06

      n7 | 0.0814285 0.135128 | 1.962e+06

      ----------------------------------------------

      Total | 0.756409 1.23726 | 1.837e+07

      ----------------------------------------------

       

      -----------------------------------------------------------------

      | Virtual Mem Usage (GB) | System Mem (GB)

      Hostname | Current Peak |

      -----------------------------------------------------------------

      DESKTOP-C6GP8UV | 0.75642 1.23726 | 15.7449

      -----------------------------------------------------------------

      Total | 0.75642 1.23726 |

      -----------------------------------------------------------------

      ()

      > (sys-stats)

       

      ---------------------------------------------------------------------------------------

      | CPU | System Mem (GB)

      Hostname | Sock x Core x HT Clock (MHz) Load (%)| Total Available

      ---------------------------------------------------------------------------------------

      DESKTOP-C6GP8UV | 1 x 8 x 2 2304 14.8064| 15.7449 3.93291

      ---------------------------------------------------------------------------------------

      Total | 16 - - | 15.7449 3.93291

      ---------------------------------------------------------------------------------------

      ()

      > (show-affinity)

       

      999999: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      0: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      1: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      2: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      3: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      4: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      5: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      6: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      7: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

      ()

       

       

       

      • kevin1986
        Subscriber

        We also have epyc workstations and we found a similar issue but not exactlly. 

        Actually we are vendors producing clusters. In our cluster (centos) with amd epyc CPUs, we test Fluent 2022 and they scale extremely linear. 10 nodes, 10 times faster. So back to your topic, I dont think its a problem of Fluent. 

        On the other hand, we are also facing a similar problem: Fluent 22 is slower than 20, but it only happens on windows system. 

        /forum/forums/topic/new-version-fluent-is-slower-than-2020r2/

        Do you have any updates for your issue?

        • mkhademi
          Subscriber

          Hi Kevin,

          Our HPC admin installed Ansys 23R1, and now it seems to me to be faster. Still, I don't know how to compare the speed. 

           

        • kevin1986
          Subscriber

          > Still, I don't know how to compare the speed. 

          at least for this question, typically what we do is to test the scaling performance. You can simply try running a case on 1 node and N nodes. If N nodes is N times faster. Everything is fine. But be sure to use a relatively large mesh (like more then 3 million cells). 

      • MangeshANSYS
        Ansys Employee

        Hello. thank you for the processor information Please see this document on Processor manufacturer's site. compare the BIOS and other settings that were used
        https://www.amd.com/system/files/documents/ansys-fluent-performance-amd-epyc7003-series-processors.pdf

    • Rob
      Forum Moderator

      It's been a holiday weekend in Europe and (I think) the US. There may be a slight delay. 

    • Rob
      Forum Moderator

      In the parallel menu in Fluent you'll find some statistics options. They will also give you some calculation speed & network performance. 

    • ryan.careno
      Subscriber

      Check if the “exclusive” flag in the HPC Pack 2019 job template is set to “true”. We were seeing the same thing running on MS HPC pack with MS-MPI, and noticed about 30% boost in performance.  

      To check/set, you need administrative access to HPC Pack Cluster Manager (or reach out to whomever manages the HPC Cluster).  The setting is as shown:

       

      HPC Setting for Exclusive Mode

       

       

      We have a 256 core cluster at the moment, and mostly run 128 core jobs.  The jobs were slower previously when this was set to false.  Say you have a 32-core server in your cluster, and you request to run a job at 24-cores, this just means nobody else can use the remaining 8 cores not being used on that server, but users can still use unused servers in the cluster.  This is at least my understanding of it. 

       

       

    • Rob
      Forum Moderator

      Ryan, that's correct. Running exclusive is often useful as it prevents a user using all the RAM with (for example) 4 cores then someone else jumps onto the other 28 (or whatever) and everything slows down. 

      Disciplined users are also useful: we launch jobs to fill a box, so multiples of 28 or 32 to avoid wasting cpu with a 36 core task. We may also run 4-10 core tasks but that "wastes" the remaining cores. 

    • mkhademi
      Subscriber

      Thank you Ryan for your help. It is very helpful. I will ask our HPC staff about this. 

      Thank you Rob for commenting.

       

Viewing 11 reply threads
  • The topic ‘Ansys Fluent running slow on HPC’ is closed to new replies.