Fluids

Fluids

Topics related to Fluent, CFX, Turbogrid and more.

Fluent fails with Intel MPI protocol on 2 nodes

    • dv.makarov
      Subscriber

      Hello,

      ANSYS Fluent fails to start on 2 compute nodes with Intel MPI, though it starts and runs fine with the same settings (InfiniBand, Shared Memory) on a single compute node. I understand the issue may be with OS (Rocky 8.10, which is not  formally supported by ANSYS as yet).

      If anybody has experience mending Intel MPI, please?

      Thank you!

      PS. Fluent starts and runs on 2 nodes with OpenMPI protocol, but 4 times slower than on 1 node with IntelMPI.

    • George Karnos
      Ansys Employee

      What Version of Fluent are you using?
      What error are you receiving?

    • dv.makarov
      Subscriber

      Morning George, thank you for replying.

      1. Tested both Fluent 2023R1 and 2024R1.
      2. Following a similar discussion on this Forum we tried to set environmental variable "I_MPI_PLATFORM" to "none". Starting Fluent from the same terminal (either in text or graphic interface) resulted in Fluent output "I_MPI_PLATFORM=auto" anyway.
      3.  Both Fluent versions crashed with the same output:

      --------------------------------------------------------------------------------------------------------------------

      Opening input/output transcript to file "/users/ssivaraman/fluent-20241022-100917-1713090.trn".

      Auto-Transcript Start Time:  10:09:17, 22 Oct 2024

      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -host -t256 -cnf=node[187-188] -path/opt/apps/ansys/v241/fluent -ssh -cx node187.pri.kelvin2.alces.network:43609:33349

      Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_host/fluent.24.1.0 host -cx node187.pri.kelvin2.alces.network:43609:33349 "(list (rpsetvar (QUOTE parallel/function) "fluent 3ddp -flux -node -r24.1.0 -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "256") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/opt/apps/ansys/v241/fluent") (rpsetvar (QUOTE parallel/hostsfile) "node[187-188]") (rpsetvar (QUOTE gpuapp/devices) ""))"

                    Welcome to ANSYS Fluent 2024 R1

       

                    Copyright 1987-2024 ANSYS, Inc. All Rights Reserved.

                    Unauthorized use, distribution or duplication is prohibited.

                    This product is subject to U.S. laws governing export and re-export.

                    For full Legal Notice, see documentation.

       

      Build Time: Nov 22 2023 10:07:25 EST  Build Id: 10184 

       

      Connected License Server List:  1055@193.61.145.219

       

           --------------------------------------------------------------

           This is an academic version of ANSYS FLUENT. Usage of this product

           license is limited to the terms and conditions specified in your ANSYS

           license form, additional terms section.

           --------------------------------------------------------------

      Host spawning Node 0 on machine "node187.pri.kelvin2.alces.network" (unix).

      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -flux -node -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh -mport 10.10.15.27:10.10.15.27:41037:0

      Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/multiport/mpi/lnamd64/intel2021/bin/mpirun -f /tmp/fluent-appfile.ssivaraman.1713921 --rsh=ssh -genv FI_PROVIDER tcp -genv FLUENT_ARCH lnamd64 -genv I_MPI_DEBUG 0 -genv I_MPI_ADJUST_GATHERV 3 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_PLATFORM auto -genv PYTHONHOME /opt/apps/ansys/v241/fluent/fluent24.1.0/../../commonfiles/CPython/3_10/linx64/Release/python -genv FLUENT_PROD_DIR /opt/apps/ansys/v241/fluent/fluent24.1.0 -genv FLUENT_AFFINITY 0 -genv I_MPI_PIN enable -genv KMP_AFFINITY disabled -machinefile /tmp/fluent-appfile.ssivaraman.1713921 -np 256 /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_node/fluent_mpi.24.1.0 node -mpiw intel -pic default -mport 10.10.15.27:10.10.15.27:41037:0

      [mpiexec@node187.pri.kelvin2.alces.network] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node[187-188] (pid 1714765, exit code 65280)

      [mpiexec@node187.pri.kelvin2.alces.network] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error

      [mpiexec@node187.pri.kelvin2.alces.network] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error

      [mpiexec@node187.pri.kelvin2.alces.network] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event

      [mpiexec@node187.pri.kelvin2.alces.network] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies

      [mpiexec@node187.pri.kelvin2.alces.network] Possible reasons:

      [mpiexec@node187.pri.kelvin2.alces.network] 1. Host is unavailable. Please check that all hosts are available.

      [mpiexec@node187.pri.kelvin2.alces.network] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.

      [mpiexec@node187.pri.kelvin2.alces.network] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.

      [mpiexec@node187.pri.kelvin2.alces.network] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.

      [mpiexec@node187.pri.kelvin2.alces.network]    You may try using -bootstrap option to select alternative launcher.

      --------------------------------------------------------------------------------------------------------------------

    • MangeshANSYS
      Ansys Employee

       

      Hello 

      ⁠please set the environment variable below. 

      I_MPI_HYDRA_BOOTSTRAP=ssh

      you could simply edit your .bashrc file and add the line below to it. logout and log back in and then try running Dluent
      export I_MPI_HYDRA_BOOTSTRAP=ssh

       

    • dv.makarov
      Subscriber

      Hello,

      The result is still the same, here we are:

      [ssivaraman@node187 [kelvin2] ~]$  export I_MPI_BOOTSTRAP=ssh
      [ssivaraman@node187 [kelvin2] ~]$ 
      [ssivaraman@node187 [kelvin2] ~]$ echo $I_MPI_BOOTSTRAP
      ssh
      [ssivaraman@node187 [kelvin2] ~]$ 
      [ssivaraman@node187 [kelvin2] ~]$ fluent 3ddp -t256 -mpi=intel -cnf=node[187-188] -ssh -g
      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -t256 -mpi=intel -cnf=node[187-188] -ssh -g
      Hostfile does not exist, will try to use it as hostname!
      ssh: Could not resolve hostname node[187-188]: Name or service not known
      ssh: Could not resolve hostname node[187-188]: Name or service not known
      /opt/apps/ansys/v241/fluent/fluent24.1.0/cortex/lnamd64/cortex.24.1.0 -f fluent -g (fluent "3ddp  -host -r24.1.0 -t256 -cnf=node[187-188] -path/opt/apps/ansys/v241/fluent -ssh")

       

      Opening input/output transcript to file "/users/ssivaraman/fluent-20241023-145644-1854988.trn".
      Auto-Transcript Start Time:  14:56:44, 23 Oct 2024 
      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -host -t256 -cnf=node[187-188] -path/opt/apps/ansys/v241/fluent -ssh -cx node187.pri.kelvin2.alces.network:44871:44849
      Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_host/fluent.24.1.0 host -cx node187.pri.kelvin2.alces.network:44871:44849 "(list (rpsetvar (QUOTE parallel/function) "fluent 3ddp -flux -node -r24.1.0 -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "256") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/opt/apps/ansys/v241/fluent") (rpsetvar (QUOTE parallel/hostsfile) "node[187-188]") (rpsetvar (QUOTE gpuapp/devices) ""))"

       

                    Welcome to ANSYS Fluent 2024 R1

       

                    Copyright 1987-2024 ANSYS, Inc. All Rights Reserved.
                    Unauthorized use, distribution or duplication is prohibited.
                    This product is subject to U.S. laws governing export and re-export.
                    For full Legal Notice, see documentation.

       

      Build Time: Nov 22 2023 10:07:25 EST  Build Id: 10184  

      Connected License Server List:  1055@193.61.145.219

       

           --------------------------------------------------------------
           This is an academic version of ANSYS FLUENT. Usage of this product
           license is limited to the terms and conditions specified in your ANSYS
           license form, additional terms section.
           --------------------------------------------------------------
      Host spawning Node 0 on machine "node187.pri.kelvin2.alces.network" (unix).
      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -flux -node -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh -mport 10.10.15.27:10.10.15.27:38735:0
      Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/multiport/mpi/lnamd64/intel2021/bin/mpirun -f /tmp/fluent-appfile.ssivaraman.1855431 --rsh=ssh -genv FI_PROVIDER tcp -genv FLUENT_ARCH lnamd64 -genv I_MPI_DEBUG 0 -genv I_MPI_ADJUST_GATHERV 3 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_PLATFORM auto -genv PYTHONHOME /opt/apps/ansys/v241/fluent/fluent24.1.0/../../commonfiles/CPython/3_10/linx64/Release/python -genv FLUENT_PROD_DIR /opt/apps/ansys/v241/fluent/fluent24.1.0 -genv FLUENT_AFFINITY 0 -genv I_MPI_PIN enable -genv KMP_AFFINITY disabled -machinefile /tmp/fluent-appfile.ssivaraman.1855431 -np 256 /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_node/fluent_mpi.24.1.0 node -mpiw intel -pic default -mport 10.10.15.27:10.10.15.27:38735:0
      [mpiexec@node187.pri.kelvin2.alces.network] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node[187-188] (pid 1856274, exit code 65280)
      [mpiexec@node187.pri.kelvin2.alces.network] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
      [mpiexec@node187.pri.kelvin2.alces.network] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
      [mpiexec@node187.pri.kelvin2.alces.network] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
      [mpiexec@node187.pri.kelvin2.alces.network] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
      [mpiexec@node187.pri.kelvin2.alces.network] Possible reasons:
      [mpiexec@node187.pri.kelvin2.alces.network] 1. Host is unavailable. Please check that all hosts are available.
      [mpiexec@node187.pri.kelvin2.alces.network] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
      [mpiexec@node187.pri.kelvin2.alces.network] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
      [mpiexec@node187.pri.kelvin2.alces.network] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
      [mpiexec@node187.pri.kelvin2.alces.network]    You may try using -bootstrap option to select alternative launcher.

       

      Thank you!

      Yours,
      Dmitriy

    • MangeshANSYS
      Ansys Employee

      my apologies, i had an error in the variable name, it was missing HYDRA, needs to be

      export I_MPI_HYDRA_BOOTSTRAP=ssh

    • dv.makarov
      Subscriber

       

      Morning,
      Still crashiing on attempt to start on 2 nodes:

      ============================================================================== 

      [ssivaraman@node187 [kelvin2] ~]$ module load ansys/v241/ulster
      ansys/v241/ulsterutility.c(2245):ERROR:50: Cannot open file '/opt/apps/etc/modules/ansys/v241/qub' for 'reading'

       

      |
      OK
      [ssivaraman@node187 [kelvin2] ~]$ export I_MPI_HYDRA_BOOTSTRAP=ssh
      [ssivaraman@node187 [kelvin2] ~]$ echo $I_MPI_HYDRA_BOOTSTRAP
      ssh
      [ssivaraman@node187 [kelvin2] ~]$ fluent 3ddp -t256 -mpi=intel -cnf=node187,node188 -ssh -g
      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -t256 -mpi=intel -cnf=node187,node188 -ssh -g
      /opt/apps/ansys/v241/fluent/fluent24.1.0/cortex/lnamd64/cortex.24.1.0 -f fluent -g (fluent "3ddp -pmpi-auto-selected  -host -r24.1.0 -t256 -mpi=intel -cnf=node187,node188 -path/opt/apps/ansys/v241/fluent -ssh")

       

      Opening input/output transcript to file "/users/ssivaraman/fluent-20241024-094521-1942286.trn".
      Auto-Transcript Start Time:  09:45:21, 24 Oct 2024 
      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -pmpi-auto-selected -host -t256 -mpi=intel -cnf=node187,node188 -path/opt/apps/ansys/v241/fluent -ssh -cx node187.pri.kelvin2.alces.network:44677:44067
      Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_host/fluent.24.1.0 host -cx node187.pri.kelvin2.alces.network:44677:44067 "(list (rpsetvar (QUOTE parallel/function) "fluent 3ddp -flux -node -r24.1.0 -t256 -pmpi-auto-selected -mpi=intel -cnf=node187,node188 -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "256") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/opt/apps/ansys/v241/fluent") (rpsetvar (QUOTE parallel/hostsfile) "node187,node188") (rpsetvar (QUOTE gpuapp/devices) ""))"

       

                    Welcome to ANSYS Fluent 2024 R1

       

                    Copyright 1987-2024 ANSYS, Inc. All Rights Reserved.
                    Unauthorized use, distribution or duplication is prohibited.
                    This product is subject to U.S. laws governing export and re-export.
                    For full Legal Notice, see documentation.

       

      Build Time: Nov 22 2023 10:07:25 EST  Build Id: 10184  

      Connected License Server List:  1055@193.61.145.219

       

           --------------------------------------------------------------
           This is an academic version of ANSYS FLUENT. Usage of this product
           license is limited to the terms and conditions specified in your ANSYS
           license form, additional terms section.
           --------------------------------------------------------------
      Host spawning Node 0 on machine "node187.pri.kelvin2.alces.network" (unix).
      /opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -flux -node -t256 -pmpi-auto-selected -mpi=intel -cnf=node187,node188 -ssh -mport 10.10.15.27:10.10.15.27:46493:0
      Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/multiport/mpi/lnamd64/intel2021/bin/mpirun -f /tmp/fluent-appfile.ssivaraman.1942769 --rsh=ssh -genv FLUENT_ARCH lnamd64 -genv I_MPI_DEBUG 0 -genv I_MPI_ADJUST_GATHERV 3 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_PLATFORM auto -genv PYTHONHOME /opt/apps/ansys/v241/fluent/fluent24.1.0/../../commonfiles/CPython/3_10/linx64/Release/python -genv FLUENT_PROD_DIR /opt/apps/ansys/v241/fluent/fluent24.1.0 -genv FLUENT_AFFINITY 0 -genv I_MPI_PIN enable -genv KMP_AFFINITY disabled -machinefile /tmp/fluent-appfile.ssivaraman.1942769 -np 256 /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_node/fluent_mpi.24.1.0 node -mpiw intel -pic mpi-auto-selected -mport 10.10.15.27:10.10.15.27:46493:0
      [mpiexec@node187.pri.kelvin2.alces.network] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node188 (pid 1943623, exit code 65280)
      [mpiexec@node187.pri.kelvin2.alces.network] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
      [mpiexec@node187.pri.kelvin2.alces.network] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
      [mpiexec@node187.pri.kelvin2.alces.network] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
      [mpiexec@node187.pri.kelvin2.alces.network] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
      [mpiexec@node187.pri.kelvin2.alces.network] Possible reasons:
      [mpiexec@node187.pri.kelvin2.alces.network] 1. Host is unavailable. Please check that all hosts are available.
      [mpiexec@node187.pri.kelvin2.alces.network] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
      [mpiexec@node187.pri.kelvin2.alces.network] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
      [mpiexec@node187.pri.kelvin2.alces.network] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
      [mpiexec@node187.pri.kelvin2.alces.network]    You may try using -bootstrap option to select alternative launcher.

    • MangeshANSYS
      Ansys Employee

      Hello,

      Can you please check the 4 possible reasons shown in the error ?

      if node187 can resolve hostname of node188 to an IPv4 address and the correct interface. 
      verify if passwordless ssh works from node 187 (run something like "ssh node188 whoami" should result in your username being shown)
      also verify that a firewall is not blocking ports used by intel MPI

      what is the scheduler on this cluster ? see if the cluster administrator requires use of something else other than ssh ?

    • dv.makarov
      Subscriber

      Hello Mangesh,

      My colleague performed some tests together with HPC team and was able to run on two nodes ANSYS 2024R2. Environment variable was set as instructed:
            > export I_MPI_HYDRA_BOOTSTRAP=ssh
      Then Fluent was started via batch file using with fluent command: 
            > fluent 3ddp -g -t$SLURM_NTASKS -pib -cnf=hosts_cpus_list -i task.jou

      Fluent run with Intel MPI using both nodes (each having 128 cores). However, performance was only 27% faster compared to a single node - would this performance boost value be typical or expected? The case file uses 2.5M CVs mesh, includes LES turbulence modelling, non-reacting species transport, two-phase flow.
      Thank you!

      Yours,
      Dmitriy

    • MangeshANSYS
      Ansys Employee

      great !
      adding -pib seems to have done the trick.

    • dv.makarov
      Subscriber

      Would you expect 27% boos with infiniband or higher? - with Centos OS (before the HPC facility moved to Rocky 10) our simulations using 2 nodes were running nearly twice faster - close to 100% boost compared to 27% now!

       

Viewing 10 reply threads
  • You must be logged in to reply to this topic.