TAGGED: ansys-fluent, intel-mpi, parallel-processing
-
-
October 18, 2024 at 3:19 pmdv.makarovSubscriber
Hello,
ANSYS Fluent fails to start on 2 compute nodes with Intel MPI, though it starts and runs fine with the same settings (InfiniBand, Shared Memory) on a single compute node. I understand the issue may be with OS (Rocky 8.10, which is not formally supported by ANSYS as yet).
If anybody has experience mending Intel MPI, please?
Thank you!
PS. Fluent starts and runs on 2 nodes with OpenMPI protocol, but 4 times slower than on 1 node with IntelMPI.
-
October 21, 2024 at 3:31 pmGeorge KarnosAnsys Employee
What Version of Fluent are you using?
What error are you receiving? -
October 22, 2024 at 10:58 amdv.makarovSubscriber
Morning George, thank you for replying.
- Tested both Fluent 2023R1 and 2024R1.
- Following a similar discussion on this Forum we tried to set environmental variable "I_MPI_PLATFORM" to "none". Starting Fluent from the same terminal (either in text or graphic interface) resulted in Fluent output "I_MPI_PLATFORM=auto" anyway.
- Both Fluent versions crashed with the same output:
--------------------------------------------------------------------------------------------------------------------
Opening input/output transcript to file "/users/ssivaraman/fluent-20241022-100917-1713090.trn".
Auto-Transcript Start Time: 10:09:17, 22 Oct 2024
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -host -t256 -cnf=node[187-188] -path/opt/apps/ansys/v241/fluent -ssh -cx node187.pri.kelvin2.alces.network:43609:33349
Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_host/fluent.24.1.0 host -cx node187.pri.kelvin2.alces.network:43609:33349 "(list (rpsetvar (QUOTE parallel/function) "fluent 3ddp -flux -node -r24.1.0 -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "256") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/opt/apps/ansys/v241/fluent") (rpsetvar (QUOTE parallel/hostsfile) "node[187-188]") (rpsetvar (QUOTE gpuapp/devices) ""))"
Welcome to ANSYS Fluent 2024 R1
Copyright 1987-2024 ANSYS, Inc. All Rights Reserved.
Unauthorized use, distribution or duplication is prohibited.
This product is subject to U.S. laws governing export and re-export.
For full Legal Notice, see documentation.
Build Time: Nov 22 2023 10:07:25 EST Build Id: 10184
Connected License Server List: 1055@193.61.145.219
--------------------------------------------------------------
This is an academic version of ANSYS FLUENT. Usage of this product
license is limited to the terms and conditions specified in your ANSYS
license form, additional terms section.
--------------------------------------------------------------
Host spawning Node 0 on machine "node187.pri.kelvin2.alces.network" (unix).
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -flux -node -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh -mport 10.10.15.27:10.10.15.27:41037:0
Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/multiport/mpi/lnamd64/intel2021/bin/mpirun -f /tmp/fluent-appfile.ssivaraman.1713921 --rsh=ssh -genv FI_PROVIDER tcp -genv FLUENT_ARCH lnamd64 -genv I_MPI_DEBUG 0 -genv I_MPI_ADJUST_GATHERV 3 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_PLATFORM auto -genv PYTHONHOME /opt/apps/ansys/v241/fluent/fluent24.1.0/../../commonfiles/CPython/3_10/linx64/Release/python -genv FLUENT_PROD_DIR /opt/apps/ansys/v241/fluent/fluent24.1.0 -genv FLUENT_AFFINITY 0 -genv I_MPI_PIN enable -genv KMP_AFFINITY disabled -machinefile /tmp/fluent-appfile.ssivaraman.1713921 -np 256 /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_node/fluent_mpi.24.1.0 node -mpiw intel -pic default -mport 10.10.15.27:10.10.15.27:41037:0
[mpiexec@node187.pri.kelvin2.alces.network] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node[187-188] (pid 1714765, exit code 65280)
[mpiexec@node187.pri.kelvin2.alces.network] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node187.pri.kelvin2.alces.network] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node187.pri.kelvin2.alces.network] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@node187.pri.kelvin2.alces.network] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@node187.pri.kelvin2.alces.network] Possible reasons:
[mpiexec@node187.pri.kelvin2.alces.network] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node187.pri.kelvin2.alces.network] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node187.pri.kelvin2.alces.network] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node187.pri.kelvin2.alces.network] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node187.pri.kelvin2.alces.network] You may try using -bootstrap option to select alternative launcher.
--------------------------------------------------------------------------------------------------------------------
-
October 22, 2024 at 5:36 pmMangeshANSYSAnsys Employee
Hello
please set the environment variable below.
I_MPI_HYDRA_BOOTSTRAP=ssh
you could simply edit your .bashrc file and add the line below to it. logout and log back in and then try running Dluent
export I_MPI_HYDRA_BOOTSTRAP=ssh -
October 23, 2024 at 2:14 pmdv.makarovSubscriber
Hello,
The result is still the same, here we are:[ssivaraman@node187 [kelvin2] ~]$ export I_MPI_BOOTSTRAP=ssh
[ssivaraman@node187 [kelvin2] ~]$
[ssivaraman@node187 [kelvin2] ~]$ echo $I_MPI_BOOTSTRAP
ssh
[ssivaraman@node187 [kelvin2] ~]$
[ssivaraman@node187 [kelvin2] ~]$ fluent 3ddp -t256 -mpi=intel -cnf=node[187-188] -ssh -g
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -t256 -mpi=intel -cnf=node[187-188] -ssh -g
Hostfile does not exist, will try to use it as hostname!
ssh: Could not resolve hostname node[187-188]: Name or service not known
ssh: Could not resolve hostname node[187-188]: Name or service not known
/opt/apps/ansys/v241/fluent/fluent24.1.0/cortex/lnamd64/cortex.24.1.0 -f fluent -g (fluent "3ddp -host -r24.1.0 -t256 -cnf=node[187-188] -path/opt/apps/ansys/v241/fluent -ssh")Opening input/output transcript to file "/users/ssivaraman/fluent-20241023-145644-1854988.trn".
Auto-Transcript Start Time: 14:56:44, 23 Oct 2024
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -host -t256 -cnf=node[187-188] -path/opt/apps/ansys/v241/fluent -ssh -cx node187.pri.kelvin2.alces.network:44871:44849
Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_host/fluent.24.1.0 host -cx node187.pri.kelvin2.alces.network:44871:44849 "(list (rpsetvar (QUOTE parallel/function) "fluent 3ddp -flux -node -r24.1.0 -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "256") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/opt/apps/ansys/v241/fluent") (rpsetvar (QUOTE parallel/hostsfile) "node[187-188]") (rpsetvar (QUOTE gpuapp/devices) ""))"Welcome to ANSYS Fluent 2024 R1
Copyright 1987-2024 ANSYS, Inc. All Rights Reserved.
Unauthorized use, distribution or duplication is prohibited.
This product is subject to U.S. laws governing export and re-export.
For full Legal Notice, see documentation.Build Time: Nov 22 2023 10:07:25 EST Build Id: 10184
Connected License Server List: 1055@193.61.145.219--------------------------------------------------------------
This is an academic version of ANSYS FLUENT. Usage of this product
license is limited to the terms and conditions specified in your ANSYS
license form, additional terms section.
--------------------------------------------------------------
Host spawning Node 0 on machine "node187.pri.kelvin2.alces.network" (unix).
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -flux -node -t256 -pdefault -mpi=intel -cnf=node[187-188] -ssh -mport 10.10.15.27:10.10.15.27:38735:0
Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/multiport/mpi/lnamd64/intel2021/bin/mpirun -f /tmp/fluent-appfile.ssivaraman.1855431 --rsh=ssh -genv FI_PROVIDER tcp -genv FLUENT_ARCH lnamd64 -genv I_MPI_DEBUG 0 -genv I_MPI_ADJUST_GATHERV 3 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_PLATFORM auto -genv PYTHONHOME /opt/apps/ansys/v241/fluent/fluent24.1.0/../../commonfiles/CPython/3_10/linx64/Release/python -genv FLUENT_PROD_DIR /opt/apps/ansys/v241/fluent/fluent24.1.0 -genv FLUENT_AFFINITY 0 -genv I_MPI_PIN enable -genv KMP_AFFINITY disabled -machinefile /tmp/fluent-appfile.ssivaraman.1855431 -np 256 /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_node/fluent_mpi.24.1.0 node -mpiw intel -pic default -mport 10.10.15.27:10.10.15.27:38735:0
[mpiexec@node187.pri.kelvin2.alces.network] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node[187-188] (pid 1856274, exit code 65280)
[mpiexec@node187.pri.kelvin2.alces.network] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node187.pri.kelvin2.alces.network] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node187.pri.kelvin2.alces.network] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@node187.pri.kelvin2.alces.network] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@node187.pri.kelvin2.alces.network] Possible reasons:
[mpiexec@node187.pri.kelvin2.alces.network] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node187.pri.kelvin2.alces.network] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node187.pri.kelvin2.alces.network] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node187.pri.kelvin2.alces.network] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node187.pri.kelvin2.alces.network] You may try using -bootstrap option to select alternative launcher.Thank you!
Yours,
Dmitriy -
October 23, 2024 at 2:35 pmMangeshANSYSAnsys Employee
my apologies, i had an error in the variable name, it was missing HYDRA, needs to be
export I_MPI_HYDRA_BOOTSTRAP=ssh
-
October 24, 2024 at 8:17 amdv.makarovSubscriber
Morning,
Still crashiing on attempt to start on 2 nodes:==============================================================================
[ssivaraman@node187 [kelvin2] ~]$ module load ansys/v241/ulster
ansys/v241/ulsterutility.c(2245):ERROR:50: Cannot open file '/opt/apps/etc/modules/ansys/v241/qub' for 'reading'|
OK
[ssivaraman@node187 [kelvin2] ~]$ export I_MPI_HYDRA_BOOTSTRAP=ssh
[ssivaraman@node187 [kelvin2] ~]$ echo $I_MPI_HYDRA_BOOTSTRAP
ssh
[ssivaraman@node187 [kelvin2] ~]$ fluent 3ddp -t256 -mpi=intel -cnf=node187,node188 -ssh -g
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -t256 -mpi=intel -cnf=node187,node188 -ssh -g
/opt/apps/ansys/v241/fluent/fluent24.1.0/cortex/lnamd64/cortex.24.1.0 -f fluent -g (fluent "3ddp -pmpi-auto-selected -host -r24.1.0 -t256 -mpi=intel -cnf=node187,node188 -path/opt/apps/ansys/v241/fluent -ssh")Opening input/output transcript to file "/users/ssivaraman/fluent-20241024-094521-1942286.trn".
Auto-Transcript Start Time: 09:45:21, 24 Oct 2024
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -pmpi-auto-selected -host -t256 -mpi=intel -cnf=node187,node188 -path/opt/apps/ansys/v241/fluent -ssh -cx node187.pri.kelvin2.alces.network:44677:44067
Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_host/fluent.24.1.0 host -cx node187.pri.kelvin2.alces.network:44677:44067 "(list (rpsetvar (QUOTE parallel/function) "fluent 3ddp -flux -node -r24.1.0 -t256 -pmpi-auto-selected -mpi=intel -cnf=node187,node188 -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "256") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/opt/apps/ansys/v241/fluent") (rpsetvar (QUOTE parallel/hostsfile) "node187,node188") (rpsetvar (QUOTE gpuapp/devices) ""))"Welcome to ANSYS Fluent 2024 R1
Copyright 1987-2024 ANSYS, Inc. All Rights Reserved.
Unauthorized use, distribution or duplication is prohibited.
This product is subject to U.S. laws governing export and re-export.
For full Legal Notice, see documentation.Build Time: Nov 22 2023 10:07:25 EST Build Id: 10184
Connected License Server List: 1055@193.61.145.219--------------------------------------------------------------
This is an academic version of ANSYS FLUENT. Usage of this product
license is limited to the terms and conditions specified in your ANSYS
license form, additional terms section.
--------------------------------------------------------------
Host spawning Node 0 on machine "node187.pri.kelvin2.alces.network" (unix).
/opt/apps/ansys/v241/fluent/fluent24.1.0/bin/fluent -r24.1.0 3ddp -flux -node -t256 -pmpi-auto-selected -mpi=intel -cnf=node187,node188 -ssh -mport 10.10.15.27:10.10.15.27:46493:0
Starting /opt/apps/ansys/v241/fluent/fluent24.1.0/multiport/mpi/lnamd64/intel2021/bin/mpirun -f /tmp/fluent-appfile.ssivaraman.1942769 --rsh=ssh -genv FLUENT_ARCH lnamd64 -genv I_MPI_DEBUG 0 -genv I_MPI_ADJUST_GATHERV 3 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_PLATFORM auto -genv PYTHONHOME /opt/apps/ansys/v241/fluent/fluent24.1.0/../../commonfiles/CPython/3_10/linx64/Release/python -genv FLUENT_PROD_DIR /opt/apps/ansys/v241/fluent/fluent24.1.0 -genv FLUENT_AFFINITY 0 -genv I_MPI_PIN enable -genv KMP_AFFINITY disabled -machinefile /tmp/fluent-appfile.ssivaraman.1942769 -np 256 /opt/apps/ansys/v241/fluent/fluent24.1.0/lnamd64/3ddp_node/fluent_mpi.24.1.0 node -mpiw intel -pic mpi-auto-selected -mport 10.10.15.27:10.10.15.27:46493:0
[mpiexec@node187.pri.kelvin2.alces.network] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node188 (pid 1943623, exit code 65280)
[mpiexec@node187.pri.kelvin2.alces.network] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node187.pri.kelvin2.alces.network] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node187.pri.kelvin2.alces.network] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@node187.pri.kelvin2.alces.network] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@node187.pri.kelvin2.alces.network] Possible reasons:
[mpiexec@node187.pri.kelvin2.alces.network] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node187.pri.kelvin2.alces.network] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node187.pri.kelvin2.alces.network] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node187.pri.kelvin2.alces.network] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node187.pri.kelvin2.alces.network] You may try using -bootstrap option to select alternative launcher. -
October 28, 2024 at 3:29 pmMangeshANSYSAnsys Employee
Hello,
Can you please check the 4 possible reasons shown in the error ?
if node187 can resolve hostname of node188 to an IPv4 address and the correct interface.
verify if passwordless ssh works from node 187 (run something like "ssh node188 whoami" should result in your username being shown)
also verify that a firewall is not blocking ports used by intel MPIwhat is the scheduler on this cluster ? see if the cluster administrator requires use of something else other than ssh ?
-
October 28, 2024 at 5:26 pmdv.makarovSubscriber
Hello Mangesh,
My colleague performed some tests together with HPC team and was able to run on two nodes ANSYS 2024R2. Environment variable was set as instructed:
> export I_MPI_HYDRA_BOOTSTRAP=ssh
Then Fluent was started via batch file using with fluent command:
> fluent 3ddp -g -t$SLURM_NTASKS -pib -cnf=hosts_cpus_list -i task.jouFluent run with Intel MPI using both nodes (each having 128 cores). However, performance was only 27% faster compared to a single node - would this performance boost value be typical or expected? The case file uses 2.5M CVs mesh, includes LES turbulence modelling, non-reacting species transport, two-phase flow.
Thank you!
Yours,
Dmitriy -
October 28, 2024 at 5:49 pmMangeshANSYSAnsys Employee
great !
adding -pib seems to have done the trick. -
October 28, 2024 at 6:04 pmdv.makarovSubscriber
Would you expect 27% boos with infiniband or higher? - with Centos OS (before the HPC facility moved to Rocky 10) our simulations using 2 nodes were running nearly twice faster - close to 100% boost compared to 27% now!
-
- You must be logged in to reply to this topic.
- Non-Intersected faces found for matching interface periodic-walls
- Help: About the expression of turbulent viscosity in Realizable k-e model
- Unburnt Hydrocarbons contour in ANSYS FORTE for sector mesh
- error udf
- Cyclone (Stairmand) simulation using RSM
- Diesel with Ammonia/Hydrogen blend combustion
- Mass Conservation Issue in Methane Pyrolysis Shock Tube Simulation
- Fluent fails with Intel MPI protocol on 2 nodes
- Encountering Error in Heterogeneous Surface Reaction
- Script Error
-
1156
-
488
-
486
-
225
-
201
© 2024 Copyright ANSYS, Inc. All rights reserved.