TAGGED: ansys-hfss, batch-hpc, hpc-cluster, mpi-with-slurm, multiple-nodes, slurm
-
-
May 30, 2024 at 2:35 pmJOHN LAGRONESubscriber
We have not been able to get slurm integration working with electronics desktop components, e.g. hfss on multiple node jobs (preferred) or with RSM (previously worked for us) with 24.1
We have modified the ansoftrsmservice.cfg to contain
   $begin 'Scheduler'
       'SchedulerName'='generic'
       'ConfigString'='{"Proxy":"slurm"}'
    $end 'Scheduler'and set multiple relevant environment variables, e.g.Â
ANSYS_HOME=/hpc/m3/apps/ansys/24R1/AnsysEM/v241/Linux64
export PATH=${ANSYS_HOME}:$PATH
export ANSOFT_DEBUG_LOG=$HOME/ansys_debug.log
export ANSYSEM_ROOT241=${ANSYS_HOME}
export ANSYSEM_COMMON_PREFIX=${ANSYS_HOME}/common
export ANSYSEM_TASKS_PER_NODE=$( echo ${SLURM_TASKS_PER_NODE} | cut -f 1 -d \()
export ANSYS_EM_EXEC_DIR=${ANSYS_HOME}
export ANSYSEM_GENERIC_MPI_WRAPPER=${ANSYS_HOME}/schedulers/scripts/utils/slurm_srun_wrapper.sh
export ANS_NODEPCHECK=1
export ANSYSEM_GENERIC_EXEC_PATH=${ANSYS_HOME}/common/mono/Linux64/bin:${ANSYS_HOME}/common/IronPython
export ANSYS_EM_GENERIC_COMMON_TEMP=/tmpOur batch options look like:
$begin 'Config'
'HFSS/NumCoresPerDistributedTask'=4
'HFSS 3D Layout Design/NumCoresPerDistributedTask'=4
'HFSS/HPCLicenseType'='Pool'
'HPCLicenseType'='Pool'
'HFSS/SolveAdaptiveOnly'=0
'HFSS/MPIVendor'='Intel'
'HFSS 3D Layout Design/MPIVendor'='Intel'
'Maxwell 2D/MPIVendor'='Intel'
'Maxwell 3D/MPIVendor'='Intel'
'Q3D Extractor/MPIVendor'='Intel'
'Icepak/MPIVendor'='Intel'
'HFSS/RemoteSpawnCommand'='scheduler'
'HFSS 3D Layout Design/RemoteSpawnCommand'='scheduler'
'Maxwell 3D/RemoteSpawnCommand'='scheduler'
'Maxwell 2D/RemoteSpawnCommand'='scheduler'
'Q3D Extractor/RemoteSpawnCommand'='scheduler'
'Icepak/RemoteSpawnCommand'='scheduler'
'Desktop/Settings/ProjectOptions/ProductImprovementOptStatus'='0'
'Desktop/Settings/ProjectOptions/AnsysEMPreferredSubnetAddress'='10.215.24.0/21'
$end 'Config'It appears that the slurm_srun_wrapper is not setting the host(s) correctly, for example, this is from a log generated:
(02:14:48 PM May 24, 2024) Command = /hpc/m3/apps/ansys/24R1/AnsysEM/v241/Linux64/schedulers/scripts/utils/slurm_srun_wrapper.sh
(02:14:48 PM May 24, 2024) All args = --nodelist c001.infiniband.cluster,c002.infiniband.cluster -N 2 -n 2 --input none --external-launcher /hpc/m3/apps/ansys/24R1/AnsysEM/v241/Linux64/common/fluent_mpi/multiport/mpi/lnamd64/intel/bin/pmi_proxy --control-port c001.cm.cluster:36271 --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher slurm --launcher-exec /hpc/m3/apps/ansys/24R1/AnsysEM/v241/Linux64/schedulers/scripts/utils/slurm_srun_wrapper.sh --launcher-exec-args --external-launcher --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1142573241 --usize -2 --proxy-id -1and it sets host=$3, which is "-N". We assume that something is not being passed correctly, but the end result is that this generates a malformed srun command that kills our slurm daemons. Changing it to host=$2 in this case seems to help as it actually gets a hostname, but it doesn't appear to communicate between nodes.
We've also been unable to get a multinode job to work with RSM like we've been able previously (We did something very similar to https://www.chpc.utah.edu/documentation/software/scripts/run_ansysedt.slr). Note, we also see "errors" in the RSM logs saying we have multiple network interfaces (which is true) and to see the docs on how to set the appropriate network interface. We have been unable to find that documentation for the RSM that is bundled with the Electronics Desktop -- we can only find the documation for the full RSM which says to set it with rsmutil, but that does not exist in the Electronics Desktop installation (e.g. https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v241/en/wb_rsm/wb_rsm_mult_nic.html)Â Â
-
July 12, 2024 at 2:10 pmrandykAnsys Employee
Discussing directly
-
July 12, 2024 at 5:36 pmrandykAnsys Employee
Case resolution:
Issue related to SLURM 23.11
SLURM 23.11 changed the behaviour of mpirun when using I_MPI_HYDRA_BOOTSTRAP=slurm (the default), by injecting two environment variables and passing --external-launcher option to the launcher command.
Patches provided to resolve the issue. -
August 10, 2024 at 1:13 ambenjamin.choiSubscriber
Hello, could you provide the exact solution for this? That would be greatly appreciated.
-
August 12, 2024 at 12:08 pmrandykAnsys Employee
Hi Benjamin,
The solution is Ansys knowledgebase 000064207
This is available on the Ansys Partner Community. Your Ansys ASC would have access - if you do not.That solution provides two patches that can be applied to AnsysEM 2022R2 through 2024R2.
AnsysEM 2025R1 will natively work with SLURM 23.11thanks
Randy
-
- The topic ‘SLURM integration’ is closed to new replies.
- HFSS Incident Plane Wave excitement mode
- Question for Maxwell
- Simulation of capacitor combining eddy currents with displacement currents
- How to calculate eddy and hysteresis losses of the core?
- Ansys Maxwell 3D – eddy current
- How to determine initial position in motion setup
- dq graph non-conformity
- How to customize pulse waveform and injection site in microstrip array
- 180 Degree Phase Shift When Measuring S21
- Simplorer+Maxwell Cosimulation results and Maxwell results mismatch
-
1191
-
513
-
488
-
225
-
209
© 2024 Copyright ANSYS, Inc. All rights reserved.