TAGGED: intel-mpi, mpi-with-slurm, slurm
-
-
January 29, 2024 at 5:26 pmkrrs87Subscriber
Hello all,
Â
I have been trying to run HFSS simulations on the HPC available at my institution. AnsysEDT 2023R2 has recently been installed on Rocky Linux 8.8.
Â
Running jobs on an exclusive node obtained through salloc work fine. As in I can salloc --nodes=1 --exclusive, ssh in, launch AnsysEDT and run both manual configuration and auto configuration jobs with no problem after changing MPI Version to 2021. The job distributes and runs super well.
Issues arrise when trying to utilize the SLURM integration available.
When running a job with auto settings and auto setup, say across 2 nodes. The multiple hf3d processes launched on each node only get pinned to 1 CPU per node. This has been verified by enabling the debug options and inspecting the log_mpirun.fl_xxxxxxxxx.log file. This happens both with the version of IntelMPI bundles with AnsysEDT and the one on the cluster (changed with $INTELMPI_ROOT var). I have set the $AnsTempDir and the tempdirectory batch option to locations accessible by all nodes. I have tried various different $ANSOFT_MPI_INTERCONNECT options and $ANSOFT_MPI_INTERCONNECT_VARIANT too.
When running a job with manual settings and the manu setup, again across 2 nodes. The pre-processing completes successfully, and in a distributed fashion, however, upon solving the first frequency for adaptive meshing the process exits with a message to contact customer support. I have tracked the issue down to be a SIGSEGV 11 from the MUMPS driver called by hf3d through the log files. Again happens with all possible variations as mentioned above.
HPCLicenseType is pool
tempdirectory is set to something reasonable
HFSS/MPIVendor intel
HFSS/MPIVersion 2021
HFSS/RemoteSpawnCommand scheduler
Â
Any help would be great! Thank you.
Â
-
January 30, 2024 at 5:02 pmrandykAnsys Employee
Hi,
Please create folder:Â $HOME/anstest
Then create file: $HOME/anstest/job.sh  with the following contents(correct partition and installation path):
#!/bin/bash#SBATCH -N 3#SBATCH -n 12#SBATCH -J AnsysEMTest   # sensible name for the job#SBATCH -p default     # partition nameÂInstFolder=/opt/AnsysEM/v232/Linux64JobFolder=$(pwd)Â# SLURM setupexport ANSYSEM_GENERIC_MPI_WRAPPER=${InstFolder}/schedulers/scripts/utils/slurm_srun_wrapper.shexport ANSYSEM_COMMON_PREFIX=${InstFolder}/commonsrun_cmd="srun --overcommit --export=ALL -n 1 -N 1 --cpu-bind=none --mem-per-cpu=0 --overlap "export ANSYSEM_TASKS_PER_NODE="${SLURM_TASKS_PER_NODE}"Â# Setup Batchoptionsecho "\$begin 'Config'" > ${JobFolder}/${JobName}.optionsecho "'HFSS/RemoteSpawnCommand'='scheduler'" >> ${JobFolder}/${JobName}.optionsecho "'HFSS 3D Layout Design/RemoteSpawnCommand'='scheduler'" >> ${JobFolder}/${JobName}.optionsecho "'HFSS/MPIVersion'='2021'" >> ${JobFolder}/${JobName}.optionsecho "'HFSS 3D Layout Design/MPIVersion'='2021'" >> ${JobFolder}/${JobName}.optionsecho "'tempdirectory'='/tmp'" >> ${JobFolder}/${JobName}.optionsecho "\$end 'Config'" >> ${JobFolder}/${JobName}.optionsÂ# skip OS/Dependency checkexport ANS_IGNOREOS=1export ANS_NODEPCHECK=1Â# MPI timeout set to 30min default for cloud suggest lower to 120 or 240 seconds for onpremexport MPI_TIMEOUT_SECONDS=120Â#copy test projectcp ${InstFolder}/schedulers/diagnostics/Projects/HFSS/OptimTee-DiscreteSweep.aedt ${JobFolder}/OptimTee-DiscreteSweep.aedtÂ# Submit AEDT Job (SLURM requires 'srun' and tight integration change to the slurm_srun_wrapper.sh${srun_cmd} ${InstFolder}/ansysedt -ng -monitor -waitforlicense -useelectronicsppe=1 -distributed -machinelist numcores=12 -auto -batchoptions ${JobFolder}/${JobName}.options -batchsolve TeeModel:Nominal:Setup1 ${JobFolder}/OptimTee-DiscreteSweep.aedt
$ cd $HOME/anstest/
$ dos2unix $HOME/anstest/job.sh$ chmod +x $HOME/anstest/job.sh$ sbatch $HOME/anstest/job.shÂÂDoes this solve successfully, does it distribute across the three machines?ÂthanksRandy -
February 1, 2024 at 2:13 pmkrrs87Subscriber
Hello Randy,
Thank you for response. Unfortunately not. While the processes get across the nodes the same is happening again where all hf3d processes are pinned to just 1 core. Verified by using top. Sure it leads to successful solving but very slow.
Â
Thanks
-
February 6, 2024 at 3:28 pmrandykAnsys Employee
Issue identified as the following environment variables have been set (externally):
BLIS_NUM_THREADS
MKL_NUM_THREADS
OMP_NUM_THREADS
OMP_PLACES
OMP_PROC_BIND
OPENBLAS_NUM_THREADSThese affect not only the core pinning by MPI, but also the number of threads the solvers use.
Depending on the system and design, we set their value internally, but typically do not override them if the user has set them externally.
-
- The topic ‘AnsysEDT SLURM tight IntelMPI integration issue’ is closed to new replies.
- HFSS Incident Plane Wave excitement mode
- Question for Maxwell
- Simulation of capacitor combining eddy currents with displacement currents
- How to calculate eddy and hysteresis losses of the core?
- Ansys Maxwell 3D – eddy current
- How to determine initial position in motion setup
- dq graph non-conformity
- How to customize pulse waveform and injection site in microstrip array
- 180 Degree Phase Shift When Measuring S21
- Simplorer+Maxwell Cosimulation results and Maxwell results mismatch
-
1191
-
513
-
488
-
225
-
209
© 2024 Copyright ANSYS, Inc. All rights reserved.