AnsysEDT SLURM tight IntelMPI integration issue

- January 29, 2024 at 5:26 pm
  
  krrs87
  Subscriber
  
  Hello all,
  
  I have been trying to run HFSS simulations on the HPC available at my institution. AnsysEDT 2023R2 has recently been installed on Rocky Linux 8.8.
  
  Running jobs on an exclusive node obtained through salloc work fine. As in I can salloc --nodes=1 --exclusive, ssh in, launch AnsysEDT and run both manual configuration and auto configuration jobs with no problem after changing MPI Version to 2021. The job distributes and runs super well.
  Issues arrise when trying to utilize the SLURM integration available.
  When running a job with auto settings and auto setup, say across 2 nodes. The multiple hf3d processes launched on each node only get pinned to 1 CPU per node. This has been verified by enabling the debug options and inspecting the log_mpirun.fl_xxxxxxxxx.log file. This happens both with the version of IntelMPI bundles with AnsysEDT and the one on the cluster (changed with $INTELMPI_ROOT var). I have set the $AnsTempDir and the tempdirectory batch option to locations accessible by all nodes. I have tried various different $ANSOFT_MPI_INTERCONNECT options and $ANSOFT_MPI_INTERCONNECT_VARIANT too.
  When running a job with manual settings and the manu setup, again across 2 nodes. The pre-processing completes successfully, and in a distributed fashion, however, upon solving the first frequency for adaptive meshing the process exits with a message to contact customer support. I have tracked the issue down to be a SIGSEGV 11 from the MUMPS driver called by hf3d through the log files. Again happens with all possible variations as mentioned above.
  HPCLicenseType is pool
  tempdirectory is set to something reasonable
  HFSS/MPIVendor intel
  HFSS/MPIVersion 2021
  HFSS/RemoteSpawnCommand scheduler
  
  Any help would be great! Thank you.
- January 30, 2024 at 5:02 pm
  
  randyk
  Forum Moderator
  
  Hi,
  Please create folder: $HOME/anstest
  Then create file: $HOME/anstest/job.sh with the following contents(correct partition and installation path):
  
  #!/bin/bash
  #SBATCH -N 3
  #SBATCH -n 12
  #SBATCH -J AnsysEMTest # sensible name for the job
  #SBATCH -p default # partition name
  
  InstFolder=/opt/AnsysEM/v232/Linux64
  JobFolder=$(pwd)
  
  # SLURM setup
  export ANSYSEM_GENERIC_MPI_WRAPPER=${InstFolder}/schedulers/scripts/utils/slurm_srun_wrapper.sh
  export ANSYSEM_COMMON_PREFIX=${InstFolder}/common
  srun_cmd="srun --overcommit --export=ALL -n 1 -N 1 --cpu-bind=none --mem-per-cpu=0 --overlap "
  export ANSYSEM_TASKS_PER_NODE="${SLURM_TASKS_PER_NODE}"
  
  # Setup Batchoptions
  echo "\$begin 'Config'" > ${JobFolder}/${JobName}.options
  echo "'HFSS/RemoteSpawnCommand'='scheduler'" >> ${JobFolder}/${JobName}.options
  echo "'HFSS 3D Layout Design/RemoteSpawnCommand'='scheduler'" >> ${JobFolder}/${JobName}.options
  echo "'HFSS/MPIVersion'='2021'" >> ${JobFolder}/${JobName}.options
  echo "'HFSS 3D Layout Design/MPIVersion'='2021'" >> ${JobFolder}/${JobName}.options
  echo "'tempdirectory'='/tmp'" >> ${JobFolder}/${JobName}.options
  echo "\$end 'Config'" >> ${JobFolder}/${JobName}.options
  
  # skip OS/Dependency check
  export ANS_IGNOREOS=1
  export ANS_NODEPCHECK=1
  
  # MPI timeout set to 30min default for cloud suggest lower to 120 or 240 seconds for onprem
  export MPI_TIMEOUT_SECONDS=120
  
  #copy test project
  cp ${InstFolder}/schedulers/diagnostics/Projects/HFSS/OptimTee-DiscreteSweep.aedt ${JobFolder}/OptimTee-DiscreteSweep.aedt
  
  # Submit AEDT Job (SLURM requires 'srun' and tight integration change to the slurm_srun_wrapper.sh
  ${srun_cmd} ${InstFolder}/ansysedt -ng -monitor -waitforlicense -useelectronicsppe=1 -distributed -machinelist numcores=12 -auto -batchoptions ${JobFolder}/${JobName}.options -batchsolve TeeModel:Nominal:Setup1 ${JobFolder}/OptimTee-DiscreteSweep.aedt
  
  $ cd $HOME/anstest/
  $ dos2unix $HOME/anstest/job.sh
  $ chmod +x $HOME/anstest/job.sh
  $ sbatch $HOME/anstest/job.sh
  
  Does this solve successfully, does it distribute across the three machines?
  
  thanks
  Randy
- February 1, 2024 at 2:13 pm
  
  krrs87
  Subscriber
  
  Hello Randy,
  Thank you for response. Unfortunately not. While the processes get across the nodes the same is happening again where all hf3d processes are pinned to just 1 core. Verified by using top. Sure it leads to successful solving but very slow.
  
  Thanks
- February 6, 2024 at 3:28 pm
  
  randyk
  Forum Moderator
  
  Issue identified as the following environment variables have been set (externally):
  BLIS_NUM_THREADS
  MKL_NUM_THREADS
  OMP_NUM_THREADS
  OMP_PLACES
  OMP_PROC_BIND
  OPENBLAS_NUM_THREADS
  These affect not only the core pinning by MPI, but also the number of threads the solvers use.
  Depending on the system and design, we set their value internally, but typically do not override them if the user has set them externally.

Viewing 3 reply threads

The topic ‘AnsysEDT SLURM tight IntelMPI integration issue’ is closed to new replies.

Electronics

AnsysEDT SLURM tight IntelMPI integration issue

Ansys Assistant

Electronics

AnsysEDT SLURM tight IntelMPI integration issue

Edit Discussion

Ansys Assistant

Welcome to Ansys Assistant!