Electronics

Electronics

Topics related to HFSS, Maxwell, SIwave, Icepak, Electronics Enterprise and more.

AnsysEDT SLURM tight IntelMPI integration issue

    • krrs87
      Subscriber

      Hello all,

       

      I have been trying to run HFSS simulations on the HPC available at my institution. AnsysEDT 2023R2 has recently been installed on Rocky Linux 8.8.

       

      Running jobs on an exclusive node obtained through salloc work fine. As in I can salloc --nodes=1 --exclusive, ssh in, launch AnsysEDT and run both manual configuration and auto configuration jobs with no problem after changing MPI Version to 2021. The job distributes and runs super well.

      Issues arrise when trying to utilize the SLURM integration available.

      When running a job with auto settings and auto setup, say across 2 nodes. The multiple hf3d processes launched on each node only get pinned to 1 CPU per node. This has been verified by enabling the debug options and inspecting the log_mpirun.fl_xxxxxxxxx.log file. This happens both with the version of IntelMPI bundles with AnsysEDT and the one on the cluster (changed with $INTELMPI_ROOT var). I have set the $AnsTempDir and the tempdirectory batch option to locations accessible by all nodes. I have tried various different $ANSOFT_MPI_INTERCONNECT options and $ANSOFT_MPI_INTERCONNECT_VARIANT too.

      When running a job with manual settings and the manu setup, again across 2 nodes. The pre-processing completes successfully, and in a distributed fashion, however, upon solving the first frequency for adaptive meshing the process exits with a message to contact customer support. I have tracked the issue down to be a SIGSEGV 11 from the MUMPS driver called by hf3d through the log files. Again happens with all possible variations as mentioned above.

      HPCLicenseType is pool

      tempdirectory is set to something reasonable

      HFSS/MPIVendor intel

      HFSS/MPIVersion 2021

      HFSS/RemoteSpawnCommand scheduler

       

      Any help would be great! Thank you.

       

    • randyk
      Ansys Employee

      Hi,

      Please create folder:  $HOME/anstest
      Then create file:  $HOME/anstest/job.sh   with the following contents(correct partition and installation path):


      #!/bin/bash

      #SBATCH -N 3
      #SBATCH -n 12
      #SBATCH -J AnsysEMTest     # sensible name for the job
      #SBATCH -p default         # partition name
       
      InstFolder=/opt/AnsysEM/v232/Linux64
      JobFolder=$(pwd)
       
      # SLURM setup
      export ANSYSEM_GENERIC_MPI_WRAPPER=${InstFolder}/schedulers/scripts/utils/slurm_srun_wrapper.sh
      export ANSYSEM_COMMON_PREFIX=${InstFolder}/common
      srun_cmd="srun --overcommit --export=ALL  -n 1 -N 1 --cpu-bind=none --mem-per-cpu=0 --overlap "
      export ANSYSEM_TASKS_PER_NODE="${SLURM_TASKS_PER_NODE}"
       
      # Setup Batchoptions
      echo "\$begin 'Config'" > ${JobFolder}/${JobName}.options
      echo "'HFSS/RemoteSpawnCommand'='scheduler'" >> ${JobFolder}/${JobName}.options
      echo "'HFSS 3D Layout Design/RemoteSpawnCommand'='scheduler'" >> ${JobFolder}/${JobName}.options
      echo "'HFSS/MPIVersion'='2021'" >> ${JobFolder}/${JobName}.options
      echo "'HFSS 3D Layout Design/MPIVersion'='2021'" >> ${JobFolder}/${JobName}.options
      echo "'tempdirectory'='/tmp'" >> ${JobFolder}/${JobName}.options
      echo "\$end 'Config'" >> ${JobFolder}/${JobName}.options
       
      # skip OS/Dependency check
      export ANS_IGNOREOS=1
      export ANS_NODEPCHECK=1
       
      # MPI timeout set to 30min default for cloud suggest lower to 120 or 240 seconds for onprem
      export MPI_TIMEOUT_SECONDS=120
       
      #copy test project
      cp ${InstFolder}/schedulers/diagnostics/Projects/HFSS/OptimTee-DiscreteSweep.aedt ${JobFolder}/OptimTee-DiscreteSweep.aedt
       
      # Submit AEDT Job (SLURM requires 'srun' and tight integration change to the slurm_srun_wrapper.sh
      ${srun_cmd} ${InstFolder}/ansysedt -ng -monitor -waitforlicense -useelectronicsppe=1 -distributed -machinelist numcores=12 -auto -batchoptions ${JobFolder}/${JobName}.options -batchsolve TeeModel:Nominal:Setup1 ${JobFolder}/OptimTee-DiscreteSweep.aedt



      $ cd $HOME/anstest/
      $ dos2unix $HOME/anstest/job.sh
      $ chmod +x $HOME/anstest/job.sh
      $ sbatch $HOME/anstest/job.sh 
       
      Does this solve successfully, does it distribute across the three machines?
       
      thanks
      Randy
    • krrs87
      Subscriber

      Hello Randy,

      Thank you for response. Unfortunately not. While the processes get across the nodes the same is happening again where all hf3d processes are pinned to just 1 core. Verified by using top. Sure it leads to successful solving but very slow.

       

      Thanks

    • randyk
      Ansys Employee

      Issue identified as the following environment variables have been set (externally):
      BLIS_NUM_THREADS
      MKL_NUM_THREADS
      OMP_NUM_THREADS
      OMP_PLACES
      OMP_PROC_BIND
      OPENBLAS_NUM_THREADS

      These affect not only the core pinning by MPI, but also the number of threads the solvers use.

      Depending on the system and design, we set their value internally, but typically do not override them if the user has set them externally.




Viewing 3 reply threads
  • The topic ‘AnsysEDT SLURM tight IntelMPI integration issue’ is closed to new replies.