Ansys Products

Ansys Products

Discuss installation & licensing of our Ansys Teaching and Research products.

Running Ansys Fluent in a HPC Cluster with LSF scheduler, Intel MPI, and Docker without SSH

TAGGED: , , ,

    • sleong
      Subscriber

      Our HPC Cluster is using IBM LSF scheduler. A job is run in a docker container with Intel MPI across multiple nodes. The problem is our HPC cluster nodes do not allow SSH access. Is there a way to disable SSH for Ansys Fluent?

    • ANSYS_MMadore
      Ansys Employee
      Please set the below two system environment variables
      In Bash Shell:
      export FLUENT_SSH=blaunch
      export SCHEDULER_RSH=1
      In C Shell:
      setenv FLUENT_SSH blaunch
      setenv SCHEDULER_RSH 1

      Also, you have to use -scheduler_tight_coupling in your command line.

      You could also try this if you aren't using blaunch: SSH_SPAWN=0 -pcheck=0 to the fluent command. You still need to add: -scheduler_tight_coupling in the command line.

    • sleong
      Subscriber
      Hi mmadore Thank you very much mmadore for the solution, I can submit to our cluster now. But, I got the error "Received signal SIGSEGV.". I can run the test job to finish occasionally with 2 nodes but most of the time I still get the error "Received signal SIGSEGV.". For more than 2 nodes jobs, It always fail with that error. How can I fix the problem?

      Building...
      mesh
      auto partitioning mesh by Metis (fast) distributing mesh
      parts.................................................. ==============================================================================

      Node 47: Process 24: Received signal SIGSEGV.

      ==============================================================================

      ==============================================================================

      Node 45: Process 22: Received signal SIGSEGV.

      ==============================================================================

      ==============================================================================

      Node 44: Process 21: Received signal SIGSEGV.

      ==============================================================================

      ==============================================================================

      Node 42: Process 19: Received signal SIGSEGV.

      ==============================================================================
      *** Error in `/export/ansys21/v211/fluent/fluent21.1.0/lnamd64/3ddp_node/fluent_mpi.21.1.0': double free or corruption (fasttop): 0x0000000007c26520 ***
      *** Error in `/export/ansys21/v211/fluent/fluent21.1.0/lnamd64/3ddp_node/fluent_mpi.21.1.0': double free or corruption (fasttop): 0x000000000692a870 ***
      *** Error in `/export/ansys21/v211/fluent/fluent21.1.0/lnamd64/3ddp_node/fluent_mpi.21.1.0': double free or corruption (fasttop): 0x000000000566beb0 ***
      *** Error in `/export/ansys21/v211/fluent/fluent21.1.0/lnamd64/3ddp_node/fluent_mpi.21.1.0': double free or corruption (fasttop): 0x00000000057102b0 ***
      *** Error in `/export/ansys21/v211/fluent/fluent21.1.0/lnamd64/3ddp_node/fluent_mpi.21.1.0': double free or corruption (fasttop): 0x00000000058c4740 ***
      ======= Backtrace: =========
      ======= Backtrace: =========
      ======= Backtrace: =========
      /lib64/libc.so.6(+0x81299)[0x7fc52cba4299]
      ======= Backtrace: =========
      /lib64/libc.so.6(+0x81299)[0x7f95ced34299]
      /lib64/libc.so.6(+0x81299)[0x7f2134f55299]
      /export/ansys21/v211/fluent/fluent21.1.0/lnamd64/syslib/libstdc++.so.6(_ZNSsD1Ev+0x3e)[0x7f2139c6bede]
      /lib64/libc.so.6(+0x81299)[0x7f3d58ee1299]
      /export/ansys21/v211/fluent/fluent21.1.0/lnamd64/syslib/libstdc++.so.6(_ZNSsD1Ev+0x3e)[0x7f3d5dbf7ede]
      ======= Backtrace: =========
      /lib64/libc.so.6(+0x81299)[0x7f1a1e11e299]
      /export/ansys21/v211/fluent/lib/lnamd64/libansysfluidssettingsparsers.so(_ZN5ansys21GenericSettingsParserD2Ev+0x46)[0x7f1a2f8ebcd6]
      /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f1a1e0d705a]
      /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f2134f0e05a]
      /export/ansys21/v211/fluent/lib/lnamd64/libansysfluidsproject.so(+0x37e43)[0x7f2146207e43]
      ======= Memory map: ========
      /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f3d58e9a05a]
      /export/ansys21/v211/fluent/lib/lnamd64/libansysfluidsproject.so(+0x37e43)[0x7f3d6a193e43]
      ======= Memory map: ========
      /export/ansys21/v211/fluent/fluent21.1.0/lnamd64/syslib/libstdc++.so.6(_ZNSsD1Ev+0x3e)[0x7fc5318baede]
      /lib64/libc.so.6(+0x39ce9)[0x7fc52cb5cce9]
      /lib64/libc.so.6(+0x39d37)[0x7fc52cb5cd37]
      /export/ansys21/v211/fluent/fluent21.1.0/multiport/lnamd64/mpi/shared/libmport.so(+0x87012)[0x7fc53c5b2012]
      /export/ansys21/v211/fluent/fluent21.1.0/lnamd64/syslib/libstdc++.so.6(_ZNSsD1Ev+0x3e)[0x7f95d3a4aede]
      /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f95ceced05a]
      /export/ansys21/v211/fluent/fluent21.1.0/cortex/lnamd64/libExpr.so(+0xac0e3)[0x7f95df0d60e3]
      ======= Memory map: ========
      /export/ansys21/v211/fluent/lib/lnamd64/libansysfluidsfactory.so(+0x5c63)[0x7f1a2fd94c63]
      ======= Memory map: ========
      /export/ansys21/v211/fluent/fluent21.1.0/multiport/lnamd64/mpi/shared/libmport.so(+0x8712d)[0x7fc53c5b212d]
      /export/ansys21/v211/fluent/fluent21.1.0/multiport/lnamd64/mpi/shared/libmport.so(+0x8f1a3)[0x7fc53c5ba1a3]
      /lib64/libpthread.so.0(+0x7ea5)[0x7fc539664ea5]
      /lib64/libc.so.6(clone+0x6d)[0x7fc52cc2196d]
      ======= Memory map: ========
      *** Error in `/export/ansys21/v211/fluent/fluent21.1.0/lnamd64/3ddp_node/fluent_mpi.21.1.0': double free or corruption (fasttop): 0x0000000007d47910 ***

      ===============Message from the Cortex Process================================

      Fatal error in one of the compute processes.

      ==============================================================================
      ======= Backtrace: =========
      /lib64/libc.so.6(+0x81299)[0x7f24a88bf299]
      /export/ansys21/v211/fluent/fluent21.1.0/lnamd64/syslib/libstdc++.so.6(_ZNSsD1Ev+0x3e)[0x7f24ad5d5ede]
      /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f24a887805a]
      /export/ansys21/v211/fluent/lib/lnamd64/libansysfluidsproject.so(+0x37e43)[0x7f24b9b71e43]
      ======= Memory map: ========

    • ANSYS_MMadore
      Ansys Employee
      Can you share the full text of the .trn file for review?
    • sleong
      Subscriber
      Attached please find the trn output.
    • ANSYS_MMadore
      Ansys Employee
      n32-63 compute1-exec-6.ris. 32/72 Linux-64 9-40 Intel(R) Xeon(R) Gold 6154
      n0-31 compute1-exec-98.ris 32/32 Linux-64 9-40 Intel(R) Xeon(R) Gold 6242
      host compute1-exec-98.ris Linux-64 539 Intel(R) Xeon(R) Gold 6242

      exec-6 is a different architecture than the host and n0 with hyperthreading enabled on exec-6. Perhaps, have host on exec-98 and the compute processes only solving on exec-6?
      Looks like you have HT enabled, perhaps try disabling.

      Can you try:
      -mpi=intel2019
    • sleong
      Subscriber
      Hi Mmadore Thank you very much! I tried 2, 4, 8 nodes with 64, 16, 144 processes with "-mpi=intel2019", all of them successfully completed without any problem. Thank you very much for your help!
Viewing 6 reply threads
  • The topic ‘Running Ansys Fluent in a HPC Cluster with LSF scheduler, Intel MPI, and Docker without SSH’ is closed to new replies.