We have an exciting announcement about badges coming in May 2025. Until then, we will temporarily stop issuing new badges for course completions and certifications. However, all completions will be recorded and fulfilled after May 2025.
LS Dyna

LS Dyna

Topics related to LS-DYNA, Autodyn, Explicit STR and more.

LS-DYNA MPP multi node errors RDMA_READ

    • oscsoft
      Subscriber

      Hello,

      We are the Ohio Supercomputer Center and we are experiencing errors similar to the following when running multinode ls-dyna mpp jobs:

      [c0054:22206:0:22206] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
      [c0054:22206:0:22206] ib_mlx5_log.c:179  RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0]

      This occurs on our Cardinal cluster which runs RHEL9 for versions LS-DYNA MPP intelmpi double R15.0.2, R13.1.0, R11.2.2

      This does not occur on our Ascend cluster which also runs RHEL9, nor does it occur on our Pitzer cluster which currently runs RHEL7

      We have confirmed that these errors only occur for multi node jobs

      We have tried to use the openmpi version with openmpi 5.0.2 but this seems to just hang

      Here are our results:

      For Cardinal MPP double R15.0.2 using intel compiler 2021.10.0 intel-oneapi-mpi 2021.10.0

           shell element 363345  failed at time  1.2250E-02
      [c0024:297464:0:297464] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
      [c0024:297464:0:297464] ib_mlx5_log.c:179  RC QP 0xadef wqe[4334]: RDMA_READ s-- [rva 0x374f94a0 rkey 0x80000] [va 0x371564c0 len 6024 lkey 0xea2d1] [rqpn 0xc51a dlid=1893 sl=0 port=1 src_path_bits=0]
      forrtl: error (76): Abort trap signal
      Image              PC                Routine            Line        Source
      ls-dyna_mpp_d_R15  000000000EF88494  Unknown               Unknown  Unknown
      libc.so.6          000015243B98E6F0  Unknown               Unknown  Unknown
      libc.so.6          000015243B9DB94C  Unknown               Unknown  Unknown
      libc.so.6          000015243B98E646  raise                 Unknown  Unknown
      libc.so.6          000015243B9787F3  abort                 Unknown  Unknown
      libucs.so.0.0.0    000015217B2D850B  Unknown               Unknown  Unknown
      libucs.so.0.0.0    000015217B2E6131  ucs_log_default_h     Unknown  Unknown
      libucs.so.0.0.0    000015217B2DCA25  ucs_log_dispatch      Unknown  Unknown
      libuct_ib.so.0.0.  000015217B1E433D  uct_ib_mlx5_compl     Unknown  Unknown
      libuct_ib.so.0.0.  000015217B1FB0D4  Unknown               Unknown  Unknown
      libuct_ib.so.0.0.  000015217B1E46BD  uct_ib_mlx5_check     Unknown  Unknown
      libuct_ib.so.0.0.  000015217B1F8D57  Unknown               Unknown  Unknown
      libucp.so.0.0.0    000015217B49930A  ucp_worker_progre     Unknown  Unknown
      libmlx-fi.so       000015217B52CDF1  Unknown               Unknown  Unknown
      libmlx-fi.so       000015217B545EAD  Unknown               Unknown  Unknown
      libmlx-fi.so       000015217B54539B  Unknown               Unknown  Unknown
      libmpi.so.12.0.0   000015243C38BE40  Unknown               Unknown  Unknown
      libmpi.so.12.0.0   000015243BF30715  Unknown               Unknown  Unknown
      libmpi.so.12.0.0   000015243C4BC6AF  PMPI_Wait             Unknown  Unknown
      libmpifort.so.12.  000015243D7F93ED  PMPI_WAIT             Unknown  Unknown
      ls-dyna_mpp_d_R15  00000000092B327F  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R15  0000000009194292  datacom_comr_            1165  datacom.F
      ls-dyna_mpp_d_R15  000000000919241D  datacom_com_              579  datacom.F
      ls-dyna_mpp_d_R15  0000000009327B4E  mpp_forceshare_           101  mpp_forceshare.F
      ls-dyna_mpp_d_R15  0000000001809564  fem3d_                  28537  _fem3d.f
      ls-dyna_mpp_d_R15  00000000018258C8  soltn_                   5881  soltn.f
      ls-dyna_mpp_d_R15  000000000193BE30  overly_                  3578  overly.f
      ls-dyna_mpp_d_R15  00000000007CB719  MAIN__                   5104  lsdyna.f
      ls-dyna_mpp_d_R15  000000000042BF62  Unknown               Unknown  Unknown
      libc.so.6          000015243B979590  Unknown               Unknown  Unknown
      libc.so.6          000015243B979640  __libc_start_main     Unknown  Unknown
      ls-dyna_mpp_d_R15  000000000042BE6F  Unknown               Unknown  Unknown
      

      For Cardinal MPP double R13.1.0 using intel compiler 2021.10.0 intel-oneapi-mpi 2021.10.0

           node  number    1244750 deleted at time 1.21957E-02
      [c0058:400169:0:400169] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
      [c0058:400169:0:400169] ib_mlx5_log.c:179  RC QP 0x1a39f wqe[31387]: RDMA_READ s-- [rva 0x31428368 rkey 0x20000] [va 0x33aa1308 len 8328 lkey 0x46f6f6] [rqpn 0xf030 dlid=1957 sl=0 port=1 src_path_bits=0]
      forrtl: error (76): Abort trap signal
      Image              PC                Routine            Line        Source
      ls-dyna_mpp_d_R13  000000000DACF6A4  Unknown               Unknown  Unknown
      libc.so.6          0000151F60ECC6F0  Unknown               Unknown  Unknown
      libc.so.6          0000151F60F1994C  Unknown               Unknown  Unknown
      libc.so.6          0000151F60ECC646  raise                 Unknown  Unknown
      libc.so.6          0000151F60EB67F3  abort                 Unknown  Unknown
      libucs.so.0.0.0    0000151CA080950B  Unknown               Unknown  Unknown
      libucs.so.0.0.0    0000151CA0817131  ucs_log_default_h     Unknown  Unknown
      libucs.so.0.0.0    0000151CA080DA25  ucs_log_dispatch      Unknown  Unknown
      libuct_ib.so.0.0.  0000151CA072F33D  uct_ib_mlx5_compl     Unknown  Unknown
      libuct_ib.so.0.0.  0000151CA07460D4  Unknown               Unknown  Unknown
      libuct_ib.so.0.0.  0000151CA072F6BD  uct_ib_mlx5_check     Unknown  Unknown
      libuct_ib.so.0.0.  0000151CA0743D57  Unknown               Unknown  Unknown
      libucp.so.0.0.0    0000151CA09CA30A  ucp_worker_progre     Unknown  Unknown
      libmlx-fi.so       0000151CA0A5DDF1  Unknown               Unknown  Unknown
      libmlx-fi.so       0000151CA0A76EAD  Unknown               Unknown  Unknown
      libmlx-fi.so       0000151CA0A7639B  Unknown               Unknown  Unknown
      libmpi.so.12.0.0   0000151F618C9E40  Unknown               Unknown  Unknown
      libmpi.so.12.0.0   0000151F6146E715  Unknown               Unknown  Unknown
      libmpi.so.12.0.0   0000151F619FA6AF  PMPI_Wait             Unknown  Unknown
      libmpifort.so.12.  0000151F62D373ED  PMPI_WAIT             Unknown  Unknown
      ls-dyna_mpp_d_R13  0000000006C398FB  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R13  0000000006B3B362  datacom_comr_           14217  datacom.f
      ls-dyna_mpp_d_R13  0000000006B394ED  datacom_com_             7063  datacom.f
      ls-dyna_mpp_d_R13  0000000006C9C1EE  mpp_forceshare_          2264  mpp_forceshare.f
      ls-dyna_mpp_d_R13  00000000014E693D  fem3d_                  24833  fem3d_p.f
      ls-dyna_mpp_d_R13  00000000014D37FA  soltn_                   4933  soltn.f
      ls-dyna_mpp_d_R13  000000000162DAAE  overly_                  2692  overly.f
      ls-dyna_mpp_d_R13  00000000004247FB  MAIN__                   4163  lsdyna.f
      ls-dyna_mpp_d_R13  000000000041CFC2  Unknown               Unknown  Unknown
      libc.so.6          0000151F60EB7590  Unknown               Unknown  Unknown
      libc.so.6          0000151F60EB7640  __libc_start_main     Unknown  Unknown
      ls-dyna_mpp_d_R13  000000000041CEC4  Unknown               Unknown  Unknown


      For Cardinal MPP double R11.2.2 using intel compiler 2021.10.0 intel-oneapi-mpi 2021.10.0

           node  number    1158014 deleted at time 1.07762E-02
      [c0023:3167986:0:3167986] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
      [c0023:3167986:0:3167986] ib_mlx5_log.c:179  RC QP 0x1d7b9 wqe[46172]: RDMA_READ s-- [rva 0x2facd550 rkey 0x20000] [va 0x4651e270 len 8328 lkey 0x928677] [rqpn 0x1ebb9 dlid=1890 sl=0 port=1 src_path_bits=0]
      forrtl: error (76): Abort trap signal
      Image              PC                Routine            Line        Source
      ls-dyna_mpp_d_R11  000000000AD7B94D  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  000000000AD797E7  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  000000000ACD99A4  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  000000000ACD97B6  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  000000000AC73D46  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  000000000AC7A808  Unknown               Unknown  Unknown
      Unknown            000014C808C246F0  Unknown               Unknown  Unknown
      Unknown            000014C808C7194C  Unknown               Unknown  Unknown
      libc.so.6          000014C808C24646  Unknown               Unknown  Unknown
      libc.so.6          000014C808C0E7F3  Unknown               Unknown  Unknown
      Unknown            000014C5485FF50B  Unknown               Unknown  Unknown
      libucs.so.0        000014C54860D131  Unknown               Unknown  Unknown
      libucs.so.0        000014C548603A25  Unknown               Unknown  Unknown
      libuct_ib.so.0     000014C54850B33D  Unknown               Unknown  Unknown
      Unknown            000014C5485220D4  Unknown               Unknown  Unknown
      libuct_ib.so.0     000014C54850B6BD  Unknown               Unknown  Unknown
      Unknown            000014C54851FD57  Unknown               Unknown  Unknown
      libucp.so.0        000014C5487C030A  Unknown               Unknown  Unknown
      Unknown            000014C548853DF1  Unknown               Unknown  Unknown
      Unknown            000014C54886CEAD  Unknown               Unknown  Unknown
      Unknown            000014C54886C39B  Unknown               Unknown  Unknown
      Unknown            000014C809621E40  Unknown               Unknown  Unknown
      libmpi.so.12       000014C8090BAD04  Unknown               Unknown  Unknown
      libmpifort.so.12   000014C80AA891E4  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  0000000005BCE963  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  0000000001978D97  mpp_alesum1_             2318  mpp_alesum1.f
      ls-dyna_mpp_d_R11  000000000198B62D  mpp_aletie2_             2465  mpp_aletie2.f
      ls-dyna_mpp_d_R11  000000000195FA0C  aleties_                 3046  aleties.f
      ls-dyna_mpp_d_R11  0000000001153E9D  fem3d_                  14984  fem3d_p.f
      ls-dyna_mpp_d_R11  000000000116DC32  soltn_                   4801  soltn.f
      ls-dyna_mpp_d_R11  000000000125973A  overly_                  2693  overly.f
      ls-dyna_mpp_d_R11  000000000041B3C8  MAIN__                   4123  lsdyna.f
      ls-dyna_mpp_d_R11  000000000041482E  Unknown               Unknown  Unknown
      Unknown            000014C808C0F590  Unknown               Unknown  Unknown
      libc.so.6          000014C808C0F640  Unknown               Unknown  Unknown
      ls-dyna_mpp_d_R11  0000000000414745  Unknown               Unknown  Unknown

      Please let us know anything that might help us resolve this issue,

      Best,

      Ohio Supercomputer Center

Viewing 0 reply threads
  • You must be logged in to reply to this topic.