-
-
May 12, 2025 at 1:47 pm
oscsoft
SubscriberHello,
We are the Ohio Supercomputer Center and we are experiencing errors similar to the following when running multinode ls-dyna mpp jobs:
[c0054:22206:0:22206] ib_mlx5_log.c:179 Â Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[c0054:22206:0:22206] ib_mlx5_log.c:179 Â RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0]
This occurs on our Cardinal cluster which runs RHEL9 for versions LS-DYNA MPP intelmpi double R15.0.2, R13.1.0, R11.2.2
This does not occur on our Ascend cluster which also runs RHEL9, nor does it occur on our Pitzer cluster which currently runs RHEL7
We have confirmed that these errors only occur for multi node jobs
We have tried to use the openmpi version with openmpi 5.0.2 but this seems to just hang
Here are our results:For Cardinal MPP double R15.0.2 using intel compiler 2021.10.0 intel-oneapi-mpi 2021.10.0
shell element 363345 failed at time 1.2250E-02 [c0024:297464:0:297464] ib_mlx5_log.c:179 Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [c0024:297464:0:297464] ib_mlx5_log.c:179 RC QP 0xadef wqe[4334]: RDMA_READ s-- [rva 0x374f94a0 rkey 0x80000] [va 0x371564c0 len 6024 lkey 0xea2d1] [rqpn 0xc51a dlid=1893 sl=0 port=1 src_path_bits=0] forrtl: error (76): Abort trap signal Image PC Routine Line Source ls-dyna_mpp_d_R15 000000000EF88494 Unknown Unknown Unknown libc.so.6 000015243B98E6F0 Unknown Unknown Unknown libc.so.6 000015243B9DB94C Unknown Unknown Unknown libc.so.6 000015243B98E646 raise Unknown Unknown libc.so.6 000015243B9787F3 abort Unknown Unknown libucs.so.0.0.0 000015217B2D850B Unknown Unknown Unknown libucs.so.0.0.0 000015217B2E6131 ucs_log_default_h Unknown Unknown libucs.so.0.0.0 000015217B2DCA25 ucs_log_dispatch Unknown Unknown libuct_ib.so.0.0. 000015217B1E433D uct_ib_mlx5_compl Unknown Unknown libuct_ib.so.0.0. 000015217B1FB0D4 Unknown Unknown Unknown libuct_ib.so.0.0. 000015217B1E46BD uct_ib_mlx5_check Unknown Unknown libuct_ib.so.0.0. 000015217B1F8D57 Unknown Unknown Unknown libucp.so.0.0.0 000015217B49930A ucp_worker_progre Unknown Unknown libmlx-fi.so 000015217B52CDF1 Unknown Unknown Unknown libmlx-fi.so 000015217B545EAD Unknown Unknown Unknown libmlx-fi.so 000015217B54539B Unknown Unknown Unknown libmpi.so.12.0.0 000015243C38BE40 Unknown Unknown Unknown libmpi.so.12.0.0 000015243BF30715 Unknown Unknown Unknown libmpi.so.12.0.0 000015243C4BC6AF PMPI_Wait Unknown Unknown libmpifort.so.12. 000015243D7F93ED PMPI_WAIT Unknown Unknown ls-dyna_mpp_d_R15 00000000092B327F Unknown Unknown Unknown ls-dyna_mpp_d_R15 0000000009194292 datacom_comr_ 1165 datacom.F ls-dyna_mpp_d_R15 000000000919241D datacom_com_ 579 datacom.F ls-dyna_mpp_d_R15 0000000009327B4E mpp_forceshare_ 101 mpp_forceshare.F ls-dyna_mpp_d_R15 0000000001809564 fem3d_ 28537 _fem3d.f ls-dyna_mpp_d_R15 00000000018258C8 soltn_ 5881 soltn.f ls-dyna_mpp_d_R15 000000000193BE30 overly_ 3578 overly.f ls-dyna_mpp_d_R15 00000000007CB719 MAIN__ 5104 lsdyna.f ls-dyna_mpp_d_R15 000000000042BF62 Unknown Unknown Unknown libc.so.6 000015243B979590 Unknown Unknown Unknown libc.so.6 000015243B979640 __libc_start_main Unknown Unknown ls-dyna_mpp_d_R15 000000000042BE6F Unknown Unknown Unknown
For Cardinal MPP double R13.1.0 using intel compiler 2021.10.0 intel-oneapi-mpi 2021.10.0
node number 1244750 deleted at time 1.21957E-02 [c0058:400169:0:400169] ib_mlx5_log.c:179 Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [c0058:400169:0:400169] ib_mlx5_log.c:179 RC QP 0x1a39f wqe[31387]: RDMA_READ s-- [rva 0x31428368 rkey 0x20000] [va 0x33aa1308 len 8328 lkey 0x46f6f6] [rqpn 0xf030 dlid=1957 sl=0 port=1 src_path_bits=0] forrtl: error (76): Abort trap signal Image PC Routine Line Source ls-dyna_mpp_d_R13 000000000DACF6A4 Unknown Unknown Unknown libc.so.6 0000151F60ECC6F0 Unknown Unknown Unknown libc.so.6 0000151F60F1994C Unknown Unknown Unknown libc.so.6 0000151F60ECC646 raise Unknown Unknown libc.so.6 0000151F60EB67F3 abort Unknown Unknown libucs.so.0.0.0 0000151CA080950B Unknown Unknown Unknown libucs.so.0.0.0 0000151CA0817131 ucs_log_default_h Unknown Unknown libucs.so.0.0.0 0000151CA080DA25 ucs_log_dispatch Unknown Unknown libuct_ib.so.0.0. 0000151CA072F33D uct_ib_mlx5_compl Unknown Unknown libuct_ib.so.0.0. 0000151CA07460D4 Unknown Unknown Unknown libuct_ib.so.0.0. 0000151CA072F6BD uct_ib_mlx5_check Unknown Unknown libuct_ib.so.0.0. 0000151CA0743D57 Unknown Unknown Unknown libucp.so.0.0.0 0000151CA09CA30A ucp_worker_progre Unknown Unknown libmlx-fi.so 0000151CA0A5DDF1 Unknown Unknown Unknown libmlx-fi.so 0000151CA0A76EAD Unknown Unknown Unknown libmlx-fi.so 0000151CA0A7639B Unknown Unknown Unknown libmpi.so.12.0.0 0000151F618C9E40 Unknown Unknown Unknown libmpi.so.12.0.0 0000151F6146E715 Unknown Unknown Unknown libmpi.so.12.0.0 0000151F619FA6AF PMPI_Wait Unknown Unknown libmpifort.so.12. 0000151F62D373ED PMPI_WAIT Unknown Unknown ls-dyna_mpp_d_R13 0000000006C398FB Unknown Unknown Unknown ls-dyna_mpp_d_R13 0000000006B3B362 datacom_comr_ 14217 datacom.f ls-dyna_mpp_d_R13 0000000006B394ED datacom_com_ 7063 datacom.f ls-dyna_mpp_d_R13 0000000006C9C1EE mpp_forceshare_ 2264 mpp_forceshare.f ls-dyna_mpp_d_R13 00000000014E693D fem3d_ 24833 fem3d_p.f ls-dyna_mpp_d_R13 00000000014D37FA soltn_ 4933 soltn.f ls-dyna_mpp_d_R13 000000000162DAAE overly_ 2692 overly.f ls-dyna_mpp_d_R13 00000000004247FB MAIN__ 4163 lsdyna.f ls-dyna_mpp_d_R13 000000000041CFC2 Unknown Unknown Unknown libc.so.6 0000151F60EB7590 Unknown Unknown Unknown libc.so.6 0000151F60EB7640 __libc_start_main Unknown Unknown ls-dyna_mpp_d_R13 000000000041CEC4 Unknown Unknown Unknown
For Cardinal MPP double R11.2.2 using intel compiler 2021.10.0 intel-oneapi-mpi 2021.10.0
node number 1158014 deleted at time 1.07762E-02 [c0023:3167986:0:3167986] ib_mlx5_log.c:179 Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [c0023:3167986:0:3167986] ib_mlx5_log.c:179 RC QP 0x1d7b9 wqe[46172]: RDMA_READ s-- [rva 0x2facd550 rkey 0x20000] [va 0x4651e270 len 8328 lkey 0x928677] [rqpn 0x1ebb9 dlid=1890 sl=0 port=1 src_path_bits=0] forrtl: error (76): Abort trap signal Image PC Routine Line Source ls-dyna_mpp_d_R11 000000000AD7B94D Unknown Unknown Unknown ls-dyna_mpp_d_R11 000000000AD797E7 Unknown Unknown Unknown ls-dyna_mpp_d_R11 000000000ACD99A4 Unknown Unknown Unknown ls-dyna_mpp_d_R11 000000000ACD97B6 Unknown Unknown Unknown ls-dyna_mpp_d_R11 000000000AC73D46 Unknown Unknown Unknown ls-dyna_mpp_d_R11 000000000AC7A808 Unknown Unknown Unknown Unknown 000014C808C246F0 Unknown Unknown Unknown Unknown 000014C808C7194C Unknown Unknown Unknown libc.so.6 000014C808C24646 Unknown Unknown Unknown libc.so.6 000014C808C0E7F3 Unknown Unknown Unknown Unknown 000014C5485FF50B Unknown Unknown Unknown libucs.so.0 000014C54860D131 Unknown Unknown Unknown libucs.so.0 000014C548603A25 Unknown Unknown Unknown libuct_ib.so.0 000014C54850B33D Unknown Unknown Unknown Unknown 000014C5485220D4 Unknown Unknown Unknown libuct_ib.so.0 000014C54850B6BD Unknown Unknown Unknown Unknown 000014C54851FD57 Unknown Unknown Unknown libucp.so.0 000014C5487C030A Unknown Unknown Unknown Unknown 000014C548853DF1 Unknown Unknown Unknown Unknown 000014C54886CEAD Unknown Unknown Unknown Unknown 000014C54886C39B Unknown Unknown Unknown Unknown 000014C809621E40 Unknown Unknown Unknown libmpi.so.12 000014C8090BAD04 Unknown Unknown Unknown libmpifort.so.12 000014C80AA891E4 Unknown Unknown Unknown ls-dyna_mpp_d_R11 0000000005BCE963 Unknown Unknown Unknown ls-dyna_mpp_d_R11 0000000001978D97 mpp_alesum1_ 2318 mpp_alesum1.f ls-dyna_mpp_d_R11 000000000198B62D mpp_aletie2_ 2465 mpp_aletie2.f ls-dyna_mpp_d_R11 000000000195FA0C aleties_ 3046 aleties.f ls-dyna_mpp_d_R11 0000000001153E9D fem3d_ 14984 fem3d_p.f ls-dyna_mpp_d_R11 000000000116DC32 soltn_ 4801 soltn.f ls-dyna_mpp_d_R11 000000000125973A overly_ 2693 overly.f ls-dyna_mpp_d_R11 000000000041B3C8 MAIN__ 4123 lsdyna.f ls-dyna_mpp_d_R11 000000000041482E Unknown Unknown Unknown Unknown 000014C808C0F590 Unknown Unknown Unknown libc.so.6 000014C808C0F640 Unknown Unknown Unknown ls-dyna_mpp_d_R11 0000000000414745 Unknown Unknown Unknown
Please let us know anything that might help us resolve this issue,
Best,
Ohio Supercomputer Center
-
- You must be logged in to reply to this topic.
-
3049
-
971
-
905
-
858
-
792
© 2025 Copyright ANSYS, Inc. All rights reserved.