-
-
April 18, 2019 at 9:45 pm
swheat
SubscriberV192, RSM from Windows Client to Linux Cluster, a parallel job gets a SIGSEGV error on all processes. Serial jobs behave the same.
But, if we "update the simulation one iteration" and then submit the job to do the rest of the iterations, it all runs fine, whether parallel or not.
The SIGSEGV seems to happen during initialization ... before the simulation gets going.
The last thing in the .trn file before the faults is noting "Hybrid initialization is done"
We captured stdout.live and it showed this at the end; and this is identical for when the job fails and when the job succeeds (starting at iteration 2).
Running Solver : /opt/apps/ansys/v192/fluent/bin/fluent --albion --run --launcher_setting_file "fluentLauncher.txt" --fluent_options " -gu -driver null -driver null -workbench-session -i "SolutionPending.jou" -mpi=intel -t4"
/opt/apps/ansys/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 --albion --run --launcher_setting_file fluentLauncher.txt
-
April 22, 2019 at 7:17 pm
JakeC
Ansys EmployeeHi Stephen,
Â
Were you able to try running the solver manually outside of RSM from the terminal on the cluster?
Â
Does this happen with all projects or just this one?
Â
Can you download and try running the following workbench project:
https://drive.google.com/file/d/1-bcXljd-BYvnORbq3bYWhLnk_AptnSUf/view?usp=sharing
Please be sure the following are on each compute node:
Â
Linux
For ALL 64-bit Linux platforms, OpenMotif, and Mesa libraries should be installed. These libraries are typically installed during a normal Linux installation. You will also need the xpdf package to view the online help.
Â
ANSYS products require OpenMotif. After installing your Linux platform, review the tables below and install the appropriate version of OpenMotif. (You may need to use "rpm -iv -force" to install these.)
Â
Table 2.1: OpenMotif Versions for SUSE Linux Enterprise
Â
SUSE Linux Enterprise Release OpenMotif Version OpenMotif Zypper Package
SUSE Linux Enterprise 12 SP 2 SLES 12 SP2: motif-2.3.4-4.15.x86_64 motif
SUSE Linux Enterprise 12 SP 3 SLES 12 SP3: motif-2.3.4-4.15.x86_64 motif
Table 2.2: OpenMotif Versions for Red Hat Enterprise Linux
Â
Red Hat Enterprise Linux Release OpenMotif Version
Red Hat Enterprise Linux 6.x motif-2.3.4-1
Red Hat Enterprise Linux 7.x motif-2.2.4-0
Â
Â
For More information on OpenMotif libraries for your platform, see the Motif download site.
Â
Red Hat Enterprise Linux 6.9 and 7.3 through 7.5 — You need to install the following libraries:
Â
libpng12
Â
libXp.x86_64
Â
xorg-x11-fonts-cyrillic.noarch
Â
xterm.x86_64
Â
openmotif.x86_64
Â
compat-libstdc++-33.x86_64
Â
compat-libstdc++-44.x86_64
Â
libstdc++.x86_64
Â
libstdc++.i686
Â
gcc-c++.x86_64
Â
compat-libstdc++-33.i686
Â
compat-libstdc++-44.i686
Â
libstdc++-devel.x86_64
Â
libstdc++-devel.i686
Â
compat-gcc-34.x86_64
Â
gtk2.i686
Â
libXxf86vm.i686
Â
libSM.i686
Â
libXt.i686
Â
xorg-x11-fonts-ISO8859-1-75dpi.noarch
Â
glibc-2.12-1.166.el6_7.1 (or greater)
Â
Red Hat no longer includes the 32-bit libraries in the base configuration so you must install those separately.
Â
For more information on Red Hat Enterprise Linux libraries, see the Red Hat Libraries site.
Â
CentOS 7.3 and 7.4 — You need to install the following libraries:
Â
glibc.i686
Â
glib2.i686
Â
bzip2-lib-s.i686
Â
libpng.i686
Â
libtiff.i686
Â
libXft.i686
Â
libXxf86vm.i686
Â
sssd-client.i686
Â
libpng12
Â
libpng12.i686
Â
libXp
Â
libXp.i686
Â
libXp
Â
openmotif
Â
zlib
Â
Thank you,
Â
Jake
-
April 22, 2019 at 7:24 pm
JakeC
Ansys EmployeeHere are the docs for manually using fluent to run the job on a compute node:
https://drive.google.com/open?id=1bThnS0TXJ6qnmFe_MXsFeUt6DuKCZb8w
Â
Thank you,
Jake
-
April 25, 2019 at 1:13 pm
swheat
SubscriberJake,
I have not been able to figure out how to run a job locally; the users are kind of caught up in Workbench and don't know how to give me a journal file to just run. My searches of the internet for an example journal file have been fruitless. I'm not a fluent user, so I have no experience with how to configure a run. Is there a link to a sample .jou file (and the rest of the needed files) that I could use to try this out?
I saw the link for a download of a project, but I'm not sure what to do with that. I've asked my user to download it and run from workbench; I don't know how to extract from that set of files what I need to run it manually.
Stephen
-
April 25, 2019 at 2:40 pm
JakeC
Ansys EmployeeHi Stephen,
Â
Ok, I have put together a package for you.
https://drive.google.com/open?id=1Yi4nHugidwWKEujbtgPNwjbBcZZ-o5pG
Â
Untar it in a location that is shared across nodes.
then to run it, cd to the directory where the .cas and .jou files are, and run:
/ansys_inc/v193/fluent/bin/fluent 3d -g -t2 -i elbow1-rel-path-no-dat.jouÂ
Adjust paths as necessary.
That will run it on two cores on the machine you are logged into.
Lets see if that works first.
If it completes without issue you should see something like:
Â
Thank you,
Jake
Â
-
April 25, 2019 at 3:04 pm
swheat
SubscriberJake,
On my login node, it worked fine. I did it for both -t2 and -t16. However, on one of my compute nodes, I got the output shown below. I'd put this in a file, but I can't see how to attach a file to this. Both nodes refer to the same NFS-mounted file systems for home directories and for the fluent executables.
Regarding how many of the rpm's you said I need, the login node did not have all of them. And the compute nodes had even less than that. It would seem that my next step is to get at least the rpm's that the login node has over to the compute nodes. It appears we're on the way to solution. Let me know if you recommend any other step. I should be able to at least test-load the rpms on the one node I am using before I change my node images for every compute node. I will likely get to that today.
Thanks,
Stephen
/ansys_inc/v192/fluent/bin/fluent 3d -g -t2 -i elbow1-rel-path-no-dat.jouÂ
/ansys_inc/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 3d -g -t2 -i elbow1-rel-path-no-dat.jou
/ansys_inc/v192/fluent/fluent19.2.0/cortex/lnamd64/cortex.19.2.0 -f fluent -g -i elbow1-rel-path-no-dat.jou (fluent "3d -pshmem -host -r19.2.0 -t2 -mpi=ibmmpi -path/ansys_inc/v192/fluent -ssh")
/ansys_inc/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 3d -pshmem -host -t2 -mpi=ibmmpi -path/ansys_inc/v192/fluent -ssh -cx s1n1:43213
3954
Starting /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/3d_host/fluent.19.2.0 host -cx s1n1:43213
3954 "(list (rpsetvar (QUOTE parallel/function) "fluent 3d -flux -node -r19.2.0 -t2 -pshmem -mpi=ibmmpi -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "2") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/ansys_inc/v192/fluent") (rpsetvar (QUOTE parallel/hostsfile) "") )"
Â
       Welcome to ANSYS Fluent Release 19.2
Â
       Copyright 1987-2018 ANSYS, Inc. All Rights Reserved.
       Unauthorized use, distribution or duplication is prohibited.
       This product is subject to U.S. laws governing export and re-export.
       For full Legal Notice, see documentation.
Â
Build Time: Aug 08 2018 12:59:03 EDTÂ Build Id: 10236Â Â
Â
Â
  Â
   This is an academic version of ANSYS FLUENT. Usage of this product
   license is limited to the terms and conditions specified in your ANSYS
   license form, additional terms section.
  Â
Host spawning Node 0 on machine "s1n1" (unix).
/ansys_inc/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 3d -flux -node -t2 -pshmem -mpi=ibmmpi -ssh -mport 192.168.1.111:192.168.1.111
4230:0
Starting /ansys_inc/v192/fluent/fluent19.2.0/multiport/mpi/lnamd64/ibmmpi/bin/mpirun -e MPI_IBV_NO_FORK_SAFE=1 -e MPI_USE_MALLOPT_MMAP_MAX=0 -np 2 /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/3d_node/fluent_mpi.19.2.0 node -mpiw ibmmpi -pic shmem -mport 192.168.1.111:192.168.1.111
4230:0
Â
ID  Hostname Core O.S.   PID     Vendor           Â
n0-1Â s1n1Â Â Â 2/24Â Linux-64Â 30181-30182Â Intel(R) Xeon(R) E5-2643 v4Â
host s1n1      Linux-64 30004    Intel(R) Xeon(R) E5-2643 v4Â
Â
MPI Option Selected: ibmmpi
Selected system interconnect: shared-memory
Â
Cleanup script file is /home/swheat/fluent/parallel/cleanup-fluent-s1n1-30004.sh
terminate called after throwing an instance of 'std::runtime_error'
 what(): locale::facet::_S_create_c_locale name not valid
Â
==============================================================================
Stack backtrace generated for process id 29853 on signal 6 :
1000000: fluent() [0x67f3b9]
1000000: /usr/lib64/libc.so.6(+0x362f0) [0x7fa75da7b2f0]
1000000: /usr/lib64/libc.so.6(gsignal+0x37) [0x7fa75da7b277]
1000000: /usr/lib64/libc.so.6(abort+0x148) [0x7fa75da7c968]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d) [0x7fa75e3ba2dd]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0x8e2b6) [0x7fa75e3b82b6]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0x8e301) [0x7fa75e3b8301]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0x8e518) [0x7fa75e3b8518]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZSt21__throw_runtime_errorPKc+0x37) [0x7fa75e3e0c07]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0xb00f4) [0x7fa75e3da0f4]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZNSt6locale5_ImplC2EPKcm+0x49) [0x7fa75e3cc269]
1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZNSt6localeC1EPKc+0x88c) [0x7fa75e3cd4dc]
1000000: /ansys_inc/v192/fluent/lib/lnamd64/libApipWrapper.so(+0xd8393) [0x7fa75e782393]
1000000: /ansys_inc/v192/fluent/lib/lnamd64/libApipWrapper.so(_ZN5Ansys10ApipHelper13GetInstallDirEPKw+0x66f) [0x7fa75e7221bf]
1000000: /ansys_inc/v192/fluent/lib/lnamd64/libApipWrapper.so(_ZN5Ansys17ApipConfiguration20initializeForVersionEPKw+0x1d) [0x7fa75e72f08d]
Please include this information with any bug report you file on this issue!
==============================================================================
Â
Â
No error handler available
Â
Error: Cortex received a fatal signal (unrecognized signal).
Error Object: ()
Â
version> exit
Â
-
April 25, 2019 at 3:40 pm
JakeC
Ansys EmployeeHi Stephen,
Yes, please get all of the required prerequisites installed on the compute nodes.
Once that is done, please rerun the manual test on the compute nodes.
Â
Thank you,
Jake
-
April 29, 2019 at 6:08 pm
swheat
SubscriberJake,
I have finally gotten back to work on this. I did install the rpm's that my login node had. It still did not work. So, I installed the rest of the required rpms. It still did not work. I then tried running fluent as an interactive job on the compute node without first allocating the node via SLURM. Then it did work. Interactive via SLURM, does not work. Normal interactive, it does work. Long story short, I found that it was an environment variable. While running in SLURM, I inherit the environment variable LANG=en_US.UTF-8. Within an interactive SLURM job, if I "unset LANG", it works fine.
It seems that my locale setup on the compute nodes may not be correct, or that fluent doesn't like LANG being set.
I get a bit of an error on the compute node when executing the "locale" command. I'm going to see if I can clean that up and see if that makes it all work.
Have you seen this issue before?
-
April 29, 2019 at 7:59 pm
swheat
SubscriberJake,
I have fixed the node configurations to have the "locale" command work properly.
That has resulted in the sample .jou file you gave me run correctly within the SLURM environment.
My student will be trying the remote job submission again this evening. If it works, I'll be closing this out.
Before I got this fix in place, he did run the project you sent us and it ran remotely ok, not running into the SIGSEGV issue; just to be sure the problem still existed, he tried the original project and it still had the SIGSEGV error. Maybe the sample project you sent didn't trigger the locale issue. Or, maybe we're facing two problems. At least the interactive job seems to be working just fine now.
Thanks for all of the help!
Stephen
-
April 30, 2019 at 1:57 am
swheat
SubscriberJake,
Unfortunately, the original project does not work. Is there some way we could transfer that to you for you to look at?
The project you sent us works fine. The .jou deck works fine.
What next?
-
April 30, 2019 at 2:48 pm
JakeC
Ansys EmployeeHi Stephen,
At this point it sounds like things are working from a systems perspective.
Please ask the student to create a post in the Fluids Physics section of the forum for help from an engineer.
Â
In the meantime I would suggest, starting with a very simple project from the student and make sure that works, then add each feature of the project that has the issue one at a time to see which features of the project may be causing the issue.
Â
Thank you,
Jake
-
April 30, 2019 at 3:52 pm
swheat
SubscriberJake,
Thanks again for all of the help. That sample .jou file was a life saver.
Stephen
-
December 19, 2022 at 4:21 pm
Arindam
SubscriberHi srwheat
I am facing the same problem. I am trying to run fluent in a cluster through Slurm. When I launch the job, I get the following error. It is the same error you have faced. Can you please let me know how did you solve it?Â
Â
Thanks
Â
Â
Build Time: May 27 2022 08:43:47 EDT Build Id: 10212 ÂÂConnected License Server List: 1055@172.16.1.22Host spawning Node 0 on machine "n001.cluster.pssclabs.com" (unix)./opt/ansys_inc/v222/fluent/fluent22.2.0/bin/fluent -r22.2.0 3ddp -flux -node -alnamd64 -t4 -pshmem -mpi=openmpi -ssh -mport 172.16.32.1:172.16.32.1:40169:0Starting fixfiledes /opt/ansys_inc/v222/fluent/fluent22.2.0/multiport/mpi/lnamd64/openmpi/bin/mpirun --map-by numa --mca btl self,vader,tcp --mca pml ^ucx --mca btl_sm_use_knem 0 --prefix /opt/ansys_inc/v222/fluent/fluent22.2.0/multiport/mpi/lnamd64/openmpi -x LD_LIBRARY_PATH --np 4 --host n001.cluster.pssclabs.com:4 /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/3ddp_node/fluent_mpi.22.2.0 node -mpiw openmpi -pic shmem -mport 172.16.32.1:172.16.32.1:40169:0Â-------------------------------------------------------------------------------ID  Hostname       Core O.S.   PID    Vendor        Â-------------------------------------------------------------------------------n0-3 n001.cluster.pssclab 4/64 Linux-64 5335-5338 AMD EPYC 7513 32-CoreÂhost n001.cluster.pssclab    Linux-64 5088    AMD EPYC 7513 32-CoreÂÂMPI Option Selected: openmpiSelected system interconnect: shared-memory-------------------------------------------------------------------------------ÂCleanup script file is /net/clusterhn.cluster.pssclabs.com/home/asingha/slurmtest/cleanup-fluent-n001.cluster.pssclabs.com-5088.shterminate called after throwing an instance of 'std::runtime_error' what(): locale::facet::_S_create_c_locale name not validÂ==============================================================================Stack backtrace generated for process id 4882 on signal 6 :1000000: fluent() [0x7a5969]1000000: /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x15302303f090]1000000: /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x15302303f00b]1000000: /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x15302301e859]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(+0x8b833) [0x1530233e3833]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(+0x919a6) [0x1530233e99a6]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(+0x919e1) [0x1530233e99e1]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(+0x91c13) [0x1530233e9c13]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(+0x8d767) [0x1530233e5767]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(+0xb2574) [0x15302340a574]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(_ZNSt6locale5_ImplC2EPKcm+0x4c) [0x1530233fcf7c]1000000: /opt/ansys_inc/v222/fluent/fluent22.2.0/lnamd64/syslib/libstdc++.so.6(_ZNSt6localeC1EPKc+0x4b2) [0x1530233fde32]1000000: /opt/ansys_inc/v222/fluent/lib/lnamd64/libApipWrapper.so(+0xf0c73) [0x1530237cbc73]1000000: /opt/ansys_inc/v222/fluent/lib/lnamd64/libApipWrapper.so(_ZN5Ansys10ApipHelper13GetInstallDirEPKw+0x591) [0x153023762c41]1000000: /opt/ansys_inc/v222/fluent/lib/lnamd64/libApipWrapper.so(_ZN5Ansys17ApipConfiguration20initializeForVersionEPKw+0x1d) [0x15302377092d]Please include this information with any bug report you file on this issue!==============================================================================ÂÂNo error handler availableÂError: Cortex received a fatal signal (unrecognized signal).Error Object: ()Â
-
- The topic ‘SIGSEGV on all nodes in a parallel job’ is closed to new replies.
-
6765
-
1906
-
1485
-
1330
-
1097
© 2026 Copyright ANSYS, Inc. All rights reserved.
