Ansys Assistant will be unavailable on the Learning Forum starting January 30. An upgraded version is coming soon. We apologize for any inconvenience and appreciate your patience. Stay tuned for updates.

Innovation Space

FREE STUDENT SOFTWARE

Job hang in RSM and Cluster

- March 29, 2019 at 3:23 pm
  
  swheat
  Subscriber
  
  After getting RSM running and the firewall set, we now see the job get to the cluster, qstat shows the four cores on a node allocated.
  
  But the windows client sees the job fail. On the linux node, the job is still occupying the queue slot in the running state.
  
  How do I find at least a hint of the cause for this?
  
  The output from qstat shows:
  
  [root@titan .ansys]# qstat
  
  Job id Name Username Time Use S Queue
  
  -
  
  419 Josh_Workbench fluentuser 00:01:28 R normal
  
  The following is an excerpt from the user log:
  
  2019-03-29 09:525 [DEBUG] ProcessActivityTracker shows activity.
  
  2019-03-29 09:535 [DEBUG] ProcessActivityTracker shows activity.
  
  2019-03-29 09:545 [DEBUG] Proxy found inactive and will shutdown
  
  2019-03-29 09:545 [FATAL] Unhandled Exception has occurred and the program will exit.
  
  2019-03-29 09:545 [FATAL] System.Runtime.Remoting.RemotingException: Requested service not found (Ansys.Rsm.UPHost.HostController, Ans.Rsm.UPHost, Version=19.2.0.0, Culture=neutral, PublicKeyToken=null). No receiver for uri /HostController
  
  Server stack trace:
  
  at System.Runtime.Remoting.Messaging.MethodCall.ResolveMethod () [0x00064] in :0
  
  at System.Runtime.Remoting.Messaging.MethodCall..ctor (System.Object handlerObject, System.Runtime.Serialization.Formatters.Binary.BinaryMethodCallMessage smuggledMsg) [0x00088] in :0
  
  at System.Runtime.Serialization.Formatters.Binary.BinaryMethodCall.Read(System.Object[] callA, System.Object handlerObject) [0x00189] in :0
- March 29, 2019 at 4:49 pm
  
  tsiriaks
  Ansys Employee
  
  On this titan node (I guess it's the submission node and it's where the RSM Launcher services are installed) , does it have multiple NICs ?
  
  If so, you must specify one of the public IP's in (default location) /ansys_inc/v1xx/RSM/Config/Ans.Rsm.AppSettings.config and set the value for
  
  and
  
  (both values are the IP)
  
  then, restart the RSM Launcher service.
  
  Thanks,
  
  Win
- March 29, 2019 at 11:38 pm
  
  swheat
  Subscriber
  
  Win, Thanks for the quick response. Yes, we have three networks. I have set the IP address of the network through which the Windows client connects to the master node. On a subsequent run, we got the same hang.
  
  The log file looks quite similar. It is interesting that the log file is shorter ... there were a lot fewer DEBUG lines, but the same ending was recorded as captured in my earlier post. We did get this on the windows client. It is possible that something weird is going on in the job configuration, because it is looking for a DISPLAY setting, yet in this configuration, we aren't enabling remote graphics. We're supposed to just be doing a compute run. But then again, it is complaining about the node configuration not being available. Yet qstat is showing four nodes having been allocated to this job number. Is there a way to get the parameters that RSM is passing to the queuing system? I am beginning to wonder if this is a Workbench job configuration issue at this point. I've asked my user to send me a screen shot of his job configuration showing the parallel cluster request parameters. I'll send when I get it. Otherwise, any other ideas as to what might be causing this?
- March 30, 2019 at 4:22 pm
  
  swheat
  Subscriber
  
  I have received the information from the user re the Windows client job submission. Below is the text from his email. I've tried to put the screenshots inline with the text. One thing I have also noted is that the job, while it asks for 4 processors, it is allocating 4 complete nodes. The job is set up to use shared memory if it can, but the jobs are distributed to four distributed memory nodes, each with 40 cores, even though the job only needs 4 cores.
  
  Yes, here is a screenshot of the settings for the Solution. There is one setting not shown in row 34 labeled "Specify Number of Processes Restriction”, and this box is unchecked.
  
  There might be something to this error in the setup part of the process, though. I attached a PDF of the tutorial I am semi-following to get my results. If you look at step 13, there are some instructions to show contour plots. I’m not sure if this is where the error is coming from or not.
  
  Also, here is the settings pane for the Setup. Rows 7 and 8 seem to be of interest, especially considering they appear in the solution settings as well. I disabled them in this screenshot and attempted to run the simulation again, but the same display error shows up still. Any thoughts?
- March 30, 2019 at 8:39 pm
  
  swheat
  Subscriber
  
  Slight update ... the failed job is no longer hanging onto the node allocation on the cluster. The nodes are freed after the job fails.
- April 1, 2019 at 8:21 pm
  
  JakeC
  Ansys Employee
  
  Hi srweat,
  
  It sounds like SLURM is set up for round robin instead of fill up type core selection.
  
  Additionally the error in yellow/orange from above is an error from srun, and not the ansys bits.
  
  You may need to edit your slurm.conf.
  
  In your Client Side RSM Cluster Configuration, what do you have set as the HPC type?
  
  Thank you,
  
  Jake
- April 1, 2019 at 8:23 pm
  
  JakeC
  Ansys Employee
  
  Sorry, additionally the graphics warnings can be ignored, they are there just in case graphs needed to be drawn to a bitmap.
  
  That should not halt the solve.
- April 3, 2019 at 12:40 pm
  
  swheat
  Subscriber
  
  JCallery, thanks for the insight. As for the RSM Cluster Conf HPC Type, I think that is "PBSPro", but I'll have to check with my user to be sure.
  
  As for the round robin vs core selection, that is true. But this job should not be requesting four whole nodes just to run on four cores. It should be envoking the job manager to just ask for 4 cores, not 4 nodes. In sbatch land, that would mean -n4, not -N4. I can't see the qsub parameters. I'll have to do some work on the qsub script to log its parameters somewhere.
  
  As it turns out, the user also tried to do the job submission via ssh. It didn't seem to pass all the parameters quite right, but it did show that sbatch was invoked with -N4. That means that it wants 4 whole nodes. So, I'm wondering if the RSM qsub is doing the same thing.
  
  The basic question is ... If it is expecting to run on a single node, but gets distributed to multiple nodes, will that cause the job to hang? I.e., is this use of 4 nodes instead of four cores a root cause to our failed runs, or is it just a node allocation optimization issue and we still don't have a handle on the root cause of the job hang?
  
  Thanks!
  
  Stephen
- April 3, 2019 at 1:52 pm
  
  swheat
  Subscriber
  
  Confirmed, it is PBSPro
- April 4, 2019 at 12:32 pm
  
  JakeC
  Ansys Employee
  
  Hi Stephen,
  
  Thank you for confirming.
  
  You can modify the actual job submission command to suit your needs by editing:
  
  /ansys_inc/v193/RSM/Config/xml/hpc_commands_PBS.xml
  
  Look for:
  
  inside that XML element you will see
  
  -l select=%RSM_HPC_CORES%:ncpus=1:mpiprocs=1
  
  TRUE
  
  that is the section that is called for a distributed job.
  
  Specifically:
  
  -l select=%RSM_HPC_CORES%:ncpus=1:mpiprocs=1
  
  It sounds like you need to modify that line to work with what your scheduler is expecting.
  
  Thank you,
  
  Jake
- April 5, 2019 at 4:05 am
  
  swheat
  Subscriber
  
  Jake,
  
  Thanks for the very positive suggestion. i shall try that tomorrow. I was able to capture the qsub command from the log file by intercepting the execution of qsub and sbatch. Thus, with the job hung, I could try the same command on other nodes and submitting with the -N and -n the way I want. It still hung with an error. Perhaps my method was flawed; I shall try this change to see if I get the -N -n config the way we want.
  
  I have noticed one thing though, the job does allocate the nodes before it hangs/fails. And I have seen this statement in the job output files:
  
  "Didn't find valid cluster type. Script running in non-cluster mode." I think it is related to the control.txt file, but I haven't figured that out yet. What would cause it to not find a "valid cluster type"?
  
  BTW, when turning the job into a serial job in the Client submission, it does successful enqueue and run a single thread job. The output looks positive, but we then have a disconnect on getting the output transferred back to the windows client. We will keep looking at that as well. That will likely end up as a distinct thread on this site unless we figure it out first.
  
  Stephen
- April 6, 2019 at 4:44 pm
  
  swheat
  Subscriber
  
  Additional observation ... as I just don't know how many of these issues are connected to this thread, I had this observation. We changed one thing on the client side, we chose sequential instead of parallel. Here, it appeared to allocate one node on the cluster with 4 processes. Exactly what we wanted. And, the reports back to the client indicate a return code of 0, but then there was a SIGSEGV. Please see this picture. It didn't appear that the output of the job successfully made it back to the client, but we don't know that for sure, as each output line seems to indicate successful transfers of something. Perhaps we don't know how to use the returned data.
  
  The question is: Is the SIGSEGV expected behavior? If not, then what should we look for?
  
  The follow up question is: Did the job really run to a successful end and were the results communicated back to the client?
- April 8, 2019 at 6:11 pm
  
  tsiriaks
  Ansys Employee
  
  Jake might be able to answer this better than me but I think from the RSM Job Report, the solve ran successfully but the SIGSEGV issue happened when it tried to transfer data back to client.
  
  There should be User proxy logs in /tmp/rsm_username__pid.log that might tell you some clue.
  
  Please also check the staging directory to see if there are any files in it. These files should be deleted if the transfer was successful.
  
  Another thing is, can you ask the user to try to run/solve the project under a shorter path that doesn't have space ?
  
  Thanks,
  
  Win
- April 8, 2019 at 10:37 pm
  
  JakeC
  Ansys Employee
  
  Hi Stephen,
  
  Regarding: What would cause it to not find a "valid cluster type"?
  
  I'm not sure where it might be looking for that, but since you are running SLURM and that is not technically a valid cluster type for the Fluids solver, that isn't too surprising. It is most likely looking for environment variables in connection with PBS that are not there, but I can't say for sure.
  
  Can you post the whole log where you see that error?
  
  Regarding the Crash:
  
  Chances are very good that did not run correctly.
  
  Please look in the Solution.trn file, that may provide some clues.
  
  Chances are that you are missing some required libraries on the compute nodes, but the .trn file should show more.
  
  It looks like the two files were copied back ok to the client, so the file transfer is fine, and for fluent in this state, those two files are the only two files to be expected to be copied back.
  
  So I think the RSM parts are working ok.
  
  Are you able to run a fluent batch directly on a compute node bypassing the scheduler?
  
  The command would look something this this:
  
  /ansys_inc/v192/fluent/bin/fluent 3d -g -t4 -i TestJournal.jou
  
  You could also try something like the following:
  
  In the RSM Configuration Utility on the client (windows) machine, in the "File Transfer" section at the bottom, check the "Keep files in staging directory"
  
  Then submit a job.
  
  Wait for it to fail.
  
  Go into the linux side, and CD to that staging directory from one of the compute nodes.
  
  Then run the solver command manually on the linux terminal that is logged into the compute node and where pwd is in the staging directory.
  
  In your case in the screenshot above it would be lines 197 and 198.
  
  See if you get more output that way.
  
  Thank you,
  
  Jake
- April 11, 2019 at 9:40 pm
  
  swheat
  Subscriber
  
  Jake,
  
  We have two new observations. Getting rid of the spaces in names didn't seem to make any difference.
  
  The sequential run still fails with SIGSEGV. Attached is my manual running of it as suggested above. Note that we do 2ddp, not 3d. I can't figure out how to attach a file, so the text of the trn is given below.
  
  Next, we found that if he ran one iteration on the Windows client, and then told it to run the rest of the iterations on the cluster (still just sequential, just using a single core), we found that the job ran to completion, whether via Client RSM. No SIGSEGV issues when we "continue" a run.
  
  I tried to run that job manually, but it appears that the run is already updated, and after loading and using a bunch of cores, it just exits without doing any work.
  
  Quick question, is there a way to tell the system to run sequential, but to use more than one core?
  
  Here are the few lines from the end of an experiment of me running a job that should be successful (one iteration run on the client, a few run on the server node, and then cancelled). It gives the same result, doesn't seem to do anything after it's all loaded.
  
  Done.
  
  #f
  
  > (wb-update-solution-points-by-one (list ) -1 )
  
  #f
  
  > (wb-update-solution-points-by-one (list 0 (cons 'unsteady? #t) (cons 'flow-time 0) (cons 'time-step 0) (cons 'case-file "FFF-3-00000.cas.gz") (cons 'data-file "FFF-3-00000.dat.gz") (cons 'crank-angle 0)) #f )
  
  #f
  
  > (wb-update-solution-points-by-one (list 7 (cons 'unsteady? #t) (cons 'flow-time 0.00100000004749745) (cons 'time-step 1) (cons 'case-file "FFF-4-00001.cas.gz") (cons 'data-file "FFF-4-00001.dat.gz") (cons 'crank-angle 0)) #f )
  
  #f
  
  > (sm-set-current-solution 7)
  
  #f
  
  > (wb-enable-progress-writing #t)
  
  monitor-lambdas
  
  > (wb-update-run-number 4)
  
  autosave/run-number
  
  > (wb-update-solution #t )
  
  #f
  
  > (wb-save #t)
  
  #f
  
  > (wb-stop-transcript-file)
  
  Transcript closed.
  
  > (wb-exit)
  
  ==============================================================
  
  Here's the .trn file for when it failed.
  
  Build Time: Aug 08 2018 12:59:02 EDT Build Id: 10236
  
  Executable Path: /ansys_inc/v192/fluent/bin/fluent
  
  ID Hostname Core O.S. PID Vendor
  
  n0-15 titan.oru.edu 16/32 Linux-64 34470-34485 Intel(R) Xeon(R) E5-2470 0
  
  host titan.oru.edu Linux-64 34297 Intel(R) Xeon(R) E5-2470 0
  
  MPI Option Selected: ibmmpi
  
  Selected system interconnect: shared-memory
  
  ()
  
  > (wb-load-pre-setup-info '((work-dir . "./")(system-type . "FFF")(is-sol . #t)(wb-version . "16.0.2013.11.18.01")(is-connected-to-sc . #f)))
  
  #f
  
  > (wb-set-base-name "FFF")
  
  base-name
  
  > (wb-set-preferences (list (cons 'show-beta-options #f)(cons 'write-setup-output #t)(cons 'write-interpolation #t)(cons 'enable-solution-monitoring #t)))
  
  > (wb-set-container-info 1 0 0 #f)
  
  n-incoming-connections
  
  > (wb-update-run-number 2)
  
  autosave/run-number
  
  > (wb-set-base-name "FFF")
  
  base-name
  
  > (wb-load-case-setting-info '((set-file . "FFF.set")(mesh-ops . "((expected-zone-names outlet-porous_zone outlet-inlet_zone outlet-filter interface1 interface2 axis-porous_zone axis-inlet_zone porous_zone inlet_zone) (operations-list ((type . zone-rename) (id . zone-rename-4) (arg-pair ("Old Zone Name" . axis-inlet_zone) ("New Zone Name" . symmetry-inlet_zone))) ((type . zone-rename) (id . zone-rename-3) (arg-pair ("Old Zone Name" . axis-porous_zone) ("New Zone Name" . symmetry-porous_zone))) ((type . zone-type-change) (id . zone-type-change-1) (arg-pair ("Zone Name" . outlet-filter) ("New Zone Type" . wall))) ((type . zone-type-change) (id . zone-type-change-2) (arg-pair ("Zone Name" . outlet-filter) ("New Zone Type" . pressure-outlet))) ((type . zone-type-change) (id . zone-type-change-3) (arg-pair ("Zone Name" . symmetry-inlet_zone) ("New Zone Type" . axis))) ((type . zone-rename) (id . zone-rename-1) (arg-pair ("Old Zone Name" . symmetry-inlet_zone) ("New Zone Name" . axis-inlet_zone))) ((type . zone-type-change) (id . zone-type-change-4) (arg-pair ("Zone Name" . symmetry-porous_zone) ("New Zone Type" . axis))) ((type . zone-rename) (id . zone-rename-2) (arg-pair ("Old Zone Name" . symmetry-porous_zone) ("New Zone Name" . axis-porous_zone))) ((type . create-mesh-interface) (id . create-mesh-interface-1) (arg-pair ("Sliding Interface Input" interface:01 (sb1-id 11) (sb2-id 10) (interior-id 6) (bnd1-id 14) (bnd2-id 15) (periodic . #f) (coupled . #f) (matching . #f) (mapped . #f) (static . #f) (face-periodic . #f) (turbo . #f) (stretched . #f) (mixing . #f) (stationary . #f) (phase-lag-fp . #f) (phase-lag-fp-param) (mperiodic-ids) (sb1-str-id) (sb2-str-id) (mperiodic-str-ids) (interior-str-id) (cperiodic-id) (cperiodic-str-id) (mti1-id) (mti2-id) (nper . 0) (nper-str . 0) (mpm-avg . 1) (mpm-prof . 3) (mpm-relax . 1) (mpm-bins . 10) (side-switched . #f))))))")(case-file . "FFF-Setup-Output.cas.gz")(file-type . "Case")(setup-output . "#t")))
  
  Multicore SMT processors detected. Processor affinity set!
  
  Reading ""| gunzip -c \"FFF-Setup-Output.cas.gz\"""...
  
  Buffering for file scan...
  
  6058 triangular cells, zone 3, binary.
  
  23254 triangular cells, zone 4, binary.
  
  434 mixed interior faces, zone 6, binary.
  
  8851 2D interior faces, zone 1, binary.
  
  34653 2D interior faces, zone 2, binary.
  
  13 2D pressure-outlet faces, zone 7, binary.
  
  77 2D pressure-outlet faces, zone 8, binary.
  
  236 2D pressure-outlet faces, zone 9, binary.
  
  228 2D interface faces, zone 10, binary.
  
  208 2D interface faces, zone 11, binary.
  
  15 2D axis faces, zone 12, binary.
  
  151 2D axis faces, zone 13, binary.
  
  434 interface face parents, binary.
  
  434 interface metric data, zone 6, binary.
  
  15122 nodes, binary.
  
  15122 node flags, binary.
  
  Building...
  
  mesh
  
  auto partitioning mesh by Metis (fast),
  
  distributing mesh
  
  parts................,
  
  faces................,
  
  nodes................,
  
  cells................,
  
  bandwidth reduction using Reverse Cuthill-McKee: 365/42 = 8.69048
  
  materials,
  
  interface,
  
  domains,
  
  mixture
  
  air
  
  water
  
  interaction
  
  zones,
  
  porous_zone (water)
  
  inlet_zone (water)
  
  interior-porous_zone (water)
  
  interior-inlet_zone (water)
  
  outlet-porous_zone (water)
  
  outlet-inlet_zone (water)
  
  interface1 (water)
  
  interface2 (water)
  
  porous_zone (air)
  
  inlet_zone (air)
  
  interior-porous_zone (air)
  
  interior-inlet_zone (air)
  
  outlet-porous_zone (air)
  
  outlet-inlet_zone (air)
  
  interface1 (air)
  
  interface2 (air)
  
  outlet-filter (air)
  
  outlet-filter (water)
  
  axis-inlet_zone (air)
  
  axis-inlet_zone (water)
  
  axis-porous_zone (air)
  
  axis-porous_zone (water)
  
  interface2-non-overlapping (air)
  
  interface2-non-overlapping (water)
  
  interface1-non-overlapping (air)
  
  interface1-non-overlapping (water)
  
  interface:01-interior-1-1 (air)
  
  interface:01-interior-1-1 (water)
  
  interface2
  
  interface1
  
  outlet-inlet_zone
  
  outlet-porous_zone
  
  interior-inlet_zone
  
  interior-porous_zone
  
  outlet-filter
  
  axis-inlet_zone
  
  axis-porous_zone
  
  interface2-non-overlapping
  
  interface1-non-overlapping
  
  interface:01-interior-1-1
  
  inlet_zone
  
  porous_zone
  
  mesh interfaces,
  
  parallel,
  
  Done.
  
  #f
  
  > (wb-apply-cx-state-and-surfaces "((((gui-processing? #t) (cell-function-defs ((0 ((name _rdabsolute-volume-flow-rate) (display "volume-flow-rate * ( - 1)") (syntax-tree ("*" "volume-flow-rate" -1)) (code (field-* (cx-report-eval "volume-flow-rate") -1)))))) (graphics/scenes (((name . "contour-2") (graphics-object-names "contour-2") (camera-setting (position 0.1145503595471382 0.01046664360910654 0.6500812768936157) (target 0.1145503595471382 0.01046664360910654 0.) (up-vector 0. 1. 0.) (target-width . 0.2600325047969818) (target-height . 0.2600325047969818) (projection-type . "perspective"))) ((name . "contour-1") (graphics-object-names "contour-1") (camera-setting (position 0.1238249987363815 0.07633239775896072 0.6500812768936157) (target 0.1238249987363815 0.07633239775896072 0.) (up-vector 0. 1. 0.) (target-width . 0.2600325047969818) (target-height . 0.2600325047969818) (projection-type . "perspective"))))) (surfaces/groups ((volume-fraction-17 (17)) (interior-porous_zone () (interior-inlet_zone (7)) (outlet-porous_zone (6)) (outlet-inlet_zone (5)) (outlet-filter (4)) (interface1 (3)) (interface2 (2)) (axis-inlet_zone (0)) (axis-porous_zone (1)) (swirl-velocity (9)) (interface2-non-overlapping (10)) (interface1-non-overlapping (11)) (interface:01-interior-1-1 (12)) (porous_zone (13)) (inlet_zone (14)) (volume-fraction-15 (15)) (frustum-free-surface-time-0s (16)))) (cx-virtual-id-list (4198 4199 4201 4202 4203 4204 4206 4207 4208 4210 4211 4212 4213 4214 4262 4263 4264 4265)) (cx-surface-id-map ((17 4265) (16 4264) (15 4263) (9 4262) (14 4214) (13 4213) (12 4212) (11 4211) (10 4210) (1 4208) (0 4207) (4 4206) (8 4204) (7 4203) (6 4202) (5 4201) (3 4199) (2 4198))) (cx-surface-type ((9 0) (2 0) (3 0) (5 0) (6 0) (7 0) (8 0) (4 0) (0 0) (1 0) (10 0) (11 0) (12 0) (13 0) (14 0))) (cx-surface-def-list ((4265 () (iso-surface 4265 () "air-vof" '(0.)) #f) (4264 (4214) (iso-surface 4264 (4214) "water-vof" '(0.5)) #f) (4263 () (iso-surface 4263 () "air-vof" '(0.)) #f) (4262 () (point-surface 4262 global '((0.098425 0.04 0.))) #f) (4214 () (zone-surface 4214 4) #f) (4213 () (zone-surface 4213 3) #f) (4212 () (zone-surface 4212 6) #f) (4211 () (zone-surface 4211 15) #f) (4210 () (zone-surface 4210 14) #f) (4208 () (zone-surface 4208 12) #f) (4207 () (zone-surface 4207 13) #f) (4206 () (zone-surface 4206 9) #f) (4204 () (zone-surface 4204 1) #f) (4203 () (zone-surface 4203 2) #f) (4202 () (zone-surface 4202 7) #f) (4201 () (zone-surface 4201 #f) (4199 () (zone-surface 4199 10) #f) (4198 () (zone-surface 4198 11) #f))) (cx-surface-list #((0 ((zid 13) (type zone-surf) (name axis-inlet_zone) (status susp) (facet-info (0 0 0 0)))) (1 ((zid 12) (type zone-surf) (name axis-porous_zone) (status susp) (facet-info (0 0 0 0)))) (2 ((zid 11) (type zone-surf) (name interface2) (status susp) (facet-info (0 0 0 0)))) (3 ((zid 10) (type zone-surf) (name interface1) (status susp) (facet-info (0 0 0 0)))) (4 ((zid 9) (type zone-surf) (name outlet-filter) (status susp) (facet-info (0 0 0 0)))) (5 ((zid (type zone-surf) (name outlet-inlet_zone) (status susp) (facet-info (0 0 0 0)))) (6 ((zid 7) (type zone-surf) (name outlet-porous_zone) (status susp) (facet-info (0 0 0 0)))) (7 ((zid 2) (type zone-surf) (name interior-inlet_zone) (status susp) (facet-info (0 0 0 0)))) (8 ((zid 1) (type zone-surf) (name interior-porous_zone) (status susp) (facet-info (0 0 0 0)))) (9 ((type point-surf) (name swirl-velocity) (status susp) (facet-info (0 0 0 0)))) (10 ((zid 14) (type zone-surf) (name interface2-non-overlapping) (status susp) (facet-info (0 0 0 0)))) (11 ((zid 15) (type zone-surf) (name interface1-non-overlapping) (status susp) (facet-info (0 0 0 0)))) (12 ((zid 6) (type zone-surf) (name interface:01-interior-1-1) (status susp) (facet-info (0 0 0 0)))) (13 ((zid 3) (type zone-surf) (name porous_zone) (status active) (facet-info (0 0 6058 3266)))) (14 ((zid 4) (type zone-surf) (name inlet_zone) (status active) (facet-info (0 0 23254 11856)))) (15 ((ref-thread 3) (zones ()) (quantity "air-vof") (units *null*) (type iso-surf) (name volume-fraction-15) (status susp) (facet-info (0 0 0 0)))) (16 ((ref-thread 3) (zones (4)) (quantity "water-vof") (units *null*) (type iso-surf) (name frustum-free-surface-time-0s) (status susp) (facet-info (0 0 0 0)))) (17 ((ref-thread 3) (zones ()) (quantity "air-vof") (units *null*) (type iso-surf) (name volume-fraction-17) (status susp) (facet-info (0 0 0 0)))) #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f #f)) (mirror-zones (13 12)) (view-list ((front ((0.123825 0.07633240067796 0.6500812499999999) (0.123825 0.07633240067796 0.) (0. 1. 0.) 0.2600325 0.2600325 "perspective") #(1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1)) (back ((0.123825 0.07633240067796 -0.6500812499999999) (0.123825 0.07633240067796 0.) (0. 1. 0.) 0.2600325 0.2600325 "perspective") #(1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1)))) (lights/headlight/on? #f) (render/surfaces (6 5 3 2)) (render/grid/surfaces (6 5 3 2)) (filled-grid? #f) (xy/bottom -1) (scale/right -0.7) (cx-case-version (19 2 0)) (cx-case-read-event-called-already #t))) (3 porous_zone) (4 inlet_zone) (6 interface:01-interior-1-1) (15 interface1-non-overlapping) (14 interface2-non-overlapping) (12 axis-porous_zone) (13 axis-inlet_zone) (9 outlet-filter) (1 interior-porous_zone) (2 interior-inlet_zone) (7 outlet-porous_zone) (8 outlet-inlet_zone) (10 interface1) (11 interface2))" )
  
  Setting Post Processing and Surfaces information ... Done.
  
  saved-mesh-id
  
  > (wb-load-post-setup-info '((original-mesh . #t)(run-number . 2)(check-grid . #t)(sol-needs-refresh . #f)))
  
  #f
  
  > (wb-update-em-loss-data '#f""#f)
  
  #f
  
  > (wb-init-flow)
  
  Initialization based on settings...
  
  Initialize using the hybrid initialization method.
  
  Checking case topology...
  
  -This case has only outlets
  
  -Case will be initialized with constant parameters
  
  Hybrid initialization is done.
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  Node 11: Process 34481: Received signal SIGSEGV.
  
  Node 12: Process 34482: Received signal SIGSEGV.
  
  Node 13: Process 34483: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  Node 5: Process 34475: Received signal SIGSEGV.
  
  ==============================================================================
  
  Node 8: Process 34478: Received signal SIGSEGV.
  
  Node 3: Process 34473: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  Node 15: Process 34485: Received signal SIGSEGV.
  
  Node 2: Process 34472: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  Node 0: Process 34470: Received signal SIGSEGV.
  
  ==============================================================================
  
  Node 10: Process 34480: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  Node 6: Process 34476: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  ==============================================================================
  
  Node 9: Process 34479: Received signal SIGSEGV.
  
  ==============================================================================
  
  Node 7: Process 34477: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  Node 14: Process 34484: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  Node 4: Process 34474: Received signal SIGSEGV.
  
  ==============================================================================
  
  ==============================================================================
  
  Node 1: Process 34471: Received signal SIGSEGV.
  
  ==============================================================================
  
  ===============Message from the Cortex Process================================
  
  Fatal error in one of the compute processes.
  
  ==============================================================================
- April 11, 2019 at 9:46 pm
  
  swheat
  Subscriber
  
  Jake,
  
  We tried one other thing, to no avail. I used a script modification to cause the batch submission of the job to not actually submit the job. That way I could have a pristine setup environment in the RSM directory by which to run the fluent command. It loaded, but then didn't do anything.
  
  Stephen
- April 11, 2019 at 9:49 pm
  
  swheat
  Subscriber
  
  Real quick, is there any way to tell it to use more than one core in serial mode? If so, we can limp along with a first iteration done on the client and the system doing the rest.
- April 14, 2019 at 12:29 am
  
  swheat
  Subscriber
  
  The problem has been resolved; now we are looking for the solution. We found that the mpirun command being used to launch Fluent uses what is returned from `hostname` somewhere in all of the scripts as the node name. The scheduler is configured on short versions of the hostname; but `hostname` returns a FQDN. Thus, when mpirun/srun tries to use it, there are no such resources.
  
  There are two solutions ... 1) change the cluster configuration files to have `hostname` return the short name. 2) change the Ansys scripts (somewhere) to use the short version of `hostname`.
  
  We're down to one simple question ... how to get Ansys scripts to use the equivalent of `hostname -s`.
  
  Is there a way to do that?
- April 15, 2019 at 1:15 pm
  
  JakeC
  Ansys Employee
  
  Hi Stephen,
  
  There only a couple of user editable locations where hostname is gathered.
  
  /ansys_inc/v193/RSM/Config/scripts/ClusterJobs.py and pbsParsing.py
  
  I don't "think" you need to change anything in pbsParsing.py, but I'm not 100% sure where the fqdn is really causing you an issue.
  
  However, if you search in ClusterJobs.py for "hostname" and "machineName" you will see how that is being used.
  
  So you could look for any socket.gethostname() and add something like .split(".")[0]
  
  For instance:
  
  _currentMachineName = socket.gethostname()
  
  Lets say that printing _currentMachineName in this case printed "test.nodes.edu"
  
  Would become:
  
  _currentMachineName = socket.gethostname().split(".")[0]
  
  Which would print "test"
  
  Could you also post the final cluster command where you are seeing the FQDN hostnames?
  
  You might be able to just do the parse/split on that final command or just before it, as to leave a smaller footprint for the modifications, but I'm not sure yet.
  
  Thank you,
  
  Jake
- April 15, 2019 at 5:16 pm
  
  swheat
  Subscriber
  
  Jake,
  
  I have lost the results of the ps -aef that showed how fluent was being called.
  
  in /tmp, there are the fluent-appfile.user.34355 style files have the hostname info; a sample result where it was using FQDN is
  
  #Intel MPI machinefile
  
  s7n3.titan.oru.edu
  
  For successful runs, the DN is just s7n3
  
  Does that help?
- April 15, 2019 at 5:38 pm
  
  JakeC
  Ansys Employee
  
  Hi Stephen,
  
  I would start with the python modifications I mentioned and see what the final solver command looks like.
  
  That might just be all you need to do.
  
  Thank you,
  
  Jake
- April 16, 2019 at 6:53 pm
  
  swheat
  Subscriber
  
  Jake,
  
  I changed the ClusterJobs.py as suggested. Some reportouts, such as in std*., show the shorter name being used. However, the actual mpiexec execution has the FQDN. See this:
  
  fluentu+ 74803 74798 0 17:01 ? 00:00:00 mpiexec.hydra -f /tmp/fluent-appfile.fluentuser.74769 -genv I_MPI_FABRICS shm -genv
  
  with lots more text, but the "add post" seemed to have a problem with the long line.
  
  I could not figure out where/how to modify the other file.
  
  I have reconfigured the cluster to use short hostnames. I needed to do some reconfiguration anyway in order to get some other features enabled.
  
  Let's mark this done. Thanks for all of the help.
  
  Stephen

Viewing 21 reply threads

The topic ‘Job hang in RSM and Cluster’ is closed to new replies.

[bingo_chatbox]