Ansys Products

Ansys Products

Discuss installation & licensing of our Ansys Teaching and Research products.

Setting up DCS services for a cluster

    • dbhands
      Subscriber

      Hello, I currently have a HPC slurm cluster that users are able to submit Fluent simulations to using RSM that works well. When submitting parameter sets, the meshing and solving tasks are submitted together with the same amount of cores specified, and I would like the ability to specify a different machine for meshing vs solving.

      To address this I have set up a DCS server on the head node of the cluster, as well as a DC evaluator on the head node of the slurm cluster and a DC Evaluator on a windows server for geometry updates (spaceclaim). The goal is to have the DCE on the head node submit the run to the slurm cluster partition. When submitting, the project updates geometry, then starts updating the solution for about 1.5 minutes before stating the project failed in workbench and to check DPS for more info. The error log from DPS is posted below. When setting the DCE to solve directly, the solution calculates as expected, however I need to be able to run across the cluster.

      What can I do to fix this error? In addition, it seems that meshing and solution are bundled into the same task in DPS. Is it possible to break these up into two tasks so they can be solved on different machines or different core counts?

       

      output from DPS:

      Job is running on hostname node00.*****. (removed from this post)
      Job user from this host: ********* (removed from this post)
      Starting directory: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
      Reading control file /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu/control_ffcba1bc-b678-4069-80e1-67b0aa785470.rsm .... 
      Correct Cluster verified
      Cluster Type: SLURM
      Underlying Cluster: SLURM
          RSM_CLUSTER_TYPE = SLURM
      Compute Server is running on NODE00.REDACTED.CR
      Reading commands and arguments...
          Command 1: C:\Program Files\ANSYS Inc\v222\Framework\bin\Win64\runwb2.bat, arguments: -B -R "test2d_Workbench_Solution.wbjn" -Z Dpdb.EvaluatorRun,Dpdb.EvaluatingProjectUpdate --output "test2d_Workbench_Solution_log.txt", redirectFile: None
      Running from shared staging directory ...
          RSM_USE_LOCAL_SCRATCH = False
          RSM_LOCAL_SCRATCH_DIRECTORY = 
          RSM_LOCAL_SCRATCH_PARTIAL_UNC_PATH = 
      Cluster Shared Directory: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
          RSM_SHARE_STAGING_DIRECTORY = /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
      Job file clean up: True
      Use SSH on Linux cluster nodes: True
          RSM_USE_SSH_LINUX = True
      LivelogFile: NOLIVELOGFILE
      StdoutLiveLogFile: stdout_ffcba1bc-b678-4069-80e1-67b0aa785470.live
      StderrLiveLogFile: stderr_ffcba1bc-b678-4069-80e1-67b0aa785470.live 
      Reading input files...
          test2d.wbpz
          test2d_Workbench_Solution.wbjn
          test2d_Workbench_Geometry.wppz
      Reading cancel files...
          *.abt
      Reading output files...
          test2d_output.wbpz
          test2d_Workbench_Solution_log.txt
          console_output.txt
          *.out
          *.trn
          *.log
          *.txt
      Reading exclude files...
          persistedStorage/*
          stdout_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          stderr_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          control_ffcba1bc-b678-4069-80e1-67b0aa785470.rsm
          hosts.dat
          exitcode_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          exitcodeCommands_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          stdout_ffcba1bc-b678-4069-80e1-67b0aa785470.live
          stderr_ffcba1bc-b678-4069-80e1-67b0aa785470.live
          ClusterJobCustomization.xml
          ClusterJobs.py
          clusterjob_ffcba1bc-b678-4069-80e1-67b0aa785470.sh
          clusterjob_ffcba1bc-b678-4069-80e1-67b0aa785470.bat
          inquire.request
          inquire.confirm
          request.upload.rsm
          request.download.rsm
          wait.download.rsm
          scratch.job.rsm
          volatile.job.rsm
          restart.xml
          cancel_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          liveLogLastPositions_ffcba1bc-b678-4069-80e1-67b0aa785470.rsm
          stdout_ffcba1bc-b678-4069-80e1-67b0aa785470_kill.rsmout
          stderr_ffcba1bc-b678-4069-80e1-67b0aa785470_kill.rsmout
          sec.interrupt
          stdout_ffcba1bc-b678-4069-80e1-67b0aa785470_*.rsmout
          stderr_ffcba1bc-b678-4069-80e1-67b0aa785470_*.rsmout
          stdout_task_*.live
          stderr_task_*.live
          control_task_*.rsm
          stdout_task_*.rsmout
          stderr_task_*.rsmout
          exitcode_task_*.rsmout
          exitcodeCommands_task_*.rsmout
          persistedStorage/*
          *.abt
      Reading environment variables...
          ANSYS_FRAMEWORK_UNDER_RSM = True
          ANSYS_FRAMEWORK_DEVELOPMENT = 1
          ANSYS_TEST_ME = 2
          ANSYS_FRAMEWORK_UNDER_RSM = True
          ANSYS_FRAMEWORK_DEVELOPMENT = 1
          ANSYS_TEST_ME = 2
          RSM_IRON_PYTHON_HOME = /ansys_inc/v222/commonfiles/IronPython
          RSM_TASK_WORKING_DIRECTORY = /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
          RSM_USE_SSH_LINUX = True
          RSM_QUEUE_NAME = Aero
          RSM_CONFIGUREDQUEUE_NAME = Aero
          RSM_COMPUTE_SERVER_MACHINE_NAME = node00.redacted.cr
          RSM_HPC_JOBNAME = RemoteJobName
          RSM_HPC_DISPLAYNAME = task_2
          RSM_HPC_CORES = 94
          RSM_HPC_DISTRIBUTED = TRUE
          RSM_HPC_NODE_EXCLUSIVE = FALSE
          RSM_HPC_QUEUE = Aero
          RSM_HPC_USER = redacted
          RSM_HPC_WORKDIR = /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
          RSM_HPC_JOBTYPE = NotUsed
          RSM_HPC_ANSYS_LOCAL_INSTALL_DIRECTORY = /ansys_inc/v222
          RSM_HPC_VERSION = 222
          RSM_HPC_STAGING = /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
          RSM_HPC_LOCAL_PLATFORM = Linux
          RSM_HPC_CLUSTER_TARGET_PLATFORM = Linux
          RSM_HPC_STDOUTFILE = stdout_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          RSM_HPC_STDERRFILE = stderr_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          RSM_HPC_STDOUTLIVE = stdout_ffcba1bc-b678-4069-80e1-67b0aa785470.live
          RSM_HPC_STDERRLIVE = stderr_ffcba1bc-b678-4069-80e1-67b0aa785470.live
          RSM_HPC_SCRIPTS_DIRECTORY_LOCAL = /ansys_inc/v222/RSM/Config/scripts
          RSM_HPC_SCRIPTS_DIRECTORY = /ansys_inc/v222/RSM/Config/scripts
          RSM_HPC_SUBMITHOST = 10.115.50.220
          RSM_HPC_STORAGEID =    ec2368dc-fca3-4690-80ef-8d47d4885614    RsmJobRunnerStorage=LocalOS#CRAC$\\192.0.0.100\ansys\qfb2scrd.gmu    Friday, March 01, 2024 09:43:55.922 AM    True
          RSM_HPC_PLATFORMSTORAGEID = \\192.0.0.100\ansys\qfb2scrd.gmu
          RSM_HPC_NATIVEOPTIONS = 
          ARC_ROOT = /ansys_inc/v222/RSM/Config/scripts/../../ARC
          RSM_HPC_KEYWORD = SLURM
          RSM_PYTHON_LOCALE = en-us
      Reading AWP_ROOT environment variable name ...
          AWP_ROOT environment variable name is: AWP_ROOT222
      Reading Low Disk Space Warning Limit ...
          Low disk space warning threshold set at: 2.0GiB
      Reading File identifier ...
          File identifier found as: ffcba1bc-b678-4069-80e1-67b0aa785470
      Done reading control file.
      RSM_AWP_ROOT_NAME = AWP_ROOT222
      AWP_ROOT222 install directory: /ansys_inc/v222
      SLURM_JOB_NODELIST = node00.redacted.cr,node[01-03]<
      SLURM_TASKS_PER_NODE = 22,24(x3)<
      RSM_MACHINES = node00.redacted.cr:22:node01:24:node02:24:node03:24
      ALTERNATE_MACHINES = node00.redacted.cr:22:node01:24:node02:24:node03:24
      Number of nodes assigned for current job = 4 
      Machine list: ['node00.redacted.cr', 'node01', 'node02', 'node03'] 
      Start running job commands ...
      Running on machine : node00.redacted.cr
      Current Directory: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu
      Running command: C:\Program Files\ANSYS Inc\v222\Framework\bin\Win64\runwb2.bat -B -R "test2d_Workbench_Solution.wbjn" -Z Dpdb.EvaluatorRun,Dpdb.EvaluatingProjectUpdate --output "test2d_Workbench_Solution_log.txt"
      Redirecting output to  None
      Final command arg list : ['C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat', '-B', '-R', 'test2d_Workbench_Solution.wbjn', '-Z', 'Dpdb.EvaluatorRun,Dpdb.EvaluatingProjectUpdate', '--output', 'test2d_Workbench_Solution_log.txt']
      Running Process
      ** Traceback (most recent call last):
      **   File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 292, in main
          exitCodeList, exitCode = runCommandList(_commandList, _commandArgList, _commandRedirectList, _commandProgressMonitoringFlags, _targetCluster, _usingLocalScratch)
      **   File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 1189, in runCommandList
          cFiles, cmdProgMonFlagList[cmdIndex], enablePrints)
      **   File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 1214, in runCommand
          clusterCmd.begin()
      **   File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 1802, in begin
          self.process = subprocess.Popen(self.argList, bufsize=-1, stdin=subprocess.PIPE, stdout=stdoutStream, stderr=stderrStream, cwd=os.getcwd(), universal_newlines=True)
      **   File "/ansys_inc/v222/commonfiles/CPython/3_7/linx64/Release/python/lib/python3.7/subprocess.py", line 800, in __init__
          restore_signals, start_new_session)
      **   File "/ansys_inc/v222/commonfiles/CPython/3_7/linx64/Release/python/lib/python3.7/subprocess.py", line 1551, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      ** FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat': 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat'
      Saving exit code file: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu/exitcode_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          Exit code file: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu/exitcode_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout has been created.
      Saving exit code file: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu/exitcodeCommands_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout
          Exit code file: /HARDDRIVE/ANSYS/Staging/qfb2scrd.gmu/exitcodeCommands_ffcba1bc-b678-4069-80e1-67b0aa785470.rsmout has been created.
      ClusterJobs Exiting with code: 9999
      Individual Command Exit Codes are: [None]
      Fatal error when running job command(s).
      [Errno 2] No such file or directory: 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat': 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat'
      A Job command did not return exit code. The job will fail with exit code 9999
      Traceback (most recent call last):
        File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 292, in main
          exitCodeList, exitCode = runCommandList(_commandList, _commandArgList, _commandRedirectList, _commandProgressMonitoringFlags, _targetCluster, _usingLocalScratch)
        File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 1189, in runCommandList
          cFiles, cmdProgMonFlagList[cmdIndex], enablePrints)
        File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 1214, in runCommand
          clusterCmd.begin()
        File "/ansys_inc/v222/RSM/Config/scripts/ClusterJobs.py", line 1802, in begin
          self.process = subprocess.Popen(self.argList, bufsize=-1, stdin=subprocess.PIPE, stdout=stdoutStream, stderr=stderrStream, cwd=os.getcwd(), universal_newlines=True)
        File "/ansys_inc/v222/commonfiles/CPython/3_7/linx64/Release/python/lib/python3.7/subprocess.py", line 800, in __init__
          restore_signals, start_new_session)
        File "/ansys_inc/v222/commonfiles/CPython/3_7/linx64/Release/python/lib/python3.7/subprocess.py", line 1551, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat': 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat'

       

    • MangeshANSYS
      Ansys Employee

      Hello,

      please check target platform mix

      the log shows 
       RSM_HPC_CLUSTER_TARGET_PLATFORM = Linux

       

       

      but then further down i see

      **   File "/ansys_inc/v222/commonfiles/CPython/3_7/linx64/Release/python/lib/python3.7/subprocess.py", line 1551, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      ** FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat': 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat'

       

       

       

      and further 
      [Errno 2] No such file or directory: 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat': 'C:\\Program Files\\ANSYS Inc\\v222\\Framework\\bin\\Win64\\runwb2.bat'
      A Job command did not return exit code. The job will fail with exit code 9999

       

       

    • dbhands
      Subscriber

      Hello,
      Thanks for the reply. I noticed this as well. Where would I check this setting? I am intending to run on linux, so the windows filepaths are incorrect. In the DC evaluator settings on the head node of the slurm cluster, the machine is set to linux and solves when set to direct. When setting to submit to RSM, however, the submission fails with the error above. When setting to RSM the only other option I see is "queue" where I have typed in the name of one of the RSM queues. I do not see anywhere to specify the platform type, however I would asume this should be taken care of by the RSM head node on the slurm cluster like it is during normal submissions to the RSM head node.

    • MangeshANSYS
      Ansys Employee

      Can you please explain the setups and which machine is set to what ?
      1. Which computer is the project being opened and submitted from? what is th eOS? what is the DCS setting ?
      2. Which is the computer where meshing needs to happen? what is the OS ?

      3. how and where is the submission to slurm configured? 

      please add screenshots obscuring any information that shuold not be on a public forum

       

Viewing 3 reply threads
  • The topic ‘Setting up DCS services for a cluster’ is closed to new replies.