Photonics

Photonics

Topics related to Lumerical and more

Lumerical cluster job crashes arbitrarily

    • Vighnesh Natarajan
      Subscriber

      I run many parallel jobs on a cluster. Sometimes - for no apparent reason, the job crashes citing "std::bad_alloc", but no more information. The RAM is sufficient (I have verified this, I have allocated 10x the RAM that the requirements ask for - and have monitored that this is not reached via observing using htop). This crash doesnt happen if I use just a single core on a local computer resource setting. However it happens on parallel slurm jobs with multiple cores. I need to use multiple cores to speed up simulations.

      This crash is not reproducible - it randomly happens, and if I run the same job again (same simulation with same resources in the resource manager), it sometimes runs and sometimes doesnt. Thus this leads me to believe that it doesn't have anything to do with RAM but is some other problem.

      This issue is okay sometimes - where for many job submissions it does not crash, but sometimes it crashes quite often, which is not desirable - and it interrupts sweeps of simulations that I run. 

      This is with Lumerical version 2022 R2

    • Lito
      Ansys Employee

      @Vighnesh Natarajan, 

      I am sorry to hear that you are having issues running Lumerical simulations on your cluster. To debug/troubleshoot the issue: 

      • What is the operating system and version running on the cluster? 
      • Which Lumerical product/solver are you running? FDTD, MODE, CHARGE, HEAT, or INTERCONNECT?  
      • How are you running the simulation job? Are you using a job scheduler submission script and running this from the Terminal on the cluster head/login node? 
      • What is the job scheduler on your cluster - if you are using one? 
      • Please paste the execution line on the submission script or the command you run on the cluster, if you are not submitting to a job scheduler.
        (e.g. running FDTD with bundled MPICH2)
      /install_path/lumerical/v222/mpich2/nemesis/bin/mpiexec -n 32 /install_path/lumerical/v222/bin/fdtd-engine-mpich2nem -t 1 /path_to/simulationfile.fsp

       

    • Vighnesh Natarajan
      Subscriber

      Hello,

      • The operating system is Ubuntu 20.04.6 LTS (Focal Fossa). I login via ssh and using x11 I can open a GUI window.
      • This issue comes up with FDTD
      • The issue occurs only when I use SLURM as a job scheduler. In the resource manager in the GUI, I use SLURM as the job launcher and have setup a batch script that will run the job. An example of the same is
        • the sbatch command is "sbatch --mem=84G --time=48:00:00 -N 12 -n 96 --ntasks-per-node=8"
        • This invokes - "mpirun -np 96 --use-hwthread-cpus /share/apps/lumerical/v222/bin/fdtd-engine-ompi-lcl -logall -remote {PROJECT_FILE_PATH}"
      • SLURM is the only way I can submit large batch jobs to the cluster across multpile nodes. If I set the resource manager to "Local computer", it uses the mpich2nem solver and that never crashes, however that uses only one core - the core that I used to launch the GUI from the remote ssh session.

      Hope this information is useful in helping diagnose the problem.

    • Lito
      Ansys Employee

      @Vighnesh Natarajan, 

      • Are you running the job from the Lumerical CAD/GUI on the master/login node on the cluster? Or are you "sending the job to the cluster" from your local computer?
      • Which version of Lumerical are you using? e.g. latest release, 2024 R1.2.
      • Send a screenshot of your "Resources advanced options". (similar to the image below)
        See this KB for the job integration details on the cluster. >>Lumerical job scheduler integration (Slurm, Torque, LSF, SGE) – Ansys Optics

      res-config-sweeps-slurm-integration.png

       

    • Vighnesh Natarajan
      Subscriber

      Hello,

      • I run the job from a GUI launched on a login node of the cluster. The resource manager in the GUI is set to use SLURM, which submits a batch job to the cluster
      • Lumerical version is 2022 R2 (from the about page under help in the GUI)
      • I have attached an image below. Openmpi is already loaded, and has version 4.1.0. I have seen that KB. I think i am following the right guidelines - this script I use does run the job. The main mystery is occasionally the job crashes with error "std::bad_alloc", and as stated earlier, the job runs 90% of the time and fails 10%. The memory allocated for the job is atleast 10x what it would need (what i observe is described in more detail in the first message of this post) and the same job would not crash with a single core on local computer - just that it would take forever, so definitely not a RAM issue. But i'm unable to figure out why there is a bad_alloc and where it is happening. I have attached an image of the error thrown in the .out file of a job also below. 

    • Lito
      Ansys Employee

      @Vighnesh Natarajan, 

      Try using 'mpiexec' with OpenMPI instead of 'mpirun' as show in the example in our KB.
      >>Running simulations with MPI on Linux – Ansys Optics 

      ## using the install path for OpenMPI shown ##
      /usr/lib64/openmpi/bin/mpiexec -n 96 /share/apps/lumerical/v222/bin/fdtd-engine-ompi-lcl -t 1 {PROJECT_FILE_PATH}

      Have you tried running with the Lumerical bundled MPICH2? Does it have the same issue?  

      /share/apps/lumerical/v222/mpich2/nemesis/bin/mpiexec -n 96 /share/apps/lumerical/v222/bin/fdtd-engine-mpich2nem -t 1 {PROJECT_FILE_PATH}

       

    • Vighnesh Natarajan
      Subscriber

       

      Hello,

      Thank you very much for your suggestions. I may have tried mpiexec but I do not recall. I have not tried the Lumerical bundled MPICH2 yet. I shall try your suggestions and report back here how they work, thank you very much

       

       

Viewing 6 reply threads
  • You must be logged in to reply to this topic.