How to stop Lumerical FDTD from crashing on a cluster

- September 25, 2021 at 3:52 pm
  
  ihammond
  Subscriber
  
  I am running Lumerical FDTD on a cluster, but it always after a bit of simulation calls
  "terminate called after throwing an instance of 'std::bad_alloc'
  what(): std::bad_alloc
  APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)"
  This appears like a memory issue, but I have allocated hundreds of GB per node and ensured that Its plenty based on the memory estimation in the GUI. It does this randomly, and for two identical simulations sometimes one will crash and the other will simulate completely. So, I tried to overcome this by automating my slurm submission file to continue running the simulation file with the -resume flag for lumerical FDTD, and configured my simulation to checkpoint every 10 minutes. This allowed me to successfully finish a simulation no matter what, but it often times requires many crashes and resumes. I would be fine continuing this method, however every time the simulation does not successfully run in one shot, it ruins the data. Somehow during the checkpointing and resume process, my monitor data is lost and my calculation of Purcell enhancement becomes wildly inaccurate, and my monitors lack any field data. If I could have advice on how to prevent it from crashing that would be most ideal, but another solution is to figure out why it loses data upon crashing. Thanks!
- September 27, 2021 at 4:16 pm
  
  Guilin Sun
  Ansys Employee
  
  Please check the memory requirements and posted here. Usually it will need sufficient amount of data to be allocated to different node/process. In addition, please provide computer OS and software version information.
- September 27, 2021 at 4:38 pm
  
  ihammond
  Subscriber
  
  I am running on a supercomputer running red hat 7.8 and using slurm and running Lumerical fdtd 2021 R2 version 8.26.2717. I am using the following slurm submission script (attached submit.txt normally a .bash but changed to allow submission to this forum) to get everything done. The attached png is a screenshot of the memory requirements. Thanks!
- September 27, 2021 at 4:55 pm
  
  Guilin Sun
  Ansys Employee
  
  Thank you! please disable all monitors and test on a single node/process/thread, and see what happens.
- September 27, 2021 at 4:57 pm
  
  ihammond
  Subscriber
  
  Okay I will do so and then post the results here afterwards. Thanks
- September 28, 2021 at 4:04 pm
  
  ihammond
  Subscriber
  
  It appears to work without any crashing on one node with the monitors disabled
- September 28, 2021 at 4:23 pm
  
  Guilin Sun
  Ansys Employee
  
  Since the file is not large, please test run one full simulation in ONE node, not to use more than one node. try and let me know if it works.
  Before you try, please copy all objects but not the FDTD, then paste to a new project file, and add/modify FDTD to do the test.
- September 28, 2021 at 4:30 pm
  
  ihammond
  Subscriber
  
  On one node, with monitors enabled this time, it appears as though the simulation will take 40 hours. Also is the new FDTD object supposed to be any different than the one previously used?
- September 28, 2021 at 5:01 pm
  
  Guilin Sun
  Ansys Employee
  
  40-hours is an estimation based on the simulation time you set. It can terminate early.
  When the file was crashed on cluster, where the file was created? if it is on different computer it could be due to version issue. Make sure the two computers have the same version of the software.
- September 28, 2021 at 5:47 pm
  
  ihammond
  Subscriber
  
  I generate the script using the python api, but it's run on the same computer and install of Lumerical. Only difference is that the api calls the gui version of lumerical-fdtd-solutions whereas the cluster calls the fdtd-engine which should match the same version, considering they are on the same install.
- September 29, 2021 at 4:02 pm
  
  ihammond
  Subscriber
  
  With monitors enabled on a single node, it runs without crashing
- September 29, 2021 at 4:09 pm
  
  Guilin Sun
  Ansys Employee
  
  Great ´╝üsome times the generated fie may not be perfect due to some unknown reason. Unknown because it is not repeatable.
- September 29, 2021 at 4:11 pm
  
  ihammond
  Subscriber
  
  So is there not a way to run my scripts in parallel on multiple nodes/processes without crashing? One processor takes quite a bit of time
- September 29, 2021 at 4:23 pm
  
  Guilin Sun
  Ansys Employee
  
  Since this is non-repeatable issue, you can try again. But for this specific case, the meshing its self only needs very small memory. It is the monitors that require large memory. So more processes may have some issues. You may try to use more threads and see if this is helpful.
- September 29, 2021 at 4:27 pm
  
  ihammond
  Subscriber
  
  Interesting. Is this related to why crashing and loading checkpoints destroys monitor data?
- September 29, 2021 at 11:27 pm
  
  Guilin Sun
  Ansys Employee
  
  It is hard to say at this moment, as it is not repeatable. Checkpoint should not affect any thing . If it does, then it will be a bug. Please confirm if the checkpoints create any issues by different testing.
- October 1, 2021 at 6:41 pm
  
  Lito
  Ansys Employee
  
  I generate the script using the python api, but it's run on the same computer and install of Lumerical. Only difference is that the api calls the gui version of lumerical-fdtd-solutions whereas the cluster calls the fdtd-engine which should match the same version, considering they are on the same install.
  Do you have a GUI connection to the cluster and run the python script in your cluster to create the simulation file?
- October 1, 2021 at 7:46 pm
  
  ihammond
  Subscriber
  
  I use the python API with the gui to create the fsp file without any cluster, then use the attached submission script with SLURM on the cluster to run the simulation with the engine, then once it completes I open the GUI again without the cluster
- October 1, 2021 at 7:53 pm
  
  Lito
  Ansys Employee
  
  ,so you create the simulation file from your local computer using the script and not from the Lumerical installation in the cluster? If this is the case, can you send the About page of FDTD that you used to create the simulation file with your script?
- October 4, 2021 at 4:51 pm
  
  ihammond
  Subscriber
  
  Yes, sort of. The installation for the cluster is the same computer as the local machine. I remote into the supercomputer (without using a cluster, just accessing the front end) and create the script using fdtd-solutions on the supercomputer. Then I run the file on the cluster (which is governed by the same computer and same lumerical installation), but this time i use fdtd-engine to run it because it is a cluster. This is the about page, and attached I've included the python script (and supporting text files called by it) that create the fsp file. The python script is labeled python_script.
- October 5, 2021 at 12:08 am
  
  Lito
  Ansys Employee
  
  Based on your submit script to the cluster, you are running using only 1 node with 5 processes. Does the issue happen with monitors enabled using more than 5 processes on 1 node? Or it only happens when you are using more than 1 node?
- October 5, 2021 at 12:10 am
  
  ihammond
  Subscriber
  
  The issue happens as long as thereÔÇÖs more than 1 process, not necessarily dependent on the number of nodes.
- October 6, 2021 at 1:42 am
  
  Lito
  Ansys Employee
  
  Can you try and run the FDTD simulation from from this example? Run it with 6 processes in your cluster. Let us know if you run into the same issue.
- October 6, 2021 at 1:59 am
  
  ihammond
  Subscriber
  
  The download link for zip file in the link appears to be broken. It says to login to download, but I am already logged in so I cannot download it. Also do you want 6 processes on one node or on multiple?
- October 6, 2021 at 7:07 pm
  
  Lito
  Ansys Employee
  
  Access to Lumerical Application Gallery requires support registration. Kindly register for support while accessing your current/active Lumerical license. Then download the example file. You can try to run on either one or multiple nodes and let us know how this pans out for the said example simulation file.
  Best Lito