


{"id":363324,"date":"2024-04-20T17:43:39","date_gmt":"2024-04-20T17:43:39","guid":{"rendered":"\/forum\/forums\/reply\/363324\/"},"modified":"2024-04-20T17:43:39","modified_gmt":"2024-04-20T17:43:39","slug":"363324","status":"publish","type":"reply","link":"https:\/\/innovationspace.ansys.com\/forum\/forums\/reply\/363324\/","title":{"rendered":"Reply To: Lumerical cluster job crashes arbitrarily"},"content":{"rendered":"<p>&lt;p&gt;Hello,&lt;\/p&gt;<\/p>\n<ul>\n<li>I run the job from a GUI launched on a login node of the cluster. The resource manager in the GUI is set to use SLURM, which submits a batch job to the cluster<\/li>\n<li>Lumerical version is 2022 R2 (from the about page under help in the GUI)<\/li>\n<li>I have attached an image below. Openmpi is already loaded, and has version 4.1.0. I have seen that KB. I think i am following the right guidelines &#8211; this script I use does run the job. The main mystery is occasionally the job crashes with error &#8220;std::bad_alloc&#8221;, and as stated earlier, the job runs 90% of the time and fails 10%. The memory allocated for the job is atleast 10x what it would need (what i observe is described in more detail in the first message of this post) and the same job would not crash with a single core on local computer &#8211; just that it would take forever, so definitely not a RAM issue. But i&#8217;m unable to figure out why there is a bad_alloc and where it is happening. I have attached an image of the error thrown in the .out file of a job also below.&nbsp;<\/li>\n<\/ul>\n<ul>\n<li><img loading=\"lazy\" decoding=\"async\" src=\"\/forum\/wp-content\/uploads\/sites\/2\/2024\/04\/20-04-2024-1713634416-mceclip0.png\" width=\"865\" height=\"507\" \/><\/li>\n<\/ul>\n<p>&lt;p&gt;<img decoding=\"async\" src=\"\/forum\/wp-content\/uploads\/sites\/2\/2024\/04\/20-04-2024-1713635014-mceclip0.png\" \/>&lt;\/p&gt;<\/p>\n","protected":false},"template":"","class_list":["post-363324","reply","type-reply","status-publish","hentry"],"aioseo_notices":[],"acf":[],"_links":{"self":[{"href":"https:\/\/innovationspace.ansys.com\/forum\/wp-json\/wp\/v2\/replies\/363324","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/innovationspace.ansys.com\/forum\/wp-json\/wp\/v2\/replies"}],"about":[{"href":"https:\/\/innovationspace.ansys.com\/forum\/wp-json\/wp\/v2\/types\/reply"}],"version-history":[{"count":0,"href":"https:\/\/innovationspace.ansys.com\/forum\/wp-json\/wp\/v2\/replies\/363324\/revisions"}],"wp:attachment":[{"href":"https:\/\/innovationspace.ansys.com\/forum\/wp-json\/wp\/v2\/media?parent=363324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}