Senior R&D Engineer (17197)
Key Duties and Responsibilities
- Deploys, configures, and maintains cloud-based HPC infrastructure
- Implement and manage automation for infrastructure using IaC tools such as Terraform, Spacelift, Helm, etc.
- Manage cluster lifecycle including scaling, upgrades, patching, and monitoring for SLURM and Kubernetes based clusters
- Monitors and responds to HPC issues
- Works with observability tools like Datadog to improve response time to issues and outages
- Develop and maintain internal documentation, operational runbooks, and support playbooks
- Works independently with minimal supervision and may take on some planning and mentoring responsibilities
- May be responsible for managing interns or co-ops but typically does not have direct reports