We’re updating our badges platform. Badge issuance is temporarily paused, but all completions are being recorded and will be fulfilled once the platform is live. Thank you for your patience.

Senior R&D Engineer (17197)

Key Duties and Responsibilities 

  • Deploys, configures, and maintains cloud-based HPC infrastructure
  • Implement and manage automation for infrastructure using IaC tools such as Terraform, Spacelift, Helm, etc.
  • Manage cluster lifecycle including scaling, upgrades, patching, and monitoring for SLURM and Kubernetes based clusters
  • Monitors and responds to HPC issues
  • Works with observability tools like Datadog to improve response time to issues and outages
  • Develop and maintain internal documentation, operational runbooks, and support playbooks
  • Works independently with minimal supervision and may take on some planning and mentoring responsibilities
  • May be responsible for managing interns or co-ops but typically does not have direct reports
    0
    Your Backpack
    Your backpack is empty