• Senior DGX Cloud Software Engineer

    NVIDIA (Santa Clara, CA)
    …operational capacity of our bare-metal, accelerated compute infrastructure and codify reliability best-practices in the broader DGX Cloud platform ecosystem. What ... working with or developing multi-cloud infrastructure services. Experience teaching reliability engineering (eg SRE) and/or other scale-oriented cloud systems… more
    NVIDIA (07/26/25)
    - Related Jobs
  • Sr. Utilities Engineer

    Florida Crystals Corporation (Crockett, CA)
    …RESPONSIBILITIES + Drives utility improvements in cost, availability, and reliability , while insuring full regulatory compliance, including environment, health, ... ability to support equipment and technology. + Develops and Reviews operation, reliability , and maintenance reports and statistics to plan and modify utilities… more
    Florida Crystals Corporation (07/19/25)
    - Related Jobs
  • Principal Firmware Engineer - Data Center…

    NVIDIA (Santa Clara, CA)
    …other leads to design & build data center health management workflow. + Drive reliability and optimization in firmware architecture from a data center view point. + ... Speed of Light + Own firmware delivered to data centers in terms of quality, reliability and telemetry performance. What we need to see: + 15+ years of relevant… more
    NVIDIA (07/11/25)
    - Related Jobs
  • Senior GPU and HPC Infrastructure Engineer

    NVIDIA (Santa Clara, CA)
    …+ Implement monitoring and health management capabilities that enable industry-leading reliability , availability, and scalability of GPU assets. You will be ... systems (Kubernetes, SLURM) + Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture,… more
    NVIDIA (07/10/25)
    - Related Jobs
  • Sr. Hardware Development Engineer /Signal…

    Amazon (Cupertino, CA)
    …systems that validate hardware quality in manufacturing; monitoring and improving hardware reliability in data centers and platform. We cover everything from low ... while taking into consideration our customer needs from a cost, performance, and reliability perspective. About the team Why AWS Amazon Web Services (AWS) is the… more
    Amazon (07/05/25)
    - Related Jobs
  • Sr. Hardware Development Engineer - PCIe,…

    Amazon (Cupertino, CA)
    …systems that validate hardware quality in manufacturing; monitoring and improving hardware reliability in data centers and platform. We cover everything from low ... while taking into consideration our customer needs from a cost, performance, and reliability perspective. About the team Within AWS AWS Board Core Design & Services… more
    Amazon (07/05/25)
    - Related Jobs
  • Distributed Systems Engineer , Cassandra,…

    DoorDash (San Francisco, CA)
    …operating a high performance, scalable, reliable data abstraction layer that optimizes reliability and efficiency. You will help us bootstrap and scale our internal ... distributed database infrastructure centered around Cassandra, with a focus on reliability , operability, security and efficiency. This infrastructure will be the… more
    DoorDash (07/04/25)
    - Related Jobs
  • Staff Software Engineer , Search Platform

    DoorDash (Sunnyvale, CA)
    …and support strategic initiatives in the search platform. + Build for scale, reliability , and efficiency - Lead design and implementation of critical components to ... or equivalent + 8+ years of industry experience + Passion for reliability & performance - you've built and operated large-scale low-latency distributed systems… more
    DoorDash (07/02/25)
    - Related Jobs
  • Senior Software Engineer , Bare Metal…

    NVIDIA (Santa Clara, CA)
    …Implementing monitoring and health management capabilities that enable industry leading reliability , availability, and scalability of GPU assets. You will be ... Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving… more
    NVIDIA (06/30/25)
    - Related Jobs
  • Staff Software Engineer , Experimentation…

    DoorDash (San Francisco, CA)
    …platform into one that can operate at massive scale with unwavering reliability . + Enjoy solving complex distributed systems problems where reliability , ... latency, throughput, and correctness are all critically important. + Are driven by impact at scale, knowing that your work will directly influence decision-making across the company. + Get energy from simplifying complex architectures without compromising… more
    DoorDash (06/27/25)
    - Related Jobs