• Senior Systems Software Engineer, AV…

    NVIDIA (Santa Clara, CA)
    …you will work with internal teams and external partners to integrate distributed systems , manage large-scale data pipelines, and operationalize next-generation ... pipelines using Go, Python, Bash, and Bazel to ensure reproducibility, efficiency, and reliable distributed execution. + Integrate simulation and drive logs (eg… more
    NVIDIA (09/19/25)
    - Related Jobs
  • Software Engineer - Distributed

    Rubrik (Palo Alto, CA)
    …/Kernel or Networking domain + Strong fundamentals in data structures, algorithms, and distributed systems design + Strong background in Systems Programming ... and CTO, our mission is to build a highly reliable , secure, and scalable software-defined platform. We are the...Go, and either C++, Java, or Scala + Large distributed systems design and development experience is… more
    Rubrik (08/07/25)
    - Related Jobs
  • Senior Software Engineer, Distributed

    NVIDIA (Austin, TX)
    …from the crowd: + Technical competency in managing and automating large-scale distributed systems independent of cloud providers. Advanced hands-on experience ... part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be...Bright Cluster Manager) + Proven operational excellence in maintaining reliable and performant AI infrastructure. NVIDIA is… more
    NVIDIA (10/04/25)
    - Related Jobs
  • Senior Distributed Software Engineer,…

    NVIDIA (Santa Clara, CA)
    …achieve this goal, we are looking for an engineer with a deep understanding of distributed systems , outstanding design skills, and a track record in building and ... the broader NVIDIA team to design and build a reliable , scalable, and efficient storage-as-a-service tailored to AI...years of industry experience + Strong background in developing distributed systems involving Golang, Kubernetes, and Cloud… more
    NVIDIA (08/08/25)
    - Related Jobs
  • Research Intern - Reliability of Cloud…

    Microsoft Corporation (Redmond, WA)
    …healthcare, economics, and the environment. Are you passionate about building the future of reliable , large-scale cloud and AI systems ? The ** Systems ... Interns to tackle cutting-edge challenges at the intersection of distributed systems , AI systems...letter. **Preferred Qualifications** + Experience of building scalable and reliable systems . + Demonstrated ability to develop… more
    Microsoft Corporation (09/30/25)
    - Related Jobs
  • Senior Technical Systems AI

    NVIDIA (Santa Clara, CA)
    …design, or enterprise platform engineering. + Deep expertise in architecting large-scale distributed systems with a focus on reliability, performance, and ... record of publishing technical papers, architecture patterns, or thought leadership in AI systems . + Knowledge of observability tools, telemetry dashboards, and… more
    NVIDIA (10/16/25)
    - Related Jobs
  • Senior Systems Software Engineer, AI

    NVIDIA (Santa Clara, CA)
    …is ideal + Demonstrated ability in building scalable, agile, and robust distributed systems + Successful product rollouts and collaboration with early ... NVIDIA DGX Cloud is a fully managed, cloud-based AI supercomputing platform that provides organizations with direct...Software Engineer with experience in building highly agile and reliable software to join us. We are building and… more
    NVIDIA (10/15/25)
    - Related Jobs
  • Senior Software Engineer, Agent Services (Core…

    Microsoft Corporation (Redmond, WA)
    …This is an opportunity to deepen your expertise in distributed systems , programming models, and multi-modal AI integration (text, audio, video), while ... in solving complex technical challenges in one or more domains such as distributed systems , AI /ML infrastructure, developer platforms, or cloud services.… more
    Microsoft Corporation (10/13/25)
    - Related Jobs
  • Software Engineer, SystemML - AI Networking

    Meta (Menlo Park, CA)
    …leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI /GPU communication stack. ... learning domains: Distributed ML Training, GPU architecture, ML systems , AI infrastructure, high performance computing, performance optimizations, or… more
    Meta (10/16/25)
    - Related Jobs
  • Senior Software Engineer, AI Resiliency

    NVIDIA (Santa Clara, CA)
    …and inference more reliable , scalable, and efficient. If you're passionate about AI , distributed systems , and high-performance computing, we want to hear ... driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable...detection. + Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and… more
    NVIDIA (10/15/25)
    - Related Jobs