- NVIDIA (Santa Clara, CA)
- …you will work with internal teams and external partners to integrate distributed systems , manage large-scale data pipelines, and operationalize next-generation ... pipelines using Go, Python, Bash, and Bazel to ensure reproducibility, efficiency, and reliable distributed execution. + Integrate simulation and drive logs (eg… more
- NVIDIA (Santa Clara, CA)
- …design, or enterprise platform engineering. + Deep expertise in architecting large-scale distributed systems with a focus on reliability, performance, and ... record of publishing technical papers, architecture patterns, or thought leadership in AI systems . + Knowledge of observability tools, telemetry dashboards, and… more
- Cisco (San Jose, CA)
- …platforms, such as AWS, Azure, or Google Cloud. + Understanding of distributed systems concepts, including scalability, reliability, fault tolerance, and data ... Team** Our dedicated team members are building the future of Cisco's AI -driven platforms and data infrastructure, supporting innovation across the globe. You will… more
- NVIDIA (Austin, TX)
- …from the crowd: + Technical competency in managing and automating large-scale distributed systems independent of cloud providers. Advanced hands-on experience ... part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be...Bright Cluster Manager) + Proven operational excellence in maintaining reliable and performant AI infrastructure. NVIDIA is… more
- NVIDIA (Santa Clara, CA)
- …hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android), a multitude of hardware platforms both NVIDIA GPUs and ... Tegra Processors. Are you passionate about distributed infrastructure and looking for sophisticated, critical issues, ready to build the next generation of cloud… more
- Amazon (Seattle, WA)
- …base. You'll bring a passion for innovation, data, search, analytics, and distributed systems . You'll also: - Solve challenging technical problems, often ... one of several AWS tools used for building Generative AI on AWS. The Neuron Compiler Engineering team is...for identifying and designing solutions that enable efficient and reliable build, test, and release mechanisms for the Neuron… more
- Oracle (Nashville, TN)
- …Work closely with a collaborative and experienced global team. - Expand your knowledge in AI , cloud computing, and distributed systems . - Contribute to one ... tools to operationalize Large Language Models (LLMs) and agentic AI systems . Our goal is to empower...will contribute to the design and implementation of scalable, distributed systems that serve LLMs and support… more
- Oracle (San Juan, PR)
- …Work closely with a collaborative and experienced global team. - Expand your knowledge in AI , cloud computing, and distributed systems . - Contribute to one ... tools to operationalize Large Language Models (LLMs) and agentic AI systems . Our goal is to empower...will contribute to the design and implementation of scalable, distributed systems that serve LLMs and support… more
- NVIDIA (Santa Clara, CA)
- …and inference more reliable , scalable, and efficient. If you're passionate about AI , distributed systems , and high-performance computing, we want to hear ... driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable...detection. + Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and… more
- Oracle (Columbus, OH)
- …. This is a highly technical, hands-on role where you'll build large-scale distributed systems , optimize AI /ML workflows, and collaborate with ... observability, CI/CD pipelines, and operational excellence. Troubleshoot complex issues in distributed systems and participate in on-call rotations as needed.… more