- NVIDIA (Santa Clara, CA)
- …you will work with internal teams and external partners to integrate distributed systems , manage large-scale data pipelines, and operationalize next-generation ... pipelines using Go, Python, Bash, and Bazel to ensure reproducibility, efficiency, and reliable distributed execution. + Integrate simulation and drive logs (eg… more
- NVIDIA (Austin, TX)
- …from the crowd: + Technical competency in managing and automating large-scale distributed systems independent of cloud providers. Advanced hands-on experience ... part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be...Bright Cluster Manager) + Proven operational excellence in maintaining reliable and performant AI infrastructure. NVIDIA is… more
- Cisco (Dallas, TX)
- …platforms, such as AWS, Azure, or Google Cloud. + Understanding of distributed systems concepts, including scalability, reliability, fault tolerance, and data ... Team** Our dedicated team members are building the future of Cisco's AI -driven platforms and data infrastructure, supporting innovation across the globe. You will… more
- NVIDIA (Santa Clara, CA)
- …design, or enterprise platform engineering. + Deep expertise in architecting large-scale distributed systems with a focus on reliability, performance, and ... record of publishing technical papers, architecture patterns, or thought leadership in AI systems . + Knowledge of observability tools, telemetry dashboards, and… more
- LinkedIn (Mountain View, CA)
- …such as Scala or other relevant coding languages + Hands-on experience developing distributed systems or other large-scale systems . Preferred Qualifications ... in production. Why join us: If you're passionate about ** AI infra, scalable evaluation systems , or model...Beam, Spark etc., feature engineering, + Experience with search systems or similar large-scale distributed systems… more
- NVIDIA (Santa Clara, CA)
- …and inference more reliable , scalable, and efficient. If you're passionate about AI , distributed systems , and high-performance computing, we want to hear ... driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable...detection. + Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and… more
- Amazon (Redmond, WA)
- …is for a Software Engineer who will design, implement, and operate globally distributed systems that enable Leo to achieve low single-digit-second query ... real-time analytics layer or lakehouse, and to support agentic AI capabilities on top. You'll build these systems...user experience in real time. We combine expertise in distributed systems , data lakehouse architectures, and applied… more
- Walmart (Sunnyvale, CA)
- …build dynamic, context-aware systems . 2. **Architecture ; Scalability:** + Architect scalable, distributed AI systems with a focus on performance, fault ... to lead the design, development, and deployment of advanced AI systems . This role involves architecting scalable...Walmart GTP, you will be building highly scalable and reliable APIs, services and applications which will drive the… more
- Amazon (Redmond, WA)
- …You will lead your team to design, implement, and operate globally available distributed systems geared towards enable Leo to achieve low single-digit-second ... real-time analytics layer or lakehouse, and to support agentic AI capabilities on top. You'll build these systems...user experience in real time. We combine expertise in distributed systems , data lakehouse architectures, and applied… more
- GE Vernova (Niskayuna, NY)
- …neural network architectures (eg, CNNs, RNNs, Transformers). + Expertise in designing scalable, distributed architectures for AI systems . + Strong experience ... Azure, GCP) and containerization (Kubernetes, Docker). + Familiarity with large-scale distributed systems and database technologies. + Experience in creating… more