- Broadcom (Palo Alto, CA)
- …solutions. We are dedicated to building robust, scalable, and high-performance distributed systems that empower enterprises to achieve their digital transformation ... for defining the technical vision, architecting, and leading the implementation of complex distributed systems that are central to our VCF offerings. You will work… more
- NVIDIA (Santa Clara, CA)
- …We are looking for an engineer who has a deep understanding of distributed systems development, object storage, network file transfer protocols, and file systems. ... + Solve technical problems spanning the areas of orchestration, distributed systems, service modeling, API modeling, monitoring, deployment, and automation… more
- Amazon (Cupertino, CA)
- …ML training workloads on AWS Trainium through deep understanding of distributed training, compilation systems, and hardware acceleration. The ideal candidate will ... have a solid understanding of AI/ML models training, distributed training architectures, and performance optimization techniques. They should be able to assess… more
- LinkedIn (Mountain View, CA)
- …resolve issues in popular libraries like Huggingface, Horovod and PyTorch, enable distributed training over 100s of billions of parameter models, debug and optimize ... problems. -Designing, implementing, and optimizing the performance of large-scale distributed serving or training for personalized recommendation as well as… more
- Rubrik (Palo Alto, CA)
- …and stack. At the heart of Rubrik's architecture is an open-source scalable, distributed SQL database. This Is a fundamental building block for all infrastructure ... components (eg distributed file system) and applications (eg Oracle db backup)...early career software engineer with a strong interest in distributed database technologies and cloud computing platforms and a… more
- Meta (Menlo Park, CA)
- …has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in ... GPU training and inference fleet through an observable, reliable and high-performance distributed AI/GPU communication stack. Currently, one of the team's focus is… more
- Panasonic Avionics Corporation (Irvine, CA)
- …IOT, Cloud or similar industry. + 10+ years of experience architecting distributed systems using Java, C++ or GoLang. + Experience implementing virtualization ... both bare metal and Cloud environments. + 10+ years of experience architecting distributed systems using Java, C++, GoLang or similar languages. + Deep technical… more
- Amazon (Sunnyvale, CA)
- …Edge AI team at Amazon Devices (Lab126) where you'll architect and implement distributed training systems that scale to hundreds of billions of parameters. Your work ... versions that run on constrained edge devices. Lead the development of our distributed training platform for large language models up to 400B parameters Design… more
- DoorDash (San Francisco, CA)
- …We're hiring a Data Solutions Engineer with deep expertise in distributed databases, particularly Apache Cassandra, Redis, Kafka, and database agnostic abstractions. ... In this role, you will design, optimize, and scale distributed data access layers that power DoorDash's most critical systems, ensuring high availability, low… more
- General Motors (Sunnyvale, CA)
- …model training performance analysis and optimizaiton solutions to scale distributed training workflows and maximize resource utilization across heterogeneous ... experience + 3+ years specialized experience in AI/ML infrastructure, eg, enabling distributed training for scaling large ML models + Strong programming skills in… more