- Microsoft Corporation (Redmond, WA)
- …+ Hands-on experience with distributed training frameworks (Ray, Slurm, HPC ), containerization and orchestration technologies (Docker, Kubernetes) for ML model ... deployment, and ML lifecycle management in production environments. + Experience designing evaluation frameworks for LLM-based applications and implementing observability for agent systems using tools such as Phoenix, MLFlow, LangFuse, or custom eval… more
- Amazon (Seattle, WA)
- …cloud offerings that enable high performance and scalability in AI/ML and HPC workloads. Utility Computing (UC) AWS Utility Computing (UC) provides product ... innovations - from foundational services such as Amazon's Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS's services and features apart in the industry. As a member… more
- Amazon (Seattle, WA)
- …AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads. You are intrigued by the continuous release of newer AWS services ... and instance types that solve newer, bigger and more interesting business problems every day? Does that make you wish your talents were applied to those at cloud scale? If yes, then come join us - we are looking for builders like you. The AWS Hardware… more
- Pacific Northwest National Laboratory (Richland, WA)
- …+ Strong publication record in advanced architectures, design automation, co-design, HPC , quantum computing, AI, microelectronics, or other related area + ... Demonstrated success securing competitive research funding (DOE, DoD, NSF, or industry) + Experience leading large, interdisciplinary R&D efforts and managing multi-million-dollar portfolios + Proven ability to mentor and develop staff, including supporting… more
- Microsoft Corporation (Redmond, WA)
- …performance analysis and optimization of state of the art LLMs, HPC applications including proficiency using GPU profiling tools Cross-team collaboration skills ... and the desire to collaborate in a team of researchers and developers + Ability to independently lead projects Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:… more
- Microsoft Corporation (Redmond, WA)
- …hardware development lifecycle. + Proficient understanding of state of the art of AI/ HPC physical infrastructure. + Ability to analyze solutions from a full TCO ... perspective, including but not limited to: CAPEX, OPEX, the system constraints that drive design tradeoffs, technology, quality and serviceability. + Deep and broad knowledge of current and emerging copper and optical based interconnect technologies. +… more
- Amazon (Seattle, WA)
- …cross-functional strategic capacity modeling to support emerging workloads (such as AI/ML and HPC ) and future region expansions will be a crucial part of your role. ... Additionally, you'll drive long-term innovation in planning frameworks and digital planning tools to support dynamic business shifts. Enterprise Level: Your responsibilities include delivering enterprise-wide SIOP transformations that break down silos between… more
- Microsoft Corporation (Redmond, WA)
- …+ Hands-on experience with distributed training frameworks (Ray, Slurm, HPC ), containerization and orchestration technologies (Docker, Kubernetes) for ML model ... deployment, and ML lifecycle management in production environments + Experience designing evaluation frameworks for LLM-based applications and implementing observability for agent systems using tools such as Phoenix, MLFlow, LangFuse, or custom eval harnesses;… more
- Meta (Bellevue, WA)
- …10. Experience in leading teams working on high performance computing ( HPC ) and AI/ML systems, including: GPU/ASIC-based kernel development and optimization (eg ... CUDA, ROCm), distributed systems for large scale training and serving, and systems architecture and performance 11. Accelerator (GPU/ASIC) kernel development and optimization 12. Experience in accelerating libraries on AI hardware, similar to cuBLAS, cuDNN,… more
- Oracle (Olympia, WA)
- …(OCI) Cluster Networking team is building an ultra-high-performance network to support AI/ML/ HPC workloads. Join us to design systems that scale from tens to ... hundreds of thousands of GPUs without sacrificing performance. Our team develops and tunes the software and hardware stack for distributed workloads using libraries such as NCCL on high-speed networks. Strong knowledge and practical experience with NCCL is… more