- Amazon (Bellevue, WA)
- …post-training multimodal LLMs. - Scale training of models on hyper large GPU and AWS Trainium clusters - Optimize training workflows using distributed ... training/parallelism techniques - Optimize low-level details of the training stack, including CUDA kernels, communication collectives, network I/O. - Utilize, build and extend upon industry leading frameworks (NeMo, Megatron Core, PyTorch, Jax, vLLM, TRT, etc)… more
- Amazon (Seattle, WA)
- …as data, tensor, model, and pipeline parallelism. - Monitor and optimize GPU memory and throughput for training large models efficiently. - Collaborate ... cross-functionally with research, data infra teams to integrate new models and features - Deep understanding of LLM algorithm and deep learning framework like PyTorch - Mathematics and Statistics: Strong understanding of linear algebra, calculus, probability,… more
- Amazon (Seattle, WA)
- …multiple projects written in C, our team enables customers to network thousands of GPU and CPU instance types to handle the toughest clustered workloads. Be a part ... of a dynamic, fast-paced group that has a big impact every day on the hottest companies doing AI and HPC today. Key job responsibilities You will write the highest-performing code in C for multiple open source projects supporting EFA, such as Libfabric and… more
- Meta (Bellevue, WA)
- …best enterprise modern parallel environments: distributed clusters, multicore SMP, or GPU 21. 9. Developing highly scalable classifiers and tools leveraging machine ... learning, statistics, regression, rules-based models, or mathematical models 22. 10. Java, C++, Perl, PHP, or Python **Public Compensation:** $184,695/year to $200,200/year + bonus + equity + benefits **Industry:** Internet **Equal Opportunity:** Meta is proud… more
- NVIDIA (Seattle, WA)
- NVIDIA's invention of the GPU fueled the PC gaming market. The company's groundbreaking work in accelerated computing-a supercharged form of computing at the ... intersection of computer graphics, high performance computing and AI-is reshaping industries, such as transportation, healthcare and manufacturing, and fueling the growth of others. In 2020, NVIDIA acquired Mellanox, a leading supplier of end-to-end Ethernet… more
- Amazon (Seattle, WA)
- …large models, working with Pytorch and/or Tensorflow using large distributed fleets of GPU or other accelerated systems. - * Experience with Linux distributions such ... as Ubuntu or CentOS, kernel development, and tooling such as perf and gdb. - * Experience with performance profiling, tracing, and analysis of AI training/inference applications. - * Experience with large scale, distributed AI training/inference applications,… more
- Microsoft Corporation (Redmond, WA)
- …continual pre-training, large-scale deep reinforcement learning running on extensive GPU resources, and significant efforts to curate and synthesize training ... data. In addition, the team employs various fine-tuning approaches to support both research and product development. The team also develops advanced AI technologies that integrate language and multi-modality for a range of Microsoft products. The team is… more
- Meta (Redmond, WA)
- …to best exploit modern parallel environments (eg distributed clusters, multicore SMP, and GPU ) 5. Work with a large and globally distributed team 6. Contribute to ... publications and open-sourcing efforts **Minimum Qualifications:** Minimum Qualifications: 7. Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience 8. Research experience in machine learning,… more
- Meta (Bellevue, WA)
- …best enterprise modern parallel environments: distributed clusters, multicore SMP, or GPU 22. 10. -Developing highly scalable classifiers and tools leveraging ... machine learning, statistics, regression, rules-based models, or mathematical models and 23. 11. -Java, C++, Perl, PHP, or Python. **Public Compensation:** $186,437/year to $200,200/year + bonus + equity + benefits **Industry:** Internet **Equal Opportunity:**… more
- Amazon (Seattle, WA)
- …equivalent - Experience with Large Language Model inference - Experience with GPU programming (eg TensorRT-LLM) or Amazon AI chip programming (Trainium) - Experience ... with Python, PyTorch, and C++ programming, particularly performance optimization Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Our inclusive culture… more