- NVIDIA (Santa Clara, CA)
- …APIs for integration with NVIDIA's resiliency stacks. + Define meaningful and actionable reliability metrics to track and improve system and service reliability . ... + Skilled in problem-solving, root cause analysis, and optimization. What we need to see: + Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems. + Bachelor's degree or higher in Computer Science or a related… more
- Chevron Corporation (El Segundo, CA)
- …may choose to focus on other areas of interest, including: * ** Reliability & Integrity Engineering** - focused on asset integrity, reliability , maintainability, ... and operability of operating assets. * **Technical Safety Engineering** \- identify hazards, assess risk, and develop mitigation strategies through application of engineering design principles to a managed level of risk for operating facilities and capital… more
- Allied Universal (Santa Clara, CA)
- …your attention to detail and commitment to Allied Universal's values-agility, reliability , innovation, teamwork, and integrity-will be highly valued. Join our team ... procedures. **What We're Looking For:** + Availability across various days and shifts + Reliability and ability to adapt to different post assignments + A desire to… more
- Meta (Menlo Park, CA)
- …and support the latest technology while maintaining a high level of reliability to ensure seamless operations for the executive team. 3. Maintain confidentiality ... in the pursuit of resolving complex issues while maintaining a high level of reliability . 6. Work across the industry to improve or resolve bugs with third-party… more
- Meta (Fremont, CA)
- …benchmark of innovation for the data center industry and improve reliability , safety, and efficiency through innovative solutions, enabling Infrastructure teams to ... and develop new data center cooling technologies/systems to reduce cost, improve reliability , efficiency, and speed to market. Contribute to development of data… more
- Meta (Menlo Park, CA)
- …an active participant in deep technical discussions on how to improve the reliability and efficiency of Meta's networks. 6. Collaborate with partner teams in ... 12. Work on performance measurement to ensure improved efficiency and reliability of the network. **Minimum Qualifications:** Minimum Qualifications: 13. Requires… more
- Meta (Menlo Park, CA)
- …products 18. Lead package development to establish package manufacturability and reliability 19. Collaborate with multi-functional teams with in Meta and define ... products 24. Lead package development to establish package manufacturability and reliability 25. Collaborate with multi-functional teams with in Meta and define… more
- Meta (Burlingame, CA)
- …problems, as well as complex robotics that balance ergonomics, performance, reliability , and usability while maintaining the highest safety and quality standards. ... communication hardware/firmware and systems integration while balancing ergonomics, performance, reliability 4. Research and design planning and control algorithms… more
- Meta (Fremont, CA)
- …testing, deployment, and ongoing maintenance 2. System Performance & Reliability : Monitor application performance, identify bottlenecks, and implement solutions to ... ensure high availability, reliability , and scalability 3. Security & Compliance: Ensure IAM applications adhere to internal security policies, industry best… more
- Meta (Menlo Park, CA)
- …SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (eg Large-Scale GenAI/LLM training) from the trainer down to ... seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance. **Required Skills:** Software Engineer, SystemML - AI Networking… more