- NVIDIA (Santa Clara, CA)
- Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and ... internal and external facing GPU cloud services run maximum reliability and uptime as promised to the users and...be doing: + Design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus… more
- NVIDIA (Santa Clara, CA)
- …SOL quality and efficiency. The DFP team is looking for a Speed and Reliability Lead. You will be leading and crafting testability features related to Speed, Timing ... and Reliability from ground up as you help turbocharge NVIDIA's...bringup and tuning a plus, related to timing, speed, reliability and power. + Familiarity with STA timing closure,… more
- NVIDIA (Santa Clara, CA)
- …NTP/PTP, DHCP, and LDAP. This includes building for performance and reliability at global scale, covering automation, monitoring, high availability, capacity ... architectures and identify opportunities for containerization to improve scalability, reliability , and efficiency. + Strong analytical skills with the ability… more
- Rubrik (Sacramento, CA)
- …and services with the objective of achieving and exceeding availability and reliability goals * Manage and streamline monitoring systems to enhance observability and ... visibility * Perform Production Readiness Assessments of new services to identify reliability needs and surface potential gaps * Develop and maintain documentation… more
- Huntington Ingalls Industries (San Diego, CA)
- …Action System (FRACAS) to streamline corrective action processes using Reliability , Availability, Maintainability - Cost (RAM-C). Tracks parts consumption, maintains ... documentation, and deploys technical support to troubleshoot maintenance issues. Provides reports on repair action costs, particularly for high-cost scenarios, justifying the economic feasibility of corrective actions. Contributes to technology refreshment… more
- LiveRamp (San Francisco, CA)
- …issues with Engineering teams** + **Setup and maintain Infrastructure & Product Reliability monitoring and alerting** + **Maintain and enhance CI/CD Tooling and ... Terraform scripts in support of the mission in close collaboration with DevOps team** + **Maintain and enhance Engineering Operational Documentation for supported products.** + **Provide expertise to build and maintain products operational documentation and… more
- Palo Alto Networks (Santa Clara, CA)
- …automation, architecture, performance, observability, troubleshooting, security, and reliability . Our Infrastructure Platform stack includes Terraform, Kubernetes, ... GitLab CI/CD, GitOps, Prometheus, Grafana, Loki, Docker, GCP, Backstage, MySQL, PagerDuty, FireHydrant, Python, Bash, Java, NodeJS and Go. **Your Impact** + Design, build, and operate reliable, secure Cloud infrastructure across multi-cloud environments +… more
- Safran (Carson, CA)
- Engineer II, RMS ( Reliability , Maintainability, Safety) Company : Safran Cabin Job field : Architecture and systems engineering Location : Carson , California , ... and Failure Modes and Effects Summary per MIL-STD-1629A and D6-56674. -Prepare Engineering Reliability Parts Prediction Count Reports (ERPPC) RMS Engineer II is… more
- MongoDB (San Francisco, CA)
- …to build next-generation, AI-powered applications. We are looking for an experienced Staff Engineer for our SRE, InfraSec team, to guide the security of our ... cloud-based infrastructure. As a Staff SRE, you will be very hands-on technically while also mentoring a small team of SREs. The InfraSec team collaborates closely with other engineering teams to ensure that our infrastructure adheres to the highest security… more
- Insight Global (Santa Clara, CA)
- …Planning and Processes organization where you will be working as a Senior SRE Engineer . The position will be part of a fast-paced crew that develops and maintains ... sophisticated internal cloud provisioning products. The team works with various other business units such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure & systems needs. As… more