- Cadence Design Systems, Inc. (San Jose, CA)
- …platform and processes to improve operations. Key Responsibilities: + Implement monitoring framework to improve infrastructure reliability, observability , and ... alerts. + Identifying and implementing automation opportunities to reduce manual work and acceleration delivery. + Drive technical decisions on architecture, automation, and tooling. + Develop processes to track and scale key metrics for reliability,… more
- Walmart (Sunnyvale, CA)
- …of the ML lifecycle-data sourcing, feature engineering, model training, deployment, monitoring , and continuous improvement. *Apply MLOps best practices such as CI/CD ... for ML, automated training pipelines, model versioning, and telemetry-based monitoring . *Implement robust evaluation frameworks for model performance, data quality,… more
- Palo Alto Networks (Santa Clara, CA)
- …robust and performant. This includes automation, architecture, performance, observability , troubleshooting, security, and reliability. Our Infrastructure Platform ... and automation frameworks** , championing **Infrastructure as Code (IaC)** and ** Monitoring as Code (MaC)** principles. + **Automate robust deployments** and… more
- Oracle (Sacramento, CA)
- …networking protocols, data center designs, infrastructure as a service, network monitoring and network automation. **Responsibilities** As a Senior Principal AI ... agents, and inference systems into the software stack for designing, monitoring , troubleshooting and deploying networks. + Evaluate, Integrate, and Optimize… more
- NVIDIA (Santa Clara, CA)
- …building for performance and reliability at global scale, covering automation, monitoring , high availability, capacity planning, and lifecycle management. + Define ... optimizations (SR-IOV/ DPU) + Experience with Technologies like eBPF and XDP for Observability & DDoS mitigation + Collect and review system data for capacity and… more
- General Motors (Sunnyvale, CA)
- …reliability or stability regressions. + **Integrate data pipelines** for continuous monitoring of release health, including automated collection of test, simulation, ... or equivalent). + Prior experience implementing **ELT/ETL pipelines** for quality monitoring , reliability, or release metrics. + Solid understanding of **system… more
- Coinbase (Sacramento, CA)
- …* Lead end-to-end delivery of projects through implementation, deployment, and monitoring * Improve and maintain operational excellence standards across the team, ... proactively addressing technical debt and driving improvements in reliability and observability * Participate in code reviews and on-call rotation, lead incident… more
- Coinbase (Sacramento, CA)
- …* Lead end-to-end delivery of projects through implementation, deployment, and monitoring * Improve and maintain operational excellence standards across the team, ... proactively addressing technical debt and driving improvements in reliability and observability * Participate in code reviews and on-call rotation, lead incident… more
- VetsEZ (CA)
- …while fostering a culture of experimentation and delivery excellence. + Observability and Reliability: Implement monitoring , logging, and automated alerting ... (eg, CloudWatch, Datadog, Prometheus) to ensure system reliability and traceability of AI workflows. + Governance and Compliance: Ensure all AI-enabled components meet HIPAA, VA, and NIST security requirements, aligning with enterprise healthcare standards. +… more
- Oracle (Sacramento, CA)
- …media tools). + Ensure services are built for scale, availability, observability , performance, and security, optimized for graphics and rendering pipelines. + ... workflows. + Drive operational excellence for GPU-powered services, including performance monitoring , failure analysis, and workload optimization. + Stay ahead of… more