- NVIDIA (Santa Clara, CA)
- …systems by pushing for changes that improve reliability and velocity + Practice sustainable incident response and blameless postmortems + Be part of an on call ... with a focus on performance at scale, real time monitoring , logging and alerting + Engage in and improve...Maintain services once they are live by measuring and monitoring availability, latency and overall system health + Scale… more
- NVIDIA (Santa Clara, CA)
- …systems by pushing for changes that improve reliability and velocity + Practice sustainable incident response and blameless postmortems + Be part of an on call ... clusters with focus on performance at scale, real time monitoring , logging and alerting + Engage in and improve...Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale… more
- Coinbase (Sacramento, CA)
- …* Collaborate with Coinbase product teams to reduce service disruptions and automate incident response * Proactively find and analyze reliability problems across ... of handling high throughput and low latency * Experience with observability and monitoring systems such as Kibana, Datadog, etc. * Familiarity with working in rapid… more
- Coinbase (Sacramento, CA)
- …configurations and maintain state using configuration management tools * Facilitate incident response , conduct root cause analysis, and blameless retrospectives ... * Define metrics and bolster monitoring /observability across corporate IAM systems * Participate in regular on-call rotation to ensure 24x7 uptime for critical… more
- Google (Sunnyvale, CA)
- …by pushing for changes that improve reliability and velocity. + Practice sustainable incident response and blameless postmortems. Google is proud to be an ... launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through… more
- Google (Mountain View, CA)
- …by pushing for changes that improve reliability and velocity. + Practice sustainable incident response and blameless postmortems. Google is proud to be an ... reviews. + Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through… more
- Banc of California (Santa Ana, CA)
- …remediation of same. + Establishes and maintains Security Operations team triage and incident response playbooks to protect and recover information assets from ... as assigned. **WHAT YOU'LL BRING** + Demonstrates knowledge of, adherence to, monitoring and responsibility for compliance with state and federal regulations and… more
- Google (Sunnyvale, CA)
- …by pushing for changes that improve reliability and velocity. + Practice sustainable incident response and blameless postmortems. Google is proud to be an ... reviews. + Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through… more
- LinkedIn (Mountain View, CA)
- …direction across orgs, and contributing deeply to culture, hiring, and technical excellenceLead incident response and post- incident reviews to identify root ... focused engineering, or distributed systemsPreferred QualificationsHands-on experience with large-scale incident response , root cause analysis, and resiliency… more
- Stanford Health Care (Palo Alto, CA)
- …Safety Consultant is responsible for the management of the hospitals incident /event reporting system, including staff and manager education, and for conducting ... to ensure timely reporting of and follow-up/corrective action is taken in response to such events/incidents in accordance with the requirements of hospital policies,… more