• Senior Site Reliability Engineer…

    NVIDIA (Santa Clara, CA)
    …systems by pushing for changes that improve reliability and velocity + Practice sustainable incident response and blameless postmortems + Be part of an on call ... with a focus on performance at scale, real time monitoring , logging and alerting + Engage in and improve...Maintain services once they are live by measuring and monitoring availability, latency and overall system health + Scale… more
    NVIDIA (08/02/25)
    - Related Jobs
  • Senior Site Reliability Engineer - DGX…

    NVIDIA (Santa Clara, CA)
    …systems by pushing for changes that improve reliability and velocity + Practice sustainable incident response and blameless postmortems + Be part of an on call ... clusters with focus on performance at scale, real time monitoring , logging and alerting + Engage in and improve...Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale… more
    NVIDIA (08/01/25)
    - Related Jobs
  • Senior Site Reliability Engineer…

    Coinbase (Sacramento, CA)
    …* Collaborate with Coinbase product teams to reduce service disruptions and automate incident response * Proactively find and analyze reliability problems across ... of handling high throughput and low latency * Experience with observability and monitoring systems such as Kibana, Datadog, etc. * Familiarity with working in rapid… more
    Coinbase (08/09/25)
    - Related Jobs
  • Senior Site Reliability Engineer - Identity…

    Coinbase (Sacramento, CA)
    …configurations and maintain state using configuration management tools * Facilitate incident response , conduct root cause analysis, and blameless retrospectives ... * Define metrics and bolster monitoring /observability across corporate IAM systems * Participate in regular on-call rotation to ensure 24x7 uptime for critical… more
    Coinbase (08/09/25)
    - Related Jobs
  • Senior Staff Software Engineer, SRE, Core…

    Google (Sunnyvale, CA)
    …by pushing for changes that improve reliability and velocity. + Practice sustainable incident response and blameless postmortems. Google is proud to be an ... launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through… more
    Google (08/08/25)
    - Related Jobs
  • Senior Software Engineer, Site Reliability…

    Google (Mountain View, CA)
    …by pushing for changes that improve reliability and velocity. + Practice sustainable incident response and blameless postmortems. Google is proud to be an ... reviews. + Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through… more
    Google (08/08/25)
    - Related Jobs
  • Officer, Senior Information Security…

    Banc of California (Santa Ana, CA)
    …remediation of same. + Establishes and maintains Security Operations team triage and incident response playbooks to protect and recover information assets from ... as assigned. **WHAT YOU'LL BRING** + Demonstrates knowledge of, adherence to, monitoring and responsibility for compliance with state and federal regulations and… more
    Banc of California (07/16/25)
    - Related Jobs
  • Senior Software Developer, Site Reliability…

    Google (Sunnyvale, CA)
    …by pushing for changes that improve reliability and velocity. + Practice sustainable incident response and blameless postmortems. Google is proud to be an ... reviews. + Maintain services once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through… more
    Google (06/21/25)
    - Related Jobs
  • Distinguished Software Engineer, Reliability Infra

    LinkedIn (Mountain View, CA)
    …direction across orgs, and contributing deeply to culture, hiring, and technical excellenceLead incident response and post- incident reviews to identify root ... focused engineering, or distributed systemsPreferred QualificationsHands-on experience with large-scale incident response , root cause analysis, and resiliency… more
    LinkedIn (06/04/25)
    - Related Jobs
  • Patient Safety Consultant (RN)

    Stanford Health Care (Palo Alto, CA)
    …Safety Consultant is responsible for the management of the hospitals incident /event reporting system, including staff and manager education, and for conducting ... to ensure timely reporting of and follow-up/corrective action is taken in response to such events/incidents in accordance with the requirements of hospital policies,… more
    Stanford Health Care (07/12/25)
    - Related Jobs