-
Site Reliability Engineer
- Insight Global (Woonsocket, RI)
-
Job Description
Insight Global is seeking 2 Site Reliability Engineers to drive observability, resiliency, executive ready reporting, and Level 3+ support for our Sales & Acquisition platform. You will build and evolve the tooling, dashboards, and automation that ensure our cloud native, omnichannel ecosystem is reliable, performant, and compliant. You’ll partner closely with platform engineering, product, and security teams to turn telemetry into action and continuously improve customer outcomes.
Key Responsibilities:
• Implement and evolve observability standards across Sales & Acquisition services, including metrics, logs, traces, alerting, dashboards, and SLO/SLIs.
• Build reliability reporting and scorecards (uptime, latency, error budgets, MTTR) that provide actionable insights for engineering and leadership.
• Contribute to resiliency engineering: define/validate fault domains, design failover patterns, run chaos/failover tests, and close gaps through prioritized backlog items.
• Serve as an L3+ escalation engineer for all incidents: participate in on call rotations, coordinate with responders, and drive root cause analysis and corrective actions.
• Improve change readiness by adding guardrails and automated health checks into CI/CD pipelines (pre /post deploy validations, canary/blue green readiness, auto rollback signals).
• Monitor capacity, performance, and cost; recommend and implement optimizations for scale, efficiency, and spend stewardship.
• Author runbooks, SLO/SLA definitions, alerting standards, and best practice guides; champion adoption across squads through hands on pairing and enablement.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to [email protected] learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Skills and Requirements
• 5+ years in application development and/or platform engineering, including 2+ years focused on SRE, observability, or production operations.
• Hands on experience implementing observability frameworks (metrics/logs/traces), incident/problem/change management, and post incident reviews.
• Strong understanding of resiliency engineering (fault isolation, timeouts/retries, bulkheads, circuit breakers, disaster recovery patterns).
• Ability to translate technical telemetry into clear, business aligned insights and recommended actions.
• Proficiency with modern CI/CD, infrastructure as code, automated testing, and progressive delivery concepts.
• Excellent communication skills; comfortable collaborating across engineering, product, security, and leadership stakeholders.
• Experience with healthcare, health insurance,medicare, medicaid, etc.
-