-
Site Reliability Engineer
- Insight Global (Plano, TX)
-
Job Description
-Contribute to developing and implementing policies, tools, and initiatives that improve platform health and developer productivity.
-Infrastructure monitoring: measure, analyze, regularly assess and improve the reliability of core infrastructure components (networking equipment, compute, databases, caching layers) with emphasis on redundancy, fault tolerance, and scalable failover strategies.
-Participate in setting service level objectives (SLOs), RPO/RTO; implement capabilities (backup/restore procedures) to meet them; develop and conduct regular exercises to validate recovery procedures.
-Ensure robust backup/restore procedures: perform regular backup validation, and protect critical data across regions and environments.
-Forecast growth, model failure domains; ensure capacity buffers and scalable architectures to avoid single points of failure or component failures.
-Maintain and improve the reliability, availability, and performance of production services, with a focus on reducing incident frequency and recovery/restoration time.
-Design, implement, and operate monitoring, alerting, logging, and tracing solutions to provide end-to-end visibility of systems and dependencies.
-Respond to and resolve production incidents, participate in post-incident reviews, and help implement corrective actions.
-Build and maintain runbooks, standard operating procedures, and automation to reduce toil in common operations tasks.
-Collaborate with software engineers to optimize code for reliability, scalability, and resilience; assist with capacity planning and performance tuning.
-Implement modern CI/CD pipelines; deployment strategies including blue/green/canary releases; patterns to ensure safe rollouts of software delivery.
-Manage infrastructure as code with provisioning, scaling, and maintaining cloud environments.
-Enforce security and compliance best practices in the production environment including access controls, secrets management.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to [email protected] learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Skills and Requirements
-7 years of experience in DevOps or a related field.
-Strong Linux/Unix administration skills and proficiency in at least one scripting language (e.g., Python, Bash).
-Experience with cloud platforms, containers, and orchestration (AWS/Azure/GCP, Docker/Kubernetes).
-Experience with containerization [Docker] and container orchestration [Kubernetes].
-Experience with monitoring/observability tools [Prometheus, Grafana, ELK/EFK, OpenTelemetry].
-Solid understanding of incident management processes, on-call practices, post-mortem analysis.
-Knowledge of CI/CD concepts and tooling [e.g., Jenkins, GitHub Actions, GitLab CI] and automation scripting.
-Strong problem-solving, debugging, communication skills; ability to work in a collaborative cross-functional environment. - Observability experience (logging and monitoring)
-Any AI tools background
-Bachelor’s degree in Information Technology, Computer Science, or a related field (or equivalent practical experience).
-Possession of IT service management certifications (ITIL Foundation or equivalent).
-Possession of government security clearances or experience in regulated environments is preferred.
-