-
Senior DevOps Engineer
- Insight Global (San Jose, CA)
-
Job Description
Insight Global is seeking a team of experienced, driven Senior DevOps Engineer to join an established health technology company sitting San Jose, CA or Austin, TX. This is a full-time, permanent role with competitive salary, bonus, and comprehensive benefits.
Responsibilities:
• Design and maintain multi-cloud infrastructure (AWS, Azure) with a strong emphasis on high availability, low latency, and fault tolerance for mission-critical workloads.
• Architect and optimize distributed systems to achieve horizontal scalability, performance efficiency, and operational automation across compute, storage, and networking layers.
• Develop and integrate AI-driven workflows, including OpenAI Python API modules, ensuring secure, performant, and compliant automation in production environments.
• Scale to handle concurrent requests through distributed coordination, Kubernetes autoscaling, and resource optimization, improving throughput and latency under heavy load.
• Deploy and manage enterprise-grade Kubernetes clusters (EKS/AKS), implementing advanced networking, multi-tenancy, autoscaling policies, and cluster lifecycle management.
• Implement Infrastructure as Code (IaC) using Terraform and GitOps tooling (ArgoCD/Flux) for consistent, auditable, and secure deployments across environments.
• Build robust CI/CD pipelines with secure artifact management, automated testing, and progressive delivery strategies (blue/green, canary deployments) to ensure safe and reliable releases.
• Optimize distributed data systems (PostgreSQL, Valkey/Redis, Kafka, Elasticsearch) for high availability, replication, performance tuning, and observability.
• Establish comprehensive observability frameworks using Prometheus, Grafana, OpenTelemetry, and ELK/EFK, enabling real-time monitoring, distributed tracing, and SLO/SLA compliance.
• Implement security best practices across cloud environments, including IAM governance, policy enforcement, and zero-trust principles, in collaboration with Security and IT teams.
• Drive reliability engineering initiatives, including capacity planning, cost optimization, disaster recovery, and incident response, ensuring operational resilience at scale.
• Produce detailed technical documentation, runbooks, and architecture diagrams to support maintainability, onboarding, and knowledge continuity.
• Expertise in bash scripts and reasonable fluency in golang
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to [email protected] learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Skills and Requirements
• Bachelor’s or Master’s degree in Computer Science, Software Engineering, Information Systems, or related fields.
• Equivalent professional experience demonstrating advanced infrastructure engineering proficiency will be fully considered.
• Professional certifications—such as CKA/CKAD, AWS Solutions Architect/DevOps Engineer, or GCP Professional Cloud Engineer—are advantageous.
• 10+ years of progressive experience in DevOps, SRE, Platform Engineering, or Infrastructure Engineering roles.
• Proven track record architecting, deploying, and operating multi-cloud, high-throughput, low-latency systems in production.
• Demonstrated expertise scaling distributed workloads to 10k–20k+ concurrent transactions.
• Experience supporting mission-critical environments with stringent availability, performance, and security requirements.
• Strong history of cross-functional collaboration with Product Engineering, Security, and IT teams. · Expert-level proficiency with Kubernetes (EKS/GKE/AKS), including cluster internals, CNI, ingress, multi-tenancy, autoscaling constructs, and workload optimization.
• Advanced scripting/automation capability in Golang (preferred) and Python, including production use of the OpenAI Python API.
• Deep technical knowledge of distributed systems including:
o PostgreSQL (replication, tuning, HA)
o Valkey/Redis (clustered caching, persistence models)
o Kafka (partitioning, consumer group optimization, broker scaling)
o Elasticsearch (indexing strategy, sharding, resilience)
• Expertise in IaC methodologies with Terraform, Pulumi, or CloudFormation, and GitOps tooling including ArgoCD or Flux.
• Proficiency in Linux systems engineering, container runtimes, service mesh, API gateways, and cloud networking.
• Strong capability in designing high-performance, horizontally scalable compute architectures.
• Advanced knowledge of observability, distributed tracing, log correlation, and SLO/SLA instrumentation.
-