-
Engineering Manager, Observability Platform
- NVIDIA (Santa Clara, CA)
-
At NVIDIA, we pride ourselves on data-driven decision-making, and the data science platform team is at the heart of this initiative. NVIDIA runs some of the most demanding AI, data, and platform workloads on the planet and none of it works without a reliable, high-scale observability foundation. We’re hiring an Engineering Manager to lead the team that builds and operates NVIDIA’s global observability platform: the system that carries every metric, log, trace, profile, and event our engineers rely on to understand and debug their services. This isn’t a traditional people-manager role. You’ll stay close to the technology, guide architecture decisions, review designs and code, and help the team solve real distributed-systems challenges. You’ll work with engineers to shape how services instrument themselves, how we ingest and store high-cardinality telemetry, and how observability fits cleanly into NVIDIA’s broader platform ecosystem.
You’ll partner directly with platform, infrastructure, and application teams to evolve how telemetry flows across metrics, logs, traces, profiling, and events. You’ll coach and mentor engineers, build strong technical habits, and drive a roadmap that keeps the platform reliable and ready for NVIDIA’s rapid growth. If you enjoy deep technical work, high-throughput pipelines, open-source observability stacks, and helping engineers do the best work of their careers, this role is built for you.
What you’ll be doing:
+ Leading a team of engineers who design and build the core services, pipelines, and storage layers behind NVIDIA’s observability platform.
+ Creating a clear technical direction for the team and supporting work that emphasizes simplicity, performance, and maintainability.
+ Defining the architecture for distributed ingestion services, time-series storage, log and trace pipelines, query paths, and multi-region data flows.
+ Partnering with platform, infrastructure, and application teams to define data models, instrumentation patterns, APIs, and integration standards.
+ Strengthening engineering practices through better tooling, automated tests, schema management, API versioning, documentation, and safe rollout processes.
+ Helping engineers solve distributed-systems issues including ingestion load, indexing pressure, compaction behavior, query fan-out, and replication patterns.
+ Driving predictable execution through clear priorities, collaborative planning, and strong alignment across teams.
+ Representing the observability platform across NVIDIA, gathering feedback, and evolving the system to support future AI workloads.
What we need to see:
+ Bachelors or Master’s degree in Computer Science or a related technical field (or equivalent experience)
+ 8+ overall years building distributed systems, with a focus on observability and monitoring systems, and 3+ years managing or leading engineers.
+ Experience with modern observability stacks such as Prometheus, Thanos, Mimir, Loki, OpenSearch, Jaeger, Tempo, or OpenTelemetry or equivalent experience.
+ Strong foundations in distributed systems concepts including replication, sharding, durability, consensus, and performance tuning.
+ Hands-on experience designing or scaling ingestion pipelines, time-series engines, trace backends, or log indexing systems, especially in high-cardinality environments.
+ Ability to read and review Go or Python code and support engineers through technical decision-making.
+ Clear architectural thinking with a focus on stable APIs, predictable performance, and long-term evolution.
+ Experience mentoring engineers, improving technical judgment, and contributing to a healthy and inclusive engineering culture.
+ Strong communication skills and the ability to explain complex challenges with clarity.
Ways to stand out from the crowd:
+ Experience building or contributing to an observability or telemetry platform used at significant scale.
+ Contributions to open-source projects such as OpenTelemetry, Prometheus, Loki, Thanos, Tempo, Jaeger, ClickHouse, Mimir, or Elasticsearch.
+ Experience with high-throughput systems like Kafka, Flink, Spark, or large-scale data collectors.
+ Deep knowledge of cardinality management, query performance, storage design, or retention optimization.
+ Experience designing multi-region architectures with a focus on consistency, availability, and data locality.
NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for exceptional people like you to help us accelerate the next wave of artificial intelligence.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 224,000 USD - 356,500 USD.
You will also be eligible for equity and benefits (https://www.nvidia.com/en-us/benefits/) .
Applications for this job will be accepted at least until January 11, 2026.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
-