- Google (Sunnyvale, CA)
- …Learning (ML) use cases. We're part of the broader Borglet team, Google's node management agent. We're a strong, systems-heavy team that works closely with Borg ... cluster management software, the kernel, with Hardware platforms and user applications. The Borglet team runs as a set of smaller 10-15 person sub-teams, with Offloads and ML being two of them, executing on various projects in all major areas of… more
- NVIDIA (Santa Clara, CA)
- …GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. ... Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity to push the limits on the state-of-the-art and deliver platforms the… more
- NVIDIA (Santa Clara, CA)
- …new approaches for improving HPC schedulers for serving many simultaneous and large multi- node GPU workloads with many complex dependencies. This role offers you an ... excellent opportunity to deliver production grade solutions, get hands on with ground-breaking technology, and work closely with technical leaders solving some of the biggest challenges in machine learning, cloud computing, and system co-design. What you'll be… more
- TP-Link North America, Inc. (Irvine, CA)
- …Help analyze and resolve production risks caused by insufficient resources, such as node groups, CPU, memory, HPA scheduling, JVM pre-warming, etc. + Write and ... maintain scripts for automation using languages like Python, Go, or Bash. + Assist in defining and maintaining the KPIs (SLA/SLO/SLI) for all cloud microservices with development teams to better understand the business. + Create and maintain technical… more