"Alerted.org

Job Title, Industry, Employer
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Advanced Search

Advanced Search

Cancel
Remove
+ Add search criteria
City & State or Zip Code
20 mi
  • 0 mi
  • 5 mi
  • 10 mi
  • 20 mi
  • 50 mi
  • 100 mi
Related to

  • Sr. HPC Systems Engineer (IT@JH Research…

    Johns Hopkins University (Baltimore, MD)



    Apply Now

    IT@JH Research Computing is seeking a **_Sr. HPC Systems Engineer_** who will design, build, and maintain advanced high-performance computing environments supporting Johns Hopkins University’s research mission. This position focuses on the reliable operation, configuration, and optimization of HPC and AI systems, including multi-node CPU and GPU clusters, high-speed InfiniBand and Ethernet networks, and large-scale parallel and object storage. The engineer implements and automates secure, efficient, and reproducible computing platforms used by faculty, researchers, and students across diverse scientific disciplines. Assignments include both ticket-based support and project-based deployments. The role operates with moderate independence, collaborating closely with the IT Architect, Research Computing, and reporting to the IT Manager for Research Computing to ensure scalable, sustainable, and high-performance systems that enable cutting-edge scientific discovery.

    Specific Duties & Responsibilities

    + Support and administer production systems used by researchers and Research Centers.

    + Provide technical leadership/project management for system configuration, implementation, management, and user support for both new and existing systems.

    + Research and recommend new functionality for HPC management and administration tools by exploring system-wide impacts, working with functional users to define current and future processes.

    + Expertise with architecting, operating, and debugging large scale HPC network and storage infrastructure, including MPI, NCCL, RDMA, Infiniband, and parallel file systems

    + Work with scientific support specialists to assign tasks and provide oversight as appropriate to HPC engineering team to support scientific researchers who use a broad spectrum of applications from diverse fields.

    + Analyze results of server monitoring and implement changes to improve performance, processing, and utilization.

    + Propose, maintain, and enforce policies, practices and security procedures.

    + Provide break/fix support, setup/installation support, escalation support, and solutions support.

    + Collaborate closely with a variety of stakeholders, both internal and external, on all aspects of projects.

    + Other duties as assigned.

    _In Addition to the Duties Described Above_

    + Deploy, configure, and maintain large-scale Linux-based HPC clusters comprising CPU and GPU nodes, high-speed interconnects, and parallel file systems.

    + Implement and optimize workload schedulers (Slurm) and job submission policies to maximize system throughput and fair-share usage.

    + Administer and monitor distributed storage systems (GPFS, Lustre, WekaFS, Ceph, MinIO) to ensure reliability and performance across multi-petabyte environments.

    + Maintain high-speed fabric and network infrastructure (Infiniband, Ethernet) to support low-latency data transfer and MPI workloads.

    + Support research groups in deploying, testing, and optimizing scientific applications and AI/ML workflows on shared computing resources.

    + Develop and maintain automation and monitoring frameworks for system provisioning, metrics collection, and alerting (Prometheus, Grafana, ELK).

    + Participate in capacity planning, hardware lifecycle management, and evaluation of new technologies in collaboration with architects and management.

    + Ensure security and compliance through configuration hardening, patch management, and integration with campus identity and access control systems.

    + Document system designs, procedures, and troubleshooting guides to support knowledge transfer and team continuity.

    + Contribute to a collaborative engineering culture that emphasizes service quality, innovation, and continuous improvement in research computing operations.

    + Participate in on-call rotation to ensure high availability and timely response to system alerts.

    Minimum Qualifications

    + Bachelor’s Degree.

    + Six years related experience.

    + Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.

    Preferred Qualifications

    + Eight + years of experience in high-performance computing systems administration or engineering, including experience with cluster management, workload scheduling (e.g., Slurm), and distributed or parallel storage.

    + Deep proficiency in Linux systems administration, configuration management (Ansible, Puppet, or Salt), performance monitoring, and tuning for HPC workloads.

    + Experience with high-speed interconnects (Infiniband, 100/400 Gb Ethernet) and parallel file systems (e.g., GPFS, Lustre, BeeGFS, or WekaFS).

    + Working knowledge of containerization and orchestration (Singularity, Docker, Kubernetes for HPC).

    + Ability to automate deployments and routine operations through scripting (Bash, Python).

    + Familiarity with data-center operations, GPU acceleration, and research software environments (e.g., CUDA, MPI, AI/ML frameworks).

    + Strong analytical and troubleshooting skills, with proven ability to support complex research workloads in multi-user, multi-tenant environments.

    + Experience collaborating with faculty and research groups to translate scientific requirements into practical and performant computing solutions.

    Technical Skills & Expected Level of Proficiency

    + Automation - Authority

    + Cloud Infrastructur - Authority

    + Cloud Migration - Authority

    + Cloud Security - Authority

    + Cloud Strategy - Authority

    + Job Scheduling Systems - Authority

    + Operating Software - Authority

    + Scripting - Authority

    + Software Development Life Cycle - Authority

    + Systems Architecture - Authority

    + Systems Analysis - Authority

    + Systems Configuration - Authority

    + Systems Design - Authority

    + Systems Development - Authority

    + Systems Engineering - Authority

    + Systems Integration - Authority

     

    Classified Title: Sr. HPC Systems Engineer

     

    Job Posting Title (Working Title): Sr. HPC Systems Engineer (IT@JH Research Computing)

    Role/Level/Range: ATP/04/PF

    Starting Salary Range: $85,500 - $149,800 Annually (Commensurate w/exp.)

     

    Employee group: Full Time

     

    Schedule: Mon-Fri, 8:30am-5pm

     

    FLSA Status: Exempt

     

    Department name: IT@JH Research Computing

     

    Personnel area: University Administration

     

    Equal Opportunity Employer

     

    All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.

     


    Apply Now



Recent Searches

[X] Clear History

Recent Jobs

  • Sr. HPC Systems Engineer (IT@JH Research Computing) - #Staff
    Johns Hopkins University (Baltimore, MD)
  • Managed Services Representative
    Proven IT (Merrillville, IN)
  • AI Architect Lead
    S&P Global (New York, NY)
[X] Clear History

Account Login

Cancel
 
Forgot your password?

Not a member? Sign up

Sign Up

Cancel
 

Already have an account? Log in
Forgot your password?

Forgot your password?

Cancel
 
Enter the email associated with your account.

Already have an account? Sign in
Not a member? Sign up

© 2025 Alerted.org