Senior Systems Engineer responsible for HPC cluster management and optimization at Rackspace. Collaborating with scientists and handling technical support for high-performance computing.
Responsibilities
Install, configure, and maintain HPC clusters (hardware, software, operating systems).
Perform regular updates/patching and manage user accounts and permissions.
Troubleshoot/resolve hardware or software issues.
Monitor and analyze system and application performance, identify bottlenecks and implement tuning solutions.
Manage job scheduling and resource allocation using tools such as Slurm, LSF, Bright Cluster Manager, OpenHPC, and Warewulf.
Configure Linux networking (TCP/IP, DNS, routing) and HPC interconnects (InfiniBand, Ethernet).
Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS) ensuring data integrity and managing backups.
Implement security controls and manage authentication services like LDAP and Active Directory.
Automate deployments and system configurations using tools like Ansible, Terraform, Jenkins, and Git.
Provide technical support, documentation, and training to researchers and collaborate with scientists and HPC architects.
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree).
Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC.
Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning.
Experience building and managing RPM and DEB packages.
Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf.
Proficiency with job schedulers and resource managers such as Slurm and LSF.
Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning.
Knowledge of parallel file systems such as Lustre, Ceph, or GPFS.
Working knowledge of Linux authentication and directory services such as LDAP and Active Directory.
Proficiency in scripting languages (e.g., Python, Bash, R) and familiarity with MPI libraries for parallel and distributed computing (nice to have).
Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git.
Knowledge of HPC in cloud environments (e.g., AWS, Azure, GCP HPC offerings) is a plus.
Strong knowledge of Linux security, compliance standards, and data protection best practices.
Excellent communication, interpersonal, and problem-solving skills.
Automation & Systems Engineer in N2P Messaging Team handling AWS environment and Jenkins pipelines for Net2Phone. Responsibilities include cloud networking, infrastructure as code, and systems maintenance.
Staff Collaboration Systems Engineer managing Google Workspace for Pinterest. Leading technical authority and mentoring in collaboration platform solutions.
Contract Systems Analyst at Sunshine Enterprise managing PeopleSoft Financials and HCM systems. Implementing, upgrading, and supporting enterprise applications, ensuring data integrity and business operations continuity.
Structural Systems Engineer specializing in structural analysis of aerospace vehicle pressurized systems. Involving design, development, and execution of test programs for launch and space structures.
Systems Engineer at Quevera collaborating with experts to deliver innovative solutions. Join our dynamic team recognized as a top employer in the Baltimore/DC area.
Staff Systems Engineer working on delivering complex software applications into operations with a talented team at CACI. Supporting development and verification of mission capabilities while ensuring operational efficiency.
Senior Systems Engineer supporting mission - critical software and AI/ML product development. Collaborating within an Agile team to transition complex systems to operational use.
IT Support Specialist ensuring installation, support, and maintenance of IT systems in healthcare settings. Focusing on efficiency, stability, and customer service with a team - oriented approach.
RF Systems Engineer III developing spacecraft communication systems for civil, commercial, and National Security Space programs. Collaborating with cross - functional teams to enhance RF communications technology.
Systems Engineer supporting deployment and operational reliability in cloud - based healthcare platform. Collaborate with engineering and QA teams to manage cloud environments and troubleshoot issues.