Senior HPC and AI Cluster Administrator at NVIDIA specializing in high-performance computing infrastructure. Responsible for deploying and managing AI clusters while supporting R&D initiatives.
Responsibilities
Deploy, manage and maintain large scale HPC/AI clusters
Managing Linux job/workload schedules and orchestration tools
Support and maintain continuous integration and delivery pipelines
Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
Supporting Research & Development activities and engaging in POCs/POVs for future improvements
Requirements
Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience
5+ years of experience
Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs.
Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience, automation and configuration management tools such as Jenkins, Ansible, Gitops
Knowledge of Networking Protocols like InfiniBand, Ethernet
Experience with virtual systems (for example VMware, Hyper-V, KVM)
Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)
Benefits
highly competitive salaries
an extensive benefits package
work environment that promotes diversity, inclusion, and flexibility
Applications, Data and AI Sales Expert at Kyndryl focusing on engaging clients and driving sales growth. Seeking experienced sales professionals in the technical domain with a focus on AI solutions.
Building automations and using AI to enhance operational efficiency at Aurixus GmbH. Engaging in hands - on projects to connect tools and streamline processes for better productivity.
Founder Associate role collaborating with founders to build AI tools for construction. Involvement in strategy, fundraising, and process optimization with high responsibility and impact.
Join Sally Beauty as an IT PMO & AI Solutions Intern, building AI - driven project management tools to enhance IT portfolio efficiency and collaboration.
Software Development Engineer for oneDNN project at Intel, focusing on AI performance in various frameworks. Responsible for design, development, and optimization of AI workloads.
Generative AI Analyst developing prompts and datasets for machine learning models. Collaborates with teams on labeling initiatives and LLM training best practices.
Head of AI developing and scaling AI solutions and sales across EMEA for Nortal. Leading client engagement and partnerships to drive market growth in AI.
Research Engineer AI developing modern applications using AI technologies for business solutions. Collaborating in a motivated team to advance AI vision with clean and maintainable code in Python.
Senior Advanced AI Engineer at Honeywell focusing on AI - driven solutions for smart buildings and industrial automation. Collaborating cross - functionally and mentoring junior engineers.
AI Delivery Lead overseeing integration of AI platform into customer environments with a focus on delivering measurable results. Collaborating with cross - functional teams to drive strategic outcomes that matter.