AI Infrastructure Engineer managing high-performance AI infrastructure for large-scale GPU clusters at 42dot. Contributing to scaling, monitoring, and operational optimization of computing environments.
Responsibilities
**Responsibilities**
Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm.
Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery.
Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency.
Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources.
Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.
Requirements
Strong proficiency in Linux operating systems, including a solid understanding of kernel operations, process management, and system security.
Practical experience with containerization technologies (Docker) and orchestration (Kubernetes), including building, managing, and troubleshooting containerized environments.
Solid understanding of networking fundamentals, including TCP/IP and HTTP(S), with the ability to perform basic network troubleshooting.
Ability to write clean and maintainable scripts in Python or Shell for automation and system administration.
Logical approach to problem-solving with the persistence to identify and resolve root causes in complex, large-scale systems.
Strong communication skills to effectively collaborate with cross-functional teams and external partners.
Experience in building observability stacks with Prometheus, Grafana, and Datadog for large-scale clusters.
Experience in building or operating infrastructure on public cloud platforms such as AWS or GCP.
Knowledge of the NVIDIA accelerated computing stack, including drivers, CUDA, and NCCL.
Familiarity with the ML model training lifecycle and deep learning frameworks such as PyTorch or TensorFlow.
Experience with large-scale workload managers or resource scheduling tools such as Kubernetes or Slurm.
Familiarity with Infrastructure as Code (IaC) tools such as Terraform to manage complex infrastructure.
Benefits
이력서 제출 시 주민등록번호, 가족관계, 혼인 여부, 연봉, 사진, 신체조건, 출신 지역 등 채용절차법상 요구 금지된 정보는 제외 부탁드립니다.
모든 제출 파일은 30MB 이하의 PDF 양식으로 업로드를 부탁드립니다. (이력서 업로드 중 문제가 발생한다면 지원하시고자 하는 포지션의 URL과 함께 이력서를 [email protected]으로 전송 부탁드립니다.)
인터뷰 프로세스 종료 후 지원자의 동의하에 평판조회가 진행될 수 있습니다.
국가보훈대상자 및 취업보호 대상자는 관계법령에 따라 우대합니다.
장애인 고용 촉진 및 직업재활법에 따라 장애인 등록증 소지자를 우대합니다.
42dot은 의뢰하지 않은 서치펌의 이력서를 받지 않으며, 요청하지 않은 이력서에 대해 수수료를 지불하지 않습니다.
Senior Infrastructure Engineer specializing in Cisco and VMware to modernize hybrid environments for strategic partners. Ownership and mentorship role within a collaborative IT team.
Data Cloud & Infrastructure Architect connecting BigQuery potential with Salesforce execution. Mastering identity resolution and driving real - time data orchestration in a hybrid environment.
Infrastructure Engineer developing infrastructure technology for public and private cloud environments. Complying with security and operational requirements, while using automation to enhance product testing.
Cloud & Infrastructure Engineer designing and supporting solutions across Power Platform and Microsoft 365. Collaborating with technical teams to ensure smooth and secure operations.
Infrastructure Specialist for Far East Organization ensuring availability and security of enterprise infrastructure, focusing on network operations and cybersecurity controls.
Infrastructure Engineer collaborating with teams to build infrastructure solutions at HCSC. Focusing on efficiency and improving deployment times in healthcare technology.
Infrastructure Engineer engineering infrastructure technology for cloud environments with security and operational compliance. Collaborating with stakeholders to inform product roadmaps and providing operational support.
Junior Infrastructure Engineer at ZILO, supporting AWS and cloud infrastructure deployment and maintenance. Collaborating with DevOps and Engineering teams on innovative technology solutions.
L2 Infrastructure Engineer at The Missing Link delivering high - quality tech support and managing modern endpoint environments in Pune. Join a collaborative team for innovative IT solutions.
Infrastructure Engineer designing and building workflows, internal tools, and services at MUBI. Collaborating in a hybrid London setting, connecting systems with AI - powered automation.