HPC Architect designing high-performance computing solutions at Applied Materials. Focused on optimizing compute, storage and networking for semiconductor manufacturing processes.
Responsibilities
architect high-performance computing solutions from scratch
design/optimize all aspects (Compute, Memory, Networking, Storage) for better cost of Ownership
responsible for designing HPC infrastructure solutions, including compute, networking, storage, and workload management components
work closely with cross-functional teams, including Hardware, Software, product management, and business stakeholders
create and maintain detailed system architecture diagrams and specifications
evaluate and select appropriate hardware and software components for HPC environments
Install, configure, and maintain HPC systems, including hardware, software, and networking components
develop and implement automation scripts for system management and deployment
subject Matter expert to unblock dependent teams in the HPC domain
develop system benchmarks, profile systems to understand bottlenecks, optimize workflows and processes to improve cost of ownership
identify and mitigate technical risks and issues throughout the HPC development life cycle
ensure that Compute Cluster is resilient, reliable, and maintainable
stay abreast of the latest HPC technologies, including Hardware, Software and Networking Solutions
focus on understand the compute workload and design HPC cluster with right combination of Nodes, CPU/GPU, Memory, Interconnects and storage to have optimum performance at minimum cost of Ownership
Requirements
In-depth experience with Linux System administration and Hardware/Software Configuration
Strong knowledge of HPC technologies including cluster computing, high speed interconnects (InfiniBand, RoCE), parallel filesystems (Lustre, GPFS, BeeGFS etc)
Experience in creating, maintaining Operating System images with different installation and boot schemes
Extremely good with automation tools like Ansible, Chef, Salt-Stack and Scripting languages (Python and Bash)
Experience in Creating, maintaining Storage Solutions with different RAID configuration
Ability to design storage solution for different IOPS, Access patterns (Random vs Sequential RW) and tune storage and filesystems for better performance
Good knowledge of Networking concepts including IP addressing, routing, protocols and Switch configuration for RDMA, VLAN configuration, network bonding etc
Good Knowledge Virtualization, Hardware and Software Hypervisors
Good knowledge of containerization technologies like docker, singularity
Experience in Software Defined Networking and Storage
Experience in setting-up remote management protocols like IPMI, Red fish etc.
Experience in setting-up and using monitoring systems like Prometheus, Grafana
Experience System profiling and custom tuning for target workload for higher performance and low cost of ownership
Very good written and verbal communication skills
Very good in Technical documentation meant to serve as manuals for non-experts in the field
Experience in HPC Cluster management and Work-load orchestration software (e.g. SLURM, Torque, LSF)
Experience in Setting-up Deep-learning training/inference solutions
Experience in Private cloud infrastructure like Kubernetes, OpenStack, CloudStack etc.
Experience in Distributed High Performance Computing and Parallel programming frameworks
Good knowledge of Low-latency and high-throughput data transfer technologies (RDMA on RoCE, InfiniBand)
Benefits
supportive work culture that encourages you to learn, develop, and grow your career
commitment to providing programs and support that encourage personal and professional growth
health and wellbeing programs
Job title
Principal Software Architect – High-Performance Computing
Infrastructure Software Engineer at Baseten building ML inference platform components with Python and Go. Collaborating with teams on Kubernetes deployments and resource management solutions.
Software Engineer focusing on ML performance at Baseten, driving optimizations for large language models. Join a dynamic team contributing to advanced AI applications.
Software Engineer developing core product for AI infrastructure platform enabling ML model deployment. Collaborating across teams to drive new product ideas and resolve customer issues.
Senior Software Engineer developing integration solutions for Qualco's financial technologies. Involved in API design, testing, and collaboration across teams.
Senior Analyst Software Development Engineer in CVS Health's Agile Product Team. Building high - quality, usable products for the Claims and Payments team (Aetna).
Engineering Lead at Flock leading the Acquisition team to build products driving growth in connected vehicle insurance. Ensuring technical quality and delivery while coaching engineers.
Senior Software Engineer developing a groundbreaking AI Adoption Platform at Multiverse. Collaborating with cross - functional teams to build features and iterate on product solutions for AI adoption.
Senior Software Engineer developing scalable, user - focused solutions for AI and tech training. Collaborating with cross - functional teams to enhance education quality and engagement.
Lead Network Engineer responsible for enterprise network designs and operations with The J.M. Smucker Company. Ensuring robust network infrastructure while collaborating across multiple teams.
Software Architect working with experienced teams to develop secure digital technology solutions in a hybrid work environment. Engaging with stakeholders and applying agile methodologies.