Onsite Software Engineer – Machine Learning Infrastructure

Posted 3 hours ago

Apply now

About the role

  • Software Engineer for ML Infrastructure at Slack, architecting systems to support large scale AI deployment and reliability. Engage in deep systems engineering focusing on ML lifecycle and infrastructure scalability.

Responsibilities

  • Design, build, and operate systems to train, serve, and deploy machine learning models at scale, with a focus on reliability, performance, and operational simplicity
  • Evolve GPU backed inference infrastructure to support high throughput, latency sensitive workloads, including large scale model serving
  • Architect and optimize distributed training and data processing systems using platforms such as Ray, Airflow, Spark, or similar technologies
  • Build and maintain Kubernetes based platforms and orchestration layers using tools such as KubeRay, vLLM, and internally developed services
  • Architect solutions that bridge legacy systems with modern technologies while maintaining monolithic application stability
  • Develop robust monitoring, observability, and alerting for production ML workloads to ensure operational excellence
  • Partner closely with AI Platform, ML modeling, security, and product engineering teams to design infrastructure that supports evolving AI use cases
  • Provide technical leadership through design reviews, mentorship, and by setting engineering standards and long term architectural direction for ML infrastructure
  • Author technical design and architecture documentation, and contribute thought leadership through engineering blog posts

Requirements

  • Significant professional experience in software engineering with a strong focus on infrastructure, backend systems, platform engineering, or MLOps
  • Deep experience building and operating distributed systems, including expert level knowledge of Kubernetes and container based platforms
  • Hands on experience with modern ML infrastructure and serving stacks such as Ray or KubeRay, vLLM, or similar training and inference orchestration frameworks
  • Experience working with GPU infrastructure, including performance optimization and operational management at scale
  • Strong experience with data infrastructure and orchestration technologies such as Airflow, Spark, or similar systems
  • Experience building and operating cloud native systems on public cloud platforms such as AWS, GCP, or Azure, including infrastructure as code
  • A demonstrated ability to drive technical direction for complex systems and balance short term delivery with long term architectural goals
  • Excellent written communication, as well as ability to thrive in an asynchronous and globally distributed infrastructure team.
  • A related technical degree required

Benefits

  • time off programs
  • medical
  • dental
  • vision
  • mental health support
  • paid parental leave
  • life and disability insurance
  • 401(k)
  • employee stock purchasing program

Job title

Software Engineer – Machine Learning Infrastructure

Job type

Experience level

Mid levelSenior

Salary

$164,000 - $313,700 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job