Site Reliability Engineer ensuring scalable infrastructure in AI product deployment for top AI companies. Involves building automated processes and collaborating across teams.
Responsibilities
Build and maintain scalable infrastructure to support the deployment and operation of machine learning models.
Establish standards and best practices for reliability and performance across the infrastructure.
Automate processes when relevant, particularly for managing CI/CD pipelines.
Own products and projects end-to-end, functioning as both an engineer and a project manager, with a focus on user empathy, project specification, and end-to-end execution.
Collaborate with cross-functional teams to understand project requirements and translate them into technical solutions.
Mentor junior team members and contribute to knowledge sharing within the organization.
Navigate ambiguity and exercise good judgment on tradeoffs and tools needed to solve problems, avoiding unnecessary complexity.
Demonstrate pride, ownership, and accountability for your work, expecting the same from your teammates.
Requirements
Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field.
Extensive experience with Kubernetes.
Experience in building and maintaining scalable infrastructure.
Experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Circle CI, Jenkins).
Relevant OSS observability experience (Prometheus, ELK stack, Grafana stack, Opentelemetry) is a plus.
Ability to own projects end-to-end, from project specification to execution.
No prior machine learning experience required, but should be open to learning about it.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Site Reliability Engineer focused on designing and maintaining observability platform for dLocal. Collaborating with global teams and optimizing system performance for major clients.
Staff Site Reliability Engineer focused on product engineering for Civica. Leading technical practices and architectural alignment while improving service delivery and quality.
Senior Cloud Operations Engineer at CELUM focusing on cloud infrastructure and system security. Collaborating on IT projects and optimizing hosting environments.
DevOps Engineer at FormativGroup focusing on Kubernetes management and automation solutions. Designing, implementing, and securing infrastructure for efficient application deployment in a remote setting.
Senior AWS Cloud Engineer designing and building cloud infrastructure at Emergn. Collaborating with global teams to enhance scalable and reliable delivery of products.
SRE/DevOps Engineer improving platform reliability for multi - award - winning digital payments platform. Working from UK offices and collaborating with engineers to build a developer - friendly platform.
Senior SRE designing and implementing infrastructure to support real - time data processing for Pigment's AI - powered business planning. Collaborating closely with software engineers and taking ownership of performance challenges.
DevOps Engineer responsible for Azure infrastructure development and optimization at Bromcom. Ensuring stability, security, and scalability of the cloud platform with CI/CD automation and monitoring.
DevOps Engineer developing and maintaining CI/CD pipelines using Azure DevOps at RebelDot. Collaborating with teams on cloud and hybrid deployments in Romania.
Staff Software Engineer joining Site Reliability team ensuring performance and reliability of legal AI platform. Designing monitoring and alerting systems while managing operations across global regions.