Lead SRE Team managing two engineers and cloud infrastructure for Robin AI's Legal AI platform. Drive monitoring strategies for high availability and reliability of services while collaborating with CTO.
Responsibilities
Lead and mentor a team of two SRE Engineers, providing technical guidance and career development
Work closely with the CTO to define and implement the technical infrastructure roadmap
Establish monitoring strategies and implement solutions to enhance reliability, scalability, and cost-efficiency
Collaborate with development team leads to optimise build, test, and deployment processes
Lead incident response and establish processes for troubleshooting production issues
Organise and oversee on-call rotations to ensure 24/7 system reliability
Drive documentation standards and knowledge sharing within the engineering organisation
Requirements
5+ years of experience in DevOps or Site Reliability Engineering roles, with 2+ years in a managerial position
Proven experience managing and mentoring technical team members
Proficiency in at least one backend programming language (We use Python)
Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
Excellent leadership, communication, and problem-solving skills
Experience with AI/ML infrastructure deployment and scaling
Benefits
Generous equity scheme - everyone gets to be an owner of Robin AI!
20 days PTO, in addition to the public holidays observed in South Africa.
Growth opportunities: We prioritise promotions for high performers and help you to progress your career.
Full - Stack Engineer enhancing engineering productivity at Fidelity. Building internal tools for SRE teams to improve operational efficiency and reliability.
DevOps Engineer at Cloudogu working with development and operations for reliable software delivery. Focusing on CI/CD, infrastructure automation, and platform services in an agile environment.
Jr. DevOps Engineer supporting and improving CI/CD pipelines and Linux systems at Swift. Collaborating with senior engineers in a hands - on learning environment.
Senior DevOps Engineer I managing automation tooling and multi - cloud infrastructure at Spring Health. Collaborating with AI and Infrastructure teams in a hybrid Seattle office.
Site Reliability Engineer for cloudified backup platform using Commvault technology at Expleo. Joining a dynamic team to ensure backup infrastructure scalability and reliability.
Site Reliability Engineer responsible for designing and maintaining scalable services with high availability. Collaborating with development teams to enhance reliability and operational excellence.
Technical Staff leading the architecture, reliability, and modernization of enterprise ALM and DevOps tools. Driving strategy and influencing product development in collaboration with various teams.
Site Reliability Engineer responsible for reliability and availability, collaborating with development teams on scalable systems. Applying software engineering practices to improve production operations.
DevOps Engineer in the Security Data and AI Lab at Lloyds Banking Group driving data and cloud infrastructure's influence on product operations and customer service improvements.