Lead SRE team to design and operate scalable, secure cloud infrastructure for Instabase's AI platform. Manage CI/CD, Kubernetes, production reliability, and release processes.
Responsibilities
Define and steer the technical direction for the team, collaborating with cross-functional partners
Develop and execute comprehensive short and long-term roadmaps balancing business needs, user experience, and technical foundations
Oversee cloud infrastructure and deployment automation to ensure efficient and reliable operations
Guarantee uptime and reliability for production systems through proactive monitoring and production support
Manage vulnerability assessments and facilitate prompt remediation
Maintain and enhance CI/CD and build infrastructure to support development workflows
Implement and optimize tools to enhance developer productivity
Drive improvements in release management processes and tooling to ensure smooth, reliable software delivery
Build scalable, distributed, and fault-tolerant systems integrating Software and Systems Engineering to drive performance, capacity, and reliability
Requirements
5+ years of experience in Site Reliability Engineering, Software Engineering, or Production Engineering
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
Proven track record of setting technical and cultural standards for engineering teams
Demonstrated experience in managing and sustaining SaaS production environments
Hands-on experience with major cloud providers such as AWS and Azure
Proficient in containerization technologies like Docker
Expertise in container orchestration platforms, especially Kubernetes
Skilled in overseeing and managing software release processes to ensure smooth deployments
Systematic approach to solving platform and production issues, strong problem-solving abilities, and a passion for automation
Benefits
Bonus
Equity
Benefits
Hybrid work
Offices in San Francisco, New York, London and Bengaluru
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.