Staff Software Engineer joining Site Reliability team ensuring performance and reliability of legal AI platform. Designing monitoring and alerting systems while managing operations across global regions.
Responsibilities
Design, implement, and manage monitoring, alerting, and infrastructure resources (compute, storage, networking) across 50+ global regions
Lead incident management processes, including postmortems, root cause analyses, and driving actionable improvements
Automate operational tasks and workflows, building tools and processes for capacity planning, graceful rollouts, and safe data access to maintain high reliability and reduce manual intervention
Establish best practices for security, compliance, and reliability and collaborate across teams to drive these principles throughout the software lifecycle
Optimize infrastructure costs through strategic capacity planning and build-versus-buy decisions while maintaining system performance, reliability, and functionality
Provide technical mentorship and leadership, promoting best practices and fostering team growth
Requirements
10+ years of experience in Site Reliability Engineering or similar roles supporting production environments, with proven ability to mentor and guide technical teams
Expertise in infrastructure as code(IaC) tools (Pulumi, Terraform, CloudFormation, etc.)
Deep familiarity with observability tools (Datadog, Sentry, etc.) and incident response practices (PagerDuty, IncidentIO, etc.)
Proficiency with cloud infrastructure platforms (Azure, GCP, AWS, etc.)
Strong programming skills (Python, Bash, Go, or similar languages)
Proven track record of diagnosing complex system problems and implementing durable solutions
Solid understanding of CI/CD, Kubernetes, containerization, networking, databases, and cloud security principles
Excellent problem-solving skills, meticulous attention to detail, and a commitment to operational excellence
Work eligibility: Must be authorized to work in India. Visa sponsorship is not available for this role.
Senior DevOps Engineer at Parser focusing on deploying and maintaining cloud - based products with AWS. Collaborating across technical teams and ensuring robust solutions for business needs.
Safety and Reliability Engineer focusing on safety assessments and reliability evaluations at Collins Aerospace. Lead analyses and ensure designs meet certification standards.
Deployment Engineer responsible for client solution deployment and integration at ng - voice. Work includes planning, configuration, and operational efficiency tasks.
DevOps Engineer participating in structuring Terraform practices at EOLEN, a consulting firm in engineering and IT. Focused on Cloud, Data, Cybersécurité, software development and IT infrastructure.
DevOps Developer coordinating IT support and developing pipelines and delivery processes for Saab. Focused on collaboration, technical solutions, and communication to achieve high - quality results.
Senior Infrastructure Engineer focused on design automation and software infrastructure at Intel Foundry. Collaborating with development teams to improve reliability and velocities in engineering processes.
Site Reliability Engineer at Personio focusing on automated infrastructure and collaboration across engineering teams. Shape the future of HR technology with meaningful impact and ownership.
Site Reliability Engineering Senior Manager leading multiple SRE teams at Netwealth. Shaping strategy and operational practices in a collaborative environment.