Site Reliability Engineer maintaining cloud infrastructure reliability for Tecsys solutions. Collaborating across teams to support services and implement automation, observability, and frameworks.
Responsibilities
Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Act as an agent orchestrator using Amazon Kiro: run multiple activities in parallel by leveraging AI agents to accelerate execution, while personally validating results and completing selected tasks manually when needed.
Be on-call.
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
Implement monitoring, Logging, alerting, and SLA Reporting.
Create and maintain technical documentation.
Implement, maintain and mature SRE best practices.
Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.
Requirements
5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
Experience with incident management, on-call participation, escalation, and structured postmortems.
Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.
Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
Basic knowledge of Java- or .Net-based development required.
Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.
Escalation on-call rotation
Occasional travel (quarterly offsites, conferences – less than 10%)
Software Engineer - DevSecOps designing modern software systems for aerospace programs at Northrop Grumman. Collaborating with multi - disciplinary teams in an Agile environment to implement DevSecOps lifecycle.
Principal Software Engineer focused on DevSecOps software factory at Northrop Grumman. Working with multi - disciplinary teams to implement DevSecOps practices for aerospace programs across various locations.
Sr. Systems Engineer implementing and optimizing CI/CD platforms at Arch Capital Group. Collaborating with teams and driving DevOps strategy with expertise in cloud technologies.
Java Full Stack and AWS DevOps Developer for Boeing's Manufacturing Quality Information Technology Team, maintaining and enhancing software systems and DevOps environments while ensuring compliance.
Senior DevOps Engineer at One Pass redefining health engagement, managing scalable cloud infrastructure and enhancing automation. Collaborate across teams to ensure system reliability and performance.
DevOps Engineer at One Pass building and improving cloud infrastructure in AWS. Collaborating with engineers on deployments, reliability, and automation in a fast - paced environment.
Senior Release Engineer designing CI/CD pipelines for Kaseware’s mission - critical software. Collaborating with engineering, security, and operations teams to ensure fast and reliable deployments.
DevOps Engineer managing Kubernetes and cloud infrastructure for innovative legal software startup. Collaborating with development teams and ensuring smooth deployment processes.
DevOps Architect defining and evolving AgencyBloc’s cloud and DevOps strategy. Leading design of infrastructure and CI/CD frameworks for secure and scalable SaaS platforms.
DevOps Engineer at VERBI Software GmbH managing AWS - centric infrastructure and driving reliability, scalability, and modernization. Hands - on role applying SRE principles to evolve towards cloud - native best practices.