Senior Site Reliability Engineer responsible for designing reliable infrastructure supporting Fixify's SaaS platform. Collaborating with product engineering teams and maintaining operational standards for infrastructure performance.
Responsibilities
Design and maintain scalable, fault-tolerant infrastructure that supports our SaaS platform and keeps pace with business growth.
Instrument observability best practices—embracing tracing-first approaches, meaningful metrics, and monitoring that actually helps during incidents.
Define, document, and maintain SLIs, SLOs, and SLAs in partnership with product engineering, translating business commitments into technical guardrails.
Build automation that eliminates manual intervention across CI/CD, deployments, configuration management, and recovery—because your time is better spent on strategic problems.
Lead incident response with steady judgment, facilitate blameless postmortems, and drive remediation efforts that prevent recurrence.
Partner with engineering and product teams during design reviews to ensure new features are production-ready and operationally scalable.
Optimize infrastructure costs through performance tuning, capacity planning, and smart use of cloud resources.
Mentor engineers on operational best practices and champion reliability thinking across the organization.
Document infrastructure architecture clearly and maintain the kind of runbooks that your future self will thank you for.
Requirements
4+ years of experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated experience supporting SaaS platforms in production.
Expert-level knowledge of an infrastructure-as-code framework (Pulumi, Terraform, CDK)—you should be the kind of person who thinks "if it's not in code, it doesn't exist."
Strong working knowledge of AWS (or equivalent cloud platforms), including designing for availability, scalability, and security.
Proficiency in TypeScript or Python for infrastructure automation and tooling.
Experience with containerization and orchestration (ECS Fargate, Kubernetes, or similar).
Deep familiarity with observability tools and practices (OpenTelemetry, CloudWatch, Honeycomb)—bonus points if you embrace a tracing-first philosophy.
Solid understanding of networking, load balancing, and distributed systems concepts.
Experience with CI/CD tooling (GitHub Actions, CodeBuild, or equivalent).
The ability to communicate complex operational issues clearly to both technical and non-technical stakeholders.
Calm effectiveness during high-pressure incidents and the judgment to balance competing priorities like performance, cost, and reliability.
A collaborative spirit and the ability to build strong relationships with engineering, product, and operations teams.
Prior experience working closely with product engineering teams is a strong plus—this role thrives on cross-disciplinary understanding.
A commitment to continuous learning and improving team practices, systems, and culture.
Benefits
Give you ownership over infrastructure that powers a globally-used platform, with clear visibility into how your work drives collaboration and productivity.
Provide meaningful opportunities to learn and grow, whether that's diving deeper into distributed systems, exploring new observability paradigms, or mastering the latest cloud-native technologies.
Surround you with a team that values blameless postmortems, continuous improvement, and the kind of operational culture where everyone learns from every incident.
Share the "why" behind architectural decisions and give you a voice in shaping Fixify's reliability engineering principles as we scale.
Connect you directly with product engineers and users, so you see firsthand how reliable infrastructure translates into delighted customers.
Let you work across a hybrid container and serverless infrastructure environment, using what works best and leaning into a service’s strengths.
Senior Site Reliability Engineer ensuring scalability and reliability for NGINX systems and SaaS platforms. Collaborating across teams to drive automation and system performance.
Site Reliability Engineer ensuring reliability and performance of data platform services for Veepee. Collaborating on cloud migration, Kubernetes operations, and observability best practices.
Senior Lead Site Reliability Engineer overseeing critical systems stability and incident management. Leading Java applications reliability and supporting a dynamic technology environment.
Infrastructure Architect connecting clients and Kyndryl. Leading projects from start to finish, ensuring technical solutions meet client needs at Kyndryl.
DevOps Engineer automating and configuring network monitoring and automation solutions for Telia’s telecom operations in Finland. Ensuring performance, resilience, and high observability of critical platforms.
Client Services Consultant specializing in DevOps Mainframe Operations with experience in automation best practices. Analyzing Life Cycle Management data needs and evaluating solutions for Endevor - related operations.
Senior AWS DevOps Engineer at LexisNexis shaping global CI/CD platform. Collaborating with teams to deliver secure, reliable, and scalable delivery pipelines.
Cloud Engineer at MetroStar focusing on building and securing cloud - native systems. Managing Kubernetes workloads and CI/CD pipelines in Agile teams with an emphasis on security.
Senior Engineer Cloud Engineering role focused on AWS migration and automation. Collaborating with teams to innovate cloud patterns and infrastructure best practices.
Senior Operations Engineer driving efficiency and reliability in NVIDIA's global business operations. Collaborating with IT subsystems and automating operational workflows for organizational impact.