Hybrid Senior Site Reliability Engineer

Posted 9 hours ago

Apply now

About the role

  • Senior Site Reliability Engineer driving observability and reliability for business-critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.

Responsibilities

  • Design, implement, and maintain observability solutions across distributed systems
  • Build and optimize logging, metrics, and tracing pipelines using tools like Dynatrace, Datadog, Splunk, ELK, Grafana, and OpenTelemetry
  • Enable end-to-end transaction tracing across microservices and APIs
  • Develop dashboards and alerting strategies for proactive issue detection
  • Own service reliability, uptime, and operational performance for critical systems
  • Lead incident response, root cause analysis (RCA), and postmortems
  • Reduce **MTTD and MTTR** through automation and improved observability
  • Create and maintain runbooks and incident response playbooks
  • Monitor and optimize system performance (latency, throughput, error rates)
  • Partner with application and database teams to troubleshoot bottlenecks
  • Use distributed tracing and telemetry data to identify and resolve issues
  • Implement performance testing and tuning strategies
  • Build and maintain fault-tolerant, highly available systems
  • Implement resiliency patterns (failover, retries, circuit breakers, self-healing)
  • Drive chaos engineering practices to validate system reliability
  • Automate operational tasks using scripting (Python, Go, etc.)
  • Define and enforce SLOs, SLIs, and error budgets aligned to business goals
  • Promote SRE principles across engineering teams
  • Partner with DevOps and platform teams to improve CI/CD reliability
  • Contribute to building a culture of operational excellence and accountability

Requirements

  • 7–10+ years of experience in **Site Reliability Engineering or Production Support Engineering**
  • Strong hands-on experience with observability tools (Dynatrace, Datadog, Splunk, ELK, Grafana, OpenTelemetry, Jaeger)
  • Experience supporting **cloud-native environments (AWS, Azure, or GCP)**
  • Deep understanding of **microservices architecture and distributed systems**
  • Proficiency in scripting/programming (Python, Go, Java, or similar)
  • Experience with monitoring, alerting, and incident management in production environments

Job title

Senior Site Reliability Engineer

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job