Site Reliability Engineer focused on designing and maintaining observability platform for dLocal. Collaborating with global teams and optimizing system performance for major clients.
Responsibilities
Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion.
Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team.
Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident.
Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements.
Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines.
Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level.
Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs.
Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure they are always actionable, reducing fatigue and improving response efficiency.
Requirements
Over 4 years’ of experience as SRE Engineer or in a very similar role more focused on observability.
Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices.
Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization.
Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog.
Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar).
Senior DevOps Engineer I managing automation tooling and multi - cloud infrastructure at Spring Health. Collaborating with AI and Infrastructure teams in a hybrid Seattle office.
Site Reliability Engineer for cloudified backup platform using Commvault technology at Expleo. Joining a dynamic team to ensure backup infrastructure scalability and reliability.
Site Reliability Engineer responsible for designing and maintaining scalable services with high availability. Collaborating with development teams to enhance reliability and operational excellence.
Technical Staff leading the architecture, reliability, and modernization of enterprise ALM and DevOps tools. Driving strategy and influencing product development in collaboration with various teams.
Site Reliability Engineer responsible for reliability and availability, collaborating with development teams on scalable systems. Applying software engineering practices to improve production operations.
DevOps Engineer in the Security Data and AI Lab at Lloyds Banking Group driving data and cloud infrastructure's influence on product operations and customer service improvements.
Senior Platform DevOps Engineer at Code Metal designing and implementing cloud and hybrid infrastructure to support customer deployments and internal platforms. Collaborating with software and security teams for reliable delivery.
DevOps Platform Intern managing cloud infrastructure and deployment pipelines for AI - native software delivery. Partnering with a Product Development Intern, set up and manage containerized applications on Azure Kubernetes Service.
UNIX DevOps Engineer managing AIX and Solaris server operations for a Swiss telecom company. Focusing on automation, optimization and 7x24h monitoring responsibilities across multiple locations.