About the role

Site Reliability Engineer focused on designing and maintaining observability platform for dLocal. Collaborating with global teams and optimizing system performance for major clients.

Responsibilities

Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion.
Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team.
Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident.
Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements.
Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines.
Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level.
Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs.
Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure they are always actionable, reducing fatigue and improving response efficiency.

Requirements

Over 4 years’ of experience as SRE Engineer or in a very similar role more focused on observability.
Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices.
Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization.
Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog.
Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar).
Experience integrating incident management platforms (PagerDuty, Jira) with automated alerting workflows.
Strong scripting abilities (Python, Go, or similar) for automating observability tasks.
A problem-solving mindset, with the ability to collaborate across multi-functional teams to drive reliability improvements.
Cloud experience, especially AWS and ECS-based workloads.
Experience managing observability pipelines at scale in high-throughput environments.
Familiarity with Configuration-as-Code (Ansible, Chef, or SaltStack) for managing configurations across legacy instances.
Database performance monitoring experience, particularly in large-scale distributed environments.

Benefits

Flexibility: we have flexible schedules and we are driven by performance.
Fintech industry: work in a dynamic and ever-evolving environment, with plenty to build and boost your creativity.
Referral bonus program: our internal talents are the best recruiters - refer someone ideal for a role and get rewarded.
Learning & development: get access to a Premium Coursera subscription.
Language classes: we provide free English, Spanish, or Portuguese classes.
Social budget: you'll get a monthly budget to chill out with your team (in person or remotely) and deepen your connections!
dLocal Houses: want to rent a house to spend one week anywhere in the world coworking with your team? We’ve got your back!

Hybrid Site Reliability Engineer, Technical Referent

at dLocal

About the role

Responsibilities

Requirements

Benefits

Job title

Job type

Experience level

Salary

Degree requirement

Tech skills

Location requirements

Report this job

Similar roles

Senior DevOps Engineer

Spring Health

Senior Site Reliability Engineer – Backup

Expleo Group

Performance & Reliability Engineer

Expleo Group

Senior Site Reliability Engineer – Storage

Expleo Group

Technical Staff – ALM & DevOps Platforms

Metsi Technologies

Senior Site Reliability Engineer

SAN R&D Business Solutions

DevOps Engineer, Security Data and AI Lab

Lloyds Banking Group

Senior Platform DevOps Engineer, Cloud – On-Prem

Code Metal

DevOps Platform Intern – Summer 2026

Rocket Mortgage

DevOps Engineer, UNIX

Swisscom