Site Reliability Engineer focused on designing and maintaining observability solutions for fintech company. Collaborating across teams and automating infrastructure for global payment processing.
Responsibilities
Own OpenTelemetry pipelines: design, implement, and maintain observability pipelines across the three primary signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion. Optimize ingestion strategies to balance cost, performance, and usability.
Empower engineering teams: build self-service automation and tooling that enables development teams to instrument and use observability without manual intervention from the SRE team. Drive adoption of best practices while ensuring teams own their telemetry.
Support incident management: act as the engineering lead for our Incident Management Team by designing processes, playbooks, checklists, and automations for engineers to follow during incidents.
Collaborate across teams: work with members from nearly every team across the business to understand their monitoring, alerting, and SLO/SLA requirements, and design systems and processes that meet or exceed those requirements. Influence architectural decisions during initial design to ensure resilience and scalability from the outset.
Automate observability infrastructure: use Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and observability configurations across OTEL pipelines.
Define baseline observability standards: design baseline requirements for new and existing services to ensure all dLocal infrastructure and code are monitored consistently and accurately.
Own technical and security health: take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs.
Optimize alerting systems: continuously refine alerting signals to minimize noise, ensure alerts are actionable, reduce fatigue, and improve response efficiency.
Requirements
4+ years of experience as an SRE or in a closely related observability-focused role.
Expertise in Kubernetes, including core components, deployment methods, and monitoring best practices.
Familiarity with OpenTelemetry, including configuring OTEL collectors, instrumentation, and pipeline optimization.
Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog.
Hands-on experience with IaC tools (Terraform) and GitOps/CI-CD solutions (Argo CD, GitHub Actions, or similar).
Strong scripting skills (Python, Go, or similar) for automating observability tasks.
Problem-solving mindset with the ability to collaborate across cross-functional teams to drive reliability improvements.
Cloud experience, especially AWS and ECS-based workloads.
Experience managing observability pipelines at scale in high-throughput environments.
Familiarity with Configuration-as-Code tools (Ansible, Chef, or SaltStack) for managing configurations across legacy instances.
Database performance monitoring experience, particularly in large-scale distributed environments.
Benefits
Flexibility: flexible schedules driven by performance.
Fintech industry: work in a dynamic, fast-evolving environment with ample opportunity to build and innovate.
Referral bonus program: our employees are our best recruiters—refer a great candidate for a role and get rewarded.
Social budget: receive a monthly budget to spend with your team (in person or remotely) to strengthen connections.
dLocal Houses: rent a house to work remotely with your team for a week anywhere in the world—we’ve got you covered.
Flexible work approach: we focus on impact and productivity over fixed hours. Depending on your role and location, you’ll combine self-managed focused time with in-person collaboration in our hubs.
DevOps Engineer at Castalia Systems automating and optimizing toolchain and CI/CD pipelines. Designing Azure infrastructure and ensuring collaboration between development and operations teams.
Senior DevOps Engineer managing Kubernetes and AI - driven workflows at Hex Trust. Supporting blockchain infrastructure while implementing best DevOps practices.
Lead DevSecOps Software Developer at Leidos enhancing automation for air traffic operations. Collaborating on safety - critical systems within a hybrid work environment.
DevSecOps Engineer overcoming client challenges using the latest tools at Booz Allen Hamilton. Collaborating on clean code and infrastructure enhancements to build user - oriented solutions.
Site Reliability Engineer improving reliability and availability of Forcepoint products through automation and operational efficiency. Engaging in incident response and collaborating with development teams.
DevOps Engineer responsible for internal tooling and API development to enhance deployment and operational efficiency at Genesys Cloud. Build automation to improve service health and scalability.
Azure Security Engineer working on cloud - based security strategies and implementations for Global Payments. Collaborating with teams to enhance the security posture and mitigate risks.