Hybrid Site Reliability Engineer – Data Platform

Posted 2 days ago

Apply now

About the role

  • Site Reliability Engineer ensuring reliability and performance of data platform services for Veepee. Collaborating on cloud migration, Kubernetes operations, and observability best practices.

Responsibilities

  • Ensure the reliability and performance of our data platform services (Trino, Iceberg, S3, Kafka, Flink)
  • Define and implement SRE best practices: SLIs/SLOs, error budgets, and observability
  • Build and maintain monitoring, alerting, and incident response frameworks (Prometheus, Grafana, etc.)
  • Contribute to the migration from a public cloud data warehouse to VeepeeCloud’s lakehouse stack
  • Support coexistence between cloud and on-prem systems and ensure data consistency and service reliability
  • Help design resilient architectures for ingestion, transformation, and serving layers
  • Operate and improve services running on Kubernetes (GKE/EKS and on-prem clusters)
  • Automate infrastructure provisioning using Terraform, Atlantis, and/or Crossplane
  • Improve GitOps workflows for platform deployment and configuration
  • Collaborate with teams to optimize compute and storage usage (Trino queries, BigQuery slots, etc.)
  • Build tools and dashboards to track cost, usage, and efficiency
  • Support the transition toward cost-efficient on-prem workloads
  • Improve self-service capabilities for data teams (e.g., provisioning Trino/Iceberg resources)
  • Help teams adopt best practices in reliability, observability, and deployment
  • Write clear technical documentation and runbooks
  • Contribute to the definition and implementation of the Disaster Recovery Plan (DRP)
  • Ensure multi-DC resilience (FR1 / NL1) and implement data replication strategies
  • Participate in incident management and postmortems

Requirements

  • Strong experience with Kubernetes in production environments
  • Experience with distributed data systems (or a strong willingness to learn)
  • Solid understanding of SRE principles (monitoring, alerting, SLAs/SLOs)
  • Experience with Infrastructure as Code (Terraform or similar tools)
  • Familiarity with GitOps workflows
  • Experience with observability tools (Prometheus, Grafana, logging systems)
  • Comfortable working in cloud environments
  • Strong collaboration mindset and the ability to work across teams
  • Fluent in English

Benefits

  • Variable bonus
  • Dynamic and creative environment within international teams
  • Access to a variety of self-learning courses on our e-learning platform
  • Opportunity to participate in local and international meetups and conferences
  • Flexible office policy with up to 3 days remote work per week

Job title

Site Reliability Engineer – Data Platform

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job