Product Reliability Engineer role at Kraken focused on scalable energy management solutions. Collaborating with teams to ensure product performance and system resilience in a hybrid work environment.
Responsibilities
Teach and support product teams on best practices for reliability, implementation patterns and effective usage of our existing platforms
Support product teams in improving the performance and availability of their systems
Be hands-on in code and infrastructure to help product teams with reliability improvements
Provide comprehensive feedback to the wider Platform group on improvements to be made to core infrastructure based on observations and first-hand experience in the code base
Support the build-out of proof-of-concept requirements in product teams as needed to evolve application deployment architecture to align with business growth as well as enhance scalability and system resilience
Collaborate with product teams to support the release of new features and services, ensuring adherence to reliability and performance standards
Guide product teams in designing systems for resilience and graceful failure under heavy load
Assist application teams with post-incident tasks and follow-ups, and contribute to the creation and review of post-mortem documentation
Analyse incident metrics to identify trends and potential improvements, communicating these insights to the product teams
Help solve interesting and difficult problems. There’s a great opportunity for disruption in the global energy market
Requirements
Previous experience as a Site Reliability Engineer
Experience working on SaaS platforms, including engaging product teams to ensure up-skilling and knowledge sharing across teams
Experience managing and supporting a large scale internet facing service
Experience in responding to incidents and outages, writing technical incident reports and organising incident retrospectives
Experience working with very large relational databases
Experience in using service level objectives to improve application performance
A proactive, innovative mindset
Benefits
Great communication skills, working effectively with developers, product managers and other business stakeholders to understand, design and deliver impactful projects and reliability improvements
Solid hands-on experience across our core platform stack:
AWS (supporting and improving cloud infrastructure used by product teams)
Terraform (infrastructure as code; comfortable operating with Terraform day-to-day)
Kubernetes (container orchestration and deployment management; comfortable working with Kubernetes day-to-day)
Experience using industry-standard observability tooling - we use Datadog, Grafana, Prometheus and Rootly (experience with other monitoring/alerting platforms is transferable)
Strong collaboration and communication skills - able to work effectively with developers, product managers, and other stakeholders to design and deliver impactful observability “golden paths” and monitoring experiences
Exposure to Python (or a similar C-based language like TypeScript, Go, C#) - able to understand how applications behave in production to support observability and reliability improvements
Previous experience working in small, highly autonomous teams
A working style that fits how we operate:
Comfortable with ambiguity and able to create structure in unclear situations
Proactive learning mindset (experiment, iterate, and adapt as the team evolves approaches)
Strong asynchronous written communication (Slack/Notion/docs) and a habit of keeping others in the loop
Autonomy and accountability - making progress independently and owning outcomes
Platform Engineer with expertise in Databricks to manage and optimise the platform's performance and costs at Deloitte. Engaging in operational excellence and analysis of performance metrics.
Cloud Engineer at SDG Group managing data volume optimization for GCP. Designing workflows and ensuring efficient data processing in a hybrid work environment.
Frontend Platform Developer at Borrowell building foundational components for product teams in a remote - first environment. Collaborating with cross - functional teams to enhance code quality and app reliability.
AI Platform Engineer role at RAVL focused on developing GenAI platforms and agent - based architectures. Building scalable integration layers for enterprise AI in a growing engineering team.
ML Platform Engineer at RAVL designing scalable machine learning platforms for financial services. Leading development on Azure Databricks and optimizing MLOps pipelines for enterprise environments.
Lead Platform Engineer managing cloud - native infrastructure at InsurTech company, driving architectural decisions and enhancing platform reliability under a culture of simplicity and care.
Principal Platform Engineer leading platform architecture and operations at Automata, transforming lab automation through integrated technology solutions in a hybrid work model.
Power Platform Engineer developing solutions using PowerPlatform at knowmad mood. Collaborating with multidisciplinary teams in Madrid for quality project delivery in a hybrid mode.
Cloud Operations Engineer responsible for ensuring operational stability of Saviynt’s cloud platform. Collaborating with teams to troubleshoot issues and implement improvements in a dynamic environment.