Site Reliability / DevOps Engineer developing Big Data platforms for clients in Telco and Retail industries. Focus on stability, scalability, and performance of large-scale data processing systems.
Responsibilities
Design, deploy, and maintain infrastructure for Big Data platforms, including Cloudera on-prem solutions and cloud-based environments on AWS or Azure
Implement, manage, and optimize monitoring solutions using Zabbix and Prometheus to ensure performance, availability, and reliability of data platforms
Troubleshoot and resolve platform-related issues, focusing on system performance, reliability, and scalability of large data sets
Use Terraform and Ansible for infrastructure automation, configuration management, and to ensure repeatable, scalable deployments
Implement and maintain CI/CD pipelines for seamless deployments and continuous integration
Deploy and manage Kubernetes clusters for orchestrating Big Data workloads and ensuring efficient resource utilization
Collaborate with network teams to ensure seamless communication between data services, optimize data traffic, and enhance security practices
Proactively identify and resolve performance bottlenecks within Big Data platforms, including resource management, cluster tuning, and workload optimization
Requirements
2+ years of hands-on experience in DevOps or SRE roles
Proficiency in monitoring systems using Zabbix or Prometheus for tracking and alerting system metrics
Knowledge of Linux (RHEL) systems, including scripting, system administration, and troubleshooting
Hands-on experience with cloud environments, particularly AWS or Azure, including the deployment of cloud-native services and infrastructure
Expertise in deploying, managing, and scaling applications using Kubernetes
Experience with Terraform and Ansible for infrastructure automation and configuration management
Experience with CI/CD pipelines and tools such as Argo CD, Flux CD, or similar
Understanding of networking concepts, including security, VPNs, and performance tuning in hybrid environments
Analytical skills with experience in identifying and solving complex platform issues, particularly performance bottlenecks
Knowledge of best practices and tools for optimizing system and application performance in large-scale distributed environments
Benefits
Participation in the company’s stock options program
Flexible Benefits & Personal learning budget from day 1
10 Growth Days per year - dedicated time for learning and development
Ownership and dynamics in your role
All the support you need from our experienced team to become an even better professional
Hybrid work environment with preferably at least 1 day per week in the office
Deployment Engineer at WRITER architecting AI solutions for enterprise customers. Collaborating with cross - functional teams to deliver impactful technologies and drive business outcomes.
DevSecOps Engineer utilizing open - source frameworks and collaboration to address client challenges at Booz Allen. Delivering user - oriented solutions consistently while mastering new tools and techniques.
DevOps Engineer designing, implementing CI/CD pipelines and supporting cloud - based solutions at eInfochips. Collaborating with QA and Engineering teams for release readiness.
DevOps Engineer III providing L3 support for Operations across Edge/on - prem and cloud environments. Building automations and handling incidents for customer deployments.
SRE leading reliability and operational excellence at a mortgage tech platform. Designing systems, tooling, and processes for managing Pylon's production systems in Palo Alto.
Senior Build & Release Engineer at GXO Logistics responsible for CI/CD solutions and build automation across various environments. Collaborating with teams for smooth software deployments and mentoring staff.
Senior Site Reliability Engineer improving the reliability of Acuity’s cloud services. Collaborating across teams to define observability standards and incident response in Cork Digital Centre of Excellence.
Azure Senior DevOps Engineer supporting critical cloud systems in the Azure Government Cloud environment. Leading CI/CD pipeline design and implementation with operational best practices.
Automation Engineer enhancing infrastructure and automating operations for client systems. Working in a complex environment oriented towards automation, security, and performance.
Graduate Reliability Engineer at GKN Aerospace enhancing operational excellence through data analysis and project participation within large structural assemblies.