Senior Site Reliability Engineer managing cloud infrastructure for SaaS solutions at PROS Holdings. Focusing on reliability, automation, and team collaboration in a hybrid work environment.
Responsibilities
Design, implement, and maintain secure, scalable infrastructure across cloud environments
Analyze cloud environment requirements from various sources, document system designs, and implement necessary modifications
Automate repetitive system tasks and manage system-related activities for internal and external clients, including Professional Services support
Ensure system reliability through robust failover mechanisms, disaster recovery processes, and 24/7 support strategies
Design, implement, and improve monitoring tools to meet SLOs, ensuring a “Monitor by Design” approach is adopted across product teams
Continuously drive reliability improvements through proactive initiatives, data-driven SLO adjustments, and advanced monitoring/alerting solutions
Lead and coordinate disaster recovery testing exercises and capacity planning to enhance system reliability
Identify and reduce operational toil through automation and tool development
Apply and enforce security best practices across cloud environments, while mentoring team members on SLO achievement
Facilitate cross-team communication, provide training, and maintain clear documentation (e.g., runbooks and procedures)
Support cloud environment management and propose technology changes to improve performance and reliability.
Requirements
7+ years of experience as a System Administrator, DevOps Engineer, SRE, or similar role
Deep knowledge of Linux administration, including performance monitoring, tuning and troubleshooting
Experience with cloud network design (Azure preferred, AWS or GCP also considered)
Proficiency in scripting (e.g., Bash, Python) for automation
Experience with version control software (preferably Git)
Experience with configuration management tools (e.g., Puppet, Foreman, Ansible, or similar)
Knowledge of container orchestration tools (e.g., Kubernetes, Docker Swarm, etc.)
In-depth knowledge of monitoring and logging solutions for cloud infrastructure (e.g., Prometheus, Grafana, etc.)
Bachelor’s degree in Computer Science or a related field
Excellent time management, organizational, crisis management, and problem-solving skills
Self-starter, able to work independently without direct supervision
Willingness to innovate, learn, and share knowledge
Excellent verbal and written communication skills
Experience developing and implementing IT security best practices and procedures
Willingness to participate in on-call rotations and respond to incidents in a timely and effective manner
Senior Engineer Cloud Engineering role focused on AWS migration and automation. Collaborating with teams to innovate cloud patterns and infrastructure best practices.
Senior Operations Engineer driving efficiency and reliability in NVIDIA's global business operations. Collaborating with IT subsystems and automating operational workflows for organizational impact.
Lead or Senior DevOps Developer joining Boeing Defense, Space and Security for advanced technology missions. Involves CI/CD, cloud systems design, and collaboration with government customers.
Site Reliability Engineer ensuring high availability and performance for digital platforms in retail. Collaborating with engineering teams for automation and observability practices.
Associate Site Reliability Engineer supporting the reliability and performance of global IT infrastructure at Exegy. Engage with senior engineers and learn foundational systems engineering skills.
Site Reliability Engineer driving innovation and growth for Banking Solutions, Payments, and Capital Markets business. Responsible for application reliability and incident response in a hybrid work environment.
DevSecOps role at Tiime ensuring implementation of security practices in products. Collaborate with teams for cloud security and incident management in a hybrid workspace.
Senior Site Reliability Engineer responsible for designing reliable infrastructure supporting Fixify's SaaS platform. Collaborating with product engineering teams and maintaining operational standards for infrastructure performance.
DevOps Engineer working with critical infrastructure systems for Swedish internet services. Focused on building and managing robust systems and contributing to automation and operational improvements.
DevSecOps Consultant integrating security into IT development and operational processes. Advising clients on seamless integration of security requirements into DevOps workflows.