Staff Site Reliability Engineer managing global infrastructure for NordVPN, automating systems and ensuring service reliability.
Responsibilities
Deliver projects on time: Plan, delegate, execute, and oversee key projects;
Collaborate: Work closely with stakeholders and other teams. Mentor colleagues and lead knowledge transfer;
Ensure quality and reduce technical debt: Deliver solutions with solid design and address blockers, toil, and debt to keep systems healthy;
Drive engineering excellence: Aim for quality and choose the right solution for the problems we face;
Protect solution quality: Ensure designs are implemented with proper quality and minimal tech debt;
Data‑backed decisions: Help teams and stakeholders navigate data and act on insights;
Design and maintain highly available, scalable infrastructure with monitoring, alerting, and anomaly detection;
Automate everything: Create and optimize automation to streamline deployments, improve speed, and cut manual work;
Solve complex issues: Troubleshoot, debug, and resolve critical issues in complex systems;
Use AI: Integrate AI into workflows and processes to speed up delivery and reduce toil.
Requirements
Observability: Experience with monitoring tools and frameworks to ensure system observability (OpenSearch, VictoriaMetrics, Prometheus, Thanos, Mimir, OpenTelemetry, Nagios);
Databases and storage systems: Experience operating highly available SQL, NoSQL databases, and object stores at scale (MySQL, Percona, PostgreSQL, Cassandra, ClickHouse, Timescale, Druid, MinIO);
Data visualization: Ability to build meaningful dashboards that show the right insights (Grafana, OpenSearch Dashboards);
Alerting and anomaly detection: Ability to build anomaly detection and alerting pipelines;
Programming: Proficiency in one or more programming languages for automation scripts and integrations (Python, Go, Rust, C);
Linux: Strong knowledge of Linux systems, especially Debian‑based distributions;
Workflow: Ability to use workflow automation frameworks (Airflow, Prefect, n8n);
Configuration management: Ability to design and develop configuration management codebases and deployment pipelines (SaltStack, Ansible, Rundeck);
Networking: Strong understanding of networking protocols and concepts (Overlay, VPN, Proxy, DNS, HTTP, SSL, TCP, UDP);
Security: Ability to design secure systems and working knowledge of security concepts and tools (Vault, PKI, mTLS).
DevOps Engineer role at Keylane focuses on optimizing software processes and client interaction. Collaborate in agile teams using Java, SQL, and various technologies.
DevOps Engineer supporting automation and cloud platform technologies with team collaboration at Workday. Developing and managing CI/CD pipelines while enhancing infrastructure efficiency in a SaaS environment.
Maintenance Reliability Engineer specializing in various automated electrical/mechanical components at Northrop Grumman. Supporting manufacturing operations in Magna, Utah, for optimal equipment performance.
Senior Systems Operations Engineer supporting Payments Modernization at Wells Fargo. Managing systems operations and ensuring resilience and observability in payment platforms.
Database Reliability Engineer managing PostgreSQL infrastructure that underpins transactions at Nodal Exchange. Ensuring data integrity and performance in a regulated financial environment.
Senior Information Security Analyst responsible for integrating security practices in development. Join Panvel’s team focusing on securing applications and infrastructure.
DevOps Engineer leading the automation and adoption of DevOps best practices. Collaborating with teams to enhance agile delivery in cloud environments.
Senior Backend Engineer designing and developing backend services in Rust for Mobile DevOps. Collaborating on the Employee Superapp and implementing digital wallet services.
AI Development Operations Engineer responsible for the internal AI infrastructure empowering developers. Integrating AI systems into engineering workflows for efficient software design and maintenance.
Reliability Engineer responsible for availability and performance of U.S. Air Force Cloud services. Collaborates with teams to deliver reliable mission - critical systems in a hybrid environment.