Site Reliability Engineer handling the design, deployment, and operation of customer-facing SaaS platforms. Collaborating with various teams to ensure high availability and performance in the cloud environment.
Responsibilities
Design, deploy, and operate SaaS platforms on AWS.
Work with Kubernetes, Terraform, Crossplane, and GitOps practices to automate infrastructure.
Develop and maintain ArgoCD pipelines and reusable automation assets.
Manage monitoring and observability using tools like Prometheus, Grafana, Loki, OpenTelemetry, and Datadog.
Investigate and resolve system, application, and network issues.
Ensure platforms adhere to security and compliance standards.
Requirements
3–7+ years in SRE, DevOps, CloudOps, or cloud engineering roles.
Strong background working with AWS services and SaaS architectures.
Experience managing reliability metrics and applying SRE principles in production environments.
Proficiency with AWS (networking, compute, storage, IAM, multi-account environments).
Strong understanding of containers and Kubernetes (EKS preferred).
Experience with Terraform, Git, CI/CD, ArgoCD, and Infrastructure-as-Code practices.
Scripting skills (Python, Bash/PowerShell, YAML) and experience with tools like Crossplane or Ansible.
Solid experience with observability stacks (Grafana, Prometheus, Loki, Datadog, OpenTelemetry).
Good knowledge of system design, troubleshooting, and performance analysis.
Clear communicator with strong organizational skills.
Operations Manager overseeing day - to - day operational performance for plasma center in healthcare. Ensuring safety, compliance, and efficiency in center operations for donor safety and production goals.
General Manager overseeing operations at DSV, a global logistics provider. Responsible for financial results, compliance, and continuous improvement practices in the Hutto location.
Intern in Energy Operations at 1KOMMA5° supporting projects for a digital and connected energy world. Collaborating on dynamic tariff and process optimization for energy efficiency.
Operations Systems Manager at Librio ensuring efficient order - to - delivery system for personalized children’s books. Collaborating on operational improvements with various teams including Customer Experience.
Director, VQ Operations at Vanderbilt University developing analytical reports and managing technology solutions for DAR. Collaborating with staff and stakeholders to interpret business needs and enhance efficiencies.
Vice President of Client Operations leading service strategies at BNY. Enhancing client satisfaction and operational efficiency within the Structured Debt Client Platform team.
CX Priority Care Operations Executive at Loaf improving customer journeys through proactive oversight and coordination. Handling complex cases while collaborating with multiple teams to resolve customer issues and maintain service levels.
Lab Manager at CordenPharma overseeing development laboratory operations and ensuring compliance with safety and GMP standards. Managing team performance, inventory, and safety programs.
Director of Operations managing daily operations for Courtyard by Marriott and TownePlace Suites by Marriott. Ensuring exceptional guest satisfaction and operational excellence in a dual - branded hotel.
Monitor and analyze operational indicators at Motiva to improve decision - making processes. Collaborate with various departments to ensure compliance and accuracy in reports.