Onsite Site Reliability Engineer II

Posted 50 minutes ago

Apply now

About the role

  • Site Reliability Engineer improving reliability and availability of Forcepoint products through automation and operational efficiency. Engaging in incident response and collaborating with development teams.

Responsibilities

  • Monitor, measure and improve the reliability, availability and scalability of Forcepoint products and infrastructure
  • Engage in Incident response and participate in post-mortem analysis to investigate root cause and capture contributing factors for remediation
  • Perform analytics on previous incidents and trend/usage patterns to better predict issues and take proactive actions
  • Design and build custom tools as needed to support process optimization, challenging the status-quo and improving operational efficiency
  • Participate in 24*7 rotational shifts & On-Call for handling production operation issues
  • Identify manual routine operational practices and build robust automation capabilities using code and modern tools
  • Review and create dashboards/reports for application telemetry and infrastructure health for pro-actively identifying performance constraints and bottlenecks
  • Monitor product performance and availability, and provide feedback to develop, test, and implement robust monitoring, alerting, and logging solutions.
  • Work collaboratively with software developers to promote best practices in reliability and operability, including code reviews and architectural discussions.
  • Participate with stakeholders to monitor our products, ensuring that the products meet architecture & observability design requirements.

Requirements

  • Strong understanding of cloud-based architecture and operations
  • Hands-on experience with Amazon Web Services is preferred
  • Experience in administration/build/management of Linux systems
  • Foundational understanding of Infrastructure and Platform Technology stacks
  • Strong understanding of Networking concepts and theories, such as different protocols (TCP/IP, UDP, routing protocols, etc), VLAN configuration, DNS, OSI layers, and load balancing
  • Understanding of security architecture and certificate management
  • Working knowledge of Infrastructure and Application monitoring platforms such as Grafana Cloud, Xymon, LibreNMS etc.
  • Working knowledge of Incident Response and Alerting platforms such as PagerDuty, Opsgenie, XMatters etc.
  • Understanding of the core DevOps practices (CI/CD pipeline, release management etc.)
  • Ability to write code using any one modern programming language (Python, JavaScript, Ruby etc.).
  • Additional scripting skills are preferred
  • Configuration management platform understanding and experience (Chef/Puppet/Ansible)
  • Prior experience in Cloud management automation tools (Terraform/CloudFormation etc.)
  • Experience with source code management software and API automation is crucial
  • Cloud certifications or equivalent experience is highly regarded
  • Service availability oriented mindset with a pro-active approach to problem solving.
  • Possesses the ability and willingness to challenge the status-quo and optimize current procedures and processes
  • Strong sense of ownership and an ability to drive cross-functional process improvement
  • Possesses excellent inter-personal, written and verbal communications skills
  • Analytical and logical approach to problem-solving and a willingness to automate repetitive tasks and reduce manual/re-active workload
  • Ability and willingness to coach and mentor Team members and colleagues.

Benefits

  • Flexible work arrangements
  • Professional development opportunities
  • Paid time off

Job title

Site Reliability Engineer II

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job