DevOps Engineer developing and maintaining observability products and automation tools at Salesforce. Overseeing system monitoring, incident response, and infrastructure reliability enhancements.
Responsibilities
Develop and maintain observability products, build tools and automation to eliminate manual touch, and respond to system failures and outages.
Gain insights into platform incidents and address repeat issues.
Leverage AIOps platforms to improve anomaly detection, automate runbooks, and meet MTTD & MTTR goals.
Oversee system monitoring, incident response, and root cause analysis in a timely manner.
Solid understanding of logging frameworks and APM tools like Splunk , Prometheus, Grafana.
Drive continuous improvement initiatives to enhance system reliability and performance.
Collaborate with development teams and drive reliability/availability improvements.
Manage deployments, oversee patching and mitigate security vulnerabilities.
Proactively plan and manage potential risks to ensure system reliability.
Prioritize tasks and projects effectively in a fast-paced environment to ensure critical issues are addressed promptly.
Design, implement, and maintain scalable, fault-tolerant cloud infrastructure.
Leverage container technologies like Kubernetes and Docker to enhance system reliability and efficiency.
Monitor, troubleshoot, and resolve production issues, ensuring system availability and performance.
Collaborate with cross-functional teams to diagnose incidents, improve observability, and drive post-mortem analysis.
Write clear technical documentation and communicate complex issues effectively with both technical and non-technical stakeholders.
Develop automation scripts and tooling using at least one object-oriented language (e.g., Java, Python, Go) and one scripting language (e.g., Bash, Python).
Manage network technologies, including DNS, Load Balancing, TCP/IP, HTTP, and tools like curl and OpenSSL to ensure seamless connectivity.
Continuously improve system performance, security, and cost efficiency through proactive monitoring and optimizations.
Proficiency with source control, continuous integration, and testing pipelines.
Requirements
Mastery of multiple programming languages and platforms
3 to 12 years of software development experience
Design, implement, and maintain scalable, fault-tolerant cloud infrastructure.
Expertise in managing large, fault-tolerant, cloud-hosted systems.
Proficiency with container technologies like Kubernetes and Docker.
Proficiency with AWS, GCP or other cloud solutions.
Clear technical communication, especially about problems and incidents.
Understanding fundamental mesh and network technologies, e.g., DNS, Load Balancing, TCP/IP, HTTP, DNS, curl, openssl.
Strong problem-solving, troubleshooting, and analytical skills demonstrated in past projects.
Proven experience managing large-scale, cloud-hosted systems in AWS, Azure, or GCP.
Strong expertise in Kubernetes, Docker, and container orchestration.
Solid understanding of networking fundamentals, including DNS, TCP/IP, HTTP, and Load Balancing.
Proficiency in at least one object-oriented and one scripting language.
Exceptional troubleshooting, problem-solving, and analytical skills, demonstrated in past projects.
Excellent communication and collaboration skills, with the ability to clearly articulate technical challenges and solutions.
Benefits
Comprehensive benefits package including well-being reimbursement, generous parental leave, adoption assistance, fertility benefits, and more!
World-class enablement and on-demand training with Trailhead.com
Exposure to executive thought leaders and regular 1:1 coaching with leadership
Volunteer opportunities and participation in our 1:1:1 model for giving back to the community
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.