Software Engineer involved in reliability engineering at Cursor, focusing on high reliability and durability across the stack. Collaborating with product and infrastructure teams to improve overall system stability.
Responsibilities
Own reliability work end-to-end, from user-facing symptoms (crashes, latency, streaming failures) to root causes in services, infrastructure, or vendor dependencies.
Design and implement resilience patterns for upstream dependency failures (for example model providers): fallbacks, routing strategies, and degraded-mode designs.
Build and maintain reliability guardrails that make teams faster and safer: deployment safety, rollbacks, operational playbooks, automated checks, and standards for production readiness.
Improve observability (metrics, logs, traces, and client telemetry) so engineers can quickly answer 'Is it up?' and 'What changed?'.
Reduce operational toil through automation and better tooling.
Partner with product and infrastructure engineering teams as a drop-in reliability multiplier: embed on the highest-impact problems and drive them to a durable technical outcome.
Participate in an on-call rotation and help improve incident response practices over time (severity definitions, runbooks, retrospectives, and clear ownership of follow-up fixes).
You will own a small set of high-leverage reliability 'themes' at a time (for example client crash rate, streaming reliability, deploy safety). You drive these end-to-end until the reliability bar measurably moves.
Requirements
Strong experience owning reliability for production systems, including both incident response and long-term engineering fixes.
Expert-level experience in at least one of: Go, Node/TypeScript, or Python.
Deep practical knowledge of cloud infrastructure (AWS) and modern deployment/orchestration patterns (Kubernetes and/or ECS).
Experience with observability systems and practices (metrics, logs, traces, and alerting).
Software Engineer designing and developing SaaS micro - services for the Aternity platform. Collaborate with cross - functional teams to troubleshoot and optimize distributed systems for digital experience management.
Senior GenAI Architect leading transformation and migration of legacy systems for clients using AI technologies. Collaborate with teams to ensure robust and secure cloud solutions.
Software Developer at Peregrine Advisors Benefit Inc. designing applications for a Department of Defense agency. Focus on SharePoint, Power Platform, and compliance with federal cybersecurity requirements.
Technical Lead managing cloud and platform technologies for The Dufresne Group in Manitoba, enhancing system design and mentoring teams across departments.
Software Engineer developing applications for Emory University's Language Biomarker Lab. Contributing to AI and NLP projects involving psychosis and other conditions aiming to understand language indicators.
Software Developer supporting critical mission work for government customers in a hybrid environment with comprehensive cloud and cybersecurity expertise.
Senior Software Engineer developing mobile applications for Headspace's B2B partnerships. Collaborating on technical design, implementation, and ensuring scalable mobile architecture.
Sistemista Linux for Eng Cloud division responsible for managing and evolving Linux systems and infrastructures. Overseeing installations, configurations, and support on cloud platforms and high - availability environments.
Principal Engineer leading stability, performance, and operational maturity efforts for Ascend at Henry Schein. Focusing on telemetry, observability, and proactive reliability engineering incidents.
Software Engineer at Walmart designing scalable backend systems with Java. Mentoring junior engineers and collaborating with cross - functional teams to deliver enterprise applications.