AI Support Engineer ensuring rapid triage, root cause analysis, and resolution for production AI incidents. Monitoring system health and collaborating with engineers to implement observability best practices.
Responsibilities
Serve as the first line of defense for production AI incidents, ensuring rapid triage, root cause analysis, and resolution.
Monitor system health and performance of deployed AI applications, agentic and RAG-based solutions, MCPs, and orchestration platforms.
Track and investigate issues related to latency, failures, model drift, hallucination, prompt misbehavior, or broken integrations, escalating to the AI engineering group where appropriate.
Collaborate with AI and platform engineers to implement observability, logging, and alerting best practices for all AI services.
Build diagnostic tools, runbooks, and automated workflows to improve incident response time and reduce manual intervention.
Maintain knowledge bases and playbooks for repeatable troubleshooting and knowledge transfer.
Partner with governance and compliance teams to ensure incidents are documented and remediated in line with internal policy.
Contribute to postmortems and continuous improvement efforts to harden production systems.
Requirements
4+ years of experience in production support, software engineering, site reliability engineering (SRE), or DevOps—preferably supporting GenAI and/or ML systems.
Strong understanding of cloud infrastructure (AWS, GCP) and AI observability tools (e.g., Fiddler AI, Arize AI, IBM WatsonX.governance, etc.).
Experience with LLM and GenAI systems (OpenAI, Azure OpenAI, Bedrock, Vertex AI, or similar).
Familiarity with modern orchestration and agentic frameworks such as LangChain, LangGraph, Autogen, or CrewAI.
Proficiency in Python or shell scripting for automation and troubleshooting.
Strong analytical, communication, and incident management skills.
Bachelor’s degree in Computer Science, Engineering, or a related field.
1+ years of experience in AI/ML engineering, with a focus on Generative AI.
Proficiency in programming languages such as Python
Strong understanding of Generative AI models (e.g., GPT, Transformer architectures) and experience in distilling, tuning and training them.
Familiarity with Retrieval Augmented Generation (RAG) techniques and their implementation.
Experience with agentic AI concepts and developing autonomous AI workflows.
Hands-on experience with GCP Vertex AI, AWS Bedrock + Sagemaker, and Snowflake Cortex platforms and their AI/ML capabilities.
Experience building production-grade AI/ML systems at scale.
Knowledge of MLOps practices, including model deployment and lifecycle management.
Technical Analyst supporting ITV's Enterprise Architecture team to analyse and document architecture, workflows, and automation opportunities across technical systems. Collaborating with diverse stakeholders and ensuring alignment with architectural standards.
Manager of Engineering Technical Support overseeing product structure management and engineering change processes. Leading cross - functional teams and maintaining product data for industrial lift trucks at Hyster - Yale.
Technical Support Manager at ADI Centre of Excellence providing pre - and post - sales support for Fire & Security products. Driving team success and improving customer satisfaction and retention.
Technical Support Team Leader overseeing Support Consultants at Procentia. Focusing on Python troubleshooting and team development for efficient technical operations.
Technical Support Engineer ensuring exceptional support for B2B customers in cloud infrastructure and financial operations. Assist Ocean & Elastigroup customers integrating with Flexera One platform for unified cost optimization.
Technical Support Engineer providing technical support for semiconductor equipment in South Korea. Responsible for troubleshooting and guiding field engineers at customer sites.
Technical Support Agent supporting Key Management Infrastructure Help Desk for CACI. Collaborating with Department of Navy and resolving technical issues for client accounts.
IT Client Services role providing Tier 1 - 3 support for software, hardware, and meeting technology issues. Involvement in user education and asset management within an inclusive workplace.
Resource Management & Budgeting Support Analyst at EY responsible for resource allocation and budget management. Collaborating with project managers and operational teams to ensure effective resource utilization.
Técnico de Soporte en Sitio en SONDA brindando soporte a clientes. Responsabilidades incluyen soporte remoto y presencial, administración de inventarios y mantenimiento de equipos.