AIOps/LLMOps Engineer at Parspec designing and managing AI infrastructure. Helping transform the construction materials supply chain by building AI-powered infrastructure.
Responsibilities
Design and build document AI platforms powered by generative AI, leveraging asynchronous architectures for scalable inference.
Implement event-driven and queue-based systems to support elastic scaling and non-blocking AI workflows.
Architect and maintain self-hosted LLM infrastructure using tools such as vLLM or Ollama on Kubernetes or EC2 with GPU orchestration.
Manage production systems for LLM serving, inference pipelines, and AI workflow orchestration.
Implement LLM gateways and routing systems (e.g., LiteLLM, Portkey) to ensure proper model usage and governance.
Develop guardrails and monitoring systems to reduce hallucinations, misuse, and unsafe outputs in generative AI systems.
Implement end-to-end observability for AI/ML pipelines using distributed tracing and monitoring tools.
Monitor AI system health using platforms such as OpenTelemetry, AWS X-Ray, Prometheus, and Grafana.
Track performance metrics including latency, token usage, inference quality, and model drift.
Manage machine learning workflows using tools such as MLflow, Kubeflow, or SageMaker MLFlow setups.
Enable experiment tracking, model versioning, and deployment pipelines for production AI systems.
Work closely with engineering teams to integrate AI workflows into scalable backend systems.
Implement AI platform security controls including Bedrock Guardrails, KMS encryption, IAM least-privilege policies, VPC endpoints, and CloudTrail auditing.
Optimize AWS infrastructure—including Bedrock, SageMaker, and EKS—for cost efficiency, performance, and reliability.
Ensure production AI systems maintain high availability and security standards.
Requirements
Strong experience with AWS cloud infrastructure including services such as EC2, Lambda, S3, EKS, Bedrock, Step Functions, API Gateway, EventBridge, and SQS/SNS.
Experience building ML infrastructure using Infrastructure-as-Code tools such as Terraform or CloudFormation.
Hands-on experience deploying and operating LLM serving infrastructure using platforms such as vLLM or Text Generation Inference.
Experience managing vector databases and retrieval systems such as Pinecone, PGVector, or Weaviate.
Strong experience designing event-driven or asynchronous systems using queues (SQS, Kafka) and micro-batching patterns.
Experience implementing observability and monitoring for distributed AI systems using tools such as ELK, Prometheus, Grafana, and OpenTelemetry.
Strong programming experience in Python, including frameworks such as FastAPI and asynchronous programming patterns (asyncio).
Experience with Docker, Kubernetes, and CI/CD pipelines using tools such as GitHub Actions or ArgoCD.
5+ years of experience in MLOps, LLMOps, AIOps, or DevOps supporting machine learning or AI systems.
Proven track record building production generative AI systems with high availability and scalability.
Experience deploying self-hosted LLMs on AWS infrastructure and building production-grade document AI platforms.
Experience operating AI systems with >99.9% uptime and cost-efficient infrastructure management.
Benefits
Competitive salary and benefits, including family insurance coverage
Free health teleconsultations
Learning/upskilling budgets
Equity in the company
Flexible hours and a hybrid work setup
Unlimited PTO
Opportunity to grow with a fast-scaling company transforming a large market
Operations and Policy Analyst evaluating nursing practice policies for Oregon State Board of Nursing. Researching and analyzing legislation, and providing consultative expertise regarding nursing practices.
Lot Operations Safety Specialist II ensuring effective coordination of safety activities at Cox Automotive locations. Handling data entry and safety documentation while supporting departmental objectives.
Head of Operations driving operational strategy and customer success in SaaS for APAC at Henry Schein One. Leading a team to ensure timely product adoption and operational efficiency.
Operations Associate supporting patients and healthcare providers through prescription services coordination and documentation. Ensuring timely processing while maintaining compliance and a positive experience.
Working student in Tech & Operations for a growing Mental Health Startup focusing on user - friendly digital infrastructure and website support. Collaborating closely with the management team to enhance digital services.
Senior Sales Operations Analyst supporting sales leadership in a leading Brazilian tech company for restaurants. Driving revenue operations and analytical insights across departments and teams.
Senior Operations Specialist driving high - quality execution across client operations and products. Collaborating with internal teams to deliver tailored solutions in a hybrid work environment.
Account Operations Manager managing operational relationships for large domestic accounts at HP. Conducting complex analysis and driving improvement plans in collaboration with customers and internal teams.
Senior Mortgage Document Review Specialist reviewing complex loan documentation for loss mitigation at M&T Bank. This role ensures accuracy and compliance with investor and regulatory standards.