AI Agent Evaluation & Observability Framework

Built comprehensive evaluation and observability infrastructure for LogicMonitor’s experimental monitoring agent project, enabling rapid iteration and production visibility.

Role

AI Engineer on a team improving LogicMonitor’s Edwin AI agent.

Challenge

LogicMonitor had developed an experimental observability AI agent and brought in additional help to bring it to production for their users.

Contributions

  • Evaluation Framework: Designed and implemented a dev-time and production-time evaluation process using Promptfoo and LangFuse
  • Evaluation Types: Created tool choice evaluations, tool correctness evaluations, and LLM-as-a-judge evaluations to comprehensively assess agent performance
  • Model Sweep Architecture: Built infrastructure for systematic model comparison, providing the client with performance metrics to guide model selection decisions
  • Real-time Observability: Enabled client SMEs to view agent interactions in real time and discover potential problems before customer complaints
  • Rapid Iteration: The evaluation harness allowed the team to rapidly evaluate changes before pushing any code

Results

Delivered a complete evaluation and observability stack that transformed how the team developed and monitored their AI agent, with concrete model recommendations backed by real performance metrics.

Technologies

  • Promptfoo (evaluation framework)
  • LangFuse (observability)
  • OpenTelemetry
  • Python
  • OpenAI