AI Agent Evaluation & Observability Framework

Built comprehensive evaluation and observability infrastructure for LogicMonitor’s experimental monitoring agent project, enabling rapid iteration and production visibility.

Role

AI Engineer on a team improving LogicMonitor’s Edwin AI agent.

Challenge

LogicMonitor had developed an experimental observability AI agent and brought in additional help to bring it to production for their users.

Contributions

Evaluation Framework: Designed and implemented a dev-time and production-time evaluation process using Promptfoo and LangFuse
Evaluation Types: Created tool choice evaluations, tool correctness evaluations, and LLM-as-a-judge evaluations to comprehensively assess agent performance
Model Sweep Architecture: Built infrastructure for systematic model comparison, providing the client with performance metrics to guide model selection decisions
Real-time Observability: Enabled client SMEs to view agent interactions in real time and discover potential problems before customer complaints
Rapid Iteration: The evaluation harness allowed the team to rapidly evaluate changes before pushing any code

Results

Delivered a complete evaluation and observability stack that transformed how the team developed and monitored their AI agent, with concrete model recommendations backed by real performance metrics.

Technologies

Promptfoo (evaluation framework)
LangFuse (observability)
OpenTelemetry
Python
OpenAI

AI Agent Evaluation & Observability Framework

Role

Challenge

Contributions

Results

Technologies

Stratis Data Labs