Built comprehensive evaluation and observability infrastructure for LogicMonitor’s experimental monitoring agent project, enabling rapid iteration and production visibility.
Role
AI Engineer on a team improving LogicMonitor’s Edwin AI agent.
Challenge
LogicMonitor had developed an experimental observability AI agent and brought in additional help to bring it to production for their users.
Contributions
- Evaluation Framework: Designed and implemented a dev-time and production-time evaluation process using Promptfoo and LangFuse
- Evaluation Types: Created tool choice evaluations, tool correctness evaluations, and LLM-as-a-judge evaluations to comprehensively assess agent performance
- Model Sweep Architecture: Built infrastructure for systematic model comparison, providing the client with performance metrics to guide model selection decisions
- Real-time Observability: Enabled client SMEs to view agent interactions in real time and discover potential problems before customer complaints
- Rapid Iteration: The evaluation harness allowed the team to rapidly evaluate changes before pushing any code
Results
Delivered a complete evaluation and observability stack that transformed how the team developed and monitored their AI agent, with concrete model recommendations backed by real performance metrics.
Technologies
- Promptfoo (evaluation framework)
- LangFuse (observability)
- OpenTelemetry
- Python
- OpenAI