zachderhake 2 minutes ago

This solves a real pain point. At the last startup I worked at, setting up mock Linear/Slack/Notion accounts for agent evals was surprisingly annoying with cleaning up test data, writing a bunch of verifiers and dealing with rate limiting. The fact that this lets you spin up isolated fake services with the API perfectly imitated means you can just run evals without any of that setup overhead. Pretty neat approach.

hubertmarek an hour ago

Hi HN, I noticed it is almost impossible to run evals or train models on 3rd party integrations, so I built interactive environments for them. Feedback is more than welcome. Thanks!

Interesting fact - running evals on 40 tasks for Linear API, most frontier models scored surprisingly well:

- Claude Opus 4.5: 95% (38/40) - GLM 4.6: 87.5% (35/40) - Claude Sonnet 4.5: 85% (34/40) - Claude Haiku 4.5: 82.5% (33/40) - Kimi K2: 82.5% (33/40) - Grok 4.1 Fast: 80% (32/40) - GPT 5.1: 77.5% (31/40)

This makes me think whether we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.