Many agent benchmarks focus on whether the final task was completed. That is useful, but it is not enough for reliable deployment. An agent can arrive at a correct-looking answer through an unsafe tool call, a lucky retrieval result, or a reasoning path that cannot recover when the environment changes.
Better evaluation should include the trace: what the agent observed, which tools it selected, whether it verified evidence, how it handled missing information, and whether it stopped when it should. These signals help separate robust behavior from accidental success.
For practical systems, trace-level evaluation also improves engineering. It gives developers a way to identify brittle prompts, weak retrieval, missing guardrails, and unclear fallback routes.