- Default to evaluation decision of None when either agent or
evaluator llm fails. This fixes accuracy calculations on errors
- Fix showing color for decision True
- Enable arg flags to specify output results file paths
Previously the batch start index wasn't being passed so all batches
started in parallel were showing the same processing example index
This change doesn't impact the evaluation itself, just the index shown
of the example currently being evaluated
Google's FRAMES benchmark evaluates multi-step retrieval and reasoning
capabilities of an agent.
The script uses Gemini as an LLM Judge to evaluate Khoj responses to
the FRAMES benchmark prompts against the ground truth provided by it.