Empty responses by Khoj will always be an incorrect response, so no need to make call to an evaluator agent to check that
Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of an agent. The script uses Gemini as an LLM Judge to evaluate Khoj responses to the FRAMES benchmark prompts against the ground truth provided by it.