sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-12 08:43:03 +01:00

Author	SHA1	Message	Date
Debanjum	9fc44f1a7f	Enable evaluation Khoj on the Talc Search Bench using Eval script - Just load the raw jsonl from Github and normalize it into FRAMES format - Color printed accuracy in eval script to blue for readability	2024-11-13 22:50:14 -08:00
Debanjum	f4e37209a2	Improve error handling, display and configurability of eval script - Default to evaluation decision of None when either agent or evaluator llm fails. This fixes accuracy calculations on errors - Fix showing color for decision True - Enable arg flags to specify output results file paths	2024-11-13 14:32:22 -08:00
Debanjum	f967bdf702	Show correct example index being currently processed in frames eval Previously the batch start index wasn't being passed so all batches started in parallel were showing the same processing example index This change doesn't impact the evaluation itself, just the index shown of the example currently being evaluated	2024-11-10 14:49:51 -08:00
Debanjum	84a8088c2b	Only evaluate non-empty responses to reduce eval script latency, cost Empty responses by Khoj will always be an incorrect response, so no need to make call to an evaluator agent to check that	2024-11-10 14:49:51 -08:00
Debanjum	1ccbf72752	Use logger instead of print to track eval	2024-11-04 00:40:26 -08:00
Debanjum	791eb205f6	Run prompt batches in parallel for faster eval runs	2024-11-02 04:58:03 -07:00
Debanjum	96904e0769	Add script to evaluate khoj on Google's FRAMES benchmark Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of an agent. The script uses Gemini as an LLM Judge to evaluate Khoj responses to the FRAMES benchmark prompts against the ground truth provided by it.	2024-11-02 04:57:42 -07:00

7 commits