Commit graph

7 commits

Author SHA1 Message Date
Debanjum
29e801c381 Add MATH500 dataset to eval
Evaluate simpler MATH500 responses with gemini 1.5 flash

This improves both the speed and cost of running this eval
2024-11-28 12:48:25 -08:00
Debanjum
22aef9bf53 Add GPQA (diamond) dataset to eval 2024-11-28 12:48:25 -08:00
Debanjum
8dd2122817 Set sample size to 200 for automated eval runs as well 2024-11-23 14:48:38 -08:00
Debanjum
50d8405981 Enable khoj to use terrarium code sandbox as tool in eval workflow 2024-11-20 14:19:27 -08:00
Debanjum
ffbd0ae3a5 Fix eval github workflow to run on releases, i.e on tags push 2024-11-20 12:57:42 -08:00
Debanjum
a2ccf6f59f Fix github workflow to start Khoj, connect to PG and upload results
- Do not trigger tests to run in ci on update to evals
2024-11-18 04:25:15 -08:00
Debanjum
7c0fd71bfd
Add GitHub workflow to quiz Khoj across modes and specified evals (#982)
- Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across *general*, *default* and *research* modes
- Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models
- Trigger eval workflow on release or manually
- Make dataset, khoj mode and sample size configurable when triggered via manual workflow
- Enable Web search, webpage read tools during evaluation
2024-11-18 02:19:30 -08:00