Commit graph

12 commits

Author SHA1 Message Date
Debanjum
2db7a1ca6b
Restart code sandbox on crash in eval github workflow (#1007)
Some checks failed
dockerize / Publish Khoj Docker Images (push) Has been cancelled
build and deploy github pages for documentation / deploy (push) Has been cancelled
pre-commit / Setup Application and Lint (push) Has been cancelled
pypi / Publish Python Package to PyPI (push) Has been cancelled
test / Run Tests (push) Has been cancelled
dockerize / manifest (push) Has been cancelled
See
e3fed3750b
for corresponding change to use pm2 to auto-restart code sandbox
2024-12-12 14:32:03 -08:00
Debanjum
9eb863e964 Restart code sandbox on crash in eval github workflow 2024-12-12 11:28:54 -08:00
sabaimran
9c403d24e1 Fix reference to directory in the eval workflow for starting terrarium 2024-12-08 13:03:05 -08:00
sabaimran
6940c6379b Add sudo when running installations in order to install relevant packages
add --legacy-peer-deps temporarily to see if it helps mitigate the issue
2024-12-08 11:11:13 -08:00
sabaimran
4c4b7120c6 Use Khoj terrarium fork instead of building from official Cohere repo 2024-12-08 11:06:33 -08:00
Debanjum
29e801c381 Add MATH500 dataset to eval
Evaluate simpler MATH500 responses with gemini 1.5 flash

This improves both the speed and cost of running this eval
2024-11-28 12:48:25 -08:00
Debanjum
22aef9bf53 Add GPQA (diamond) dataset to eval 2024-11-28 12:48:25 -08:00
Debanjum
8dd2122817 Set sample size to 200 for automated eval runs as well 2024-11-23 14:48:38 -08:00
Debanjum
50d8405981 Enable khoj to use terrarium code sandbox as tool in eval workflow 2024-11-20 14:19:27 -08:00
Debanjum
ffbd0ae3a5 Fix eval github workflow to run on releases, i.e on tags push 2024-11-20 12:57:42 -08:00
Debanjum
a2ccf6f59f Fix github workflow to start Khoj, connect to PG and upload results
- Do not trigger tests to run in ci on update to evals
2024-11-18 04:25:15 -08:00
Debanjum
7c0fd71bfd
Add GitHub workflow to quiz Khoj across modes and specified evals (#982)
- Evaluate khoj on random 200 questions from each of google frames and openai simpleqa benchmarks across *general*, *default* and *research* modes
- Run eval with Gemini 1.5 Flash as test giver and Gemini 1.5 Pro as test evaluator models
- Trigger eval workflow on release or manually
- Make dataset, khoj mode and sample size configurable when triggered via manual workflow
- Enable Web search, webpage read tools during evaluation
2024-11-18 02:19:30 -08:00