- Khoj supports indexing subdirectories but the khoj docker config
wasn't updated to support the same
- This should also allow khoj docker users to index multiple separate
directory trees by mounting them into separate sub folders within
/data/<content-type>/.
For e.g /data/org/dir1, /data/org/dir2 etc in khoj_docker.yml
- Default to using `text-davinci-003' if conversation model not
explicitly configured by user. Stop using the older `davinci' and
`davinci-instruct' models
- Use `model' instead of `engine' as parameter.
Usage of `engine' parameter in OpenAI API is deprecated
- CLIP Image score and XMP metadata score are not combining well.
When combined they give non sensical results. Enable only once
figure how best to combine the two.
- Show scores with higher precision for image search
- Image search scores seem to be mostly be between 0.2 - 0.3 for some reason
- Higher precision scores make it easier to understand the quality
of returned results perceived by the model itself
- Reason
- Simplifies code. No merge_dict required
- 1 place for user to see all configurables, defaults and required values
- Details
- Remove default_config from code. Set defaults in khoj_sample.yml itself
- Keep fields required to be set by user as empty in khoj_sample to YAML
- Set defaults for fields not requiring configuration by user
- That is, sample_config.yml is renamed to khoj_sample.yml
- This makes the application config filename less generic,
more easily identifiable with the application
- Update docs, app accordingly
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
Conda doesn't support using the same environment across platforms
We were able to get away with this till now because of manually
setting up the conda environment.yml
But it's more robust to just add conda environment YAML files for each
platform as necessary
- Keeps directory paths consistent between host and container volumes
- Consistency simplifies documentation and updates required to setup
sample_config.yml for local installation