Commit graph

13 commits

Author SHA1 Message Date
Debanjum Singh Solanky
a4eb55dd00 Rename khoj config yml file to follow more specific khoj*.yml pattern
- That is, sample_config.yml is renamed to khoj_sample.yml
- This makes the application config filename less generic,
  more easily identifiable with the application
- Update docs, app accordingly
2022-08-03 12:06:55 +03:00
Debanjum Singh Solanky
1f4b5ac112 Create test markdown files. Use them in sample config, docker-compose 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
6c9ffdba57 Allow indexing multiple image directories for image search 2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky
732b2d287f Give the project a short, less generic name. Rename it to Khoj
- Semantic Search was just a placeholder used to test the idea out
  Didn't want to get into naming at that point of time
2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky
989526ae54 Use a more accurate model for symmetric semantic search
- The all-MiniLM-L6-v2 is more accurate
  - The exact previous model isn't benchmarked but based on the
    performance of the closest model to it. Seems like the new model
    maybe similar in speed and size

- On very preliminary evaluation of the model, the new model seems
  faster, with pretty decent results
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
4a90972e38 Use a better model for asymmetric semantic search
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
  - It doubles the encoding speed of all entries (down from ~8min to 4mins)
  - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)

[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
50658453cd Add separate conda environment.yml for osx-arm64
Conda doesn't support using the same environment across platforms
We were able to get away with this till now because of manually
setting up the conda environment.yml
But it's more robust to just add conda environment YAML files for each
platform as necessary
2022-07-14 23:16:49 +04:00
Debanjum Singh Solanky
e96253a7c1 Add dateparser library to conda environment YAML 2022-07-14 22:29:07 +04:00
Saba
07a56c4ab6 Add specific version for Python packages and downgrade miniconda Docker image to potentially fix build issues 2022-07-04 18:01:55 -04:00
Saba
092d0f2f21 Move Dockerfile to project root to avoid permissions issues. Allocate more memory to docker-compose to avoid OOM 2022-07-04 12:33:55 -04:00
Debanjum Singh Solanky
78b76d65a0 Minor fix to notes jsonl file extension in sample_config.yml 2022-01-29 04:13:36 -05:00
Debanjum Singh Solanky
c31abad0a6 Mount embeddings to /data/embeddings for directory naming consistency
- Keeps directory paths consistent between host and container volumes
- Consistency simplifies documentation and updates required to setup
  sample_config.yml for local installation
2022-01-29 03:24:02 -05:00
Debanjum Singh Solanky
b0067fc32e Store docker, conda, semantic-search configuration in a config directory
- Improves organization of config files required for application
- Declutters the application root directory from configs
2022-01-29 02:41:11 -05:00