- Index markdown test data as knowledge base. As easier to get good
markdown content (vs org)
- Setup markdown_content_config, processor_config and chat_client to
test chat API
- Why
The khoj pypi packages should be installed in `khoj' directory.
Previously it was being installed into `src' directory, which is a
generic top level directory name that is discouraged from being used
- Changes
- move src/* to src/khoj/*
- update `setup.py' to `find_packages' in `src' instead of project root
- rename imports to form `from khoj.*' in complete project
- update `constants.web_directory' path to use `khoj' directory
- rename root logger to `khoj' in `main.py'
- fix image_search tests to use the newly rename `khoj' logger
- update config, docs, workflows to reference new path `src/khoj'
- Start standardizing implementation of the `text_to_jsonl' processors
- `text_to_jsonl; scripts already had a shared structure
- This change starts to codify that implicit structure
- Benefits
- Ease adding more `text_to_jsonl; processors
- Allow merging shared functionality
- Help with type hinting
- Drawbacks
- Lower agility to change. But this was already an implicit issue as
the text_to_jsonl processors got more deeply wired into the app
- Update existings code, tests to process input-filters as list
instead of str
- Test `text_to_jsonl' get files methods to work with combination of
`input-files' and `input-filters'
Resolves#84
- It's more of a hassle to not let word filter go stale on entry
updates
- Generating index on 120K lines of notes takes 1s. Loading from file
takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
- Remove unused model_dir pytest fixture. It was only being used by
the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
Downloading search models for each pytest sessions seems excessive and
slows down tests quite a bit
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
required by looking for ML compute device in global state
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Put test data for each content type into separate directories
- Makes config.yml for docker and local host consistent
- Prepending tests to /data in sample_config.yml makes application
run on local host using test data
- Allows mounting separate volume for each content type in docker-compose
- Ignore gitignore to only add tests content, not generated models or embeddings
- Rename pytest fixture search_config to more appropriate
content_config
- Create search_config pytest fixture
- Use search_config where search being setup, used in tests
- Move search config fixture to conftests.py to be shared across tests
- Move image search type specific tests to test_image_search.py file
- Move, create asymmetric search type specific tests in new file