Commit graph

79 commits

Author SHA1 Message Date
Debanjum Singh Solanky
d40076fcd6 Deduplicate test code, make teardown more robust using pytest fixtures 2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
53cd2e5605 Regenerate initial model in asymmetric reload test to reduce flakyness
- Fix logger message when converting org node to entries
- Remove unused import from conftest
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
ebd5039bd1 Merge branch 'master' into support-incremental-updates-of-embeddings 2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
2b58218b56 Reuse search models across sessions. Merge unused pytest fixtures
- Remove unused model_dir pytest fixture. It was only being used by
  the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
  Downloading search models for each pytest sessions seems excessive and
  slows down tests quite a bit
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
30c3eb372a Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
972523e8a9 Re-enable tests for image search
Verify if recent fixes resolve test flakiness
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
82d2891765 Do not pass ML compute `device' around as argument to search funcs
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
  required by looking for ML compute device in global state
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
7b04978f52 Put global state variables into separate state module
- Variables storing app, device state aren't constants.
  Do not mix with actual constants like empty_escape_sequence, web_directory
2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky
bc423d8f76 Disable image search in tests. Import global state from constants module
- Upstream issues causing load of image search model to fail.
  Disable tests related to image search for now
2022-08-06 02:47:52 +03:00
Debanjum Singh Solanky
4788143aa6 Set clip model name in conftest to sentence-tranformers/clip as well 2022-08-04 22:54:39 +03:00
Debanjum Singh Solanky
f50f343f73 Rename org-mode test data directory to more specific org/ from notes/ 2022-08-04 22:29:57 +03:00
Debanjum Singh Solanky
0602d018c0 Merge Symmetric, Asymmetric Search Types into a single Text Search Type
- The code for both the text search types were mostly the same
  It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
  text_search type
- This simplifies the app and making it easier to process other
  text types
2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky
6c9ffdba57 Allow indexing multiple image directories for image search 2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky
989526ae54 Use a more accurate model for symmetric semantic search
- The all-MiniLM-L6-v2 is more accurate
  - The exact previous model isn't benchmarked but based on the
    performance of the closest model to it. Seems like the new model
    maybe similar in speed and size

- On very preliminary evaluation of the model, the new model seems
  faster, with pretty decent results
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
4a90972e38 Use a better model for asymmetric semantic search
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
  - It doubles the encoding speed of all entries (down from ~8min to 4mins)
  - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)

[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
eda4b65ddb Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU
- Move embeddings to CUDA GPU for compute, when available
- Normalize embeddings and Use Dot Product instead of Cosine
2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky
79c2224eaa Improve test data organization and update correspoding conftests
- Put test data for each content type into separate directories
- Makes config.yml for docker and local host consistent
  - Prepending tests to /data in sample_config.yml makes application
    run on local host using test data
  - Allows mounting separate volume for each content type in docker-compose
- Ignore gitignore to only add tests content, not generated models or embeddings
2022-01-29 02:03:17 -05:00
Debanjum Singh Solanky
179153dc5a Rename RawConfig Types for Consistency
- Naming convention - [ContentType][ConfigType]Config
  - Where [ConfigType] ~ Content, Search, Processor
  - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation

- Current Configs:
  - Content:
    - Org Notes
    - Org Music
    - Image
    - Ledger/Beancount

  - Search:
     - Asymmetric
     - Symmetric
     - Image

  - Processor:
    - Conversation
2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky
ed144f7984 Setup Search with Search_Config to Fix Tests
- Rename pytest fixture search_config to more appropriate
content_config
- Create search_config pytest fixture
- Use search_config where search being setup, used in tests
2022-01-14 20:13:14 -05:00
Saba
ba8dc9ed5f Update the search_config instantiated for tests in conftest 2021-12-11 14:14:31 -05:00
Saba
d65190c3ee Update unit tests, files with removing model suffix to config types 2021-12-09 08:50:38 -05:00
Saba
76e9e9da2f Update unit tests to use the new BaseModel types 2021-12-05 09:31:39 -05:00
Saba
c4cd4b57f1 Update types used in conftest.py 2021-12-04 12:02:19 -05:00
Debanjum Singh Solanky
7e0d9bafa7 Split test_main into client and search type specific test files
- Move search config fixture to conftests.py to be shared across tests
- Move image search type specific tests to test_image_search.py file
- Move, create asymmetric search type specific tests in new file
2021-10-02 20:28:33 -07:00
Debanjum Singh Solanky
da33e9e743 Create test directory with model data to reuse for pytest session
- Use pytest fixture with session scope
- Instantiate test directory with model data to reuse for tests
2021-10-02 19:46:29 -07:00