- Add custom validator to throw if neither input_filter or
input_<files|directories> are specified
- Set field expecting paths to type Path
- Now that default_config isn't used in code. We can update
fields in rawconfig to specify whether they're required or not.
This lets pydantic validate config file and throw appropriate error
- That is, sample_config.yml is renamed to khoj_sample.yml
- This makes the application config filename less generic,
more easily identifiable with the application
- Update docs, app accordingly
- Improve search speed by ~10x
Tested on corpus of 125K lines, 12.5K entries
- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
- It's an optional param that default to False
- Earlier all results were re-ranked by cross-encoder
- Making this configurable allows for much faster results, if desired
but for lower accuracy
- Formalize filters into class with can_filter() and filter() methods
- Use can_filter() method to decide whether to apply filter and
create deep copies of entries and embeddings for it
- Improve search speed for queries with no filters
as deep copying entries, embeddings takes the most time
after cross-encodes scoring when calling the /search API
Earlier we would create deep copies of entries, embeddings
even if the query did not contain any filter keywords
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
- While it's true those strings are going to be used to generated
embeddings, the more generic term allows them to be used elsewhere as
well
- Their main property is that they are processed, compiled for
usage by semantic search
- Unlike the 'raw' string which contains the external representation
of the data, as is
- Had already made some progress on this earlier by updating the image
search responses. But needed to update the text search responses to
use lowercase entry and score
- Update khoj.el to consume the updated json response keys for text
search
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Fix date_filter date_in_entry within query range check
- Extracted_date_range is in [included_date, excluded_date) format
- But check was checking for date_in_entry <= excluded_date
- Fixed it to do date_in_entry < excluded_date
- Fix removal of date filter from query
- Add tests for date_filter
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
dates in the future but dateparser's parse doesn't parse it at all.
E.g parse('5 months from now') returns nothing
- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
parse('5 months') to dateparser.parse works as expected
- test_regenerate_with_valid_content failed when run after test_asymmetric_search
- test_asymmetric_search did't clean the temporary update to config it had made
- This was resulting in regenerate looking for a file that didn't exist
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
- Notes are being processed on a different machine and used on a different machine
- But the notes directory is in the same location relative to home on both the machines
- Put test data for each content type into separate directories
- Makes config.yml for docker and local host consistent
- Prepending tests to /data in sample_config.yml makes application
run on local host using test data
- Allows mounting separate volume for each content type in docker-compose
- Ignore gitignore to only add tests content, not generated models or embeddings
- Rename pytest fixture search_config to more appropriate
content_config
- Create search_config pytest fixture
- Use search_config where search being setup, used in tests
- Allow conversing with user using GPT's contextually aware, generative capability
- Extract metadata, user intent from user's messages using GPT's general understanding
- Move search config fixture to conftests.py to be shared across tests
- Move image search type specific tests to test_image_search.py file
- Move, create asymmetric search type specific tests in new file