Long words (>500 characters) provide less useful context to models.
Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
- Remove property drawer from test entry for max_words splitting test
- Property drawer is not required for the test
- Keep minimal test case to reduce chance for confusion
- Issue
ML Models truncate entries exceeding some max token limit.
This lowers the quality of search results
- Fix
Split entries by max tokens before indexing.
This should improve searching for content in longer entries.
- Miscellaneous
- Test method to split entries by max tokens
- Reason
- All clients that currently consume the API are part of Khoj
- Any breaking API changes will be fixed in clients immediately
- So decoupling client from API is not required
- This removes the burden of maintaining muliple versions of the API
- Context
- The app maintains all text content in a standard, intermediate format
- The intermediate format was loaded, passed around as a dictionary
for easier, faster updates to the intermediate format schema initially
- The intermediate format is reasonably stable now, given it's usage
by all 3 text content types currently implemented
- Changes
- Concretize text entries into `Entries' class instead of using dictionaries
- Code is updated to load, pass around entries as `Entries' objects
instead of as dictionaries
- `text_search' and `text_to_jsonl' methods are annotated with
type hints for the new `Entries' type
- Code and Tests referencing entries are updated to use class style
access patterns instead of the previous dictionary access patterns
- Move `mark_entries_for_update' method into `TextToJsonl' base class
- This is a more natural location for the method as it is only
(to be) used by `text_to_jsonl' classes
- Avoid circular reference issues on importing `Entries' class
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
- Start standardizing implementation of the `text_to_jsonl' processors
- `text_to_jsonl; scripts already had a shared structure
- This change starts to codify that implicit structure
- Benefits
- Ease adding more `text_to_jsonl; processors
- Allow merging shared functionality
- Help with type hinting
- Drawbacks
- Lower agility to change. But this was already an implicit issue as
the text_to_jsonl processors got more deeply wired into the app
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
- This also removes dependency on system Exiftool package for
XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
- For queries with only filters in them short-circuit and return
filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
generated by a separate server/app instance
- Update existings code, tests to process input-filters as list
instead of str
- Test `text_to_jsonl' get files methods to work with combination of
`input-files' and `input-filters'
Resolves#84
- Issue
- Indent regex was previously catching escape sequences like newlines
- This was resulting in entries with only escape sequences in body to
be prepended to property drawers etc during rendering
- Fix
- Update indent regex to only look for spaces in each line
- Only render body when body contains non-escape characters
- Create test to prevent this regression from silently resurfacing
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
- Previously updates to index required explicitly setting `regenerate=True`
- Now incremental update check made everytime on `text_search.setup` now
- Test if index automatically updates when call `text_search.setup`
with new content even with `regenerate=False`
- It's more of a hassle to not let word filter go stale on entry
updates
- Generating index on 120K lines of notes takes 1s. Loading from file
takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
- Having Tags as sets was returning them in a different order
everytime
- This resulted in spuriously identifying existing entries as new
because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
entries for incremental update correctly
- What
- Hash the entries and compare to find new/updated entries
- Reuse embeddings encoded for existing entries
- Only encode embeddings for updated or new entries
- Merge the existing and new entries and embeddings to get the updated
entries, embeddings
- Why
- Given most note text entries are expected to be unchanged
across time. Reusing their earlier encoded embeddings should
significantly speed up embeddings updates
- Previously we were regenerating embeddings for all entries,
even if they had existed in previous runs
- Remove unused model_dir pytest fixture. It was only being used by
the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
Downloading search models for each pytest sessions seems excessive and
slows down tests quite a bit
- Parsed `level` argument passed to OrgNode during init is expected to
be a string, not an integer
- This was resulting in app failure only when parsing org files with
no headings, like in issue #83, as level is set to string of `*`s
the moment a heading is found in the current file
- Previously we were failing if no valid entries while computing
embeddings. This was obscuring the actual issue of no valid entries
found in the specified content files
- Throwing an exception early with clear message when no entries found
should make clarify the issue to be fixed
- See issue #83 for details
- Filter entries, embeddings by ids satisfying all filters in query
func, after each filter has returned entry ids satisfying their
individual acceptance criteria
- Previously each filter would return a filtered list of entries.
Each filter would be applied on entries filtered by previous filters.
This made the filtering order dependent
- Benefits
- Filters can be applied independent of their order of execution
- Precomputed indexes for each filter is not in danger of running
into index out of bound errors, as filters run on original entries
instead of on entries filtered by filters that have run before it
- Extract entries satisfying filter only once instead of doing
this for each filter
- Costs
- Each filter has to process all entries even if previous filters
may have already marked them as non-satisfactory
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
entry in text_search as all text search content types do not have
file being set in jsonl converters
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
if `entry1` is in `file1.org` at `~/notes/file.org`
- Test
- Test converting simple file name filter to regex for path match
- Test file filter with space in file name
- Do not run the more expensive explicit filter until the word to be
filtered is completed by user. This requires an end sequence marker
to identify end of explicit word filter to trigger filtering
- Space isn't a good enough delimiter as the explicit filter could be
at the end of the query in which case no space
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
required by looking for ML compute device in global state
Conflicts:
- src/main.py
- router functions have moved to router
- move logic to handle null query perf timer variables into
router.py
- set main.py to current branch, not master
- Test invalid config file path throws. Remove redundant cli test
- Simplify cli parser code
- Do not need to explicitly check if args.config_file set.
argparser checks for positional arguments automatically
- Use standard semantics for cli args
- All positional args are required. Non positional args are optional
- Improve command line --help description
- Add custom validator to throw if neither input_filter or
input_<files|directories> are specified
- Set field expecting paths to type Path
- Now that default_config isn't used in code. We can update
fields in rawconfig to specify whether they're required or not.
This lets pydantic validate config file and throw appropriate error
- That is, sample_config.yml is renamed to khoj_sample.yml
- This makes the application config filename less generic,
more easily identifiable with the application
- Update docs, app accordingly
- Improve search speed by ~10x
Tested on corpus of 125K lines, 12.5K entries
- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
- It's an optional param that default to False
- Earlier all results were re-ranked by cross-encoder
- Making this configurable allows for much faster results, if desired
but for lower accuracy
- Formalize filters into class with can_filter() and filter() methods
- Use can_filter() method to decide whether to apply filter and
create deep copies of entries and embeddings for it
- Improve search speed for queries with no filters
as deep copying entries, embeddings takes the most time
after cross-encodes scoring when calling the /search API
Earlier we would create deep copies of entries, embeddings
even if the query did not contain any filter keywords
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
- While it's true those strings are going to be used to generated
embeddings, the more generic term allows them to be used elsewhere as
well
- Their main property is that they are processed, compiled for
usage by semantic search
- Unlike the 'raw' string which contains the external representation
of the data, as is
- Had already made some progress on this earlier by updating the image
search responses. But needed to update the text search responses to
use lowercase entry and score
- Update khoj.el to consume the updated json response keys for text
search
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Fix date_filter date_in_entry within query range check
- Extracted_date_range is in [included_date, excluded_date) format
- But check was checking for date_in_entry <= excluded_date
- Fixed it to do date_in_entry < excluded_date
- Fix removal of date filter from query
- Add tests for date_filter
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
dates in the future but dateparser's parse doesn't parse it at all.
E.g parse('5 months from now') returns nothing
- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
parse('5 months') to dateparser.parse works as expected
- test_regenerate_with_valid_content failed when run after test_asymmetric_search
- test_asymmetric_search did't clean the temporary update to config it had made
- This was resulting in regenerate looking for a file that didn't exist
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
- Notes are being processed on a different machine and used on a different machine
- But the notes directory is in the same location relative to home on both the machines
- Put test data for each content type into separate directories
- Makes config.yml for docker and local host consistent
- Prepending tests to /data in sample_config.yml makes application
run on local host using test data
- Allows mounting separate volume for each content type in docker-compose
- Ignore gitignore to only add tests content, not generated models or embeddings
- Rename pytest fixture search_config to more appropriate
content_config
- Create search_config pytest fixture
- Use search_config where search being setup, used in tests
- Allow conversing with user using GPT's contextually aware, generative capability
- Extract metadata, user intent from user's messages using GPT's general understanding
- Move search config fixture to conftests.py to be shared across tests
- Move image search type specific tests to test_image_search.py file
- Move, create asymmetric search type specific tests in new file