- While it's true those strings are going to be used to generated
embeddings, the more generic term allows them to be used elsewhere as
well
- Their main property is that they are processed, compiled for
usage by semantic search
- Unlike the 'raw' string which contains the external representation
of the data, as is
- Had already made some progress on this earlier by updating the image
search responses. But needed to update the text search responses to
use lowercase entry and score
- Update khoj.el to consume the updated json response keys for text
search
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
across machines.
- Use case:
A more powerful, always on machine actually computes the image embeddings regularly
The client machine just load these periodically to provide semantic search functionality
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
- return empty strings in such a scenario instead of ". "
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Avoids having to click the query input box
- Just open page, type whatever and hit enter to do image search
- For other search types select appropriate type from dropdown
- Use shr to render image response from html in result buffer
Earlier was using org-mode. But rendering HTML with shr seems cleaner
- Use Headings to Add highlights
- Use Random to Force fetch of Image. Similar to what was done for Web interface
- Remove trailing elisp brackets from response
- Show query match scores by image model for each image in results
- Metadata match score were consistently giving higher scores by a
factor of ~3x wrt to image match score. This was resulting in all
results being from the metadata match with query and none from the
image match with query.
- Scaling the metadata match scores down by scaling factor seems to
give more consistently give a blend of results from both image and
metadata matches
Adding a random, unused url param at the end of the img.src string
fixes the issue. As the browser thinks it's a new image and doesn't
use the image data that's already cached because of which it wasn't
even making the fetch call for the image
- Allow viewing image results returned by Semantic Search.
Until now there wasn't any interface within the app to view image
search results. For text results, we at least had the emacs interface
- This should help with debugging issues with image search too
For text the Swagger interface was good enough
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
directly in browser
- Return image, metadata scores for each image in response as well
This should help get a better sense of image scores along both
XMP metadata and whole image axis
- With \t Last Word in Headings was suffixed by \t and so couldn't be
filtered by
- User interacts with raw entries, so run explicit filters on raw entry
- For semantic search using the filtered entry is cleaner, still
- Fix date_filter date_in_entry within query range check
- Extracted_date_range is in [included_date, excluded_date) format
- But check was checking for date_in_entry <= excluded_date
- Fixed it to do date_in_entry < excluded_date
- Fix removal of date filter from query
- Add tests for date_filter
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
dates in the future but dateparser's parse doesn't parse it at all.
E.g parse('5 months from now') returns nothing
- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
parse('5 months') to dateparser.parse works as expected
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
query, entries and embeddings before semantic search is performed
Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query
Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
- Issue
- Explicit filtering was earlier being done after search by bi-encoder
but before re-ranking by cross-encoder
- This was limiting the quality of results being returned. As the
bi-encoder returned results which were going to be excluded. So the
burden of improving those limited results post filtering was on the
cross-encoder by re-ranking the remaining results based on query
- Fix
- Given the embeddings corresponding to an entry are at the same index
in their respective lists. We can run the filter for blocked,
required words before the search by the bi-encoder model. And limit
entries, embeddings being considered for the current query
- Result
- Semantic search by the bi-encoder gets to return most relevant
results for the query, knowing that the results aren't going to be
filtered out after. So the cross-encoder shoulders less of the
burden of improving results
- Corollary
- This pre-filtering technique allows us to apply other explicit
filters on entries relevant for the current query
- E.g limit search for entries within date/time specified in query
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
- The reload API adds the ability to separate out the loading of
embeddings from file without having to restart app or (re-)generate embeddings
- Before this the only way to load model from file was by restarting app
- The other way to reload the model embeddings by regenerating them
was to expensive for larger datasets
- This unlocks at least 1 use-case, where
- we regenerate model via an app instance running on a separate server and
- just reload the generated embeddings on the client device
- This allows us to offload the expensive embedding generation
compute to a background server while letting
- This avoids having to (re-)restart application on client device or
be forced to generate embeddings on the client device itself
- But it requires the model relevant files to be synced to the client device
This can be done with any file syncing application like Syncthing
- We can then call /regenerate on server and /reload client on a
regular schedule to keep our data up to date on semantic search
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
- Notes are being processed on a different machine and used on a different machine
- But the notes directory is in the same location relative to home on both the machines