- Improve search speed by ~10x
Tested on corpus of 125K lines, 12.5K entries
- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
- It's an optional param that default to False
- Earlier all results were re-ranked by cross-encoder
- Making this configurable allows for much faster results, if desired
but for lower accuracy
- Formalize filters into class with can_filter() and filter() methods
- Use can_filter() method to decide whether to apply filter and
create deep copies of entries and embeddings for it
- Improve search speed for queries with no filters
as deep copying entries, embeddings takes the most time
after cross-encodes scoring when calling the /search API
Earlier we would create deep copies of entries, embeddings
even if the query did not contain any filter keywords
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The logic for compiling a beancount entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows symmetric search to be generic and not be aware of
beancount specific properties that were extracted by the
beancount-to-jsonl processor layer
- Now symmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be location, transaction, chat etc, it doesn't have to care
- The logic for compiling an org-mode entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows asymmetric search to be generic and not be aware of
org-mode specific properties that were extracted by the org-to-jsonl
processor layer
- Now asymmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be mail, chat, markdown, org-mode etc, it doesn't have to care
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings
- The (new?) model seems to understand dates. So can give more
relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
will give different results, with the second prioritizing entries
mentioning any entries with closed, scheduled dates from 1984
- While it's true those strings are going to be used to generated
embeddings, the more generic term allows them to be used elsewhere as
well
- Their main property is that they are processed, compiled for
usage by semantic search
- Unlike the 'raw' string which contains the external representation
of the data, as is
- Had already made some progress on this earlier by updating the image
search responses. But needed to update the text search responses to
use lowercase entry and score
- Update khoj.el to consume the updated json response keys for text
search
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
across machines.
- Use case:
A more powerful, always on machine actually computes the image embeddings regularly
The client machine just load these periodically to provide semantic search functionality
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
- return empty strings in such a scenario instead of ". "
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- Metadata match score were consistently giving higher scores by a
factor of ~3x wrt to image match score. This was resulting in all
results being from the metadata match with query and none from the
image match with query.
- Scaling the metadata match scores down by scaling factor seems to
give more consistently give a blend of results from both image and
metadata matches
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
directly in browser
- Return image, metadata scores for each image in response as well
This should help get a better sense of image scores along both
XMP metadata and whole image axis
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
query, entries and embeddings before semantic search is performed
Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query
Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
- Issue
- Explicit filtering was earlier being done after search by bi-encoder
but before re-ranking by cross-encoder
- This was limiting the quality of results being returned. As the
bi-encoder returned results which were going to be excluded. So the
burden of improving those limited results post filtering was on the
cross-encoder by re-ranking the remaining results based on query
- Fix
- Given the embeddings corresponding to an entry are at the same index
in their respective lists. We can run the filter for blocked,
required words before the search by the bi-encoder model. And limit
entries, embeddings being considered for the current query
- Result
- Semantic search by the bi-encoder gets to return most relevant
results for the query, knowing that the results aren't going to be
filtered out after. So the cross-encoder shoulders less of the
burden of improving results
- Corollary
- This pre-filtering technique allows us to apply other explicit
filters on entries relevant for the current query
- E.g limit search for entries within date/time specified in query
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
- Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
for code readability
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration
This reverts commit a2a08d1354.
- Fix loading entries from jsonl in extract_entries method
- Only extract Title from jsonl of each entry
This is the only thing written to the jsonl for symmetric ledger
- This fixes the trailing escape seq in loaded entries
- Remove the need for semantic-search.el response reader to do pointless complicated cleanup
- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
Both methods were doing similar work
- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl