Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Avoids having to click the query input box
- Just open page, type whatever and hit enter to do image search
- For other search types select appropriate type from dropdown
- Use shr to render image response from html in result buffer
Earlier was using org-mode. But rendering HTML with shr seems cleaner
- Use Headings to Add highlights
- Use Random to Force fetch of Image. Similar to what was done for Web interface
- Remove trailing elisp brackets from response
- Show query match scores by image model for each image in results
- Metadata match score were consistently giving higher scores by a
factor of ~3x wrt to image match score. This was resulting in all
results being from the metadata match with query and none from the
image match with query.
- Scaling the metadata match scores down by scaling factor seems to
give more consistently give a blend of results from both image and
metadata matches
Adding a random, unused url param at the end of the img.src string
fixes the issue. As the browser thinks it's a new image and doesn't
use the image data that's already cached because of which it wasn't
even making the fetch call for the image
- Allow viewing image results returned by Semantic Search.
Until now there wasn't any interface within the app to view image
search results. For text results, we at least had the emacs interface
- This should help with debugging issues with image search too
For text the Swagger interface was good enough
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
directly in browser
- Return image, metadata scores for each image in response as well
This should help get a better sense of image scores along both
XMP metadata and whole image axis
Conda doesn't support using the same environment across platforms
We were able to get away with this till now because of manually
setting up the conda environment.yml
But it's more robust to just add conda environment YAML files for each
platform as necessary
Goal
--
Allow Limiting Search to Entries in Specified Date Range
Example Queries:
---
- _Traveled for work internationally dt>"2 years ago"_
Finds relevant notes since start of 2020
- _Learnt a cool new skill dt="last month"_
Finds relevant notes anytime in the last month
- _Filed my taxes dt>="Jan 1984" dt<="April 1984"_
Any tax related notes between 1st Jan 1984 to 30th April of 1984
Details
--
- Parse natural language dates in query into date ranges
- Use `dateparser` library to parse natural language dates. But tune results to return more natural date ranges
- Example: A user asking for entries from April, requires looking for entries in the whole of April, not April 1st or April 30th
- Find all dates in entry (currently limited to YYYY-MM-DD format)
- Only perform semantic search on entries within date range specified in query by user
- With \t Last Word in Headings was suffixed by \t and so couldn't be
filtered by
- User interacts with raw entries, so run explicit filters on raw entry
- For semantic search using the filtered entry is cleaner, still
- Fix date_filter date_in_entry within query range check
- Extracted_date_range is in [included_date, excluded_date) format
- But check was checking for date_in_entry <= excluded_date
- Fixed it to do date_in_entry < excluded_date
- Fix removal of date filter from query
- Add tests for date_filter
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
dates in the future but dateparser's parse doesn't parse it at all.
E.g parse('5 months from now') returns nothing
- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
parse('5 months') to dateparser.parse works as expected
Reason
--
This abstraction will simplify adding other pre-search filters. E.g A date-time filter
Capabilities
--
- Multiple filters can be applied on the query, entries etc before search
- The filters to apply are configured for each type in the search controller
Details
--
- Move `explicit_filters` function into separate module under `search_filter`
- Update signature of explicit filter to take and return `query`, `entries`, `embeddings`
- Use this `explicit_filter` function from `search_filters` module in
`search` method in controller
- The asymmetric query method now just applies the passed filters to the
`query`, `entries` and `embeddings` before semantic search is performed
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
query, entries and embeddings before semantic search is performed
Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query
Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
## Issue
- Explicit filtering was being done after search by the bi-encoder
but before re-ranking by the cross-encoder
- This limited the quality of results being returned for queries with explicit filters.
The bi-encoder returned results which were going to be excluded.
So the burden of improving those limited results post filtering was on the
cross-encoder, by re-ranking the remaining results to best match the query
## Fix
- Given that the entry and its embedding are at the same index in their respective lists.
We know which entries map to which embedding tensors.
So we can run the filter for blocked, required words before the bi-encoder search.
And limit entries, embeddings being considered for the current query
## Result
- Semantic search by the bi-encoder returns the most relevant results
for the query, knowing that the results aren't going to be filtered out after.
So the cross-encoder shoulders less of the burden of improving the results
## Corollary
- This pre-filtering technique allows us to apply other explicit filters
on entries relevant for the current query, before calling search
- E.g limit search to entries within date/time specified in query
- Issue
- Explicit filtering was earlier being done after search by bi-encoder
but before re-ranking by cross-encoder
- This was limiting the quality of results being returned. As the
bi-encoder returned results which were going to be excluded. So the
burden of improving those limited results post filtering was on the
cross-encoder by re-ranking the remaining results based on query
- Fix
- Given the embeddings corresponding to an entry are at the same index
in their respective lists. We can run the filter for blocked,
required words before the search by the bi-encoder model. And limit
entries, embeddings being considered for the current query
- Result
- Semantic search by the bi-encoder gets to return most relevant
results for the query, knowing that the results aren't going to be
filtered out after. So the cross-encoder shoulders less of the
burden of improving results
- Corollary
- This pre-filtering technique allows us to apply other explicit
filters on entries relevant for the current query
- E.g limit search for entries within date/time specified in query
- test_regenerate_with_valid_content failed when run after test_asymmetric_search
- test_asymmetric_search did't clean the temporary update to config it had made
- This was resulting in regenerate looking for a file that didn't exist