- Metadata match score were consistently giving higher scores by a
factor of ~3x wrt to image match score. This was resulting in all
results being from the metadata match with query and none from the
image match with query.
- Scaling the metadata match scores down by scaling factor seems to
give more consistently give a blend of results from both image and
metadata matches
Adding a random, unused url param at the end of the img.src string
fixes the issue. As the browser thinks it's a new image and doesn't
use the image data that's already cached because of which it wasn't
even making the fetch call for the image
- Allow viewing image results returned by Semantic Search.
Until now there wasn't any interface within the app to view image
search results. For text results, we at least had the emacs interface
- This should help with debugging issues with image search too
For text the Swagger interface was good enough
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
directly in browser
- Return image, metadata scores for each image in response as well
This should help get a better sense of image scores along both
XMP metadata and whole image axis
Conda doesn't support using the same environment across platforms
We were able to get away with this till now because of manually
setting up the conda environment.yml
But it's more robust to just add conda environment YAML files for each
platform as necessary
Goal
--
Allow Limiting Search to Entries in Specified Date Range
Example Queries:
---
- _Traveled for work internationally dt>"2 years ago"_
Finds relevant notes since start of 2020
- _Learnt a cool new skill dt="last month"_
Finds relevant notes anytime in the last month
- _Filed my taxes dt>="Jan 1984" dt<="April 1984"_
Any tax related notes between 1st Jan 1984 to 30th April of 1984
Details
--
- Parse natural language dates in query into date ranges
- Use `dateparser` library to parse natural language dates. But tune results to return more natural date ranges
- Example: A user asking for entries from April, requires looking for entries in the whole of April, not April 1st or April 30th
- Find all dates in entry (currently limited to YYYY-MM-DD format)
- Only perform semantic search on entries within date range specified in query by user
- With \t Last Word in Headings was suffixed by \t and so couldn't be
filtered by
- User interacts with raw entries, so run explicit filters on raw entry
- For semantic search using the filtered entry is cleaner, still
- Fix date_filter date_in_entry within query range check
- Extracted_date_range is in [included_date, excluded_date) format
- But check was checking for date_in_entry <= excluded_date
- Fixed it to do date_in_entry < excluded_date
- Fix removal of date filter from query
- Add tests for date_filter
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
dates in the future but dateparser's parse doesn't parse it at all.
E.g parse('5 months from now') returns nothing
- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
parse('5 months') to dateparser.parse works as expected
Reason
--
This abstraction will simplify adding other pre-search filters. E.g A date-time filter
Capabilities
--
- Multiple filters can be applied on the query, entries etc before search
- The filters to apply are configured for each type in the search controller
Details
--
- Move `explicit_filters` function into separate module under `search_filter`
- Update signature of explicit filter to take and return `query`, `entries`, `embeddings`
- Use this `explicit_filter` function from `search_filters` module in
`search` method in controller
- The asymmetric query method now just applies the passed filters to the
`query`, `entries` and `embeddings` before semantic search is performed
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
query, entries and embeddings before semantic search is performed
Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query
Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
## Issue
- Explicit filtering was being done after search by the bi-encoder
but before re-ranking by the cross-encoder
- This limited the quality of results being returned for queries with explicit filters.
The bi-encoder returned results which were going to be excluded.
So the burden of improving those limited results post filtering was on the
cross-encoder, by re-ranking the remaining results to best match the query
## Fix
- Given that the entry and its embedding are at the same index in their respective lists.
We know which entries map to which embedding tensors.
So we can run the filter for blocked, required words before the bi-encoder search.
And limit entries, embeddings being considered for the current query
## Result
- Semantic search by the bi-encoder returns the most relevant results
for the query, knowing that the results aren't going to be filtered out after.
So the cross-encoder shoulders less of the burden of improving the results
## Corollary
- This pre-filtering technique allows us to apply other explicit filters
on entries relevant for the current query, before calling search
- E.g limit search to entries within date/time specified in query
- Issue
- Explicit filtering was earlier being done after search by bi-encoder
but before re-ranking by cross-encoder
- This was limiting the quality of results being returned. As the
bi-encoder returned results which were going to be excluded. So the
burden of improving those limited results post filtering was on the
cross-encoder by re-ranking the remaining results based on query
- Fix
- Given the embeddings corresponding to an entry are at the same index
in their respective lists. We can run the filter for blocked,
required words before the search by the bi-encoder model. And limit
entries, embeddings being considered for the current query
- Result
- Semantic search by the bi-encoder gets to return most relevant
results for the query, knowing that the results aren't going to be
filtered out after. So the cross-encoder shoulders less of the
burden of improving results
- Corollary
- This pre-filtering technique allows us to apply other explicit
filters on entries relevant for the current query
- E.g limit search for entries within date/time specified in query
- test_regenerate_with_valid_content failed when run after test_asymmetric_search
- test_asymmetric_search did't clean the temporary update to config it had made
- This was resulting in regenerate looking for a file that didn't exist
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
- The reload API adds the ability to separate out the loading of
embeddings from file without having to restart app or (re-)generate embeddings
- Before this the only way to load model from file was by restarting app
- The other way to reload the model embeddings by regenerating them
was to expensive for larger datasets
- This unlocks at least 1 use-case, where
- we regenerate model via an app instance running on a separate server and
- just reload the generated embeddings on the client device
- This allows us to offload the expensive embedding generation
compute to a background server while letting
- This avoids having to (re-)restart application on client device or
be forced to generate embeddings on the client device itself
- But it requires the model relevant files to be synced to the client device
This can be done with any file syncing application like Syncthing
- We can then call /regenerate on server and /reload client on a
regular schedule to keep our data up to date on semantic search
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
- Notes are being processed on a different machine and used on a different machine
- But the notes directory is in the same location relative to home on both the machines
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
- Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
for code readability