- Reason:
Allow natural search on markdown based notes, documentation,
websites etc
- Details:
- Create markdown processor to extract Markdown entries (identified by
Heading) into standard jsonl format required by text_search
- Update API, Configs to support interfacing with new markdown type
- Update Emacs, Web clients to support interfacing with new markdown
type via API
- Update Readme to mentiond markdown is also supported
Closes#35
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The logic for compiling a beancount entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows symmetric search to be generic and not be aware of
beancount specific properties that were extracted by the
beancount-to-jsonl processor layer
- Now symmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be location, transaction, chat etc, it doesn't have to care
- The logic for compiling an org-mode entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows asymmetric search to be generic and not be aware of
org-mode specific properties that were extracted by the org-to-jsonl
processor layer
- Now asymmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be mail, chat, markdown, org-mode etc, it doesn't have to care
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings
- The (new?) model seems to understand dates. So can give more
relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
will give different results, with the second prioritizing entries
mentioning any entries with closed, scheduled dates from 1984
- While it's true those strings are going to be used to generated
embeddings, the more generic term allows them to be used elsewhere as
well
- Their main property is that they are processed, compiled for
usage by semantic search
- Unlike the 'raw' string which contains the external representation
of the data, as is
- Had already made some progress on this earlier by updating the image
search responses. But needed to update the text search responses to
use lowercase entry and score
- Update khoj.el to consume the updated json response keys for text
search
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
across machines.
- Use case:
A more powerful, always on machine actually computes the image embeddings regularly
The client machine just load these periodically to provide semantic search functionality
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
- return empty strings in such a scenario instead of ". "
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
- Avoids having to click the query input box
- Just open page, type whatever and hit enter to do image search
- For other search types select appropriate type from dropdown
- Use shr to render image response from html in result buffer
Earlier was using org-mode. But rendering HTML with shr seems cleaner
- Use Headings to Add highlights
- Use Random to Force fetch of Image. Similar to what was done for Web interface
- Remove trailing elisp brackets from response
- Show query match scores by image model for each image in results
- Metadata match score were consistently giving higher scores by a
factor of ~3x wrt to image match score. This was resulting in all
results being from the metadata match with query and none from the
image match with query.
- Scaling the metadata match scores down by scaling factor seems to
give more consistently give a blend of results from both image and
metadata matches
Adding a random, unused url param at the end of the img.src string
fixes the issue. As the browser thinks it's a new image and doesn't
use the image data that's already cached because of which it wasn't
even making the fetch call for the image