- Context
- The app maintains all text content in a standard, intermediate format
- The intermediate format was loaded, passed around as a dictionary
for easier, faster updates to the intermediate format schema initially
- The intermediate format is reasonably stable now, given it's usage
by all 3 text content types currently implemented
- Changes
- Concretize text entries into `Entries' class instead of using dictionaries
- Code is updated to load, pass around entries as `Entries' objects
instead of as dictionaries
- `text_search' and `text_to_jsonl' methods are annotated with
type hints for the new `Entries' type
- Code and Tests referencing entries are updated to use class style
access patterns instead of the previous dictionary access patterns
- Move `mark_entries_for_update' method into `TextToJsonl' base class
- This is a more natural location for the method as it is only
(to be) used by `text_to_jsonl' classes
- Avoid circular reference issues on importing `Entries' class
- Both Text, Image Search were already giving list of entry, score
- This change just concretizes this change and exposes this in the API
documentation (i.e OpenAPI, Swagger, Redocs)
- Start standardizing implementation of the `text_to_jsonl' processors
- `text_to_jsonl; scripts already had a shared structure
- This change starts to codify that implicit structure
- Benefits
- Ease adding more `text_to_jsonl; processors
- Allow merging shared functionality
- Help with type hinting
- Drawbacks
- Lower agility to change. But this was already an implicit issue as
the text_to_jsonl processors got more deeply wired into the app
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
- This also removes dependency on system Exiftool package for
XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
- For queries with only filters in them short-circuit and return
filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
- Update existings code, tests to process input-filters as list
instead of str
- Test `text_to_jsonl' get files methods to work with combination of
`input-files' and `input-filters'
Resolves#84
- Let the specific text_to_jsonl method decide which of the
TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
methods in check
- What
- Hash the entries and compare to find new/updated entries
- Reuse embeddings encoded for existing entries
- Only encode embeddings for updated or new entries
- Merge the existing and new entries and embeddings to get the updated
entries, embeddings
- Why
- Given most note text entries are expected to be unchanged
across time. Reusing their earlier encoded embeddings should
significantly speed up embeddings updates
- Previously we were regenerating embeddings for all entries,
even if they had existed in previous runs
- Previously we were failing if no valid entries while computing
embeddings. This was obscuring the actual issue of no valid entries
found in the specified content files
- Throwing an exception early with clear message when no entries found
should make clarify the issue to be fixed
- See issue #83 for details
- Filter entries, embeddings by ids satisfying all filters in query
func, after each filter has returned entry ids satisfying their
individual acceptance criteria
- Previously each filter would return a filtered list of entries.
Each filter would be applied on entries filtered by previous filters.
This made the filtering order dependent
- Benefits
- Filters can be applied independent of their order of execution
- Precomputed indexes for each filter is not in danger of running
into index out of bound errors, as filters run on original entries
instead of on entries filtered by filters that have run before it
- Extract entries satisfying filter only once instead of doing
this for each filter
- Costs
- Each filter has to process all entries even if previous filters
may have already marked them as non-satisfactory
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
entry in text_search as all text search content types do not have
file being set in jsonl converters
- Only the filter knows when entries, embeddings are to be manipulated.
So move the responsibility to deep copy before manipulating entries,
embeddings to the filters
- Create deep copy in filters. Avoids creating deep copy of entries,
embeddings when filter results are being loaded from cache etc
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
- verbose = 0 maps to logging.WARN
- verbose = 1 maps to logging.INFO
- verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
- CLIP doesn't need full size images for generating embeddings with
decent search results. The sentence transformers docs use images
scaled to 640px width
- Benefits
- Normalize image sizes
- Increase image embeddings generation speed
- Decrease memory usage while generating embeddings from images
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
required by looking for ML compute device in global state
- Pass device to load models onto from app state.
- SentenceTransformer models accept device to load models onto during initialization
- Pass device to load corpus embeddings onto from app state
- CLIP Image score and XMP metadata score are not combining well.
When combined they give non sensical results. Enable only once
figure how best to combine the two.
- Show scores with higher precision for image search
- Image search scores seem to be mostly be between 0.2 - 0.3 for some reason
- Higher precision scores make it easier to understand the quality
of returned results perceived by the model itself
- Improve search speed by ~10x
Tested on corpus of 125K lines, 12.5K entries
- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
- It's an optional param that default to False
- Earlier all results were re-ranked by cross-encoder
- Making this configurable allows for much faster results, if desired
but for lower accuracy
- Formalize filters into class with can_filter() and filter() methods
- Use can_filter() method to decide whether to apply filter and
create deep copies of entries and embeddings for it
- Improve search speed for queries with no filters
as deep copying entries, embeddings takes the most time
after cross-encodes scoring when calling the /search API
Earlier we would create deep copies of entries, embeddings
even if the query did not contain any filter keywords
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The logic for compiling a beancount entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows symmetric search to be generic and not be aware of
beancount specific properties that were extracted by the
beancount-to-jsonl processor layer
- Now symmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be location, transaction, chat etc, it doesn't have to care
- The logic for compiling an org-mode entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows asymmetric search to be generic and not be aware of
org-mode specific properties that were extracted by the org-to-jsonl
processor layer
- Now asymmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be mail, chat, markdown, org-mode etc, it doesn't have to care
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings
- The (new?) model seems to understand dates. So can give more
relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
will give different results, with the second prioritizing entries
mentioning any entries with closed, scheduled dates from 1984
- While it's true those strings are going to be used to generated
embeddings, the more generic term allows them to be used elsewhere as
well
- Their main property is that they are processed, compiled for
usage by semantic search
- Unlike the 'raw' string which contains the external representation
of the data, as is
- Had already made some progress on this earlier by updating the image
search responses. But needed to update the text search responses to
use lowercase entry and score
- Update khoj.el to consume the updated json response keys for text
search
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
across machines.
- Use case:
A more powerful, always on machine actually computes the image embeddings regularly
The client machine just load these periodically to provide semantic search functionality
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
- return empty strings in such a scenario instead of ". "
Issue:
- Had different schema of extracted entries for symmetric_ledger vs asymmetric
- Entry extraction for asymmetric was dirty, relying on cryptic
indices to store raw entry vs cleaned entry meant to be passed to embeddings
- This was pushing the load of figuring out what property to extract
from each entry to downstream processes like the filters
- This limited the filters to only work for asymmetric search, not for
symmetric_ledger
- Fix
- Use consistent format for extracted entries
{
'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
'raw' : raw_entry_string_meant_to_be_passed_to_use
}
- Result
- Now filters can be applied across search types, and the specific
field they should be applied on can be configured by each search
type
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results