- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
- Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
for code readability
- Now that excluding the times line from the raw body of node,
show it in repr so user can see it for reference
- But the model doesn't need to see it for it's embeddings to be
confused by
- Add links to property drawer
- This ensures results returned by semantic search contain these links
- This allows the user to jump to entry within original file for context
- The ID, file+heading based links are more robust to find relevant
entry in original file than the line no based link,
as edits being done by user to original files between embedding regenerations
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration
This reverts commit a2a08d1354.
- Introduce prompt for GPT to automatically extract user's search intent
- Expose new search api endpoint to use that to set SearchType being
passed to search API
- Currently meant as an experimental API to gauge usefulness,
extendability. Evaluating for phone or voice use-case
To prompt improve readability:
- Remove newline escape sequence and use actual newline directly
- This avoids one long line of text as prompt and
- Remove escaping of double quotes
- Add search query to top of buffer as Beancount comment
- Remove trailing ) from response
- Separate entries by empty line
- Load beancount-mode in semantic search on ledger buffer
- Fix loading entries from jsonl in extract_entries method
- Only extract Title from jsonl of each entry
This is the only thing written to the jsonl for symmetric ledger
- This fixes the trailing escape seq in loaded entries
- Remove the need for semantic-search.el response reader to do pointless complicated cleanup
- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
Both methods were doing similar work
- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl
Conversation logs structure now has session info too instead of just chat info
Session info will allow loading past conversation summaries as context for AI in new conversations
{
"session": [
{
"summary": <chat_session_summary>,
"session-start": <session_start_index_in_chat_log>,
"session-end": <session_end_index_in_chat_log>
}],
"chat": [
{
"intent": <intent-object>
"trigger-emotion": <emotion-triggered-by-message>
"by": <AI|Human>
"message": <chat_message>
"created": <message_created_date>
}]
}
- Allow conversing with user using GPT's contextually aware, generative capability
- Extract metadata, user intent from user's messages using GPT's general understanding
Details
- Rename method query_* to query in search_types for standardization
- Wrapping Config code in classes simplified mocking test config
- Reduce args beings passed to a function by passing it as single
argument wrapped in a class
- Minimize setup in main.py:__main__. Put most of it into functions
These functions can be mocked if required in tests later too
Setup Flow:
CLI_Args|Config_YAML -> (Text|Image)SearchConfig -> (Text|Image)SearchModel
- Wrap Image, Music, Ledger search into the type of SearchModel they use
Similar to what was done for notes model by wrapping it's config
into an AsymmetricSearchModel.
- Use the uber wrapper class to expose all type specific search models
- Wrap asymmetric search model parameters into AsymmetricSearchModel class
- Create wrapper for all search type models. Put notes search model into it
- Test notes search end-to-end from client API layer to results.
Use model build on test data
- Cleaner, more idiomatic usage of a global variable
- Simplifies mocking when testing client in pytest as setting wrapped
in object rather than a simple type. So passed around by reference
- Use a SearchType to limit types that can be passed by user
- FastAPI automatically validates type passed in query param
- Available type options show up in Swagger UI, FastAPI docs
- controller code looks neater instead of doing string comparisons for type
- Test invalid, valid search types via pytest
- Break the compute embeddings method into separate methods:
compute_image_embeddings and compute_metadata_embeddings
- If image_metadata_embeddings isn't defined, do not use it to enhance
search results. Given image_metadata_embeddings wouldn't be defined
if use_xmp_metadata is False, we can avoid unnecessary addition of
args to query method
- Issue:
Process would get killed while encoding images
for consuming too much memory
- Fix:
- Encode images in batches and append to image_embeddings
- No need to use copy or deep_copy anymore with batch processing.
It would earlier throw too many files open error
Other Changes:
- Use tqdm to see progress even when using batch
- See progress bar of encoding independent of verbosity (for now)
- Details
- The CLIP model can represent images, text in the same vector space
- Enhance CLIP's image understanding by augmenting the plain image
with it's text based metadata.
Specifically with any subject, description XMP tags on the image
- Improve results by combining plain image similarity score with
metadata similarity scores for the highest ranked images
- Minor Fixes
- Convert verbose to integer from bool in image_search.
It's already passed as integer from the main program entrypoint
- Process images with ".jpeg" extensions too
- Previously:
The text the model was trained on was being used to
re-create a semblance of the original org-mode entry.
- Now:
- Store raw entry as another key:value in each entry json too
Only return actual raw org entries in results
But create embeddings like before
- Also add link to entry in file:<filename>::<line_number> form
in property drawer of returned results
This can be used to jump to actual entry in it's original file