- What
- Hash the entries and compare to find new/updated entries
- Reuse embeddings encoded for existing entries
- Only encode embeddings for updated or new entries
- Merge the existing and new entries and embeddings to get the updated
entries, embeddings
- Why
- Given most note text entries are expected to be unchanged
across time. Reusing their earlier encoded embeddings should
significantly speed up embeddings updates
- Previously we were regenerating embeddings for all entries,
even if they had existed in previous runs
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
- Word, Date Filters were accidently removed from the above types yesterday
- File Filter is the only filter that newly got added
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
entry in text_search as all text search content types do not have
file being set in jsonl converters
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
- verbose = 0 maps to logging.WARN
- verbose = 1 maps to logging.INFO
- verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
- Reason:
Allow natural search on markdown based notes, documentation,
websites etc
- Details:
- Create markdown processor to extract Markdown entries (identified by
Heading) into standard jsonl format required by text_search
- Update API, Configs to support interfacing with new markdown type
- Update Emacs, Web clients to support interfacing with new markdown
type via API
- Update Readme to mentiond markdown is also supported
Closes#35
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The logic for compiling a beancount entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows symmetric search to be generic and not be aware of
beancount specific properties that were extracted by the
beancount-to-jsonl processor layer
- Now symmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be location, transaction, chat etc, it doesn't have to care
- The logic for compiling an org-mode entry (for later encoding) now
completely resides in the org-to-jsonl processor layer
- This allows asymmetric search to be generic and not be aware of
org-mode specific properties that were extracted by the org-to-jsonl
processor layer
- Now asymmetric search just expects the jsonl to (at least) have the
'compiled' and 'raw' keys for each entry. What original text the
entry was compiled from is irrelevant to it. The original text
could be mail, chat, markdown, org-mode etc, it doesn't have to care
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings
- The (new?) model seems to understand dates. So can give more
relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
will give different results, with the second prioritizing entries
mentioning any entries with closed, scheduled dates from 1984
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
- Notes are being processed on a different machine and used on a different machine
- But the notes directory is in the same location relative to home on both the machines
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
- Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
for code readability
- Now that excluding the times line from the raw body of node,
show it in repr so user can see it for reference
- But the model doesn't need to see it for it's embeddings to be
confused by
- Add links to property drawer
- This ensures results returned by semantic search contain these links
- This allows the user to jump to entry within original file for context
- The ID, file+heading based links are more robust to find relevant
entry in original file than the line no based link,
as edits being done by user to original files between embedding regenerations
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration
This reverts commit a2a08d1354.
- Introduce prompt for GPT to automatically extract user's search intent
- Expose new search api endpoint to use that to set SearchType being
passed to search API
- Currently meant as an experimental API to gauge usefulness,
extendability. Evaluating for phone or voice use-case
To prompt improve readability:
- Remove newline escape sequence and use actual newline directly
- This avoids one long line of text as prompt and
- Remove escaping of double quotes
- Fix loading entries from jsonl in extract_entries method
- Only extract Title from jsonl of each entry
This is the only thing written to the jsonl for symmetric ledger
- This fixes the trailing escape seq in loaded entries
- Remove the need for semantic-search.el response reader to do pointless complicated cleanup
- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
Both methods were doing similar work
- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl
Conversation logs structure now has session info too instead of just chat info
Session info will allow loading past conversation summaries as context for AI in new conversations
{
"session": [
{
"summary": <chat_session_summary>,
"session-start": <session_start_index_in_chat_log>,
"session-end": <session_end_index_in_chat_log>
}],
"chat": [
{
"intent": <intent-object>
"trigger-emotion": <emotion-triggered-by-message>
"by": <AI|Human>
"message": <chat_message>
"created": <message_created_date>
}]
}
- Allow conversing with user using GPT's contextually aware, generative capability
- Extract metadata, user intent from user's messages using GPT's general understanding