Commit graph

130 commits

Author SHA1 Message Date
Debanjum Singh Solanky
7640e2ab0c Wrap attempt to extract dates from entry in try/catch
- Not all YYYY-MM-DD strings in entry are necessarily dates
2022-07-14 21:38:00 +04:00
Debanjum Singh Solanky
9de2097182 Fix date filter usage with multi word queries. Simplify date regex 2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky
dcb6fe479e Fix date_filter query, entry in query range check. Add tests for it
- Fix date_filter date_in_entry within query range check
  - Extracted_date_range is in [included_date, excluded_date) format
  - But check was checking for date_in_entry <= excluded_date
  - Fixed it to do date_in_entry < excluded_date

- Fix removal of date filter from query
- Add tests for date_filter
2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky
011f81fac5 Fix date_filter to handle non overlapping date ranges 2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky
70ac35b2a5 Compute Date Range to filter entries to, from Comparators, Dates in Query 2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky
e6db3e3d00 Prefer Dates From Future only when specific words in date string
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
  dates in the future but dateparser's parse doesn't parse it at all.
  E.g parse('5 months from now') returns nothing

- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
  parse('5 months') to dateparser.parse works as expected
2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky
4a201d52af Add, test date filter regex and date parsing to get natural date range 2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky
b54588717f Filter for entries with dates specified by user in query
- Create Date filter
  - Users can pass dates in YYYY-MM-DD format in their query
- Use it to filter asymmetric search to user specified dates
2022-07-14 00:51:02 +04:00
Debanjum Singh Solanky
b82aef26bf Make filters to apply before semantic search configurable
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
  query, entries and embeddings before semantic search is performed

Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky
c92789d20a Extract explicit pre-search filter function into a separate module
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query

Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:20:04 +04:00
Debanjum Singh Solanky
6d7ab50113 Run Explicit Filter on Entries, Embeddings before Semantic Search for Query
- Issue
  - Explicit filtering was earlier being done after search by bi-encoder
    but before re-ranking by cross-encoder

  - This was limiting the quality of results being returned. As the
    bi-encoder returned results which were going to be excluded. So the
    burden of improving those limited results post filtering was on the
    cross-encoder by re-ranking the remaining results based on query

- Fix
  - Given the embeddings corresponding to an entry are at the same index
    in their respective lists. We can run the filter for blocked,
    required words before the search by the bi-encoder model. And limit
    entries, embeddings being considered for the current query

- Result
  - Semantic search by the bi-encoder gets to return most relevant
    results for the query, knowing that the results aren't going to be
    filtered out after. So the cross-encoder shoulders less of the
    burden of improving results

- Corollary
  - This pre-filtering technique allows us to apply other explicit
    filters on entries relevant for the current query
    - E.g limit search for entries within date/time specified in query
2022-07-12 18:25:42 +04:00
Debanjum Singh Solanky
7677465f23 Fix passing of device to setup method in /reload, /regenerate API
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky
eda4b65ddb Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU
- Move embeddings to CUDA GPU for compute, when available
- Normalize embeddings and Use Dot Product instead of Cosine
2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky
b89fc2f4ac Add /reload API to reload model embeddings and entries from file
- The reload API adds the ability to separate out the loading of
  embeddings from file without having to restart app or (re-)generate embeddings

- Before this the only way to load model from file was by restarting app
- The other way to reload the model embeddings by regenerating them
  was to expensive for larger datasets

- This unlocks at least 1 use-case, where
  - we regenerate model via an app instance running on a separate server and
  - just reload the generated embeddings on the client device

  - This allows us to offload the expensive embedding generation
    compute to a background server while letting

  - This avoids having to (re-)restart application on client device or
    be forced to generate embeddings on the client device itself

  - But it requires the model relevant files to be synced to the client device
    This can be done with any file syncing application like Syncthing

  - We can then call /regenerate on server and /reload client on a
    regular schedule to keep our data up to date on semantic search
2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky
f5d6d1e752 Tiny style fix to separate functions by 2 newlines 2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky
85fbe1c42b Normalize org notes path to be relative to home directory
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
  - Notes are being processed on a different machine and used on a different machine
  - But the notes directory is in the same location relative to home on both the machines
2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky
094eaf3fcc Fix minor bugs in OrgNode parser
- Bugs discovered from writing org-node tests
2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky
36495038dd Fix storing parsed CLOSED date in OrgNode
The CLOSED date was getting parsed but not stored
Adding setClosed at start also fixed the issue
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
1c5754bf95 Simplify storing Tags in OrgNode object
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
  - Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
  for code readability
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
51a43245d3 Escape square brackets in file+heading based org-mode links 2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky
04610f453a Include scheduled date, deadline date and close date in repr of org node
- Now that excluding the times line from the raw body of node,
  show it in repr so user can see it for reference

- But the model doesn't need to see it for it's embeddings to be
  confused by
2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky
367d7377df Ignore scheduled, closed, deadline time and logbook start, end in org node body
- Gives cleaner embeddings for semantic search
- Hopefully improves results and reduces size, compute
2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky
b77ccadcba Make property key regex more strict. Property key has to be alphanumeric 2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky
ac9d746444 Fix Tags extraction in Org Node parser
- Previous version required two tags at least to work, not sure why
- Fixed it to extract all tags, even if only one tag in heading
2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky
fb86be8cd9 Add ID, File+Heading based Links to Org-Mode Entries
- Add links to property drawer
- This ensures results returned by semantic search contain these links
- This allows the user to jump to entry within original file for context
- The ID, file+heading based links are more robust to find relevant
  entry in original file than the line no based link,
  as edits being done by user to original files between embedding regenerations
2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky
de23fc2051 Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration

This reverts commit a2a08d1354.
2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky
a2a08d1354 Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search 2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky
cfbd5c4ecc Update global model on regenerate via API 2022-06-17 00:49:06 +03:00
Debanjum Singh Solanky
c78bf84eef Introduce search api endpoint that auto infers search type intent
- Introduce prompt for GPT to automatically extract user's search intent
- Expose new search api endpoint to use that to set SearchType being
  passed to search API
- Currently meant as an experimental API to gauge usefulness,
  extendability. Evaluating for phone or voice use-case
2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
8ef7917014 Fix json format passed in prompt to GPT 2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
f57b7f65ea Wrap prompts for GPT in triple quotes to improve prompt readability
To prompt improve readability:
- Remove newline escape sequence and use actual newline directly
  - This avoids one long line of text as prompt and
- Remove escaping of double quotes
2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
1eba7b1c6f Use empty_escape_sequence constant to strip response text from gpt 2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
1c3a1420f8 Update asymmetric extract_entries method to handle uncompressed jsonl
This is similar to what was done for the symmetric extract_entries
method earlier
2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky
3d8a07f252 Extract empty line escape sequences var into constants file for reuse 2022-02-27 19:01:49 -05:00
Debanjum Singh Solanky
bb5d0d8908 Improve Semantic Search Buffer Names in Emacs
- Allow multiple semantic searches buffers to exist simultaneously
  - Uniquify semantic search buffer namew
- Add query and search-type to semantic search buffer name for easier
  disambiguration, search and find appropriate
2022-02-26 18:30:14 -05:00
Debanjum Singh Solanky
b68558651b Improve Extraction of Beancount Entries
- Only extract entries starting with YYYY-MM-DD from Beancount
- Strip Trailing Escape Sequences from Entries
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
b3ac2dd730 Improve Results Rendered on Emacs from Semantic Search on Ledger
- Add search query to top of buffer as Beancount comment
- Remove trailing ) from response
- Separate entries by empty line
- Load beancount-mode in semantic search on ledger buffer
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
502c68d4f8 Remove trailling escape sequence in ledger search response entries
- Fix loading entries from jsonl in extract_entries method
  - Only extract Title from jsonl of each entry
    This is the only thing written to the jsonl for symmetric ledger
  - This fixes the trailing escape seq in loaded entries
  - Remove the need for semantic-search.el response reader to do pointless complicated cleanup

- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
  Both methods were doing similar work

- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
248aa632c0 Do not throw warning for beancount files with .beancount extension 2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
76cd63f4bd Fix count of processed jsonl entries shown to user by ledger processor
Count lines not chars
2022-02-26 17:46:06 -05:00
Saba
33bc62dc19 Fix type of use_xmp_metadata to be bool, rather than str 2022-01-24 21:53:26 -05:00
Debanjum Singh Solanky
179153dc5a Rename RawConfig Types for Consistency
- Naming convention - [ContentType][ConfigType]Config
  - Where [ConfigType] ~ Content, Search, Processor
  - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation

- Current Configs:
  - Content:
    - Org Notes
    - Org Music
    - Image
    - Ledger/Beancount

  - Search:
     - Asymmetric
     - Symmetric
     - Image

  - Processor:
    - Conversation
2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky
c64e0c2965 Load model from HuggingFace if model_directory unset in config YAML
- Do not save/load the model to/from disk when model_directory unset
in config.yml
- Add symmetric search default config to cli.py
2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
510faa1904 Save Image Search Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
934ec233b0 Add Search Config for Symmetric Model. Save Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
b63026d97c Save Asymmetric Search Model to Disk
- Improve application load time
- Remove dependence on internet to startup application and perform semantic search
2022-01-14 17:36:27 -05:00
Debanjum Singh Solanky
2e53fbc844 Fix the user intent extraction prompt for GPT. Clean up chatbot test 2022-01-12 10:36:01 -05:00
Debanjum Singh Solanky
ea28897cdd Remove deprecated conversation_history field from config 2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky
5a686b7be9 Add logs for chat bot in verbose mode 2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky
6dc2a99d35 Merge branch 'master' of github.com:debanjum/semantic-search into add-summarize-capability-to-chat-bot
- Fix openai_api_key being set in ConfigProcessorConfig
- Merge addition of config UI and config instantiation updates
2021-12-20 13:30:42 +05:30