Commit graph

70 commits

Author SHA1 Message Date
Debanjum Singh Solanky
4d5183063c Create images directory if doesn't exist, to store image search results 2022-07-28 21:30:31 +04:00
Debanjum Singh Solanky
a12eaa4ce0 Move Khoj image results into a child images/ directory 2022-07-28 20:45:12 +04:00
Debanjum Singh Solanky
09727ac3be Make bi-encoder return fewer results to reduce cross-encoder latency 2022-07-27 07:26:02 +04:00
Debanjum Singh Solanky
1168244c92 Make cross-encoder re-rank results if query param set on /search API
- Improve search speed by ~10x
  Tested on corpus of 125K lines, 12.5K entries

- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
  - It's an optional param that default to False
  - Earlier all results were re-ranked by cross-encoder
  - Making this configurable allows for much faster results, if desired
    but for lower accuracy
2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky
b1e64fd4a8 Improve search speed. Only apply filter if filter keywords in query
- Formalize filters into class with can_filter() and filter() methods

- Use can_filter() method to decide whether to apply filter and
  create deep copies of entries and embeddings for it

- Improve search speed for queries with no filters
  as deep copying entries, embeddings takes the most time
  after cross-encodes scoring when calling the /search API

  Earlier we would create deep copies of entries, embeddings
  even if the query did not contain any filter keywords
2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky
f094c86204 Trace query response performance and display timings in verbose mode 2022-07-26 21:03:53 +04:00
Debanjum Singh Solanky
0602d018c0 Merge Symmetric, Asymmetric Search Types into a single Text Search Type
- The code for both the text search types were mostly the same
  It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
  text_search type
- This simplifies the app and making it easier to process other
  text types
2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky
0917f1574d Consolidate jsonl helper methods in a single file under utils module 2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky
5aad297286 Reuse logic to extract entries across symmetric, asymmetric search
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types

Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky
e220ecc00b Generate compiled form of each transaction directly in the beancount processor
- The logic for compiling a beancount entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows symmetric search to be generic and not be aware of
  beancount specific properties that were extracted by the
  beancount-to-jsonl processor layer

- Now symmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be location, transaction, chat etc, it doesn't have to care
2022-07-21 02:43:28 +04:00
Debanjum Singh Solanky
06cf425314 Generate compiled form of each entry directly in the org-mode processor
- The logic for compiling an org-mode entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows asymmetric search to be generic and not be aware of
  org-mode specific properties that were extracted by the org-to-jsonl
  processor layer

- Now asymmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be mail, chat, markdown, org-mode etc, it doesn't have to care
2022-07-21 02:08:02 +04:00
Debanjum Singh Solanky
4ead79d272 Make Notes Search Natural Language Date Aware
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings

- The (new?) model seems to understand dates. So can give more
  relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
  will give different results, with the second prioritizing entries
  mentioning any entries with closed, scheduled dates from 1984
2022-07-21 01:06:49 +04:00
Debanjum Singh Solanky
70e70d4b15 Rename 'embed' key to more generic 'compiled' for jsonl extracted results
- While it's true those strings are going to be used to generated
  embeddings, the more generic term allows them to be used elsewhere as
  well

- Their main property is that they are processed, compiled for
  usage by semantic search

- Unlike the 'raw' string which contains the external representation
  of the data, as is
2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky
c1369233db Consistently use "entry", "score" in json response for all search types
- Had already made some progress on this earlier by updating the image
  search responses. But needed to update the text search responses to
  use lowercase entry and score

- Update khoj.el to consume the updated json response keys for text
  search
2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky
d68a9dc445 Sort extracted images before computing their embeddings
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
  across machines.
- Use case:
  A more powerful, always on machine actually computes the image embeddings regularly
  The client machine just load these periodically to provide semantic search functionality
2022-07-20 03:51:27 +04:00
Debanjum Singh Solanky
c4c7f38b15 Fix extracting image names from multiple image directories 2022-07-20 03:40:49 +04:00
Debanjum Singh Solanky
bdc1b9f2bb Resolve edge case errors in encoding image metadata
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
  - return empty strings in such a scenario instead of ". "
2022-07-20 02:58:43 +04:00
Debanjum Singh Solanky
2a5445216c Image input directory not required by collate result as image_name already absolute path 2022-07-20 02:56:23 +04:00
Debanjum Singh Solanky
6c9ffdba57 Allow indexing multiple image directories for image search 2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky
b673d26a12 Extract Entries in a standardized format across text search types
Issue:
 - Had different schema of extracted entries for symmetric_ledger vs asymmetric

 - Entry extraction for asymmetric was dirty, relying on cryptic
   indices to store raw entry vs cleaned entry meant to be passed to embeddings

 - This was pushing the load of figuring out what property to extract
   from each entry to downstream processes like the filters

 - This limited the filters to only work for asymmetric search, not for
   symmetric_ledger

- Fix
   - Use consistent format for extracted entries
     {
       'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
       'raw'  : raw_entry_string_meant_to_be_passed_to_use
     }

 - Result
   - Now filters can be applied across search types, and the specific
     field they should be applied on can be configured by each search
     type
2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky
989526ae54 Use a more accurate model for symmetric semantic search
- The all-MiniLM-L6-v2 is more accurate
  - The exact previous model isn't benchmarked but based on the
    performance of the closest model to it. Seems like the new model
    maybe similar in speed and size

- On very preliminary evaluation of the model, the new model seems
  faster, with pretty decent results
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
e0d8398b27 Normalize metadata match score to work better with image match score
- Metadata match score were consistently giving higher scores by a
  factor of ~3x wrt to image match score. This was resulting in all
  results being from the metadata match with query and none from the
  image match with query.
- Scaling the metadata match scores down by scaling factor seems to
  give more consistently give a blend of results from both image and
  metadata matches
2022-07-16 03:39:33 +04:00
Debanjum Singh Solanky
a3fc82817d Log and continue on image metadata encoding error due to Tensor size mismatch 2022-07-16 03:39:19 +04:00
Debanjum Singh Solanky
f26d0ddbbd Minor fix to asymmetric search when no entries returned 2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
4e27ae0577 Ease access to image result for given query by image_search
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
  directly in browser
- Return image, metadata scores for each image in response as well
  This should help get a better sense of image scores along both
  XMP metadata and whole image axis
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
0e979587e0 Add configurable filter support to Symmetric Ledger Search 2022-07-14 23:40:41 +04:00
Debanjum Singh Solanky
b82aef26bf Make filters to apply before semantic search configurable
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
  query, entries and embeddings before semantic search is performed

Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky
c92789d20a Extract explicit pre-search filter function into a separate module
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query

Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:20:04 +04:00
Debanjum Singh Solanky
6d7ab50113 Run Explicit Filter on Entries, Embeddings before Semantic Search for Query
- Issue
  - Explicit filtering was earlier being done after search by bi-encoder
    but before re-ranking by cross-encoder

  - This was limiting the quality of results being returned. As the
    bi-encoder returned results which were going to be excluded. So the
    burden of improving those limited results post filtering was on the
    cross-encoder by re-ranking the remaining results based on query

- Fix
  - Given the embeddings corresponding to an entry are at the same index
    in their respective lists. We can run the filter for blocked,
    required words before the search by the bi-encoder model. And limit
    entries, embeddings being considered for the current query

- Result
  - Semantic search by the bi-encoder gets to return most relevant
    results for the query, knowing that the results aren't going to be
    filtered out after. So the cross-encoder shoulders less of the
    burden of improving results

- Corollary
  - This pre-filtering technique allows us to apply other explicit
    filters on entries relevant for the current query
    - E.g limit search for entries within date/time specified in query
2022-07-12 18:25:42 +04:00
Debanjum Singh Solanky
7677465f23 Fix passing of device to setup method in /reload, /regenerate API
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky
eda4b65ddb Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU
- Move embeddings to CUDA GPU for compute, when available
- Normalize embeddings and Use Dot Product instead of Cosine
2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky
1c5754bf95 Simplify storing Tags in OrgNode object
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
  - Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
  for code readability
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
de23fc2051 Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration

This reverts commit a2a08d1354.
2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky
a2a08d1354 Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search 2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky
1c3a1420f8 Update asymmetric extract_entries method to handle uncompressed jsonl
This is similar to what was done for the symmetric extract_entries
method earlier
2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky
502c68d4f8 Remove trailling escape sequence in ledger search response entries
- Fix loading entries from jsonl in extract_entries method
  - Only extract Title from jsonl of each entry
    This is the only thing written to the jsonl for symmetric ledger
  - This fixes the trailing escape seq in loaded entries
  - Remove the need for semantic-search.el response reader to do pointless complicated cleanup

- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
  Both methods were doing similar work

- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
179153dc5a Rename RawConfig Types for Consistency
- Naming convention - [ContentType][ConfigType]Config
  - Where [ConfigType] ~ Content, Search, Processor
  - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation

- Current Configs:
  - Content:
    - Org Notes
    - Org Music
    - Image
    - Ledger/Beancount

  - Search:
     - Asymmetric
     - Symmetric
     - Image

  - Processor:
    - Conversation
2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky
510faa1904 Save Image Search Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
934ec233b0 Add Search Config for Symmetric Model. Save Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
b63026d97c Save Asymmetric Search Model to Disk
- Improve application load time
- Remove dependence on internet to startup application and perform semantic search
2022-01-14 17:36:27 -05:00
Saba
97a6dfaa1e Use default value False for verbose parameter, and small changes
Pass config as parameter to initialize_search, change name of API methods to handle config CRUD operations, and initalize config to FullConfig
2021-12-11 14:13:14 -05:00
Saba
9536358d34 Fix key error model_name issue by upgrade sentence-transformers version
Refer to https://github.com/UKPLab/sentence-transformers/issues/1241
Also user verbose flag passed through function parameters in image_search
2021-12-11 11:58:19 -05:00
Saba
ce7a751e6b Fix passing verbose flag down in symmetric_ledger.py 2021-12-11 11:36:32 -05:00
Saba
d65190c3ee Update unit tests, files with removing model suffix to config types 2021-12-09 08:50:38 -05:00
Saba
9b16cdbb41 Use past tense for verbose log 2021-12-04 11:45:44 -05:00
Saba
10e4065e05 Consolidate the search config models and pass verbose as a top level flag 2021-12-04 11:43:48 -05:00
Saba
5b80b87379 Streamline None checking in initialize_search 2021-11-28 12:05:04 -05:00
Saba
66183cc298 Working API request body parsing to /post config! 2021-11-28 11:12:26 -05:00
debanjum
46661b3057 Ensure top_k never more than total entries to run symmetric search on 2021-11-16 11:32:21 -08:00
debanjum
8c858d1a94 Reduce symmetric search results for cross-encoder to re-rank to improve search speed 2021-11-16 11:31:19 -08:00