Commit graph

1086 commits

Author SHA1 Message Date
Debanjum Singh Solanky
e8b7f06709 Ignore less files. Make gitignore more specific 2021-09-30 04:35:22 -07:00
Debanjum Singh Solanky
516f28b082 Merge branch 'master' of github.com:debanjum/semantic-search 2021-09-30 04:17:32 -07:00
Debanjum Singh Solanky
c200189385 Test notes search with explicit include, exclude filters 2021-09-30 04:13:40 -07:00
Debanjum Singh Solanky
d2905c4be6 Move tests out to project root. Use absolute import in project
tests/ directory in project root is more standard.
Just had to use absolute path for internal module imports to get it to
work
2021-09-30 04:12:14 -07:00
Debanjum Singh Solanky
58bb420f69 Fix image_metadata argument ordering bug. Add E2E image search test
- Image search test seems a little flaky
- Interchanged argument was causing inaccurate results earlier
2021-09-30 03:30:47 -07:00
Debanjum Singh Solanky
d5597442f4 Modularize Code. Wrap Search, Model Config in Classes. Add Tests
Details
  - Rename method query_* to query in search_types for standardization
  - Wrapping Config code in classes simplified mocking test config
  - Reduce args beings passed to a function by passing it as single
    argument wrapped in a class
  - Minimize setup in main.py:__main__. Put most of it into functions
    These functions can be mocked if required in tests later too

Setup Flow:
  CLI_Args|Config_YAML -> (Text|Image)SearchConfig -> (Text|Image)SearchModel
2021-09-30 02:04:04 -07:00
Debanjum Singh Solanky
f4dd9cd117 Use type specific model for other search types too. Expose them via SearchModels
- Wrap Image, Music, Ledger search into the type of SearchModel they use
  Similar to what was done for notes model by wrapping it's config
  into an AsymmetricSearchModel.

- Use the uber wrapper class to expose all type specific search models
2021-09-29 21:09:42 -07:00
Debanjum Singh Solanky
352d2930ee Use multiple threads to generate model embeddings. Other minor formating 2021-09-29 20:47:58 -07:00
Debanjum Singh Solanky
e22e0b41e3 Wrap asymmetric search model into SearchModels. Test notes search end-to-end
- Wrap asymmetric search model parameters into AsymmetricSearchModel class
- Create wrapper for all search type models. Put notes search model into it
- Test notes search end-to-end from client API layer to results.
  Use model build on test data
2021-09-29 20:47:35 -07:00
Debanjum Singh Solanky
cde11a2331 Wrap search type enablement status in a search settings class
- Cleaner, more idiomatic usage of a global variable
- Simplifies mocking when testing client in pytest as setting wrapped
  in object rather than a simple type. So passed around by reference
2021-09-29 19:18:33 -07:00
Debanjum Singh Solanky
81ce0cacc3 Only allow supported search types to /search, /regenerate APIs
- Use a SearchType to limit types that can be passed by user
- FastAPI automatically validates type passed in query param
- Available type options show up in Swagger UI, FastAPI docs
- controller code looks neater instead of doing string comparisons for type
- Test invalid, valid search types via pytest
2021-09-29 19:12:56 -07:00
Debanjum Singh Solanky
5db08c5293 Set query as heading of notes search results in Emacs Org buffer 2021-09-29 13:30:15 -07:00
Debanjum Singh Solanky
150593c776 Update Readme. Acknowledger PyExifTool and Minor Fixes 2021-09-16 12:39:42 -07:00
Debanjum Singh Solanky
fdb60a8dcf Set Query as Heading of Image Search Results Emacs Buffer 2021-09-16 12:30:06 -07:00
Debanjum Singh Solanky
169ddcc8c6 Make Using XMP Metadata to Enhance Image Search Optional, Configurable
- Break the compute embeddings method into separate methods:
  compute_image_embeddings and compute_metadata_embeddings

- If image_metadata_embeddings isn't defined, do not use it to enhance
  search results. Given image_metadata_embeddings wouldn't be defined
  if use_xmp_metadata is False, we can avoid unnecessary addition of
  args to query method
2021-09-16 12:01:05 -07:00
Debanjum Singh Solanky
a4a23d7a72 Batch encode XMP metadata from images too for image_search 2021-09-16 11:11:36 -07:00
Debanjum Singh Solanky
3afe054312 Make image batch size to encode configurable via config.yml 2021-09-16 10:52:31 -07:00
Debanjum Singh Solanky
41c328dae0 Batch encode images to keep memory consumption manageable
- Issue:
  Process would get killed while encoding images
  for consuming too much memory

- Fix:
  - Encode images in batches and append to image_embeddings
  - No need to use copy or deep_copy anymore with batch processing.
    It would earlier throw too many files open error

Other Changes:
  - Use tqdm to see progress even when using batch
  - See progress bar of encoding independent of verbosity (for now)
2021-09-16 10:15:54 -07:00
Debanjum Singh Solanky
d8abbc0552 Use XMP metadata in images to improve image search
- Details
  - The CLIP model can represent images, text in the same vector space

  - Enhance CLIP's image understanding by augmenting the plain image
    with it's text based metadata.
    Specifically with any subject, description XMP tags on the image

  - Improve results by combining plain image similarity score with
    metadata similarity scores for the highest ranked images

- Minor Fixes
  - Convert verbose to integer from bool in image_search.
    It's already passed as integer from the main program entrypoint

  - Process images with ".jpeg" extensions too
2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky
0e34c8f493 Allow semantic search on images from Emacs
Images are rendered inline a temporary org-mode buffer
2021-09-10 01:14:34 -07:00
Debanjum Singh Solanky
7d5514ecaa Allow user to override inferred search type with other valid options 2021-09-10 00:58:24 -07:00
Debanjum Singh Solanky
3bdeeb1e19 Autoload main semantic-search function 2021-09-09 22:10:37 -07:00
Debanjum Singh Solanky
f4bde75249 Decouple results shown to user and text the model is trained on
- Previously:
  The text the model was trained on was being used to
  re-create a semblance of the original org-mode entry.

- Now:
  - Store raw entry as another key:value in each entry json too
    Only return actual raw org entries in results
    But create embeddings like before
  - Also add link to entry in file:<filename>::<line_number> form
    in property drawer of returned results
    This can be used to jump to actual entry in it's original file
2021-08-29 06:06:54 -07:00
Debanjum Singh Solanky
7ee3007070 Get ID, QUERY, TYPE, CATEGORY properties from org property drawer when present 2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
0263d4d068 Enable semantic search for songs in org-music
Org-Music: https://github.com/debanjum/org-music
2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
fd7888f3d4 Resolve relative file paths to config YAML file in cli.py 2021-08-29 03:03:37 -07:00
Debanjum Singh Solanky
fc531a1915 Resolve relative file paths to model embeddings in all search types 2021-08-28 22:26:12 -07:00
Debanjum Singh Solanky
74faa34bee Update sample config to add minimal config for ledger, image search 2021-08-22 21:54:49 -07:00
Debanjum Singh Solanky
8dec58b12a Update Readme to state can now query beancount transactions, images 2021-08-22 21:50:27 -07:00
Debanjum Singh Solanky
4daeddbbda Enable Semantic Search on Images 2021-08-22 21:42:37 -07:00
Debanjum Singh Solanky
fd217fe8b7 Enable Semantic Search for Beancount transactions 2021-08-22 21:36:06 -07:00
Debanjum Singh Solanky
97263b8209 Move CLI into a separate module. Move CLI tests into a separate file 2021-08-21 19:21:38 -07:00
Debanjum Singh Solanky
78a1f4ebb4 Use YAML file to allow user to configure application. Add tests
- YAML Config
  - Can specify all params[1] earlier being passed via cmd args in config YAML
  - Can now also configure sentence-transformer models to use etc for search
    - [1] Config params
       - org files
       - compressed entries file config path
       - embeddings file config path

  - Include sample_config.yaml
  - Include sample .org file from this repos readmes

- CLI
  - Configuration Priority: Config via cmd > Config via YAML > Default Config
  - Test CLI, include test config.yml for the tests

- Set default type to None unless set via query param to API
  Run notes search if search_enabled, also if type is None (default)
  Prepares for running queries on all search types unless type
  specified in API query param

- Update Readme
2021-08-21 19:07:39 -07:00
Debanjum Singh Solanky
bafc86d583 Add helpers to merge dictionaries and get keys deep inside a dictionary 2021-08-21 18:27:50 -07:00
Debanjum Singh Solanky
eddbc67358 Document how to install latest version in Readme 2021-08-17 18:27:10 -07:00
Debanjum Singh Solanky
252266b62a Pass type of item via regenerate API. Default type query param to None 2021-08-17 18:25:07 -07:00
Debanjum Singh Solanky
ff7207a6bd Extract commandline arguments into separate testable method 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
a3a1100be9 Arrange modules in standardized ordering 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
569e30b1c8 Create a few basic tests 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
af9660f28e Move application files under src directory. Update Readmes
- Remove callign asymmetric search script directly command.
  It doesn't work anymore on calling directly due to internal package
  import issues
2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
c35c6fb0b3 Reuse asymmetric.setup & input validation from asymmetric & org_to_jsonl
Create asymmetric.setup method to
  - initialize model
  - generate compressed jsonl
  - compute embeddings

put input_files, input_file_filter validation in org_to_jsonl for
reuse in main.py, asymmetic.py
2021-08-17 00:45:40 -07:00
Debanjum Singh Solanky
02a84df37a Update state vars after regeneration. Minimize time app in inconsistent state 2021-08-16 23:47:33 -07:00
Debanjum Singh Solanky
0509854e14 Replace README.md with README.org. Can be used as notes for testing 2021-08-16 20:00:05 -07:00
Debanjum Singh Solanky
79aff85fcb Update Readme. No separate SETUP step required. Simpler RUN step
- Setup now happens on first run of application
- Embeddings can now be regenerated without killing app by calling API
2021-08-16 19:24:04 -07:00
Debanjum Singh Solanky
95bf26a7f2 Set verbosity commandline parameters default value to 0 2021-08-16 19:16:29 -07:00
Debanjum Singh Solanky
04a9a6d62f Expose API endpoint to (re-)generate embeddings from latest notes
- Provides mechanism to update notes from within application
  - Instead of having to pass the same arguments multiple times
    Pass it once (or rely on defaults when possible) and let app keep
    state and location of intermediary files

- Allows user to not have to deal with the internals of the application
  - E.g user doesn't have to specify the jsonl.gz or embeddings file path
    The app will still put those files in a default location
  - The user doesn't have to run the generation from the commandline
    as a separate step
2021-08-16 18:52:38 -07:00
Debanjum Singh Solanky
1c00c33e73 Improve debug output from org_to_jsonl.py script 2021-08-16 18:50:29 -07:00
Debanjum Singh Solanky
2a57156428 Fix org_to_jsonl. Use passed args not global variables in methods. Fix orgnode import 2021-08-16 17:37:44 -07:00
Debanjum Singh Solanky
66238004d8 Use verbosity level instead of bool across application
For consistent, more granular verbosity controls across app
Allows user to increase verbosity by passing -vvv flags passed to main.py
2021-08-16 17:15:41 -07:00
Debanjum Singh Solanky
adbf157deb Remove usage of the closure to search_notes as it's not required 2021-08-16 16:52:48 -07:00