Commit graph

1571 commits

Author SHA1 Message Date
Debanjum Singh Solanky
a4a23d7a72 Batch encode XMP metadata from images too for image_search 2021-09-16 11:11:36 -07:00
Debanjum Singh Solanky
3afe054312 Make image batch size to encode configurable via config.yml 2021-09-16 10:52:31 -07:00
Debanjum Singh Solanky
41c328dae0 Batch encode images to keep memory consumption manageable
- Issue:
  Process would get killed while encoding images
  for consuming too much memory

- Fix:
  - Encode images in batches and append to image_embeddings
  - No need to use copy or deep_copy anymore with batch processing.
    It would earlier throw too many files open error

Other Changes:
  - Use tqdm to see progress even when using batch
  - See progress bar of encoding independent of verbosity (for now)
2021-09-16 10:15:54 -07:00
Debanjum Singh Solanky
d8abbc0552 Use XMP metadata in images to improve image search
- Details
  - The CLIP model can represent images, text in the same vector space

  - Enhance CLIP's image understanding by augmenting the plain image
    with it's text based metadata.
    Specifically with any subject, description XMP tags on the image

  - Improve results by combining plain image similarity score with
    metadata similarity scores for the highest ranked images

- Minor Fixes
  - Convert verbose to integer from bool in image_search.
    It's already passed as integer from the main program entrypoint

  - Process images with ".jpeg" extensions too
2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky
0e34c8f493 Allow semantic search on images from Emacs
Images are rendered inline a temporary org-mode buffer
2021-09-10 01:14:34 -07:00
Debanjum Singh Solanky
7d5514ecaa Allow user to override inferred search type with other valid options 2021-09-10 00:58:24 -07:00
Debanjum Singh Solanky
3bdeeb1e19 Autoload main semantic-search function 2021-09-09 22:10:37 -07:00
Debanjum Singh Solanky
f4bde75249 Decouple results shown to user and text the model is trained on
- Previously:
  The text the model was trained on was being used to
  re-create a semblance of the original org-mode entry.

- Now:
  - Store raw entry as another key:value in each entry json too
    Only return actual raw org entries in results
    But create embeddings like before
  - Also add link to entry in file:<filename>::<line_number> form
    in property drawer of returned results
    This can be used to jump to actual entry in it's original file
2021-08-29 06:06:54 -07:00
Debanjum Singh Solanky
7ee3007070 Get ID, QUERY, TYPE, CATEGORY properties from org property drawer when present 2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
0263d4d068 Enable semantic search for songs in org-music
Org-Music: https://github.com/debanjum/org-music
2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
fd7888f3d4 Resolve relative file paths to config YAML file in cli.py 2021-08-29 03:03:37 -07:00
Debanjum Singh Solanky
fc531a1915 Resolve relative file paths to model embeddings in all search types 2021-08-28 22:26:12 -07:00
Debanjum Singh Solanky
74faa34bee Update sample config to add minimal config for ledger, image search 2021-08-22 21:54:49 -07:00
Debanjum Singh Solanky
8dec58b12a Update Readme to state can now query beancount transactions, images 2021-08-22 21:50:27 -07:00
Debanjum Singh Solanky
4daeddbbda Enable Semantic Search on Images 2021-08-22 21:42:37 -07:00
Debanjum Singh Solanky
fd217fe8b7 Enable Semantic Search for Beancount transactions 2021-08-22 21:36:06 -07:00
Debanjum Singh Solanky
97263b8209 Move CLI into a separate module. Move CLI tests into a separate file 2021-08-21 19:21:38 -07:00
Debanjum Singh Solanky
78a1f4ebb4 Use YAML file to allow user to configure application. Add tests
- YAML Config
  - Can specify all params[1] earlier being passed via cmd args in config YAML
  - Can now also configure sentence-transformer models to use etc for search
    - [1] Config params
       - org files
       - compressed entries file config path
       - embeddings file config path

  - Include sample_config.yaml
  - Include sample .org file from this repos readmes

- CLI
  - Configuration Priority: Config via cmd > Config via YAML > Default Config
  - Test CLI, include test config.yml for the tests

- Set default type to None unless set via query param to API
  Run notes search if search_enabled, also if type is None (default)
  Prepares for running queries on all search types unless type
  specified in API query param

- Update Readme
2021-08-21 19:07:39 -07:00
Debanjum Singh Solanky
bafc86d583 Add helpers to merge dictionaries and get keys deep inside a dictionary 2021-08-21 18:27:50 -07:00
Debanjum Singh Solanky
eddbc67358 Document how to install latest version in Readme 2021-08-17 18:27:10 -07:00
Debanjum Singh Solanky
252266b62a Pass type of item via regenerate API. Default type query param to None 2021-08-17 18:25:07 -07:00
Debanjum Singh Solanky
ff7207a6bd Extract commandline arguments into separate testable method 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
a3a1100be9 Arrange modules in standardized ordering 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
569e30b1c8 Create a few basic tests 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
af9660f28e Move application files under src directory. Update Readmes
- Remove callign asymmetric search script directly command.
  It doesn't work anymore on calling directly due to internal package
  import issues
2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
c35c6fb0b3 Reuse asymmetric.setup & input validation from asymmetric & org_to_jsonl
Create asymmetric.setup method to
  - initialize model
  - generate compressed jsonl
  - compute embeddings

put input_files, input_file_filter validation in org_to_jsonl for
reuse in main.py, asymmetic.py
2021-08-17 00:45:40 -07:00
Debanjum Singh Solanky
02a84df37a Update state vars after regeneration. Minimize time app in inconsistent state 2021-08-16 23:47:33 -07:00
Debanjum Singh Solanky
0509854e14 Replace README.md with README.org. Can be used as notes for testing 2021-08-16 20:00:05 -07:00
Debanjum Singh Solanky
79aff85fcb Update Readme. No separate SETUP step required. Simpler RUN step
- Setup now happens on first run of application
- Embeddings can now be regenerated without killing app by calling API
2021-08-16 19:24:04 -07:00
Debanjum Singh Solanky
95bf26a7f2 Set verbosity commandline parameters default value to 0 2021-08-16 19:16:29 -07:00
Debanjum Singh Solanky
04a9a6d62f Expose API endpoint to (re-)generate embeddings from latest notes
- Provides mechanism to update notes from within application
  - Instead of having to pass the same arguments multiple times
    Pass it once (or rely on defaults when possible) and let app keep
    state and location of intermediary files

- Allows user to not have to deal with the internals of the application
  - E.g user doesn't have to specify the jsonl.gz or embeddings file path
    The app will still put those files in a default location
  - The user doesn't have to run the generation from the commandline
    as a separate step
2021-08-16 18:52:38 -07:00
Debanjum Singh Solanky
1c00c33e73 Improve debug output from org_to_jsonl.py script 2021-08-16 18:50:29 -07:00
Debanjum Singh Solanky
2a57156428 Fix org_to_jsonl. Use passed args not global variables in methods. Fix orgnode import 2021-08-16 17:37:44 -07:00
Debanjum Singh Solanky
66238004d8 Use verbosity level instead of bool across application
For consistent, more granular verbosity controls across app
Allows user to increase verbosity by passing -vvv flags passed to main.py
2021-08-16 17:15:41 -07:00
Debanjum Singh Solanky
adbf157deb Remove usage of the closure to search_notes as it's not required 2021-08-16 16:52:48 -07:00
Debanjum Singh Solanky
649e5d1327 Allow reuse of get_absolute_path, is_none_or_empty methods
- Move them to utils.helper.py for reuse
- Import those modules where required
- Delete duplicate methods defined in org_to_jsonl.py, asymmetric.py
2021-08-16 16:33:43 -07:00
Debanjum Singh Solanky
9703afb814 Rename search_types to search_type to standardize to singular naming
Using singular names for other directories in application already
 - processor instead of processors
 - interface instead of interfaces
2021-08-16 16:31:30 -07:00
Debanjum Singh Solanky
19d6678eb1 Allow importing org-to-jsonl as module for reuse
To allow importing org-to-jsonl as module
  - Wrap code in __main__ into a org-to-jsonl method
  - Rename processor/org-mode to processor/org_mode
  - Add __init__.py to processor directory
2021-08-16 16:31:30 -07:00
Debanjum Singh Solanky
5f8221f77e Remove unused verbose argument to collate_results method 2021-08-16 13:54:41 -07:00
Debanjum Singh Solanky
85bf15628d Use better cmdline argument names. Drop unneeded no-compress argument
Can infer to compress or not via the output_file suffix
2021-08-16 13:49:39 -07:00
Debanjum Singh Solanky
d9f60c00bf Warn if any input files to org-to-json are potentially non org-mode files
That is, if the file paths in the input set don't end with .org
2021-08-16 13:49:39 -07:00
Debanjum Singh Solanky
3aa0c30fee Use absolute file path to open files in org-to-jsonl.py, asymmetric.py
Exit script if neither org_files, org_file_filter is present
2021-08-16 13:49:39 -07:00
Debanjum Singh Solanky
e773611558 Remove unused jsonl_file argument from convert_org_entries_to_jsonl 2021-08-16 13:49:35 -07:00
Debanjum Singh Solanky
8b29e272d3 Standardize interface, better default args for org-to-json.py script
- Remove non-standard, unnecessary argument for org-directory
  Pass path each file in org-files and org-files-filter argument directly
- Allow shorthand -i, -o for input files, output files
- Default to compress, unless user explicitly specifies not to
2021-08-16 11:29:08 -07:00
Debanjum Singh Solanky
7547e90745 Minor doc updates after merging emacs package with main repository 2021-08-16 02:02:26 -07:00
Debanjum Singh Solanky
ec157ea0ff Add Emacs interface to semantic-search directly to main repository
Too much overhead to maintain multiple repositories, especially when
the Emacs library for semantic-search is a single file.

Import Readme from the emacs-semantic-search repository too
2021-08-16 01:27:46 -07:00
Debanjum Singh Solanky
dcf7b2d04f Remove requirements.txt for now as virtualenv setup doesn't work
Haven't gotten it to work on Mac or Ubuntu. Remove to avoid confusion
for now. Application depends on miniconda for now
2021-08-16 00:15:10 -07:00
Debanjum Singh Solanky
3b81fafa3e Use updated path to MiniLM bi-encoder model on hugging-face 2021-08-15 23:57:22 -07:00
Debanjum Singh Solanky
4839153086 Acknowledge ML models used for search. Simplify path used in commands 2021-08-15 23:56:18 -07:00
Debanjum Singh Solanky
c58c1d96aa Change default install directory to current, fix open file code 2021-08-15 23:01:55 -07:00