sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 15:38:55 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	d2905c4be6	Move tests out to project root. Use absolute import in project tests/ directory in project root is more standard. Just had to use absolute path for internal module imports to get it to work	2021-09-30 04:12:14 -07:00
Debanjum Singh Solanky	58bb420f69	Fix image_metadata argument ordering bug. Add E2E image search test - Image search test seems a little flaky - Interchanged argument was causing inaccurate results earlier	2021-09-30 03:30:47 -07:00
Debanjum Singh Solanky	d5597442f4	Modularize Code. Wrap Search, Model Config in Classes. Add Tests Details - Rename method query_* to query in search_types for standardization - Wrapping Config code in classes simplified mocking test config - Reduce args beings passed to a function by passing it as single argument wrapped in a class - Minimize setup in main.py:__main__. Put most of it into functions These functions can be mocked if required in tests later too Setup Flow: CLI_Args\|Config_YAML -> (Text\|Image)SearchConfig -> (Text\|Image)SearchModel	2021-09-30 02:04:04 -07:00
Debanjum Singh Solanky	f4dd9cd117	Use type specific model for other search types too. Expose them via SearchModels - Wrap Image, Music, Ledger search into the type of SearchModel they use Similar to what was done for notes model by wrapping it's config into an AsymmetricSearchModel. - Use the uber wrapper class to expose all type specific search models	2021-09-29 21:09:42 -07:00
Debanjum Singh Solanky	352d2930ee	Use multiple threads to generate model embeddings. Other minor formating	2021-09-29 20:47:58 -07:00
Debanjum Singh Solanky	e22e0b41e3	Wrap asymmetric search model into SearchModels. Test notes search end-to-end - Wrap asymmetric search model parameters into AsymmetricSearchModel class - Create wrapper for all search type models. Put notes search model into it - Test notes search end-to-end from client API layer to results. Use model build on test data	2021-09-29 20:47:35 -07:00
Debanjum Singh Solanky	cde11a2331	Wrap search type enablement status in a search settings class - Cleaner, more idiomatic usage of a global variable - Simplifies mocking when testing client in pytest as setting wrapped in object rather than a simple type. So passed around by reference	2021-09-29 19:18:33 -07:00
Debanjum Singh Solanky	81ce0cacc3	Only allow supported search types to /search, /regenerate APIs - Use a SearchType to limit types that can be passed by user - FastAPI automatically validates type passed in query param - Available type options show up in Swagger UI, FastAPI docs - controller code looks neater instead of doing string comparisons for type - Test invalid, valid search types via pytest	2021-09-29 19:12:56 -07:00
Debanjum Singh Solanky	150593c776	Update Readme. Acknowledger PyExifTool and Minor Fixes	2021-09-16 12:39:42 -07:00
Debanjum Singh Solanky	fdb60a8dcf	Set Query as Heading of Image Search Results Emacs Buffer	2021-09-16 12:30:06 -07:00
Debanjum Singh Solanky	169ddcc8c6	Make Using XMP Metadata to Enhance Image Search Optional, Configurable - Break the compute embeddings method into separate methods: compute_image_embeddings and compute_metadata_embeddings - If image_metadata_embeddings isn't defined, do not use it to enhance search results. Given image_metadata_embeddings wouldn't be defined if use_xmp_metadata is False, we can avoid unnecessary addition of args to query method	2021-09-16 12:01:05 -07:00
Debanjum Singh Solanky	a4a23d7a72	Batch encode XMP metadata from images too for image_search	2021-09-16 11:11:36 -07:00
Debanjum Singh Solanky	3afe054312	Make image batch size to encode configurable via config.yml	2021-09-16 10:52:31 -07:00
Debanjum Singh Solanky	41c328dae0	Batch encode images to keep memory consumption manageable - Issue: Process would get killed while encoding images for consuming too much memory - Fix: - Encode images in batches and append to image_embeddings - No need to use copy or deep_copy anymore with batch processing. It would earlier throw too many files open error Other Changes: - Use tqdm to see progress even when using batch - See progress bar of encoding independent of verbosity (for now)	2021-09-16 10:15:54 -07:00
Debanjum Singh Solanky	d8abbc0552	Use XMP metadata in images to improve image search - Details - The CLIP model can represent images, text in the same vector space - Enhance CLIP's image understanding by augmenting the plain image with it's text based metadata. Specifically with any subject, description XMP tags on the image - Improve results by combining plain image similarity score with metadata similarity scores for the highest ranked images - Minor Fixes - Convert verbose to integer from bool in image_search. It's already passed as integer from the main program entrypoint - Process images with ".jpeg" extensions too	2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky	0e34c8f493	Allow semantic search on images from Emacs Images are rendered inline a temporary org-mode buffer	2021-09-10 01:14:34 -07:00
Debanjum Singh Solanky	7d5514ecaa	Allow user to override inferred search type with other valid options	2021-09-10 00:58:24 -07:00
Debanjum Singh Solanky	3bdeeb1e19	Autoload main semantic-search function	2021-09-09 22:10:37 -07:00
Debanjum Singh Solanky	f4bde75249	Decouple results shown to user and text the model is trained on - Previously: The text the model was trained on was being used to re-create a semblance of the original org-mode entry. - Now: - Store raw entry as another key:value in each entry json too Only return actual raw org entries in results But create embeddings like before - Also add link to entry in file:<filename>::<line_number> form in property drawer of returned results This can be used to jump to actual entry in it's original file	2021-08-29 06:06:54 -07:00
Debanjum Singh Solanky	7ee3007070	Get ID, QUERY, TYPE, CATEGORY properties from org property drawer when present	2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky	0263d4d068	Enable semantic search for songs in org-music Org-Music: https://github.com/debanjum/org-music	2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky	fd7888f3d4	Resolve relative file paths to config YAML file in cli.py	2021-08-29 03:03:37 -07:00
Debanjum Singh Solanky	fc531a1915	Resolve relative file paths to model embeddings in all search types	2021-08-28 22:26:12 -07:00
Debanjum Singh Solanky	74faa34bee	Update sample config to add minimal config for ledger, image search	2021-08-22 21:54:49 -07:00
Debanjum Singh Solanky	8dec58b12a	Update Readme to state can now query beancount transactions, images	2021-08-22 21:50:27 -07:00
Debanjum Singh Solanky	4daeddbbda	Enable Semantic Search on Images	2021-08-22 21:42:37 -07:00
Debanjum Singh Solanky	fd217fe8b7	Enable Semantic Search for Beancount transactions	2021-08-22 21:36:06 -07:00
Debanjum Singh Solanky	97263b8209	Move CLI into a separate module. Move CLI tests into a separate file	2021-08-21 19:21:38 -07:00
Debanjum Singh Solanky	78a1f4ebb4	Use YAML file to allow user to configure application. Add tests - YAML Config - Can specify all params[1] earlier being passed via cmd args in config YAML - Can now also configure sentence-transformer models to use etc for search - [1] Config params - org files - compressed entries file config path - embeddings file config path - Include sample_config.yaml - Include sample .org file from this repos readmes - CLI - Configuration Priority: Config via cmd > Config via YAML > Default Config - Test CLI, include test config.yml for the tests - Set default type to None unless set via query param to API Run notes search if search_enabled, also if type is None (default) Prepares for running queries on all search types unless type specified in API query param - Update Readme	2021-08-21 19:07:39 -07:00
Debanjum Singh Solanky	bafc86d583	Add helpers to merge dictionaries and get keys deep inside a dictionary	2021-08-21 18:27:50 -07:00
Debanjum Singh Solanky	eddbc67358	Document how to install latest version in Readme	2021-08-17 18:27:10 -07:00
Debanjum Singh Solanky	252266b62a	Pass type of item via regenerate API. Default type query param to None	2021-08-17 18:25:07 -07:00
Debanjum Singh Solanky	ff7207a6bd	Extract commandline arguments into separate testable method	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	a3a1100be9	Arrange modules in standardized ordering	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	569e30b1c8	Create a few basic tests	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	af9660f28e	Move application files under src directory. Update Readmes - Remove callign asymmetric search script directly command. It doesn't work anymore on calling directly due to internal package import issues	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	c35c6fb0b3	Reuse asymmetric.setup & input validation from asymmetric & org_to_jsonl Create asymmetric.setup method to - initialize model - generate compressed jsonl - compute embeddings put input_files, input_file_filter validation in org_to_jsonl for reuse in main.py, asymmetic.py	2021-08-17 00:45:40 -07:00
Debanjum Singh Solanky	02a84df37a	Update state vars after regeneration. Minimize time app in inconsistent state	2021-08-16 23:47:33 -07:00
Debanjum Singh Solanky	0509854e14	Replace README.md with README.org. Can be used as notes for testing	2021-08-16 20:00:05 -07:00
Debanjum Singh Solanky	79aff85fcb	Update Readme. No separate SETUP step required. Simpler RUN step - Setup now happens on first run of application - Embeddings can now be regenerated without killing app by calling API	2021-08-16 19:24:04 -07:00
Debanjum Singh Solanky	95bf26a7f2	Set verbosity commandline parameters default value to 0	2021-08-16 19:16:29 -07:00
Debanjum Singh Solanky	04a9a6d62f	Expose API endpoint to (re-)generate embeddings from latest notes - Provides mechanism to update notes from within application - Instead of having to pass the same arguments multiple times Pass it once (or rely on defaults when possible) and let app keep state and location of intermediary files - Allows user to not have to deal with the internals of the application - E.g user doesn't have to specify the jsonl.gz or embeddings file path The app will still put those files in a default location - The user doesn't have to run the generation from the commandline as a separate step	2021-08-16 18:52:38 -07:00
Debanjum Singh Solanky	1c00c33e73	Improve debug output from org_to_jsonl.py script	2021-08-16 18:50:29 -07:00
Debanjum Singh Solanky	2a57156428	Fix org_to_jsonl. Use passed args not global variables in methods. Fix orgnode import	2021-08-16 17:37:44 -07:00
Debanjum Singh Solanky	66238004d8	Use verbosity level instead of bool across application For consistent, more granular verbosity controls across app Allows user to increase verbosity by passing -vvv flags passed to main.py	2021-08-16 17:15:41 -07:00
Debanjum Singh Solanky	adbf157deb	Remove usage of the closure to search_notes as it's not required	2021-08-16 16:52:48 -07:00
Debanjum Singh Solanky	649e5d1327	Allow reuse of get_absolute_path, is_none_or_empty methods - Move them to utils.helper.py for reuse - Import those modules where required - Delete duplicate methods defined in org_to_jsonl.py, asymmetric.py	2021-08-16 16:33:43 -07:00
Debanjum Singh Solanky	9703afb814	Rename search_types to search_type to standardize to singular naming Using singular names for other directories in application already - processor instead of processors - interface instead of interfaces	2021-08-16 16:31:30 -07:00
Debanjum Singh Solanky	19d6678eb1	Allow importing org-to-jsonl as module for reuse To allow importing org-to-jsonl as module - Wrap code in __main__ into a org-to-jsonl method - Rename processor/org-mode to processor/org_mode - Add __init__.py to processor directory	2021-08-16 16:31:30 -07:00
Debanjum Singh Solanky	5f8221f77e	Remove unused verbose argument to collate_results method	2021-08-16 13:54:41 -07:00

1 2

82 commits