sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	41c328dae0	Batch encode images to keep memory consumption manageable - Issue: Process would get killed while encoding images for consuming too much memory - Fix: - Encode images in batches and append to image_embeddings - No need to use copy or deep_copy anymore with batch processing. It would earlier throw too many files open error Other Changes: - Use tqdm to see progress even when using batch - See progress bar of encoding independent of verbosity (for now)	2021-09-16 10:15:54 -07:00
Debanjum Singh Solanky	d8abbc0552	Use XMP metadata in images to improve image search - Details - The CLIP model can represent images, text in the same vector space - Enhance CLIP's image understanding by augmenting the plain image with it's text based metadata. Specifically with any subject, description XMP tags on the image - Improve results by combining plain image similarity score with metadata similarity scores for the highest ranked images - Minor Fixes - Convert verbose to integer from bool in image_search. It's already passed as integer from the main program entrypoint - Process images with ".jpeg" extensions too	2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky	0e34c8f493	Allow semantic search on images from Emacs Images are rendered inline a temporary org-mode buffer	2021-09-10 01:14:34 -07:00
Debanjum Singh Solanky	7d5514ecaa	Allow user to override inferred search type with other valid options	2021-09-10 00:58:24 -07:00
Debanjum Singh Solanky	3bdeeb1e19	Autoload main semantic-search function	2021-09-09 22:10:37 -07:00
Debanjum Singh Solanky	f4bde75249	Decouple results shown to user and text the model is trained on - Previously: The text the model was trained on was being used to re-create a semblance of the original org-mode entry. - Now: - Store raw entry as another key:value in each entry json too Only return actual raw org entries in results But create embeddings like before - Also add link to entry in file:<filename>::<line_number> form in property drawer of returned results This can be used to jump to actual entry in it's original file	2021-08-29 06:06:54 -07:00
Debanjum Singh Solanky	7ee3007070	Get ID, QUERY, TYPE, CATEGORY properties from org property drawer when present	2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky	0263d4d068	Enable semantic search for songs in org-music Org-Music: https://github.com/debanjum/org-music	2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky	fd7888f3d4	Resolve relative file paths to config YAML file in cli.py	2021-08-29 03:03:37 -07:00
Debanjum Singh Solanky	fc531a1915	Resolve relative file paths to model embeddings in all search types	2021-08-28 22:26:12 -07:00
Debanjum Singh Solanky	74faa34bee	Update sample config to add minimal config for ledger, image search	2021-08-22 21:54:49 -07:00
Debanjum Singh Solanky	8dec58b12a	Update Readme to state can now query beancount transactions, images	2021-08-22 21:50:27 -07:00
Debanjum Singh Solanky	4daeddbbda	Enable Semantic Search on Images	2021-08-22 21:42:37 -07:00
Debanjum Singh Solanky	fd217fe8b7	Enable Semantic Search for Beancount transactions	2021-08-22 21:36:06 -07:00
Debanjum Singh Solanky	97263b8209	Move CLI into a separate module. Move CLI tests into a separate file	2021-08-21 19:21:38 -07:00
Debanjum Singh Solanky	78a1f4ebb4	Use YAML file to allow user to configure application. Add tests - YAML Config - Can specify all params[1] earlier being passed via cmd args in config YAML - Can now also configure sentence-transformer models to use etc for search - [1] Config params - org files - compressed entries file config path - embeddings file config path - Include sample_config.yaml - Include sample .org file from this repos readmes - CLI - Configuration Priority: Config via cmd > Config via YAML > Default Config - Test CLI, include test config.yml for the tests - Set default type to None unless set via query param to API Run notes search if search_enabled, also if type is None (default) Prepares for running queries on all search types unless type specified in API query param - Update Readme	2021-08-21 19:07:39 -07:00
Debanjum Singh Solanky	bafc86d583	Add helpers to merge dictionaries and get keys deep inside a dictionary	2021-08-21 18:27:50 -07:00
Debanjum Singh Solanky	eddbc67358	Document how to install latest version in Readme	2021-08-17 18:27:10 -07:00
Debanjum Singh Solanky	252266b62a	Pass type of item via regenerate API. Default type query param to None	2021-08-17 18:25:07 -07:00
Debanjum Singh Solanky	ff7207a6bd	Extract commandline arguments into separate testable method	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	a3a1100be9	Arrange modules in standardized ordering	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	569e30b1c8	Create a few basic tests	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	af9660f28e	Move application files under src directory. Update Readmes - Remove callign asymmetric search script directly command. It doesn't work anymore on calling directly due to internal package import issues	2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky	c35c6fb0b3	Reuse asymmetric.setup & input validation from asymmetric & org_to_jsonl Create asymmetric.setup method to - initialize model - generate compressed jsonl - compute embeddings put input_files, input_file_filter validation in org_to_jsonl for reuse in main.py, asymmetic.py	2021-08-17 00:45:40 -07:00
Debanjum Singh Solanky	02a84df37a	Update state vars after regeneration. Minimize time app in inconsistent state	2021-08-16 23:47:33 -07:00
Debanjum Singh Solanky	0509854e14	Replace README.md with README.org. Can be used as notes for testing	2021-08-16 20:00:05 -07:00
Debanjum Singh Solanky	79aff85fcb	Update Readme. No separate SETUP step required. Simpler RUN step - Setup now happens on first run of application - Embeddings can now be regenerated without killing app by calling API	2021-08-16 19:24:04 -07:00
Debanjum Singh Solanky	95bf26a7f2	Set verbosity commandline parameters default value to 0	2021-08-16 19:16:29 -07:00
Debanjum Singh Solanky	04a9a6d62f	Expose API endpoint to (re-)generate embeddings from latest notes - Provides mechanism to update notes from within application - Instead of having to pass the same arguments multiple times Pass it once (or rely on defaults when possible) and let app keep state and location of intermediary files - Allows user to not have to deal with the internals of the application - E.g user doesn't have to specify the jsonl.gz or embeddings file path The app will still put those files in a default location - The user doesn't have to run the generation from the commandline as a separate step	2021-08-16 18:52:38 -07:00
Debanjum Singh Solanky	1c00c33e73	Improve debug output from org_to_jsonl.py script	2021-08-16 18:50:29 -07:00
Debanjum Singh Solanky	2a57156428	Fix org_to_jsonl. Use passed args not global variables in methods. Fix orgnode import	2021-08-16 17:37:44 -07:00
Debanjum Singh Solanky	66238004d8	Use verbosity level instead of bool across application For consistent, more granular verbosity controls across app Allows user to increase verbosity by passing -vvv flags passed to main.py	2021-08-16 17:15:41 -07:00
Debanjum Singh Solanky	adbf157deb	Remove usage of the closure to search_notes as it's not required	2021-08-16 16:52:48 -07:00
Debanjum Singh Solanky	649e5d1327	Allow reuse of get_absolute_path, is_none_or_empty methods - Move them to utils.helper.py for reuse - Import those modules where required - Delete duplicate methods defined in org_to_jsonl.py, asymmetric.py	2021-08-16 16:33:43 -07:00
Debanjum Singh Solanky	9703afb814	Rename search_types to search_type to standardize to singular naming Using singular names for other directories in application already - processor instead of processors - interface instead of interfaces	2021-08-16 16:31:30 -07:00
Debanjum Singh Solanky	19d6678eb1	Allow importing org-to-jsonl as module for reuse To allow importing org-to-jsonl as module - Wrap code in __main__ into a org-to-jsonl method - Rename processor/org-mode to processor/org_mode - Add __init__.py to processor directory	2021-08-16 16:31:30 -07:00
Debanjum Singh Solanky	5f8221f77e	Remove unused verbose argument to collate_results method	2021-08-16 13:54:41 -07:00
Debanjum Singh Solanky	85bf15628d	Use better cmdline argument names. Drop unneeded no-compress argument Can infer to compress or not via the output_file suffix	2021-08-16 13:49:39 -07:00
Debanjum Singh Solanky	d9f60c00bf	Warn if any input files to org-to-json are potentially non org-mode files That is, if the file paths in the input set don't end with .org	2021-08-16 13:49:39 -07:00
Debanjum Singh Solanky	3aa0c30fee	Use absolute file path to open files in org-to-jsonl.py, asymmetric.py Exit script if neither org_files, org_file_filter is present	2021-08-16 13:49:39 -07:00
Debanjum Singh Solanky	e773611558	Remove unused jsonl_file argument from convert_org_entries_to_jsonl	2021-08-16 13:49:35 -07:00
Debanjum Singh Solanky	8b29e272d3	Standardize interface, better default args for org-to-json.py script - Remove non-standard, unnecessary argument for org-directory Pass path each file in org-files and org-files-filter argument directly - Allow shorthand -i, -o for input files, output files - Default to compress, unless user explicitly specifies not to	2021-08-16 11:29:08 -07:00
Debanjum Singh Solanky	7547e90745	Minor doc updates after merging emacs package with main repository	2021-08-16 02:02:26 -07:00
Debanjum Singh Solanky	ec157ea0ff	Add Emacs interface to semantic-search directly to main repository Too much overhead to maintain multiple repositories, especially when the Emacs library for semantic-search is a single file. Import Readme from the emacs-semantic-search repository too	2021-08-16 01:27:46 -07:00
Debanjum Singh Solanky	dcf7b2d04f	Remove requirements.txt for now as virtualenv setup doesn't work Haven't gotten it to work on Mac or Ubuntu. Remove to avoid confusion for now. Application depends on miniconda for now	2021-08-16 00:15:10 -07:00
Debanjum Singh Solanky	3b81fafa3e	Use updated path to MiniLM bi-encoder model on hugging-face	2021-08-15 23:57:22 -07:00
Debanjum Singh Solanky	4839153086	Acknowledge ML models used for search. Simplify path used in commands	2021-08-15 23:56:18 -07:00
Debanjum Singh Solanky	c58c1d96aa	Change default install directory to current, fix open file code	2021-08-15 23:01:55 -07:00
Debanjum Singh Solanky	ae15e429b5	Reduce indentation from 4 to 2 in Readme.md. Prevent everything looking like code blocks due to 4 space indentations	2021-08-15 22:56:36 -07:00
Debanjum Singh Solanky	636b6195cc	Add Readme, License. Update .gitignore	2021-08-15 22:52:37 -07:00

... 2 3 4 5 6

269 commits