sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-18 18:47:11 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	06afeec7e2	Hide stars of org entry results on Emacs to reduce visual clutter They've all been normlized to the same level and hence don't hold much data. So good opportunity to reduce, non-useful visual clutter	2022-08-05 05:27:57 +03:00
Saba	d1fe6353b5	Check whether processor_config exists during shutdown event	2022-08-04 21:57:36 -04:00
Debanjum Singh Solanky	4d4d2ff921	Ensure all org entries are unfolded in results buffer on Emacs	2022-08-05 04:54:29 +03:00
Debanjum Singh Solanky	49ef741d4b	Prevent Zoom on Input in Web Interface. Document Pip upgrade in Readme - Name /Reload API Controller Reload	2022-08-05 03:51:34 +03:00
Debanjum Singh Solanky	675e821d95	Make embeddings, jsonl paths absolute. Create directories if non-existent	2022-08-05 02:57:59 +03:00
Debanjum Singh Solanky	d5b43eb836	Use input filter in image search setup. Input filter wasn't used earlier	2022-08-05 02:40:03 +03:00
Debanjum Singh Solanky	ca5a8bd113	Make config file a positional argument, as it is required - Test invalid config file path throws. Remove redundant cli test - Simplify cli parser code - Do not need to explicitly check if args.config_file set. argparser checks for positional arguments automatically - Use standard semantics for cli args - All positional args are required. Non positional args are optional - Improve command line --help description	2022-08-05 01:09:40 +03:00
Debanjum Singh Solanky	1374065092	Mark all required fields for config. Throw if no input_* field specified - Add custom validator to throw if neither input_filter or input_<files\|directories> are specified - Set field expecting paths to type Path - Now that default_config isn't used in code. We can update fields in rawconfig to specify whether they're required or not. This lets pydantic validate config file and throw appropriate error	2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky	f78d6ae754	Create khoj_sample file with all configurable fields in one place - Reason - Simplifies code. No merge_dict required - 1 place for user to see all configurables, defaults and required values - Details - Remove default_config from code. Set defaults in khoj_sample.yml itself - Keep fields required to be set by user as empty in khoj_sample to YAML - Set defaults for fields not requiring configuration by user	2022-08-05 01:08:33 +03:00
Debanjum Singh Solanky	3abf3e5ee0	Update merge_dicts to recursively merge the dictionaries Previously it was only merging dictionary at the first/top level	2022-08-04 22:46:20 +03:00
Debanjum Singh Solanky	61c26ba611	Only show large Khoj favicon on web interface - Do not want browsers to use the small, grainy favicons - Firefox for Android does use the bigger icon, when it's the only one available - Update svg to match the 144x144 ratio just for consistency	2022-08-04 14:33:29 +03:00
Debanjum Singh Solanky	1649fa644c	Autofocus on Query field in Web Interface. Improve time to query	2022-08-04 05:23:19 +03:00
Debanjum Singh Solanky	71fcb1087f	Add icons for web interface to render on more browsers and as PWA Safari, Firefox for Android etc don't support SVG Favicons yet	2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky	5b6b7ec123	Delete khoj network connections on incremental search teardown on Emacs interface Currently only get into this state when debug breakpoints on backend are keeping the connection open and user exits khoj search from Emacs Results in a number of open connections that slow khoj down.	2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky	555c1088cc	Cache queries in /search controller using LRU cache - Most concretely right now, it eliminates the re-rank latency hit on re-rank triggered on user hitting enter after re-rank is already done on user idle in the emacs interface - Improves search latency of (incremental) search	2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky	38df727ef4	Fix escape sequence usage in strings. Remove unneeded import of os Rename /config API method to config to match it's purpose. UI is anyway too generic, and not what it is doing	2022-08-03 18:51:55 +03:00
Debanjum Singh Solanky	f642450ed9	Disable Incremental Search for Images on Web Bug introduced in commit `da118b3fed`	2022-08-03 11:52:51 +03:00
Debanjum Singh Solanky	b9e6273644	Include interfaces in pip package. Fix paths to web interface in app	2022-08-03 00:02:39 +03:00
Debanjum Singh Solanky	1b55462fb0	Convert search_filter, conversation dir to proper modules Add __init__.py files to their directories	2022-08-02 20:23:42 +03:00
Debanjum Singh Solanky	5108d45951	Wrap application startup steps into a method	2022-08-02 20:13:14 +03:00
Debanjum Singh Solanky	0ebfbb43ce	Nest org, md results at level 2 on Emacs interface. Improve readability - Makes it easier to fold/unfold, traverse and read results - This 2 level nesting is already being used on the web interface - Previously we were using the original nesting depth of the entry. This was aimed at providing more of the orginal context of the results. But currently this additional information does not provide as much, for the decreased legibility of the results	2022-08-01 04:01:18 +03:00
Debanjum Singh Solanky	1201bfddf3	Simplify name of config css from config-style.css to config.css	2022-08-01 01:34:00 +03:00
Debanjum Singh Solanky	075dba5d64	Use Khoj Title, Favicon in Config Page for Consistency	2022-08-01 01:27:14 +03:00
Debanjum Singh Solanky	56a4429f01	Move web interface to configure application into src/interface/web directory - Improve code layout by ensuring all web interface specific code under the src/interface/web directory - Rename config API to more specifi /config instead of /ui - Rename config data GET, POST api to /config/data instead of /config	2022-08-01 00:53:42 +03:00
Debanjum	bb2ccec1ca	Populate type dropdown on the web interface with only enabled search types - Previously we were statically populating types dropdown field in the web interface with all available search types - This change populates the type dropdown field with only search types that are enabled/configured - It queries the `/config` backend API to see which of the available search types are configured	2022-08-01 00:20:45 +03:00
Debanjum Singh Solanky	8b6058c879	Fix instantiating type field with value from URL query parameter - Populate via `.then` after enabled search types in dropdown are populated - Call to `/config` API is async and will usually complete after the value of type field is set from url - So value of type field would earlier be overridden when search types dropdown is populated after the call to `/config` API completes	2022-08-01 00:04:50 +03:00
Debanjum Singh Solanky	be253bab39	Populate type dropdown with only enabled search types in web interface - Get /config API and check config for which available search types is populated. This gives us the list of enabled search types - Dynamically populate search type field with enabled search types only	2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky	0abd40aeb7	Only set query field when appropriate query param passed via URL - Setting query value to default option when query param wasn't passed via URL was overriding placeholder text in query field - We wanted placeholder text in field, not the query field to actually be populated by placeholder text - This clears field when user starts typing query into the query field, instead of them having to manually delete the default text populated	2022-07-31 22:29:23 +03:00
Debanjum Singh Solanky	17c38b526a	Default config for each search types to None - Setting up default compressed-jsonl, embeddings-file was only required for org search_type, while org-files and org-filter were allowed to be passed as command line argument - This avoided having to set compressed-jsonl and embeddings-file via command line argument as well for org search type - Now that all search types are only configurable via config file, We can default all search types to None. The default config for the rest of the search types wasn't being used anyway	2022-07-31 22:23:57 +03:00
Debanjum Singh Solanky	b83021a723	Improve code readability of merge_dicts helper method	2022-07-31 22:07:56 +03:00
Debanjum Singh Solanky	38aede68f2	Only configure org via config file for consistency across search types - Previously org-files were configurable via cmdline args. Where as none of the other search types are - This is an artifact of how the application grew - It can be removed for better consistency and equal preference given all search types	2022-07-31 22:02:03 +03:00
Saba	b55159f5bd	Fix URL for khoj.el quelpa setup instructions	2022-07-29 23:01:04 -04:00
Debanjum Singh Solanky	da118b3fed	Simplify incremental search function used in web interface Re-rank isn't passed to image search API in search function. So don't need to check type in incremental_search function too	2022-07-29 23:18:01 +04:00
Debanjum Singh Solanky	3079614981	Allow set up of search form via query params in web interface - Default search type to org, instead of images	2022-07-29 23:13:26 +04:00
Debanjum Singh Solanky	02ca2c05a1	Add Eagle Icon for Khoj to Web, Emacs Interfaces and Readme	2022-07-29 17:50:29 +04:00
Debanjum Singh Solanky	78314263a0	Add Table of Contents, Features, Performance Details to Readme	2022-07-29 17:08:17 +04:00
Debanjum Singh Solanky	ed181f47c9	Prettify rendering of org music results on Khoj web interface	2022-07-29 04:28:22 +04:00
Debanjum Singh Solanky	7e5291a38e	Make org result headings at same level. Improve spacing of results Having org-mode result headings change size based on their depth in the source document makes is a confusing UI experience. Improve font-size, line-spacing and margins of results to make delineation between entries, and differntiating between entry heading and it's body easier to visually infer. Do not white-space: pre-line. Improves rendering of Markdown results	2022-07-29 01:55:46 +04:00
Debanjum Singh Solanky	4d5183063c	Create images directory if doesn't exist, to store image search results	2022-07-28 21:30:31 +04:00
Debanjum Singh Solanky	a9bc17a6b0	Prettify Render of Markdown Results in Web Interface	2022-07-28 20:56:37 +04:00
Debanjum Singh Solanky	a6ae74f52e	Move JS files like org.js into a separate assets/ directory	2022-07-28 20:46:48 +04:00
Debanjum Singh Solanky	a12eaa4ce0	Move Khoj image results into a child images/ directory	2022-07-28 20:45:12 +04:00
Debanjum	a71253e137	Support Incremental Search on Web Interface ## Support Incremental Search on Khoj Web Interface - Use default, fast path to query /search API while user is typing - Upgrade to cross-encoder re-ranked results once user hits enter on search box ## Improve Render of Org Results on Web Interface - We were previously just wrapping results from /search API into a pre formatted div field. This was not easy to read - Use [org.js](https://mooz.github.io/org-js/) to render results from Khoj `/search` API as proper HTML - Improve org.js to render all task states, stylize task tags and make org-mode results look more like original content Closes #42 #41	2022-07-28 09:31:57 -07:00
Debanjum Singh Solanky	e8029bf415	Extract and Highlight org-mode tags in HTML render of search results	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	c6c248df26	Improve styling of org-mode results to original alignment, line breaks	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	9f59897eeb	Highlight all org-mode task states in HTML. Not just TODO, DONE. - Make logic to extract, mark todo state in org.js more generic - Add default todo state styling to html	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	f040b3f65c	Stylize TODO/DONE states with CSS	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	581b6097c7	Clean Results. Remove TOC, Heading Number and Property Drawers	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	965a93a2f2	Add Basic HTML Rendering of Org-Mode Results	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	1da44d4dfe	Add Incremental Search to Khoj Web Interface	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	af1dd31401	Do not pass verbose argument to image_search.query() as not supported	2022-07-28 19:52:58 +04:00
Debanjum Singh Solanky	80ac10835c	Rerank results on normal minibuffer exit In current state: - Rerank results: - If user idles while entering query OR - exits normally - Do not rerank results: - If user exits abnormally, e.g via C-g from query	2022-07-28 03:37:16 +04:00
Debanjum Singh Solanky	1b759597df	Make incremental search more robust. Follow standard user expectations - Rename functions to more standard, descriptive names - Keep known, required code for incremental search - E.g Do not set buffer local flag in hooks on minibuffer setup - Only query when user in khoj minibuffer - Use active-minibuffer-window and track khoj minibuffer - (minibuffer-prompt) is not useful for our use-case here - (For now) Run re-rank only if user idle while querying - Do not run rerank on teardown/completion - The reranking lag (~2s) is annoying; hit enter, wait to see results - Also triggered when user exits abnormally, so C-g also results in rerank which is even more annoying - Emacs will still hang if re-ranking gets triggered on idle but that's better than always getting triggered. And better than not having mechanism to get results re-ranked via cross-encoder at all	2022-07-28 02:52:27 +04:00
Debanjum Singh Solanky	9a6eee31be	Make number of results to get from Khoj API customizable in khoj.el	2022-07-27 18:55:18 +04:00
Debanjum Singh Solanky	9302b45fe0	Use khoj-incremental as the main khoj func. Rename khoj to khoj-simple - Update khoj-simple to work cross-encoder re-ranked results like before - Increment major version as incremental search considered a breaking change and a major update to search capability	2022-07-27 18:18:17 +04:00
Debanjum Singh Solanky	09727ac3be	Make bi-encoder return fewer results to reduce cross-encoder latency	2022-07-27 07:26:02 +04:00
Debanjum Singh Solanky	9ab3edf6d6	Re-rank incremental search results using cross-encoder if user idle This provides a relatively smooth mechanism - to improve relevance of results on idle - while providing the rapid, incremental results while typing	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	ad242cafa7	Support querying all text search types in incremental search - Before incremental search was hard-coded to only query org	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	bfcb962cbe	Use post-command-hook to only query on user input - Hooking into after-change-functions results in system logs triggering query	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	0d49398954	Reuse code to query api, render results. Formalize method, arg names	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	fd1963d781	Implement Basic Incremental Search Interface in Emacs for Org Mode Notes	2022-07-27 03:05:00 +04:00
Debanjum Singh Solanky	3fa7d8f03a	Skeleton to allow incremental search on Khoj via Emacs	2022-07-27 02:48:27 +04:00
Debanjum Singh Solanky	1168244c92	Make cross-encoder re-rank results if query param set on /search API - Improve search speed by ~10x Tested on corpus of 125K lines, 12.5K entries - Allow cross-encoder to re-rank results by settings &?r=true when querying /search API - It's an optional param that default to False - Earlier all results were re-ranked by cross-encoder - Making this configurable allows for much faster results, if desired but for lower accuracy	2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky	b1e64fd4a8	Improve search speed. Only apply filter if filter keywords in query - Formalize filters into class with can_filter() and filter() methods - Use can_filter() method to decide whether to apply filter and create deep copies of entries and embeddings for it - Improve search speed for queries with no filters as deep copying entries, embeddings takes the most time after cross-encodes scoring when calling the /search API Earlier we would create deep copies of entries, embeddings even if the query did not contain any filter keywords	2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky	f094c86204	Trace query response performance and display timings in verbose mode	2022-07-26 21:03:53 +04:00
Debanjum Singh Solanky	65fea7681a	Rename notes search type to org search, now that markdown notes supported	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	4c24202e42	Update documentation. Simplify, reflect current capabilities	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	d4d7dbaca6	Support Natural Search on Markdown Files - Reason: Allow natural search on markdown based notes, documentation, websites etc - Details: - Create markdown processor to extract Markdown entries (identified by Heading) into standard jsonl format required by text_search - Update API, Configs to support interfacing with new markdown type - Update Emacs, Web clients to support interfacing with new markdown type via API - Update Readme to mentiond markdown is also supported Closes #35	2022-07-21 22:07:05 +04:00
Debanjum Singh Solanky	0602d018c0	Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types	2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky	0917f1574d	Consolidate jsonl helper methods in a single file under utils module	2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky	de726c4b6c	Minor fixes to unused installer utility script	2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky	5aad297286	Reuse logic to extract entries across symmetric, asymmetric search Now that the logic to compile entries is in the processor layer, the extract_entries method is standard across (text) search_types Extract the load_jsonl method as a utility helper method. Use it in (a)symmetric search types	2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky	e220ecc00b	Generate compiled form of each transaction directly in the beancount processor - The logic for compiling a beancount entry (for later encoding) now completely resides in the org-to-jsonl processor layer - This allows symmetric search to be generic and not be aware of beancount specific properties that were extracted by the beancount-to-jsonl processor layer - Now symmetric search just expects the jsonl to (at least) have the 'compiled' and 'raw' keys for each entry. What original text the entry was compiled from is irrelevant to it. The original text could be location, transaction, chat etc, it doesn't have to care	2022-07-21 02:43:28 +04:00
Debanjum Singh Solanky	06cf425314	Generate compiled form of each entry directly in the org-mode processor - The logic for compiling an org-mode entry (for later encoding) now completely resides in the org-to-jsonl processor layer - This allows asymmetric search to be generic and not be aware of org-mode specific properties that were extracted by the org-to-jsonl processor layer - Now asymmetric search just expects the jsonl to (at least) have the 'compiled' and 'raw' keys for each entry. What original text the entry was compiled from is irrelevant to it. The original text could be mail, chat, markdown, org-mode etc, it doesn't have to care	2022-07-21 02:08:02 +04:00
Debanjum Singh Solanky	4ead79d272	Make Notes Search Natural Language Date Aware - Pass Scheduled, Closed Dates of Entries to Include in Embeddings - The (new?) model seems to understand dates. So can give more relevant entries if date in natural language mentioned in query - E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984" will give different results, with the second prioritizing entries mentioning any entries with closed, scheduled dates from 1984	2022-07-21 01:06:49 +04:00
Debanjum Singh Solanky	d50bfb5188	Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests	2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky	70e70d4b15	Rename 'embed' key to more generic 'compiled' for jsonl extracted results - While it's true those strings are going to be used to generated embeddings, the more generic term allows them to be used elsewhere as well - Their main property is that they are processed, compiled for usage by semantic search - Unlike the 'raw' string which contains the external representation of the data, as is	2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky	c1369233db	Consistently use "entry", "score" in json response for all search types - Had already made some progress on this earlier by updating the image search responses. But needed to update the text search responses to use lowercase entry and score - Update khoj.el to consume the updated json response keys for text search	2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky	d68a9dc445	Sort extracted images before computing their embeddings - Image order returned by glob is OS dependent - This prevented sharing image embeddings across machines running different OS - A stable sort order for processed images allows sharing embeddings across machines. - Use case: A more powerful, always on machine actually computes the image embeddings regularly The client machine just load these periodically to provide semantic search functionality	2022-07-20 03:51:27 +04:00
Debanjum Singh Solanky	c4c7f38b15	Fix extracting image names from multiple image directories	2022-07-20 03:40:49 +04:00
Debanjum Singh Solanky	bdc1b9f2bb	Resolve edge case errors in encoding image metadata - Handle case where current image batch smaller than batch_size - Handle case where no XMP metadata for current image - return empty strings in such a scenario instead of ". "	2022-07-20 02:58:43 +04:00
Debanjum Singh Solanky	2a5445216c	Image input directory not required by collate result as image_name already absolute path	2022-07-20 02:56:23 +04:00
Debanjum Singh Solanky	6c9ffdba57	Allow indexing multiple image directories for image search	2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky	70221bb038	Allow filtering transactions by date in symmetric ledger	2022-07-19 20:58:24 +04:00
Debanjum Singh Solanky	b673d26a12	Extract Entries in a standardized format across text search types Issue: - Had different schema of extracted entries for symmetric_ledger vs asymmetric - Entry extraction for asymmetric was dirty, relying on cryptic indices to store raw entry vs cleaned entry meant to be passed to embeddings - This was pushing the load of figuring out what property to extract from each entry to downstream processes like the filters - This limited the filters to only work for asymmetric search, not for symmetric_ledger - Fix - Use consistent format for extracted entries { 'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings, 'raw' : raw_entry_string_meant_to_be_passed_to_use } - Result - Now filters can be applied across search types, and the specific field they should be applied on can be configured by each search type	2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky	e66cd5bf59	Only extract transactions from Beancount - Earlier was extracting all entries starting with dates but the other type of entries like account open/close, asserts etc aren't useful for querying	2022-07-19 19:50:58 +04:00
Debanjum Singh Solanky	732b2d287f	Give the project a short, less generic name. Rename it to Khoj - Semantic Search was just a placeholder used to test the idea out Didn't want to get into naming at that point of time	2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky	989526ae54	Use a more accurate model for symmetric semantic search - The all-MiniLM-L6-v2 is more accurate - The exact previous model isn't benchmarked but based on the performance of the closest model to it. Seems like the new model maybe similar in speed and size - On very preliminary evaluation of the model, the new model seems faster, with pretty decent results	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	4a90972e38	Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	5e302dbcda	Fix using 1 column layout on small screens	2022-07-18 02:40:16 +04:00
Debanjum Singh Solanky	7d16b673b1	Use Single Column Layout for Small Screens on Web Interface	2022-07-18 02:08:52 +04:00
Debanjum Singh Solanky	31a221a76b	Auto focus cursor on query input box to simplify, speed interactions - Avoids having to click the query input box - Just open page, type whatever and hit enter to do image search - For other search types select appropriate type from dropdown	2022-07-16 19:39:15 +04:00
Debanjum Singh Solanky	06b0c720d6	Improve Rendering of Image Search Results in Emacs - Use shr to render image response from html in result buffer Earlier was using org-mode. But rendering HTML with shr seems cleaner - Use Headings to Add highlights - Use Random to Force fetch of Image. Similar to what was done for Web interface - Remove trailing elisp brackets from response - Show query match scores by image model for each image in results	2022-07-16 19:31:49 +04:00
Debanjum Singh Solanky	28ec9af589	Extract image URL location from response in elisp after API update	2022-07-16 18:43:55 +04:00
Debanjum Singh Solanky	47613cba1f	Improve Landing Page Look in General and Layout for Mobile - Ask for 6 Images to Fill Grid into 3x2 Layout - Submit Form on Hitting Enter	2022-07-16 16:55:13 +04:00
Debanjum Singh Solanky	cf207d6ebe	Add title, heading to the semantic search web interface	2022-07-16 03:44:29 +04:00
Debanjum Singh Solanky	e0d8398b27	Normalize metadata match score to work better with image match score - Metadata match score were consistently giving higher scores by a factor of ~3x wrt to image match score. This was resulting in all results being from the metadata match with query and none from the image match with query. - Scaling the metadata match scores down by scaling factor seems to give more consistently give a blend of results from both image and metadata matches	2022-07-16 03:39:33 +04:00
Debanjum Singh Solanky	a3fc82817d	Log and continue on image metadata encoding error due to Tensor size mismatch	2022-07-16 03:39:19 +04:00
Debanjum Singh Solanky	f26d0ddbbd	Minor fix to asymmetric search when no entries returned	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	ca3f93e641	Add button on web interface to regenerate embeddings of specified type	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	231cc91e14	Force reload of images every time user clicks search button Adding a random, unused url param at the end of the img.src string fixes the issue. As the browser thinks it's a new image and doesn't use the image data that's already cached because of which it wasn't even making the fetch call for the image	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	a6aef62a99	Create Basic Landing Page to Query Semantic Search and Render Results - Allow viewing image results returned by Semantic Search. Until now there wasn't any interface within the app to view image search results. For text results, we at least had the emacs interface - This should help with debugging issues with image search too For text the Swagger interface was good enough	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	4e27ae0577	Ease access to image result for given query by image_search - Copy images to accessible directory - Return URL paths to them to ease access - This is to be used in the web interface to render image results directly in browser - Return image, metadata scores for each image in response as well This should help get a better sense of image scores along both XMP metadata and whole image axis	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	801e59a20d	Allow explicit filters when querying Ledger transactions	2022-07-15 23:41:54 +04:00
Debanjum Singh Solanky	0e979587e0	Add configurable filter support to Symmetric Ledger Search	2022-07-14 23:40:41 +04:00
Debanjum Singh Solanky	85077bc1d1	Handle unparseable date range passed via date filter in query - Do not reuse the same list - Just create new list, so only parsed data is in it	2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky	a60de2c02b	Include date filter in asymmetic search on music as well	2022-07-14 22:37:17 +04:00
Debanjum Singh Solanky	c3b3e8959d	Put entry splitting regex in explicit filter into a variable for code readability	2022-07-14 22:00:10 +04:00
Debanjum Singh Solanky	3aac3c7d52	Run explicit filter on raw entry, add more terms to split entries by - With \t Last Word in Headings was suffixed by \t and so couldn't be filtered by - User interacts with raw entries, so run explicit filters on raw entry - For semantic search using the filtered entry is cleaner, still	2022-07-14 21:54:04 +04:00
Debanjum Singh Solanky	7640e2ab0c	Wrap attempt to extract dates from entry in try/catch - Not all YYYY-MM-DD strings in entry are necessarily dates	2022-07-14 21:38:00 +04:00
Debanjum Singh Solanky	9de2097182	Fix date filter usage with multi word queries. Simplify date regex	2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky	dcb6fe479e	Fix date_filter query, entry in query range check. Add tests for it - Fix date_filter date_in_entry within query range check - Extracted_date_range is in [included_date, excluded_date) format - But check was checking for date_in_entry <= excluded_date - Fixed it to do date_in_entry < excluded_date - Fix removal of date filter from query - Add tests for date_filter	2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky	011f81fac5	Fix date_filter to handle non overlapping date ranges	2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky	70ac35b2a5	Compute Date Range to filter entries to, from Comparators, Dates in Query	2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky	e6db3e3d00	Prefer Dates From Future only when specific words in date string - Default to looking at dates from past, as most notes are from past - Look for dates in future for cases where it's obvious query is for dates in the future but dateparser's parse doesn't parse it at all. E.g parse('5 months from now') returns nothing - Setting PREFER_DATES_FROM_FUTURE in this case and passing just parse('5 months') to dateparser.parse works as expected	2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky	4a201d52af	Add, test date filter regex and date parsing to get natural date range	2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky	b54588717f	Filter for entries with dates specified by user in query - Create Date filter - Users can pass dates in YYYY-MM-DD format in their query - Use it to filter asymmetric search to user specified dates	2022-07-14 00:51:02 +04:00
Debanjum Singh Solanky	b82aef26bf	Make filters to apply before semantic search configurable Details -- - The filters to apply are configured for each type in the search controller - Muliple filters can be applied on the query, entries etc before search - The asymmetric query method now just applies the passed filters to the query, entries and embeddings before semantic search is performed Reason -- This abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky	c92789d20a	Extract explicit pre-search filter function into a separate module Details -- - Move explicit_filters function into separate module under search_filter - Update signature of explicit filter to take and return query, entries, embeddings - Use this explicit_filter func from search_filters module in query Reason -- Abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:20:04 +04:00
Debanjum Singh Solanky	6d7ab50113	Run Explicit Filter on Entries, Embeddings before Semantic Search for Query - Issue - Explicit filtering was earlier being done after search by bi-encoder but before re-ranking by cross-encoder - This was limiting the quality of results being returned. As the bi-encoder returned results which were going to be excluded. So the burden of improving those limited results post filtering was on the cross-encoder by re-ranking the remaining results based on query - Fix - Given the embeddings corresponding to an entry are at the same index in their respective lists. We can run the filter for blocked, required words before the search by the bi-encoder model. And limit entries, embeddings being considered for the current query - Result - Semantic search by the bi-encoder gets to return most relevant results for the query, knowing that the results aren't going to be filtered out after. So the cross-encoder shoulders less of the burden of improving results - Corollary - This pre-filtering technique allows us to apply other explicit filters on entries relevant for the current query - E.g limit search for entries within date/time specified in query	2022-07-12 18:25:42 +04:00
Debanjum Singh Solanky	7677465f23	Fix passing of device to setup method in /reload, /regenerate API - Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API - Set default argument to torch.device('cpu') instead of 'cpu' to be more formal	2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky	eda4b65ddb	Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine	2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky	b89fc2f4ac	Add /reload API to reload model embeddings and entries from file - The reload API adds the ability to separate out the loading of embeddings from file without having to restart app or (re-)generate embeddings - Before this the only way to load model from file was by restarting app - The other way to reload the model embeddings by regenerating them was to expensive for larger datasets - This unlocks at least 1 use-case, where - we regenerate model via an app instance running on a separate server and - just reload the generated embeddings on the client device - This allows us to offload the expensive embedding generation compute to a background server while letting - This avoids having to (re-)restart application on client device or be forced to generate embeddings on the client device itself - But it requires the model relevant files to be synced to the client device This can be done with any file syncing application like Syncthing - We can then call /regenerate on server and /reload client on a regular schedule to keep our data up to date on semantic search	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	f5d6d1e752	Tiny style fix to separate functions by 2 newlines	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	85fbe1c42b	Normalize org notes path to be relative to home directory - This is still clunky but it should be commitable - General enough that it'll work even when a users notes are not in the home directory - While solving for the special case where: - Notes are being processed on a different machine and used on a different machine - But the notes directory is in the same location relative to home on both the machines	2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky	094eaf3fcc	Fix minor bugs in OrgNode parser - Bugs discovered from writing org-node tests	2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky	36495038dd	Fix storing parsed CLOSED date in OrgNode The CLOSED date was getting parsed but not stored Adding setClosed at start also fixed the issue	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	1c5754bf95	Simplify storing Tags in OrgNode object - Use Set for Tags instead of dictionary with empty keys - No Need to store First Tag separately - Remove properties methods associated with storing first tag separately - Simplify extraction of tags string in org_to_jsonl - Split notes_string creation into multiple f-string in separate line for code readability	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	51a43245d3	Escape square brackets in file+heading based org-mode links	2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky	04610f453a	Include scheduled date, deadline date and close date in repr of org node - Now that excluding the times line from the raw body of node, show it in repr so user can see it for reference - But the model doesn't need to see it for it's embeddings to be confused by	2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky	367d7377df	Ignore scheduled, closed, deadline time and logbook start, end in org node body - Gives cleaner embeddings for semantic search - Hopefully improves results and reduces size, compute	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	b77ccadcba	Make property key regex more strict. Property key has to be alphanumeric	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	ac9d746444	Fix Tags extraction in Org Node parser - Previous version required two tags at least to work, not sure why - Fixed it to extract all tags, even if only one tag in heading	2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky	fb86be8cd9	Add ID, File+Heading based Links to Org-Mode Entries - Add links to property drawer - This ensures results returned by semantic search contain these links - This allows the user to jump to entry within original file for context - The ID, file+heading based links are more robust to find relevant entry in original file than the line no based link, as edits being done by user to original files between embedding regenerations	2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky	de23fc2051	Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search Sentence Transformer MSMarco Model isn't date aware So no use of adding scheduled, deadline dates to model embeddings for consideration This reverts commit `a2a08d1354`.	2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky	a2a08d1354	Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search	2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky	cfbd5c4ecc	Update global model on regenerate via API	2022-06-17 00:49:06 +03:00
Debanjum Singh Solanky	c78bf84eef	Introduce search api endpoint that auto infers search type intent - Introduce prompt for GPT to automatically extract user's search intent - Expose new search api endpoint to use that to set SearchType being passed to search API - Currently meant as an experimental API to gauge usefulness, extendability. Evaluating for phone or voice use-case	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	8ef7917014	Fix json format passed in prompt to GPT	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	f57b7f65ea	Wrap prompts for GPT in triple quotes to improve prompt readability To prompt improve readability: - Remove newline escape sequence and use actual newline directly - This avoids one long line of text as prompt and - Remove escaping of double quotes	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	1eba7b1c6f	Use empty_escape_sequence constant to strip response text from gpt	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	1c3a1420f8	Update asymmetric extract_entries method to handle uncompressed jsonl This is similar to what was done for the symmetric extract_entries method earlier	2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky	3d8a07f252	Extract empty line escape sequences var into constants file for reuse	2022-02-27 19:01:49 -05:00
Debanjum Singh Solanky	bb5d0d8908	Improve Semantic Search Buffer Names in Emacs - Allow multiple semantic searches buffers to exist simultaneously - Uniquify semantic search buffer namew - Add query and search-type to semantic search buffer name for easier disambiguration, search and find appropriate	2022-02-26 18:30:14 -05:00
Debanjum Singh Solanky	b68558651b	Improve Extraction of Beancount Entries - Only extract entries starting with YYYY-MM-DD from Beancount - Strip Trailing Escape Sequences from Entries	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	b3ac2dd730	Improve Results Rendered on Emacs from Semantic Search on Ledger - Add search query to top of buffer as Beancount comment - Remove trailing ) from response - Separate entries by empty line - Load beancount-mode in semantic search on ledger buffer	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	502c68d4f8	Remove trailling escape sequence in ledger search response entries - Fix loading entries from jsonl in extract_entries method - Only extract Title from jsonl of each entry This is the only thing written to the jsonl for symmetric ledger - This fixes the trailing escape seq in loaded entries - Remove the need for semantic-search.el response reader to do pointless complicated cleanup - Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl Both methods were doing similar work - Make load_jsonl handle loading entries from both gzip and uncompressed jsonl	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	248aa632c0	Do not throw warning for beancount files with .beancount extension	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	76cd63f4bd	Fix count of processed jsonl entries shown to user by ledger processor Count lines not chars	2022-02-26 17:46:06 -05:00
Saba	33bc62dc19	Fix type of use_xmp_metadata to be bool, rather than str	2022-01-24 21:53:26 -05:00
Debanjum Singh Solanky	179153dc5a	Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation	2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky	c64e0c2965	Load model from HuggingFace if model_directory unset in config YAML - Do not save/load the model to/from disk when model_directory unset in config.yml - Add symmetric search default config to cli.py	2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky	510faa1904	Save Image Search Model to Disk	2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky	934ec233b0	Add Search Config for Symmetric Model. Save Model to Disk	2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky	b63026d97c	Save Asymmetric Search Model to Disk - Improve application load time - Remove dependence on internet to startup application and perform semantic search	2022-01-14 17:36:27 -05:00
Debanjum Singh Solanky	2e53fbc844	Fix the user intent extraction prompt for GPT. Clean up chatbot test	2022-01-12 10:36:01 -05:00
Debanjum Singh Solanky	ea28897cdd	Remove deprecated conversation_history field from config	2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky	5a686b7be9	Add logs for chat bot in verbose mode	2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky	6dc2a99d35	Merge branch 'master' of github.com:debanjum/semantic-search into add-summarize-capability-to-chat-bot - Fix openai_api_key being set in ConfigProcessorConfig - Merge addition of config UI and config instantiation updates	2021-12-20 13:30:42 +05:30
Debanjum Singh Solanky	65da7daf1f	Load, Save Conversation Session Summaries to Log. s/chat_log/chat_session Conversation logs structure now has session info too instead of just chat info Session info will allow loading past conversation summaries as context for AI in new conversations { "session": [ { "summary": <chat_session_summary>, "session-start": <session_start_index_in_chat_log>, "session-end": <session_end_index_in_chat_log> }], "chat": [ { "intent": <intent-object> "trigger-emotion": <emotion-triggered-by-message> "by": <AI\|Human> "message": <chat_message> "created": <message_created_date> }] }	2021-12-15 10:17:07 +05:30
Saba	97a6dfaa1e	Use default value False for verbose parameter, and small changes Pass config as parameter to initialize_search, change name of API methods to handle config CRUD operations, and initalize config to FullConfig	2021-12-11 14:13:14 -05:00
Saba	9536358d34	Fix key error model_name issue by upgrade sentence-transformers version Refer to https://github.com/UKPLab/sentence-transformers/issues/1241 Also user verbose flag passed through function parameters in image_search	2021-12-11 11:58:19 -05:00
Saba	ce7a751e6b	Fix passing verbose flag down in symmetric_ledger.py	2021-12-11 11:36:32 -05:00
Saba	d65190c3ee	Update unit tests, files with removing model suffix to config types	2021-12-09 08:50:38 -05:00
Debanjum Singh Solanky	0ac1e5f372	Summarize chat logs and notes returned by semantic search via /chat API	2021-12-08 02:34:07 +05:30
Saba	76e9e9da2f	Update unit tests to use the new BaseModel types	2021-12-05 09:31:39 -05:00
Saba	9b16cdbb41	Use past tense for verbose log	2021-12-04 11:45:44 -05:00
Saba	10e4065e05	Consolidate the search config models and pass verbose as a top level flag	2021-12-04 11:43:48 -05:00
Saba	43e647835b	Append Model Suffixed to config models	2021-12-04 10:51:21 -05:00
Saba	e068968b35	Update imports for raw config models in config.py	2021-12-04 10:44:55 -05:00
Saba	4d6284b0af	Remove Test suffix from Config models	2021-12-04 10:44:13 -05:00
Saba	7fcc8d2cef	Add null check for processor config	2021-12-04 10:11:00 -05:00
Saba	7ca4fc3453	Resolve mrege conflicts with updated processor conversation data model	2021-11-28 16:22:52 -05:00
Saba	87a6c2d716	Use parse_obj instead of parse_raw as incoming data is in dict	2021-11-28 14:34:32 -05:00
Saba	5d50487d83	Linting New line at end of config.html Remove debug print statement	2021-11-28 13:32:56 -05:00
Saba	6f466c8d99	Use global config and add a regenerate button to the config ui' && git push	2021-11-28 13:28:22 -05:00
Saba	34d1e4199c	Use alias generator when deserializing the config file	2021-11-28 13:05:48 -05:00
Saba	19b81e82f0	Write back to the raw config.yml file on update	2021-11-28 12:34:40 -05:00
Saba	8837b02de6	dump updated config to a yaml file	2021-11-28 12:26:07 -05:00
Saba	5b80b87379	Streamline None checking in initialize_search	2021-11-28 12:05:04 -05:00
Saba	bf8ae31e6a	Streamline None checking in initialize_search	2021-11-28 11:59:45 -05:00
Saba	da52433d89	Update to re-use the raw config base models in config.py as well	2021-11-28 11:57:33 -05:00
Saba	6292fe4481	Update to re-use the raw config base models in config.py as well	2021-11-28 11:57:13 -05:00
Saba	311c4b7e7b	Working API request body parsing to /post config!	2021-11-28 11:16:33 -05:00
Saba	66183cc298	Working API request body parsing to /post config!	2021-11-28 11:12:26 -05:00
Debanjum Singh Solanky	5cd920544d	Add GPT method to summarize notes and chat logs	2021-11-28 13:08:05 +05:30
Debanjum Singh Solanky	1785047ea6	Improve understand primer and load understand response as dict	2021-11-28 13:04:16 +05:30
Saba	64645c3ac1	Begin type checking/input validation effort	2021-11-27 21:47:56 -05:00
Saba	9a0264b7fc	Add a dummy POST config endpoint, integrate with editable UI	2021-11-27 20:36:03 -05:00
Saba	f3b03ea5b7	Make raw data reactive to changes	2021-11-27 19:17:15 -05:00
Debanjum Singh Solanky	67c3cd7372	Wire up GPT understand method to /chat API. Log conversation metadata too	2021-11-28 00:04:39 +05:30
Saba	3db06eee3f	Basic example of serving conifg as JSON and retriving on button click	2021-11-27 10:49:33 -05:00
Saba	3d4471e107	Merge branch 'master' of github.com:debanjum/semantic-search into saba/configui	2021-11-27 08:52:48 -05:00
Debanjum Singh Solanky	ccfb97e1a7	Wire up minimal conversation processor. Expose it over /chat API endpoint Ensure conversation history persists across application restart	2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky	a99b4b3434	Make conversation processor configurable	2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky	d4e1120b22	Add GPT based conversation processor to understand intent and converse with user - Allow conversing with user using GPT's contextually aware, generative capability - Extract metadata, user intent from user's messages using GPT's general understanding	2021-11-27 18:12:01 +05:30
Saba	baee52648d	Set up basic ui page with no functionality	2021-11-26 14:51:11 -05:00
debanjum	46661b3057	Ensure top_k never more than total entries to run symmetric search on	2021-11-16 11:32:21 -08:00
debanjum	8c858d1a94	Reduce symmetric search results for cross-encoder to re-rank to improve search speed	2021-11-16 11:31:19 -08:00
Debanjum Singh Solanky	f3fd5ae978	Improve code comments. Do not import unused modules in asymmetric search	2021-11-17 00:58:31 +05:30

... 2 3 4 5 6 ...

389 commits