sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2025-02-18 22:34:20 +00:00

Author	SHA1	Message	Date
Debanjum	6356feb637	Make filters applied before semantic search configurable Reason -- This abstraction will simplify adding other pre-search filters. E.g A date-time filter Capabilities -- - Multiple filters can be applied on the query, entries etc before search - The filters to apply are configured for each type in the search controller Details -- - Move `explicit_filters` function into separate module under `search_filter` - Update signature of explicit filter to take and return `query`, `entries`, `embeddings` - Use this `explicit_filter` function from `search_filters` module in `search` method in controller - The asymmetric query method now just applies the passed filters to the `query`, `entries` and `embeddings` before semantic search is performed	2022-07-13 05:53:02 -07:00
Debanjum Singh Solanky	b82aef26bf	Make filters to apply before semantic search configurable Details -- - The filters to apply are configured for each type in the search controller - Muliple filters can be applied on the query, entries etc before search - The asymmetric query method now just applies the passed filters to the query, entries and embeddings before semantic search is performed Reason -- This abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky	c92789d20a	Extract explicit pre-search filter function into a separate module Details -- - Move explicit_filters function into separate module under search_filter - Update signature of explicit filter to take and return query, entries, embeddings - Use this explicit_filter func from search_filters module in query Reason -- Abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:20:04 +04:00
Debanjum	589bfa9424	Run Explicit Filter on Entries, Embeddings before Semantic Search for Query ## Issue - Explicit filtering was being done after search by the bi-encoder but before re-ranking by the cross-encoder - This limited the quality of results being returned for queries with explicit filters. The bi-encoder returned results which were going to be excluded. So the burden of improving those limited results post filtering was on the cross-encoder, by re-ranking the remaining results to best match the query ## Fix - Given that the entry and its embedding are at the same index in their respective lists. We know which entries map to which embedding tensors. So we can run the filter for blocked, required words before the bi-encoder search. And limit entries, embeddings being considered for the current query ## Result - Semantic search by the bi-encoder returns the most relevant results for the query, knowing that the results aren't going to be filtered out after. So the cross-encoder shoulders less of the burden of improving the results ## Corollary - This pre-filtering technique allows us to apply other explicit filters on entries relevant for the current query, before calling search - E.g limit search to entries within date/time specified in query	2022-07-12 13:12:22 -07:00
Debanjum Singh Solanky	741fca0e6b	Fix asymmetric search test to pass entries returned by query to collate_results	2022-07-12 18:48:49 +04:00
Debanjum Singh Solanky	6d7ab50113	Run Explicit Filter on Entries, Embeddings before Semantic Search for Query - Issue - Explicit filtering was earlier being done after search by bi-encoder but before re-ranking by cross-encoder - This was limiting the quality of results being returned. As the bi-encoder returned results which were going to be excluded. So the burden of improving those limited results post filtering was on the cross-encoder by re-ranking the remaining results based on query - Fix - Given the embeddings corresponding to an entry are at the same index in their respective lists. We can run the filter for blocked, required words before the search by the bi-encoder model. And limit entries, embeddings being considered for the current query - Result - Semantic search by the bi-encoder gets to return most relevant results for the query, knowing that the results aren't going to be filtered out after. So the cross-encoder shoulders less of the burden of improving results - Corollary - This pre-filtering technique allows us to apply other explicit filters on entries relevant for the current query - E.g limit search for entries within date/time specified in query	2022-07-12 18:25:42 +04:00
sabaimran	36ef37e940	Fix formatting for pytest command Use org formatting rather than md.	2022-07-08 10:18:26 -04:00
sabaimran	d6945f4f6b	Merge pull request #29 from debanjum/saba/fix-docker-build Address Issues with Docker builds	2022-07-06 21:32:37 -04:00
Saba	2eb44c7a64	Correct syntax of memory limit in docker-compose.yml	2022-07-06 20:07:11 -04:00
Debanjum Singh Solanky	8bb9a49994	Cleanup Test Asymmetric Search to Fix Test - test_regenerate_with_valid_content failed when run after test_asymmetric_search - test_asymmetric_search did't clean the temporary update to config it had made - This was resulting in regenerate looking for a file that didn't exist	2022-07-07 01:25:31 +04:00
Saba	7bb35ccc7e	Run build on PR	2022-07-04 18:09:47 -04:00
Saba	07a56c4ab6	Add specific version for Python packages and downgrade miniconda Docker image to potentially fix build issues	2022-07-04 18:01:55 -04:00
Saba	0f88abd219	Allocate 8GB of memory to docker container. Adjust path to Dockerfile in Github action	2022-07-04 14:01:59 -04:00
Saba	092d0f2f21	Move Dockerfile to project root to avoid permissions issues. Allocate more memory to docker-compose to avoid OOM	2022-07-04 12:33:55 -04:00
Debanjum Singh Solanky	7677465f23	Fix passing of device to setup method in /reload, /regenerate API - Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API - Set default argument to torch.device('cpu') instead of 'cpu' to be more formal	2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky	eda4b65ddb	Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine	2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky	2f7ef08b11	Add Unit Tests to verify the Reload API functions as desired	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	b89fc2f4ac	Add /reload API to reload model embeddings and entries from file - The reload API adds the ability to separate out the loading of embeddings from file without having to restart app or (re-)generate embeddings - Before this the only way to load model from file was by restarting app - The other way to reload the model embeddings by regenerating them was to expensive for larger datasets - This unlocks at least 1 use-case, where - we regenerate model via an app instance running on a separate server and - just reload the generated embeddings on the client device - This allows us to offload the expensive embedding generation compute to a background server while letting - This avoids having to (re-)restart application on client device or be forced to generate embeddings on the client device itself - But it requires the model relevant files to be synced to the client device This can be done with any file syncing application like Syncthing - We can then call /regenerate on server and /reload client on a regular schedule to keep our data up to date on semantic search	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	f5d6d1e752	Tiny style fix to separate functions by 2 newlines	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	d64bed26f3	Make Docker ignore unnecessary files	2022-06-29 22:29:34 +04:00
Debanjum Singh Solanky	85fbe1c42b	Normalize org notes path to be relative to home directory - This is still clunky but it should be commitable - General enough that it'll work even when a users notes are not in the home directory - While solving for the special case where: - Notes are being processed on a different machine and used on a different machine - But the notes directory is in the same location relative to home on both the machines	2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky	094eaf3fcc	Fix minor bugs in OrgNode parser - Bugs discovered from writing org-node tests	2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky	f66192f2a7	Test OrgNode Parsing and Rendering	2022-06-17 19:13:11 +03:00
Debanjum Singh Solanky	36495038dd	Fix storing parsed CLOSED date in OrgNode The CLOSED date was getting parsed but not stored Adding setClosed at start also fixed the issue	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	1c5754bf95	Simplify storing Tags in OrgNode object - Use Set for Tags instead of dictionary with empty keys - No Need to store First Tag separately - Remove properties methods associated with storing first tag separately - Simplify extraction of tags string in org_to_jsonl - Split notes_string creation into multiple f-string in separate line for code readability	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	51a43245d3	Escape square brackets in file+heading based org-mode links	2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky	04610f453a	Include scheduled date, deadline date and close date in repr of org node - Now that excluding the times line from the raw body of node, show it in repr so user can see it for reference - But the model doesn't need to see it for it's embeddings to be confused by	2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky	367d7377df	Ignore scheduled, closed, deadline time and logbook start, end in org node body - Gives cleaner embeddings for semantic search - Hopefully improves results and reduces size, compute	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	b77ccadcba	Make property key regex more strict. Property key has to be alphanumeric	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	ac9d746444	Fix Tags extraction in Org Node parser - Previous version required two tags at least to work, not sure why - Fixed it to extract all tags, even if only one tag in heading	2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky	fb86be8cd9	Add ID, File+Heading based Links to Org-Mode Entries - Add links to property drawer - This ensures results returned by semantic search contain these links - This allows the user to jump to entry within original file for context - The ID, file+heading based links are more robust to find relevant entry in original file than the line no based link, as edits being done by user to original files between embedding regenerations	2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky	de23fc2051	Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search Sentence Transformer MSMarco Model isn't date aware So no use of adding scheduled, deadline dates to model embeddings for consideration This reverts commit `a2a08d1354`.	2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky	a2a08d1354	Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search	2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky	cfbd5c4ecc	Update global model on regenerate via API	2022-06-17 00:49:06 +03:00
Debanjum	35117af322	Show Demo of Semantic Search in Readme Merge pull request #27 from debanjum/debanjum/add-demo	2022-05-14 01:32:18 -07:00
Debanjum Singh Solanky	2eab256af9	Delete markdown file. It helped upload the demo video to Github	2022-05-14 04:30:20 -04:00
Debanjum Singh Solanky	96c588b7bc	Add demo of semantic search to repository	2022-05-14 04:29:25 -04:00
Debanjum	19f8f85333	Show Demo of Semantic Search in Readme - Use Markdown file to help upload demo to Github - Use generated link from upload into Readme org file	2022-05-14 01:29:13 -07:00
Debanjum Singh Solanky	031d6bddb4	Delete markdown file. It helped upload the demo video to Github	2022-05-14 04:25:17 -04:00
Debanjum Singh Solanky	c78bf84eef	Introduce search api endpoint that auto infers search type intent - Introduce prompt for GPT to automatically extract user's search intent - Expose new search api endpoint to use that to set SearchType being passed to search API - Currently meant as an experimental API to gauge usefulness, extendability. Evaluating for phone or voice use-case	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	8ef7917014	Fix json format passed in prompt to GPT	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	f57b7f65ea	Wrap prompts for GPT in triple quotes to improve prompt readability To prompt improve readability: - Remove newline escape sequence and use actual newline directly - This avoids one long line of text as prompt and - Remove escaping of double quotes	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	1eba7b1c6f	Use empty_escape_sequence constant to strip response text from gpt	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	37bfc956c9	Update Readme Local Development Section	2022-02-27 23:16:58 -05:00
Debanjum Singh Solanky	1c3a1420f8	Update asymmetric extract_entries method to handle uncompressed jsonl This is similar to what was done for the symmetric extract_entries method earlier	2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky	3d8a07f252	Extract empty line escape sequences var into constants file for reuse	2022-02-27 19:01:49 -05:00
Debanjum Singh Solanky	624a3faf92	Update Readme. Improve Organization, Reduce Staleness	2022-02-26 19:04:49 -05:00
Debanjum Singh Solanky	bb5d0d8908	Improve Semantic Search Buffer Names in Emacs - Allow multiple semantic searches buffers to exist simultaneously - Uniquify semantic search buffer namew - Add query and search-type to semantic search buffer name for easier disambiguration, search and find appropriate	2022-02-26 18:30:14 -05:00
Debanjum	6a84ca965a	Merge pull request #25 from debanjum/users/debanjum/improve-semantic-search-on-ledger Improve Extraction and Rendering of Semantic Search on Ledger	2022-02-26 15:18:22 -08:00
Debanjum Singh Solanky	b68558651b	Improve Extraction of Beancount Entries - Only extract entries starting with YYYY-MM-DD from Beancount - Strip Trailing Escape Sequences from Entries	2022-02-26 17:48:45 -05:00

... 40 41 42 43 44 ...

2315 commits