sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2025-02-18 22:04:20 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	67e9366c0f	Minor style fix. Use consistent/standard dates for date_filter tests	2022-07-14 20:06:39 +04:00
Debanjum Singh Solanky	dcb6fe479e	Fix date_filter query, entry in query range check. Add tests for it - Fix date_filter date_in_entry within query range check - Extracted_date_range is in [included_date, excluded_date) format - But check was checking for date_in_entry <= excluded_date - Fixed it to do date_in_entry < excluded_date - Fix removal of date filter from query - Add tests for date_filter	2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky	011f81fac5	Fix date_filter to handle non overlapping date ranges	2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky	70ac35b2a5	Compute Date Range to filter entries to, from Comparators, Dates in Query	2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky	e6db3e3d00	Prefer Dates From Future only when specific words in date string - Default to looking at dates from past, as most notes are from past - Look for dates in future for cases where it's obvious query is for dates in the future but dateparser's parse doesn't parse it at all. E.g parse('5 months from now') returns nothing - Setting PREFER_DATES_FROM_FUTURE in this case and passing just parse('5 months') to dateparser.parse works as expected	2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky	4a201d52af	Add, test date filter regex and date parsing to get natural date range	2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky	b54588717f	Filter for entries with dates specified by user in query - Create Date filter - Users can pass dates in YYYY-MM-DD format in their query - Use it to filter asymmetric search to user specified dates	2022-07-14 00:51:02 +04:00
Debanjum	6356feb637	Make filters applied before semantic search configurable Reason -- This abstraction will simplify adding other pre-search filters. E.g A date-time filter Capabilities -- - Multiple filters can be applied on the query, entries etc before search - The filters to apply are configured for each type in the search controller Details -- - Move `explicit_filters` function into separate module under `search_filter` - Update signature of explicit filter to take and return `query`, `entries`, `embeddings` - Use this `explicit_filter` function from `search_filters` module in `search` method in controller - The asymmetric query method now just applies the passed filters to the `query`, `entries` and `embeddings` before semantic search is performed	2022-07-13 05:53:02 -07:00
Debanjum Singh Solanky	b82aef26bf	Make filters to apply before semantic search configurable Details -- - The filters to apply are configured for each type in the search controller - Muliple filters can be applied on the query, entries etc before search - The asymmetric query method now just applies the passed filters to the query, entries and embeddings before semantic search is performed Reason -- This abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky	c92789d20a	Extract explicit pre-search filter function into a separate module Details -- - Move explicit_filters function into separate module under search_filter - Update signature of explicit filter to take and return query, entries, embeddings - Use this explicit_filter func from search_filters module in query Reason -- Abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:20:04 +04:00
Debanjum	589bfa9424	Run Explicit Filter on Entries, Embeddings before Semantic Search for Query ## Issue - Explicit filtering was being done after search by the bi-encoder but before re-ranking by the cross-encoder - This limited the quality of results being returned for queries with explicit filters. The bi-encoder returned results which were going to be excluded. So the burden of improving those limited results post filtering was on the cross-encoder, by re-ranking the remaining results to best match the query ## Fix - Given that the entry and its embedding are at the same index in their respective lists. We know which entries map to which embedding tensors. So we can run the filter for blocked, required words before the bi-encoder search. And limit entries, embeddings being considered for the current query ## Result - Semantic search by the bi-encoder returns the most relevant results for the query, knowing that the results aren't going to be filtered out after. So the cross-encoder shoulders less of the burden of improving the results ## Corollary - This pre-filtering technique allows us to apply other explicit filters on entries relevant for the current query, before calling search - E.g limit search to entries within date/time specified in query	2022-07-12 13:12:22 -07:00
Debanjum Singh Solanky	741fca0e6b	Fix asymmetric search test to pass entries returned by query to collate_results	2022-07-12 18:48:49 +04:00
Debanjum Singh Solanky	6d7ab50113	Run Explicit Filter on Entries, Embeddings before Semantic Search for Query - Issue - Explicit filtering was earlier being done after search by bi-encoder but before re-ranking by cross-encoder - This was limiting the quality of results being returned. As the bi-encoder returned results which were going to be excluded. So the burden of improving those limited results post filtering was on the cross-encoder by re-ranking the remaining results based on query - Fix - Given the embeddings corresponding to an entry are at the same index in their respective lists. We can run the filter for blocked, required words before the search by the bi-encoder model. And limit entries, embeddings being considered for the current query - Result - Semantic search by the bi-encoder gets to return most relevant results for the query, knowing that the results aren't going to be filtered out after. So the cross-encoder shoulders less of the burden of improving results - Corollary - This pre-filtering technique allows us to apply other explicit filters on entries relevant for the current query - E.g limit search for entries within date/time specified in query	2022-07-12 18:25:42 +04:00
sabaimran	36ef37e940	Fix formatting for pytest command Use org formatting rather than md.	2022-07-08 10:18:26 -04:00
sabaimran	d6945f4f6b	Merge pull request #29 from debanjum/saba/fix-docker-build Address Issues with Docker builds	2022-07-06 21:32:37 -04:00
Saba	2eb44c7a64	Correct syntax of memory limit in docker-compose.yml	2022-07-06 20:07:11 -04:00
Debanjum Singh Solanky	8bb9a49994	Cleanup Test Asymmetric Search to Fix Test - test_regenerate_with_valid_content failed when run after test_asymmetric_search - test_asymmetric_search did't clean the temporary update to config it had made - This was resulting in regenerate looking for a file that didn't exist	2022-07-07 01:25:31 +04:00
Saba	7bb35ccc7e	Run build on PR	2022-07-04 18:09:47 -04:00
Saba	07a56c4ab6	Add specific version for Python packages and downgrade miniconda Docker image to potentially fix build issues	2022-07-04 18:01:55 -04:00
Saba	0f88abd219	Allocate 8GB of memory to docker container. Adjust path to Dockerfile in Github action	2022-07-04 14:01:59 -04:00
Saba	092d0f2f21	Move Dockerfile to project root to avoid permissions issues. Allocate more memory to docker-compose to avoid OOM	2022-07-04 12:33:55 -04:00
Debanjum Singh Solanky	7677465f23	Fix passing of device to setup method in /reload, /regenerate API - Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API - Set default argument to torch.device('cpu') instead of 'cpu' to be more formal	2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky	eda4b65ddb	Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine	2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky	2f7ef08b11	Add Unit Tests to verify the Reload API functions as desired	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	b89fc2f4ac	Add /reload API to reload model embeddings and entries from file - The reload API adds the ability to separate out the loading of embeddings from file without having to restart app or (re-)generate embeddings - Before this the only way to load model from file was by restarting app - The other way to reload the model embeddings by regenerating them was to expensive for larger datasets - This unlocks at least 1 use-case, where - we regenerate model via an app instance running on a separate server and - just reload the generated embeddings on the client device - This allows us to offload the expensive embedding generation compute to a background server while letting - This avoids having to (re-)restart application on client device or be forced to generate embeddings on the client device itself - But it requires the model relevant files to be synced to the client device This can be done with any file syncing application like Syncthing - We can then call /regenerate on server and /reload client on a regular schedule to keep our data up to date on semantic search	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	f5d6d1e752	Tiny style fix to separate functions by 2 newlines	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	d64bed26f3	Make Docker ignore unnecessary files	2022-06-29 22:29:34 +04:00
Debanjum Singh Solanky	85fbe1c42b	Normalize org notes path to be relative to home directory - This is still clunky but it should be commitable - General enough that it'll work even when a users notes are not in the home directory - While solving for the special case where: - Notes are being processed on a different machine and used on a different machine - But the notes directory is in the same location relative to home on both the machines	2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky	094eaf3fcc	Fix minor bugs in OrgNode parser - Bugs discovered from writing org-node tests	2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky	f66192f2a7	Test OrgNode Parsing and Rendering	2022-06-17 19:13:11 +03:00
Debanjum Singh Solanky	36495038dd	Fix storing parsed CLOSED date in OrgNode The CLOSED date was getting parsed but not stored Adding setClosed at start also fixed the issue	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	1c5754bf95	Simplify storing Tags in OrgNode object - Use Set for Tags instead of dictionary with empty keys - No Need to store First Tag separately - Remove properties methods associated with storing first tag separately - Simplify extraction of tags string in org_to_jsonl - Split notes_string creation into multiple f-string in separate line for code readability	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	51a43245d3	Escape square brackets in file+heading based org-mode links	2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky	04610f453a	Include scheduled date, deadline date and close date in repr of org node - Now that excluding the times line from the raw body of node, show it in repr so user can see it for reference - But the model doesn't need to see it for it's embeddings to be confused by	2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky	367d7377df	Ignore scheduled, closed, deadline time and logbook start, end in org node body - Gives cleaner embeddings for semantic search - Hopefully improves results and reduces size, compute	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	b77ccadcba	Make property key regex more strict. Property key has to be alphanumeric	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	ac9d746444	Fix Tags extraction in Org Node parser - Previous version required two tags at least to work, not sure why - Fixed it to extract all tags, even if only one tag in heading	2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky	fb86be8cd9	Add ID, File+Heading based Links to Org-Mode Entries - Add links to property drawer - This ensures results returned by semantic search contain these links - This allows the user to jump to entry within original file for context - The ID, file+heading based links are more robust to find relevant entry in original file than the line no based link, as edits being done by user to original files between embedding regenerations	2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky	de23fc2051	Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search Sentence Transformer MSMarco Model isn't date aware So no use of adding scheduled, deadline dates to model embeddings for consideration This reverts commit `a2a08d1354`.	2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky	a2a08d1354	Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search	2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky	cfbd5c4ecc	Update global model on regenerate via API	2022-06-17 00:49:06 +03:00
Debanjum	35117af322	Show Demo of Semantic Search in Readme Merge pull request #27 from debanjum/debanjum/add-demo	2022-05-14 01:32:18 -07:00
Debanjum Singh Solanky	2eab256af9	Delete markdown file. It helped upload the demo video to Github	2022-05-14 04:30:20 -04:00
Debanjum Singh Solanky	96c588b7bc	Add demo of semantic search to repository	2022-05-14 04:29:25 -04:00
Debanjum	19f8f85333	Show Demo of Semantic Search in Readme - Use Markdown file to help upload demo to Github - Use generated link from upload into Readme org file	2022-05-14 01:29:13 -07:00
Debanjum Singh Solanky	031d6bddb4	Delete markdown file. It helped upload the demo video to Github	2022-05-14 04:25:17 -04:00
Debanjum Singh Solanky	c78bf84eef	Introduce search api endpoint that auto infers search type intent - Introduce prompt for GPT to automatically extract user's search intent - Expose new search api endpoint to use that to set SearchType being passed to search API - Currently meant as an experimental API to gauge usefulness, extendability. Evaluating for phone or voice use-case	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	8ef7917014	Fix json format passed in prompt to GPT	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	f57b7f65ea	Wrap prompts for GPT in triple quotes to improve prompt readability To prompt improve readability: - Remove newline escape sequence and use actual newline directly - This avoids one long line of text as prompt and - Remove escaping of double quotes	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	1eba7b1c6f	Use empty_escape_sequence constant to strip response text from gpt	2022-02-27 23:17:49 -05:00

... 3 4 5 6 7 ...

472 commits