sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2025-02-18 22:44:19 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	231cc91e14	Force reload of images every time user clicks search button Adding a random, unused url param at the end of the img.src string fixes the issue. As the browser thinks it's a new image and doesn't use the image data that's already cached because of which it wasn't even making the fetch call for the image	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	a6aef62a99	Create Basic Landing Page to Query Semantic Search and Render Results - Allow viewing image results returned by Semantic Search. Until now there wasn't any interface within the app to view image search results. For text results, we at least had the emacs interface - This should help with debugging issues with image search too For text the Swagger interface was good enough	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	4e27ae0577	Ease access to image result for given query by image_search - Copy images to accessible directory - Return URL paths to them to ease access - This is to be used in the web interface to render image results directly in browser - Return image, metadata scores for each image in response as well This should help get a better sense of image scores along both XMP metadata and whole image axis	2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky	801e59a20d	Allow explicit filters when querying Ledger transactions	2022-07-15 23:41:54 +04:00
Debanjum Singh Solanky	0e979587e0	Add configurable filter support to Symmetric Ledger Search	2022-07-14 23:40:41 +04:00
Debanjum Singh Solanky	85077bc1d1	Handle unparseable date range passed via date filter in query - Do not reuse the same list - Just create new list, so only parsed data is in it	2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky	a60de2c02b	Include date filter in asymmetic search on music as well	2022-07-14 22:37:17 +04:00
Debanjum Singh Solanky	c3b3e8959d	Put entry splitting regex in explicit filter into a variable for code readability	2022-07-14 22:00:10 +04:00
Debanjum Singh Solanky	3aac3c7d52	Run explicit filter on raw entry, add more terms to split entries by - With \t Last Word in Headings was suffixed by \t and so couldn't be filtered by - User interacts with raw entries, so run explicit filters on raw entry - For semantic search using the filtered entry is cleaner, still	2022-07-14 21:54:04 +04:00
Debanjum Singh Solanky	7640e2ab0c	Wrap attempt to extract dates from entry in try/catch - Not all YYYY-MM-DD strings in entry are necessarily dates	2022-07-14 21:38:00 +04:00
Debanjum Singh Solanky	9de2097182	Fix date filter usage with multi word queries. Simplify date regex	2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky	dcb6fe479e	Fix date_filter query, entry in query range check. Add tests for it - Fix date_filter date_in_entry within query range check - Extracted_date_range is in [included_date, excluded_date) format - But check was checking for date_in_entry <= excluded_date - Fixed it to do date_in_entry < excluded_date - Fix removal of date filter from query - Add tests for date_filter	2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky	011f81fac5	Fix date_filter to handle non overlapping date ranges	2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky	70ac35b2a5	Compute Date Range to filter entries to, from Comparators, Dates in Query	2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky	e6db3e3d00	Prefer Dates From Future only when specific words in date string - Default to looking at dates from past, as most notes are from past - Look for dates in future for cases where it's obvious query is for dates in the future but dateparser's parse doesn't parse it at all. E.g parse('5 months from now') returns nothing - Setting PREFER_DATES_FROM_FUTURE in this case and passing just parse('5 months') to dateparser.parse works as expected	2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky	4a201d52af	Add, test date filter regex and date parsing to get natural date range	2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky	b54588717f	Filter for entries with dates specified by user in query - Create Date filter - Users can pass dates in YYYY-MM-DD format in their query - Use it to filter asymmetric search to user specified dates	2022-07-14 00:51:02 +04:00
Debanjum Singh Solanky	b82aef26bf	Make filters to apply before semantic search configurable Details -- - The filters to apply are configured for each type in the search controller - Muliple filters can be applied on the query, entries etc before search - The asymmetric query method now just applies the passed filters to the query, entries and embeddings before semantic search is performed Reason -- This abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky	c92789d20a	Extract explicit pre-search filter function into a separate module Details -- - Move explicit_filters function into separate module under search_filter - Update signature of explicit filter to take and return query, entries, embeddings - Use this explicit_filter func from search_filters module in query Reason -- Abstraction will simplify adding other pre-search filters. E.g datetime filter	2022-07-13 16:20:04 +04:00
Debanjum Singh Solanky	6d7ab50113	Run Explicit Filter on Entries, Embeddings before Semantic Search for Query - Issue - Explicit filtering was earlier being done after search by bi-encoder but before re-ranking by cross-encoder - This was limiting the quality of results being returned. As the bi-encoder returned results which were going to be excluded. So the burden of improving those limited results post filtering was on the cross-encoder by re-ranking the remaining results based on query - Fix - Given the embeddings corresponding to an entry are at the same index in their respective lists. We can run the filter for blocked, required words before the search by the bi-encoder model. And limit entries, embeddings being considered for the current query - Result - Semantic search by the bi-encoder gets to return most relevant results for the query, knowing that the results aren't going to be filtered out after. So the cross-encoder shoulders less of the burden of improving results - Corollary - This pre-filtering technique allows us to apply other explicit filters on entries relevant for the current query - E.g limit search for entries within date/time specified in query	2022-07-12 18:25:42 +04:00
Debanjum Singh Solanky	7677465f23	Fix passing of device to setup method in /reload, /regenerate API - Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API - Set default argument to torch.device('cpu') instead of 'cpu' to be more formal	2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky	eda4b65ddb	Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine	2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky	b89fc2f4ac	Add /reload API to reload model embeddings and entries from file - The reload API adds the ability to separate out the loading of embeddings from file without having to restart app or (re-)generate embeddings - Before this the only way to load model from file was by restarting app - The other way to reload the model embeddings by regenerating them was to expensive for larger datasets - This unlocks at least 1 use-case, where - we regenerate model via an app instance running on a separate server and - just reload the generated embeddings on the client device - This allows us to offload the expensive embedding generation compute to a background server while letting - This avoids having to (re-)restart application on client device or be forced to generate embeddings on the client device itself - But it requires the model relevant files to be synced to the client device This can be done with any file syncing application like Syncthing - We can then call /regenerate on server and /reload client on a regular schedule to keep our data up to date on semantic search	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	f5d6d1e752	Tiny style fix to separate functions by 2 newlines	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	85fbe1c42b	Normalize org notes path to be relative to home directory - This is still clunky but it should be commitable - General enough that it'll work even when a users notes are not in the home directory - While solving for the special case where: - Notes are being processed on a different machine and used on a different machine - But the notes directory is in the same location relative to home on both the machines	2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky	094eaf3fcc	Fix minor bugs in OrgNode parser - Bugs discovered from writing org-node tests	2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky	36495038dd	Fix storing parsed CLOSED date in OrgNode The CLOSED date was getting parsed but not stored Adding setClosed at start also fixed the issue	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	1c5754bf95	Simplify storing Tags in OrgNode object - Use Set for Tags instead of dictionary with empty keys - No Need to store First Tag separately - Remove properties methods associated with storing first tag separately - Simplify extraction of tags string in org_to_jsonl - Split notes_string creation into multiple f-string in separate line for code readability	2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky	51a43245d3	Escape square brackets in file+heading based org-mode links	2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky	04610f453a	Include scheduled date, deadline date and close date in repr of org node - Now that excluding the times line from the raw body of node, show it in repr so user can see it for reference - But the model doesn't need to see it for it's embeddings to be confused by	2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky	367d7377df	Ignore scheduled, closed, deadline time and logbook start, end in org node body - Gives cleaner embeddings for semantic search - Hopefully improves results and reduces size, compute	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	b77ccadcba	Make property key regex more strict. Property key has to be alphanumeric	2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky	ac9d746444	Fix Tags extraction in Org Node parser - Previous version required two tags at least to work, not sure why - Fixed it to extract all tags, even if only one tag in heading	2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky	fb86be8cd9	Add ID, File+Heading based Links to Org-Mode Entries - Add links to property drawer - This ensures results returned by semantic search contain these links - This allows the user to jump to entry within original file for context - The ID, file+heading based links are more robust to find relevant entry in original file than the line no based link, as edits being done by user to original files between embedding regenerations	2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky	de23fc2051	Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search Sentence Transformer MSMarco Model isn't date aware So no use of adding scheduled, deadline dates to model embeddings for consideration This reverts commit `a2a08d1354`.	2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky	a2a08d1354	Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search	2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky	cfbd5c4ecc	Update global model on regenerate via API	2022-06-17 00:49:06 +03:00
Debanjum Singh Solanky	c78bf84eef	Introduce search api endpoint that auto infers search type intent - Introduce prompt for GPT to automatically extract user's search intent - Expose new search api endpoint to use that to set SearchType being passed to search API - Currently meant as an experimental API to gauge usefulness, extendability. Evaluating for phone or voice use-case	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	8ef7917014	Fix json format passed in prompt to GPT	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	f57b7f65ea	Wrap prompts for GPT in triple quotes to improve prompt readability To prompt improve readability: - Remove newline escape sequence and use actual newline directly - This avoids one long line of text as prompt and - Remove escaping of double quotes	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	1eba7b1c6f	Use empty_escape_sequence constant to strip response text from gpt	2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky	1c3a1420f8	Update asymmetric extract_entries method to handle uncompressed jsonl This is similar to what was done for the symmetric extract_entries method earlier	2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky	3d8a07f252	Extract empty line escape sequences var into constants file for reuse	2022-02-27 19:01:49 -05:00
Debanjum Singh Solanky	bb5d0d8908	Improve Semantic Search Buffer Names in Emacs - Allow multiple semantic searches buffers to exist simultaneously - Uniquify semantic search buffer namew - Add query and search-type to semantic search buffer name for easier disambiguration, search and find appropriate	2022-02-26 18:30:14 -05:00
Debanjum Singh Solanky	b68558651b	Improve Extraction of Beancount Entries - Only extract entries starting with YYYY-MM-DD from Beancount - Strip Trailing Escape Sequences from Entries	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	b3ac2dd730	Improve Results Rendered on Emacs from Semantic Search on Ledger - Add search query to top of buffer as Beancount comment - Remove trailing ) from response - Separate entries by empty line - Load beancount-mode in semantic search on ledger buffer	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	502c68d4f8	Remove trailling escape sequence in ledger search response entries - Fix loading entries from jsonl in extract_entries method - Only extract Title from jsonl of each entry This is the only thing written to the jsonl for symmetric ledger - This fixes the trailing escape seq in loaded entries - Remove the need for semantic-search.el response reader to do pointless complicated cleanup - Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl Both methods were doing similar work - Make load_jsonl handle loading entries from both gzip and uncompressed jsonl	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	248aa632c0	Do not throw warning for beancount files with .beancount extension	2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky	76cd63f4bd	Fix count of processed jsonl entries shown to user by ledger processor Count lines not chars	2022-02-26 17:46:06 -05:00
Saba	33bc62dc19	Fix type of use_xmp_metadata to be bool, rather than str	2022-01-24 21:53:26 -05:00

1 2 3 4 5

239 commits