sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-23 12:48:09 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	f040b3f65c	Stylize TODO/DONE states with CSS	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	581b6097c7	Clean Results. Remove TOC, Heading Number and Property Drawers	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	965a93a2f2	Add Basic HTML Rendering of Org-Mode Results	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	1da44d4dfe	Add Incremental Search to Khoj Web Interface	2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky	af1dd31401	Do not pass verbose argument to image_search.query() as not supported	2022-07-28 19:52:58 +04:00
Debanjum Singh Solanky	80ac10835c	Rerank results on normal minibuffer exit In current state: - Rerank results: - If user idles while entering query OR - exits normally - Do not rerank results: - If user exits abnormally, e.g via C-g from query	2022-07-28 03:37:16 +04:00
Debanjum Singh Solanky	1b759597df	Make incremental search more robust. Follow standard user expectations - Rename functions to more standard, descriptive names - Keep known, required code for incremental search - E.g Do not set buffer local flag in hooks on minibuffer setup - Only query when user in khoj minibuffer - Use active-minibuffer-window and track khoj minibuffer - (minibuffer-prompt) is not useful for our use-case here - (For now) Run re-rank only if user idle while querying - Do not run rerank on teardown/completion - The reranking lag (~2s) is annoying; hit enter, wait to see results - Also triggered when user exits abnormally, so C-g also results in rerank which is even more annoying - Emacs will still hang if re-ranking gets triggered on idle but that's better than always getting triggered. And better than not having mechanism to get results re-ranked via cross-encoder at all	2022-07-28 02:52:27 +04:00
Debanjum Singh Solanky	9a6eee31be	Make number of results to get from Khoj API customizable in khoj.el	2022-07-27 18:55:18 +04:00
Debanjum Singh Solanky	9302b45fe0	Use khoj-incremental as the main khoj func. Rename khoj to khoj-simple - Update khoj-simple to work cross-encoder re-ranked results like before - Increment major version as incremental search considered a breaking change and a major update to search capability	2022-07-27 18:18:17 +04:00
Debanjum Singh Solanky	09727ac3be	Make bi-encoder return fewer results to reduce cross-encoder latency	2022-07-27 07:26:02 +04:00
Debanjum Singh Solanky	9ab3edf6d6	Re-rank incremental search results using cross-encoder if user idle This provides a relatively smooth mechanism - to improve relevance of results on idle - while providing the rapid, incremental results while typing	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	ad242cafa7	Support querying all text search types in incremental search - Before incremental search was hard-coded to only query org	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	bfcb962cbe	Use post-command-hook to only query on user input - Hooking into after-change-functions results in system logs triggering query	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	0d49398954	Reuse code to query api, render results. Formalize method, arg names	2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky	fd1963d781	Implement Basic Incremental Search Interface in Emacs for Org Mode Notes	2022-07-27 03:05:00 +04:00
Debanjum Singh Solanky	3fa7d8f03a	Skeleton to allow incremental search on Khoj via Emacs	2022-07-27 02:48:27 +04:00
Debanjum Singh Solanky	1168244c92	Make cross-encoder re-rank results if query param set on /search API - Improve search speed by ~10x Tested on corpus of 125K lines, 12.5K entries - Allow cross-encoder to re-rank results by settings &?r=true when querying /search API - It's an optional param that default to False - Earlier all results were re-ranked by cross-encoder - Making this configurable allows for much faster results, if desired but for lower accuracy	2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky	b1e64fd4a8	Improve search speed. Only apply filter if filter keywords in query - Formalize filters into class with can_filter() and filter() methods - Use can_filter() method to decide whether to apply filter and create deep copies of entries and embeddings for it - Improve search speed for queries with no filters as deep copying entries, embeddings takes the most time after cross-encodes scoring when calling the /search API Earlier we would create deep copies of entries, embeddings even if the query did not contain any filter keywords	2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky	f094c86204	Trace query response performance and display timings in verbose mode	2022-07-26 21:03:53 +04:00
Debanjum	d8efcd559f	Add Feature Section to Readme - Make Architecture a top-level section - Minor improvement to Configure section	2022-07-25 15:43:27 -07:00
Debanjum Singh Solanky	f953b20415	Add Khoj Architecture Diagram in Docs. Show it in the Project Readme	2022-07-26 02:09:51 +04:00
Debanjum Singh Solanky	674d933282	Improve Khoj Intro text. Move Run Unit Test Section under Developement Heading	2022-07-26 02:06:44 +04:00
Debanjum Singh Solanky	3728583e08	Update Readme. Add section for using Khoj via Web interface	2022-07-22 04:02:03 +04:00
Debanjum Singh Solanky	65fea7681a	Rename notes search type to org search, now that markdown notes supported	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	1f4b5ac112	Create test markdown files. Use them in sample config, docker-compose	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	4c24202e42	Update documentation. Simplify, reflect current capabilities	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	d4d7dbaca6	Support Natural Search on Markdown Files - Reason: Allow natural search on markdown based notes, documentation, websites etc - Details: - Create markdown processor to extract Markdown entries (identified by Heading) into standard jsonl format required by text_search - Update API, Configs to support interfacing with new markdown type - Update Emacs, Web clients to support interfacing with new markdown type via API - Update Readme to mentiond markdown is also supported Closes #35	2022-07-21 22:07:05 +04:00
Debanjum Singh Solanky	0602d018c0	Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types	2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky	0917f1574d	Consolidate jsonl helper methods in a single file under utils module	2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky	de726c4b6c	Minor fixes to unused installer utility script	2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky	5aad297286	Reuse logic to extract entries across symmetric, asymmetric search Now that the logic to compile entries is in the processor layer, the extract_entries method is standard across (text) search_types Extract the load_jsonl method as a utility helper method. Use it in (a)symmetric search types	2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky	e220ecc00b	Generate compiled form of each transaction directly in the beancount processor - The logic for compiling a beancount entry (for later encoding) now completely resides in the org-to-jsonl processor layer - This allows symmetric search to be generic and not be aware of beancount specific properties that were extracted by the beancount-to-jsonl processor layer - Now symmetric search just expects the jsonl to (at least) have the 'compiled' and 'raw' keys for each entry. What original text the entry was compiled from is irrelevant to it. The original text could be location, transaction, chat etc, it doesn't have to care	2022-07-21 02:43:28 +04:00
Debanjum Singh Solanky	06cf425314	Generate compiled form of each entry directly in the org-mode processor - The logic for compiling an org-mode entry (for later encoding) now completely resides in the org-to-jsonl processor layer - This allows asymmetric search to be generic and not be aware of org-mode specific properties that were extracted by the org-to-jsonl processor layer - Now asymmetric search just expects the jsonl to (at least) have the 'compiled' and 'raw' keys for each entry. What original text the entry was compiled from is irrelevant to it. The original text could be mail, chat, markdown, org-mode etc, it doesn't have to care	2022-07-21 02:08:02 +04:00
Debanjum Singh Solanky	4ead79d272	Make Notes Search Natural Language Date Aware - Pass Scheduled, Closed Dates of Entries to Include in Embeddings - The (new?) model seems to understand dates. So can give more relevant entries if date in natural language mentioned in query - E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984" will give different results, with the second prioritizing entries mentioning any entries with closed, scheduled dates from 1984	2022-07-21 01:06:49 +04:00
Debanjum Singh Solanky	d50bfb5188	Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests	2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky	70e70d4b15	Rename 'embed' key to more generic 'compiled' for jsonl extracted results - While it's true those strings are going to be used to generated embeddings, the more generic term allows them to be used elsewhere as well - Their main property is that they are processed, compiled for usage by semantic search - Unlike the 'raw' string which contains the external representation of the data, as is	2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky	c1369233db	Consistently use "entry", "score" in json response for all search types - Had already made some progress on this earlier by updating the image search responses. But needed to update the text search responses to use lowercase entry and score - Update khoj.el to consume the updated json response keys for text search	2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky	d68a9dc445	Sort extracted images before computing their embeddings - Image order returned by glob is OS dependent - This prevented sharing image embeddings across machines running different OS - A stable sort order for processed images allows sharing embeddings across machines. - Use case: A more powerful, always on machine actually computes the image embeddings regularly The client machine just load these periodically to provide semantic search functionality	2022-07-20 03:51:27 +04:00
Debanjum Singh Solanky	c4c7f38b15	Fix extracting image names from multiple image directories	2022-07-20 03:40:49 +04:00
Debanjum Singh Solanky	c9ff97451b	Fix tests to handle updated response types by API	2022-07-20 03:01:56 +04:00
Debanjum Singh Solanky	bdc1b9f2bb	Resolve edge case errors in encoding image metadata - Handle case where current image batch smaller than batch_size - Handle case where no XMP metadata for current image - return empty strings in such a scenario instead of ". "	2022-07-20 02:58:43 +04:00
Debanjum Singh Solanky	2a5445216c	Image input directory not required by collate result as image_name already absolute path	2022-07-20 02:56:23 +04:00
Debanjum Singh Solanky	6c9ffdba57	Allow indexing multiple image directories for image search	2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky	68ee88cebc	Fix image search tests after update to API response for image search types - Look for 'entry' key in response json instead of 'Entry' - Expect image where id = alphanumeric order of image name	2022-07-20 01:37:01 +04:00
Debanjum Singh Solanky	70221bb038	Allow filtering transactions by date in symmetric ledger	2022-07-19 20:58:24 +04:00
Debanjum Singh Solanky	b673d26a12	Extract Entries in a standardized format across text search types Issue: - Had different schema of extracted entries for symmetric_ledger vs asymmetric - Entry extraction for asymmetric was dirty, relying on cryptic indices to store raw entry vs cleaned entry meant to be passed to embeddings - This was pushing the load of figuring out what property to extract from each entry to downstream processes like the filters - This limited the filters to only work for asymmetric search, not for symmetric_ledger - Fix - Use consistent format for extracted entries { 'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings, 'raw' : raw_entry_string_meant_to_be_passed_to_use } - Result - Now filters can be applied across search types, and the specific field they should be applied on can be configured by each search type	2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky	e66cd5bf59	Only extract transactions from Beancount - Earlier was extracting all entries starting with dates but the other type of entries like account open/close, asserts etc aren't useful for querying	2022-07-19 19:50:58 +04:00
Debanjum Singh Solanky	732b2d287f	Give the project a short, less generic name. Rename it to Khoj - Semantic Search was just a placeholder used to test the idea out Didn't want to get into naming at that point of time	2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky	989526ae54	Use a more accurate model for symmetric semantic search - The all-MiniLM-L6-v2 is more accurate - The exact previous model isn't benchmarked but based on the performance of the closest model to it. Seems like the new model maybe similar in speed and size - On very preliminary evaluation of the model, the new model seems faster, with pretty decent results	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	4a90972e38	Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers	2022-07-18 20:27:26 +04:00

... 75 76 77 78 79 ...

4147 commits