sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-25 08:25:07 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	237123d18c	Fix tests for the conversation processor - Use latest davinci model for tests - Wrap prompt in triple quotes to improve legibilty - `understand' method returns dictionary instead of string. Fix its test - Fix prompt for new model to pass `chat_with_history' test	2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky	918af5e6f8	Make OpenAI conversation model configurable via khoj.yml - Default to using `text-davinci-003' if conversation model not explicitly configured by user. Stop using the older `davinci' and `davinci-instruct' models - Use `model' instead of `engine' as parameter. Usage of `engine' parameter in OpenAI API is deprecated	2023-01-09 00:17:51 -03:00
Debanjum Singh Solanky	7e05389776	Quote all values passed to input-filter fields in sample yaml files	2023-01-08 22:40:18 -03:00
Debanjum Singh Solanky	0440f3fd57	Add encoder-type field to the search-type sections in khoj_sample.yml	2023-01-08 22:07:13 -03:00
Debanjum Singh Solanky	8b8e202ab3	Set input-filter to list in khoj_docker.yml and khoj_sample.yml `input-filter' was converted to a list a while back but the sample khoj configs were not updated to reflect this. This change fixes that	2023-01-08 21:08:00 -03:00
Debanjum Singh Solanky	74e779f8d0	Fix /beta/chat API to use Entry class instead of old dictionary pattern Search returns response of type SearchResponse instead of a dict now	2023-01-08 15:28:26 -03:00
Debanjum Singh Solanky	f2436039a0	Improve readability of GPT prompt strings in conversation processor	2023-01-08 15:27:41 -03:00
Debanjum	1c091e509b	Make Encoder Type Configurable. Allow using OpenAI Model for Search - `2fe37a0` Make type of encoder to use for embeddings configurable via `khoj.yml' - Previously `encoder_type' was set in the setup code of search_type - All encoders were of type `SentenceTransformer' - All cross_encoders were of type `CrossEncoder' - Now the `encoder_type' can be configured via the new `encoder_type' field in `TextSearchConfig' under `search_type` in `khoj.yml' - All the specified `encoder-type' class needs is an `encode' method that takes entries and returns embedding vectors - `826f9dc` Drop long words from compiled entries to be within max token limit of models Long words (>500 characters) provide less useful context to models. Dropping very long words allow models to create better embeddings by passing more of the useful context from the entry to the model - `c0ae8ee` Allow using OpenAI models for search in Khoj To use OpenAI models for search in Khoj, in `~/.khoj/khoj.yml' 1. Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002 2. Set `encoder-type' to src.utils.models.OpenAI 3. Set `model-directory` to null, as this is an online model and cannot be stored on the file system	2023-01-08 11:10:25 -03:00
Debanjum Singh Solanky	6119005838	Improve comments, exceptions, typing and init of OpenAI model code	2023-01-08 00:36:18 -03:00
Debanjum Singh Solanky	c0ae8eee99	Allow using OpenAI models for search in Khoj - Init processor before search to instantiate `openai_api_key' from `khoj.yml'. The key is used to configure search with openai models - To use OpenAI models for search in Khoj - Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002 - Set `encoder-type' in `khoj.yml' to `src.utils.models.OpenAI' - Set `model-directory' to `null', as online model cannot be stored on disk	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	826f9dc054	Drop long words from compiled entries to be within max token limit of models Long words (>500 characters) provide less useful context to models. Dropping very long words allow models to create better embeddings by passing more of the useful context from the entry to the model	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	6a30a13326	Only create model directory if the optional field is set in SearchConfig	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	2fe37a090f	Make type of encoder to use for embeddings configurable via khoj.yml - Previously `model_type' was set in the setup of each `search_type' - All encoders were of type `SentenceTransformer' - All cross_encoders were of type `CrossEncoder' - Now `encoder-type' can be configured via the new `encoder_type' field in `TextSearchConfig' under `search-type` in `khoj.yml`. - All the specified `encoder-type' class needs is an `encode' method that takes entries and returns embedding vectors	2023-01-07 23:09:12 -03:00
Debanjum Singh Solanky	fa92adcf0d	Add Visualization of Codebase to Readme under Development Section Source from Github vNext Repo Visualizer at https://githubnext.com/projects/repo-visualization/	2023-01-05 20:11:56 -03:00
Debanjum Singh Solanky	8c7ffd7aee	Add Readme doc to fix failure to build tokenizer dependency	2023-01-05 20:11:56 -03:00
Debanjum Singh Solanky	d55d7d53dc	Fix GPU usage by Khoj on Macs to speed up search and indexing - Ensure all tensors are on MPS device before doing operations across them - Background - GPU is used by default for Khoj on MacOS now - Needed PyTorch > 1.13.0 on Macs to use GPU, which we do now - MPS should speed up search and indexing on MacOS	2023-01-05 15:39:09 -03:00
Debanjum Singh Solanky	7380518f24	Upgrade PyTorch, Pillow version to resolve Dependabot Security Advisories This also enables GPU usage by Khoj on MacOS as MPS support is now in PyTorch mainline	2023-01-05 15:39:09 -03:00
Debanjum	abd035e2fa	Merge PR #112 to fix quote usage in khoj.el docstring from suliveevil/master Fix usage warning for unescaped single quote in `khoj.el' docstring. Converts usage of '<text>' into `<text>' to use the correct quote forms in generated docs	2023-01-05 13:24:11 -03:00
Debanjum Singh Solanky	1dc1472c55	In publish workflow, make twine upload verbose to troubleshoot	2023-01-05 12:56:46 -03:00
Debanjum Singh Solanky	e792523849	Bump version in metadata packages for khoj, khoj.el and obsidian plugin	2023-01-05 12:50:27 -03:00
suliveevil	b2812b409f	fix docstring usage warning ⛔ Warning (comp): khoj.el:119:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting) ⛔ Warning (comp): khoj.el:120:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting) ⛔ Warning (comp): khoj.el:121:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting) ⛔ Warning (comp): khoj.el:168:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)	2023-01-05 16:47:38 +08:00
Debanjum Singh Solanky	3d1199540c	Update the publish workflow to also run on any tag push	2023-01-04 20:47:23 -03:00
Debanjum Singh Solanky	4842daca5f	Run releases workflow on pushing any tag. 'v' prefix not required Obsidian for some reason cannot pick up plugin assets from releases made with prefixed tags	2023-01-04 20:27:56 -03:00
Debanjum Singh Solanky	47015ee6cc	Fold Demo video descriptions, analysis by default in main Readme	2023-01-04 20:13:43 -03:00
Debanjum Singh Solanky	da17ff6ac8	Add Upgrade instructions for Khoj.el Readme. Fix version of khoj.el	2023-01-04 20:06:39 -03:00
Debanjum	65917eb5c9	Create Obsidian plugin for Khoj ### Plugin Features - Search Obsidian notes using Khoj Provide Natural language search on your (markdown) notes in Obsidian Vault - Show search results as rendered Markdown Improve legibility of the results - Jump to selected note from search result in Khoj search modal Simplify seeing result within its original note context - Automatically configure khoj to index markdown files in current vault Reduce khoj setup steps for plugin users by using reasonable defaults - Code updates the markdown config in `khoj.yml` and triggers index update - It can be configured by user in khoj plugin settings, if required - Add Demo and detailed Readme for the Obsidian plugin Ease setup and usage. Give context about capabilities ### Miscellaneous - (Try) Keep a mono repo until the Khoj project is mature enough to reduce maintainance burden ### Commits Details - `0e39e0f` Add details about the Khoj Obsidian plugin to the main Readme - `cd8b918` Add `manifest.json`, `versions.json` of Obsidian plugin to project root - `66ccd0c` Create Obsidian plugin for Khoj	2023-01-04 20:02:42 -03:00
Debanjum Singh Solanky	3dd69f7505	Add Upgrade instructions for Obsidian, Emacs to main Readme	2023-01-04 19:50:26 -03:00
Debanjum Singh Solanky	0e39e0ff71	Add details about the Khoj Obsidian plugin to the main Readme - Add Khoj in Obsidian Demo - Update Interfaces Screenshot to include Obsidian Plugin Screenshot - Update .gitignore to ignore obsidian plugin ignorelist Section the .gitignore for better readability - Update the Setup, Usage instructions to include information about the Obsidian plugin	2023-01-04 18:42:53 -03:00
Debanjum Singh Solanky	cd8b918a55	Add manifest.json, versions.json of Obsidian plugin to project root - Obsidian provides limited support for plugins in larger repositories. Currently, it does not have a way to specify the directory of a plugin So it expects the plugins `manifest.json' and `versions.json' to be at project root - While this unnecessarily litters the codebase. It is the (current) required tradeoff for keeping the core plugins in a mono repo	2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky	66ccd0c970	Create Obsidian plugin for Khoj - Features - Search using Khoj from within the Obsidian app Allow Natural language search on your (markdown) notes in Obsidian Vault - Show search results as rendered (instead of raw) Markdown Improve legibility of the results - Jump to selected note from search result in Khoj search modal Simplify seeing result within its original note context - Automatically configure khoj to index markdown files in current vault Reduce khoj setup steps for plugin users by using reasonable defaults - Code updates the markdown config in khoj.yml and triggers index update - It can be configured by user in khoj plugin settings, if required - Add Demo and detailed Readme for the Obsidian plugin Ease setup and usage. Give context about capabilities - Miscellaneous - Trying keep a mono repo until the Khoj project is mature enough to reduce maintainance burden	2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky	e5ef7789fc	Add screenshot of Khoj as PWA on Android Homescreen to Readme	2023-01-04 15:47:08 -03:00
Debanjum Singh Solanky	feddb6ce62	Add start_url to khoj webmanifest to show Khoj as PWA on Chrome	2023-01-04 13:37:56 -03:00
Debanjum Singh Solanky	5ca60a2df7	Add How to Access Khoj on Mobile instructions to Readme	2023-01-04 13:37:40 -03:00
Debanjum Singh Solanky	3dee1aed9e	Create /config/data/default API endpoint to serve default khoj config This can ease configuring khoj from the different interfaces - Don't need to know all the (default) config used by khoj. - Just get default config by calling the above API endpoint. - Then modify desired portions and call POST /api/config/data to configure khoj.	2023-01-03 21:52:34 -03:00
Debanjum Singh Solanky	ce945f7a90	Configure processors too on calling /update API - Previously only search was being reconfigured - But Processors are configured on app start too - Match that behavior on calling /update API	2023-01-03 21:51:02 -03:00
Debanjum Singh Solanky	9d31988f42	Allow starting khoj in non-GUI mode without config file instantiated - Start khoj server (in non-GUI mode) without needing config file already instantiated. - But throw warning to configure khoj to use it - This allows plugins to configure the app via the /config/data APIs - To be used by the Khoj obsidian plugin to configure markdown content in khoj	2023-01-03 21:36:59 -03:00
Debanjum Singh Solanky	52664dd96c	Allow recursive glob pattern (**) to add files to search index - Simplify configuring files to index For Obsidian/Org-Roam type systems with lots of small files in khoj.yml using `input-filter'	2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky	152e5f1661	Return the file of each search result in response - Useful for enabling jump to note functionality in interfaces - It will be used in the Khoj plugin for Obsidian	2023-01-03 01:25:34 -03:00
Debanjum	fe1398401d	Automatically update search index hourly - `c535953` Update index automatically in non GUI mode too - `701d92e` Lock the index before updating it via API or Scheduler - `3b0783a` Automate updating embeddings, search index on a hourly schedule Resolves #106	2023-01-02 00:37:59 +00:00
Debanjum Singh Solanky	c535953915	Update index automatically in non GUI mode too - Poll scheduler every minute using threading.Timer - Use 60 seconds polling interval to avoid fork bombing - Schedule next via the same poll scheduler - Allow clean program interrupt by running scheduler in daemon mode	2023-01-01 21:03:19 -03:00
Debanjum Singh Solanky	701d92e17b	Lock the index before updating it via API or Scheduler - There are 3 paths to updating/setting the index (stored in state.model) - App start - API - Scheduler - Put all updates to the index behind a lock. As multiple updates path that could (potentially) run at the same time (via API or Scheduler)	2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky	3b0783aab9	Automate updating embeddings, search index on a hourly schedule - Use the schedule pypi package - Use QTimer to poll schedule.run_pending() regularly for jobs to run	2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky	a58c243bc0	Document using Word, Date and File Query Filter in Readme	2022-12-26 16:12:49 -03:00
Debanjum	06c25682c9	Split text entries by max tokens supported by ML models ### Background There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector. For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated) ### Issue Until now entries exceeding max token size would silently get truncated during embedding generation. So the truncated portion of the entries would be ignored when matching queries with entries This would degrade the quality of the results ### Fix - `e057c8e` Add method to split entries by specified max tokens limit - Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL - `b283650` Deduplicate results for user query by raw text before returning results ### Results - The quality of the search results should improve - Relevant, long entries should show up in results more often	2022-12-26 18:23:43 +00:00
Debanjum Singh Solanky	17fa123b4e	Split entries by max tokens while converting Beancount entries To JSONL	2022-12-26 15:14:32 -03:00
Debanjum Singh Solanky	f209e30a3b	Split entries by max tokens while converting Markdown entries To JSONL	2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky	24676f95d8	Fix comments, use minimal test case, regenerate test index, merge debug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion	2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky	b283650991	Deduplicate results for user query by raw text before returning results - Required because entries are now split by the max_word count supported by the ML models - This would now result in potentially duplicate hits, entries being returned to user - Do deduplication after ranking to get the top ranked deduplicated results	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	53cd2e5605	Regenerate initial model in asymmetric reload test to reduce flakyness - Fix logger message when converting org node to entries - Remove unused import from conftest	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	c79919bd68	Split entries by max tokens while converting Org entries To JSONL - Test usage the entry splitting by max tokens in text search	2022-12-25 21:36:00 -03:00

... 14 15 16 17 18 ...

1558 commits