sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	e3cd8b4150	Only index files returned by input-filter globs in fs_syncer Ignore .org, .pdf etc. suffixed directories under `input-filter' from being evaluated as files. Explicitly filter results by input-filter globs to only index files, not directory for each text type Add test to prevent regression Closes #448	2023-10-17 23:32:10 -07:00
Debanjum Singh Solanky	51363d280d	Do not configure khoj server for pull based indexing from khoj.el Do not make khoj server pull update index on Obsidian plugin load. Index is updated on push from plugin instead now/	2023-10-17 21:47:19 -07:00
Debanjum Singh Solanky	d9d133dfb9	Read text files as utf-8, instead of default os locale On Windows, the default locale isn't utf8. Khoj had regressed to reading files in OS specified locale encoding, e.g cp1252, cp949 etc. It now explicitly uses utf8 encoding to read text files for indexing Resolves #495, resolves #472	2023-10-17 21:47:19 -07:00
Debanjum	3d4576ae38	Fix encoding binary files for sync from the Desktop, Obsidian client (#506 ) - Fix encoding binary files like PDFs for sync from Desktop client - Fix encoding binary files like PDFs for sync from Obsidian client	2023-10-17 15:37:22 -07:00
Debanjum Singh Solanky	c8293998d9	Fix encoding binary files like PDFs for sync from Obsidian client Use readBinary to read binary files like PDFs instead of read	2023-10-17 15:08:30 -07:00
sabaimran	ba60c869c9	Fix encoding binary files like PDFs for sync from Desktop client Use readFileSync, Buffer to pass appropriately formatted binary data	2023-10-17 15:08:23 -07:00
Andrew Spott	3d7381446d	Changed globbing. Now doesn't clobber a users glob if they want to a… (#496 ) * Changed globbing. Now doesn't clobber a users glob if they want to add it, but will (if just given a directory), add a recursive glob. Note: python's glob engine doesn't support `{}` globing, a future option is to warn if that is included. * Fix typo in globformat variable * Use older glob pattern for plaintext files --------- Co-authored-by: Saba <narmiabas@gmail.com>	2023-10-17 11:26:06 -07:00
sabaimran	2646c8554d	Provide a default value to offline_chat configuration of the conversation processor	2023-10-17 10:35:22 -07:00
Debanjum Singh Solanky	b8976426eb	Update offline chat model config schema used by Emacs, Obsidian clients The server uses a new schema for the conversation config. The Emacs, Obsidian clients need to use this schema to update the conversation config	2023-10-17 07:01:35 -07:00
Debanjum	ecc6fbfeb2	Push Files to Index from Emacs, Obsidian & Desktop Clients using Multi-Part Forms (#499 ) ### Overview - Add ability to push data to index from the Emacs, Obsidian client - Switch to standard mechanism of syncing files via HTTP multi-part/form. Previously we were streaming the data as JSON - Benefits of new mechanism - No manual parsing of files to send or receive on clients or server is required as most have in-built mechanisms to send multi-part/form requests - The whole response is not required to be kept in memory to parse content as JSON. As individual files arrive they're automatically pushed to disk to conserve memory if required - Binary files don't need to be encoded on client and decoded on server ### Code Details ### Major - Use multi-part form to receive files to index on server - Use multi-part form to send files to index on desktop client - Send files to index on server from the khoj.el emacs client - Send content for indexing on server at a regular interval from khoj.el - Send files to index on server from the khoj obsidian client - Update tests to test multi-part/form method of pushing files to index #### Minor - Put indexer API endpoint under /api path segment - Explicitly make GET request to /config/data from khoj.el:khoj-server-configure method - Improve emoji, message on content index updated via logger - Don't call khoj server on khoj.el load, only once khoj invoked explicitly by user - Improve indexing of binary files - Let fs_syncer pass PDF files directly as binary before indexing - Use encoding of each file set in indexer request to read file - Add CORS policy to khoj server. Allow requests from khoj apps, obsidian & localhost - Update indexer API endpoint URL to` index/update` from `indexer/batch` Resolves #471 #243	2023-10-17 06:05:15 -07:00
Debanjum Singh Solanky	7b1c62ba53	Mark test_get_configured_types_via_api unit test as flaky It passes locally on running individually but fails when run in parallel on local or CI	2023-10-17 05:56:00 -07:00
Debanjum Singh Solanky	6a4f1b2188	Add more client, request details in logs by index/update API endpoint	2023-10-17 05:43:29 -07:00
Debanjum Singh Solanky	5efae1ad55	Update indexer API endpoint query params for force, content type New URL query params, `force' and `t' match name of query parameter in existing Khoj API endpoints Update Desktop, Obsidian and Emacs client to call using these new API query params. Set `client' query param from each client for telemetry visibility	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	84654ffc5d	Update indexer API endpoint URL to index/update from indexer/batch New URL follows action oriented endpoint naming convention used for other Khoj API endpoints Update desktop, obsidian and emacs client to call this new API endpoint	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	e347823ff4	Log telemetry for index updates via push to API endpoint	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	05be6bd877	Clicking Update Index in Obsidian settings should push files to index Use the indexer/batch API endpoint to regenerate content index rather than the previous pull based content indexing API endpoint	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	13a3122bf3	Stop configuring server to pull files to index from Obsidian client Obsidian client now pushes vault files to index instead	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	99a2c934a3	Add CORS policy to allow requests from khoj apps, obsidian & localhost Using fetch from Khoj Obsidian plugin was failing due to cross-origin request and method: no-cors didn't allow passing x-api-key custom header. And using Obsidian's request with multi-part/form-data wasn't possible either.	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	541cd59a49	Let fs_syncer pass PDF files directly as binary before indexing No need to do unneeded base64 encoding/decoding to pass pdf contents for indexing from fs_syncer to pdf_to_jsonl	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	d27dc71dfe	Use encoding of each file set in indexer request to read file Get encoding type from multi-part/form-request body for each file Read text files as utf-8 and pdfs, images as binary	2023-10-17 04:58:12 -07:00
Debanjum Singh Solanky	8e627a5809	Pass any files to be deleted to indexer API via Khoj Obsidian plugin - Keep state of previously synced files to identify files to be deleted - Last synced files stored in settings for persistence of this data across Obsidian reboots	2023-10-17 03:34:49 -07:00
Debanjum Singh Solanky	f2e293a149	Push Vault files to index to Khoj server using Khoj Obsidian plugin Use the multi-part/form-data request to sync Markdown, PDF files in vault to index on khoj server Run scheduled job to push updates to value for indexing every 1 hour	2023-10-17 03:05:30 -07:00
Debanjum Singh Solanky	6baaaaf91a	Test request body of multi-part form to update content index from khoj.el	2023-10-16 23:54:32 -07:00
Debanjum Singh Solanky	79b3f8273a	Make khoj.el send files to be deleted from index to server	2023-10-16 23:53:02 -07:00
Debanjum Singh Solanky	5dc399b32e	Document system requirements to run offline chat Closes #375	2023-10-16 19:39:06 -07:00
Debanjum Singh Solanky	f64fa06e22	Initialize the Khoj Transient menu on first run instead of load This prevents Khoj from polling the Khoj server until explicitly invoked via `khoj' entrypoint function. Previously it'd make a request to the khoj server every time Emacs or khoj.el was loaded Closes #243	2023-10-16 19:11:46 -07:00
Debanjum	b4949f7f0b	Improve Offline Chat Model Experience (#494 ) - Make offline chat model user configurable. Use `filename` of any [GPT4All supported model](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models.json) like below: - Run GPT4All Chat Model on GPU, when available via [GPT4All Vulcan support](https://blog.nomic.ai/posts/gpt4all-gpu-inference-with-vulkan) - Use default Llama 2 supported by GPT4All - Make `tokenizer` and `max-prompt-size` of chat model user configurable. E.g When using chat models not in [this pre-defined list](https://github.com/khoj-ai/khoj/blob/master/src/khoj/processor/conversation/utils.py) that support larger context window or a different tokenizer. Closes #406, #418	2023-10-16 17:44:49 -07:00
Debanjum Singh Solanky	644c3b787f	Scale no. of chat history messages to use as context with max_prompt_size Previously lookback turns was set to a static 2. But now that we support more chat models, their prompt size vary considerably. Make lookback_turns proportional to max_prompt_size. The truncate_messages can remove messages if they exceed max_prompt_size later This lets Khoj pass more of the chat history as context for models with larger context window	2023-10-16 17:22:28 -07:00
Debanjum Singh Solanky	90e1d9e3d6	Pin gpt4all to 1.0.12 as next version will introduce breaking changes	2023-10-16 10:57:16 -07:00
Debanjum Singh Solanky	1a9023d396	Update Chat Actor test to not incept with prior world knowledge	2023-10-15 17:22:44 -07:00
Debanjum Singh Solanky	df1d74a879	Use max_prompt_size, tokenizer from config for chat model context stuffing	2023-10-15 16:52:53 -07:00
Debanjum Singh Solanky	116595b351	Use chat_model specified in new offline_chat section of config - Dedupe offline_chat_model variable. Only reference offline chat model stored under offline_chat. Delete the previous chat_model field under GPT4AllProcessorConfig - Set offline chat model to use via config/offline_chat API endpoint	2023-10-15 16:37:49 -07:00
Debanjum Singh Solanky	feb4f17e3d	Update chat config schema. Make max_prompt, chat tokenizer configurable This provides flexibility to use non 1st party supported chat models - Create migration script to update khoj.yml config - Put `enable_offline_chat' under new `offline-chat' section Referring code needs to be updated to accomodate this change - Move `offline_chat_model' to `chat-model' under new `offline-chat' section - Put chat `tokenizer` under new `offline-chat' section - Put `max_prompt' under existing `conversation' section As `max_prompt' size effects both openai and offline chat models	2023-10-15 16:35:11 -07:00
Debanjum Singh Solanky	247e75595c	Use AutoTokenizer to support more tokenizers	2023-10-14 16:54:52 -07:00
Saba	ff2dbadc9d	Use computed plaintext_content to set file content rather than calling f.read again	2023-10-14 13:28:34 -07:00
Debanjum Singh Solanky	1ad8b150e8	Add default tokenizer, max_prompt as fallback for non-default offline chat models Pass user configured chat model as argument to use by converse_offline The proper fix for this would allow users to configure the max_prompt and tokenizer to use (while supplying default ones, if none provided) For now, this is a reasonable start.	2023-10-13 22:48:56 -07:00
Debanjum Singh Solanky	56bd69d5af	Improve Llama v2 extract questions actor and associated prompt - Format extract questions prompt format with newlines and whitespaces - Make llama v2 extract questions prompt consistent - Remove empty questions extracted by offline extract_questions actor - Update implicit qs extraction unit test for offline search actor	2023-10-13 22:48:56 -07:00
sabaimran	09bb3686cc	Strip the incoming query from the slash conversation command (#500 ) * Strip the incoming query from the slash conversation command before passing it to the model or for search * Return q when content index not loaded * Remove -n 4 from pytest ini configuration to isolate test failures	2023-10-13 21:11:23 -07:00
Debanjum Singh Solanky	96c0b21285	Sync desktop app package.json with other Khoj clients metadata - Make `bump_version.sh' script set version for the Khoj desktop app too - Sync Khoj desktop app authors, license, description and version with the other interfaces and server - Update description in packages metadata to match project subtitle on Github	2023-10-13 20:43:55 -07:00
sabaimran	80fb56b8a5	Sync deksktop app package version with the other releases	2023-10-13 19:23:00 -07:00
Debanjum Singh Solanky	b669aa2395	Clean and fix the content indexing code in the Emacs client - Pass payloads as unibyte. This was causing the request to fail for files with unicode characters - Suppress messages with file content in on index updates - Fix rendering response from server on index update API call - Extract code to populate body of index update HTTP request with files	2023-10-13 18:00:37 -07:00
Debanjum Singh Solanky	bea196aa30	Explicitly make GET request to /config/data from khoj.el:khoj-server-configure method Previously global state of `url-request-method' would affect the kind of request made to api/config/data API endpoint as it wasn't being explicitly being set before calling the API endpoint This was done with the assumption that the default value of GET for url-request-method wouldn't change globally But in some cases, experientially, it can get changed. This was resulting in khoj.el load failing as POST request was being made instead which would throw error	2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky	292f0420ad	Send content for indexing on server at a regular interval from khoj.el - Allow indexing frequency to be configurable by user - Ensure there is only one khoj indexing timer running	2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky	bed3aff059	Update tests to test multi-part/form method of pushing files to index Instead of using the previous method to push data as json payload of POST request pass it as files to upload via the multi-part/form to the batch indexer API endpoint	2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky	fc99431754	Send files to index on server from the khoj.el emacs client - Add elisp variable to set API key to engage with the Khoj server - Use multi-part form to POST the files to index to the indexer API endpoint on the khoj server	2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky	68018ef397	Use multi-part form to send files to index on desktop client - Add typing for variables in for loop and other minor formatting clean-up - Assume utf8 encoding for text files and binary for image, pdf files	2023-10-12 20:58:49 -07:00
Debanjum Singh Solanky	7190b3811d	Remove all filter terms in user query from defiltered_query Previously only the the last filter's terms were getting effectively applied as the `filter.defilter' operation was being done on `user_query' but was updating the `defiltered_query'	2023-10-12 20:56:17 -07:00
Debanjum Singh Solanky	72f8fde7ef	Run pytests in parallel on multiple CPU cores using pytest-xdist for speed	2023-10-12 20:56:17 -07:00
Debanjum Singh Solanky	60e9a61647	Use multi-part form to receive files to index on server - This uses existing HTTP affordance to process files - Better handling of binary file formats as removes need to url encode/decode - Less memory utilization than streaming json as files get automatically written to disk once memory utilization exceeds preset limits - No manual parsing of raw files streams required	2023-10-11 23:58:23 -07:00
Debanjum Singh Solanky	9ba173bc2d	Improve emoji, message on content index updated via logger Use mailbox closed with flag down once content index completed. Use standard, existing logger messages in new indexer messages, when files to index sent by clients	2023-10-11 17:12:03 -07:00

1 2 3 4 5 ...

1683 commits