sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Author	SHA1	Message	Date
sabaimran	1a1d9c7257	Merge branch 'master' of github.com:khoj-ai/khoj into features/big-upgrade-chat-ux	2024-07-27 14:18:05 +05:30
sabaimran	44d34f9090	Update the unit test for the subscribed user	2024-07-26 19:59:01 +05:30
sabaimran	377f7668c5	Merge pull request #858 from khoj-ai/use-sse-instead-of-websocket Use Single HTTP API for Robust, Generalizable Chat Streaming	2024-07-26 07:11:54 -07:00
Debanjum Singh Solanky	54b4203683	Update chat API client tests to mix testing of batch and streaming mode	2024-07-23 17:56:03 +05:30
Debanjum Singh Solanky	469a1cb6a2	Move API endpoints under /api/configure/content/ to /api/content/ Pull out /api/configure/content API endpoints into /api/content to allow for more logical organization of API path hierarchy This should make the url more succinct and API request intent more understandable by using existing HTTP method semantics along with the path. The /configure URL path segment was either - redundant (e.g POST /configure/notion) or - incorrect (e.g GET /configure/files) Some example of naming improvements: - GET /configure/types -> GET /content/types - GET /configure/files -> GET /content/files - DELETE /configure/files -> DELETE /content/files This should also align, merge better the the content indexing API triggered via PUT, PATCH /content Refactor Flow 1. Rename /api/configure/types -> /api/content/types 2. Rename /api/configure -> /api 3. Move /api/content to api_content from under api_config	2024-07-19 05:40:34 +05:30
Debanjum Singh Solanky	bba4e0b529	Accept file deletion requests by clients during sync - Remove unused full_corpus boolean. The full_corpus=False code path wasn't being used (accept for in a test) - The full_corpus=True code path used was ignoring file deletion requests sent by clients during sync. Unclear why this was done - Added unit test to prevent regression and show file deletion by clients during sync not ignored now	2024-07-19 04:53:01 +05:30
Debanjum Singh Solanky	5923b6d89e	Split /api/v1/index/update into /api/content PUT, PATCH API endpoints - This utilizes PUT, PATCH HTTP method semantics to remove need for the "regenerate" query param and "/update" url suffix - This should make the url more succinct and API request intent more understandable by using existing HTTP method semantics	2024-07-19 01:45:53 +05:30
Debanjum Singh Solanky	e9f86e320b	Fix and improve offline chat actor, director tests - Use updated references schema with compiled key - Enable director tests that are now expected to pass and that do pass (with Gemma 2 at least)	2024-07-18 03:43:09 +05:30
Debanjum Singh Solanky	de15a7a3fc	Rename API path /api/config to /api/configure - Update clients calling /api/config to call /api/configure instead	2024-07-16 16:13:27 +05:30
Debanjum Singh Solanky	21fe1a917b	Support syncing, searching images from Obsidian plugin	2024-07-11 16:22:31 +05:30
Debanjum Singh Solanky	010486fb36	Split current section once by heading to resolve org-mode indexing bug - Split once by heading (=first_non_empty) to extract current section body Otherwise child headings with same prefix as current heading will cause the section split to go into infinite loop - Also add check to prevent getting into recursive loop while trying to split entry into sub sections	2024-07-06 19:35:59 +05:30
Debanjum Singh Solanky	d5ceff2691	Update tests and documentation with Jina reader API usage and info Update offline, openai chat actor, director tests to not require Serper to run the online command tests Update documentation for self-hosted online search to mention no setup is required by default. But improvements can be made by using Serper.dev or Olostep	2024-07-02 17:19:09 +05:30
Raghav Tirumale	8eccd8a5e4	Support Indexing Images via OCR (#823 ) - Added support for uploading .jpeg, .jpg, and .png files to Khoj from Web, Desktop app - Updating indexer to generate raw text and entries using RapidOCR - Details * added support for indexing images via ocr * fixed pyproject.toml * Update src/khoj/processor/content/images/image_to_entries.py Co-authored-by: Debanjum <debanjum@gmail.com> * Update src/khoj/processor/content/images/image_to_entries.py Co-authored-by: Debanjum <debanjum@gmail.com> * removed redudant try except blocks * updated desktop js file to support image formats * added tests for jpg and png * Fix processing for image to entries files * Update unit tests with working image indexer * Change png test from version verificaition to open-cv verification --------- Co-authored-by: Debanjum <debanjum@gmail.com> Co-authored-by: sabaimran <narmiabas@gmail.com>	2024-07-01 06:00:00 -07:00
Debanjum Singh Solanky	732332a3c5	Spell fix s/e.g/e.g./ across code, tests and docs	2024-06-24 15:24:45 +05:30
Debanjum Singh Solanky	22f6db0a6b	Upgrade RapidOCR and enable for Python 3.12. Fix PDF OCR test	2024-06-22 16:01:55 +05:30
Raghav Tirumale	bd3b590153	Support Indexing Docx Files (#801 ) * Add support for indexing docx files and associated unit tests --------- Co-authored-by: sabaimran <narmiabas@gmail.com>	2024-06-20 11:18:01 +05:30
Raghav Tirumale	d4e5c95711	Add Ability to Summarize Documents (#800 ) * Uses entire file text and summarizer model to generate document summary. * Uses the contents of the user's query to create a tailored summary. * Integrates with File Filters #788 for a better UX.	2024-06-18 19:31:07 +05:30
Debanjum	6afbd8032e	Improve Intermediate Steps in Formulating Chat Response (#799 ) # Major - Disambiguate Text output mode to disambiguate from Default data source lookup - Fix showing headings in intermediate step in generating chat response - Remove "Path" prefix from org ancestor heading in compiled entry # Minor - Fix OpenAI chat actor, director unit tests	2024-06-09 07:55:01 +05:30
Debanjum Singh Solanky	f440ddbe1d	Fix openai chat actor, director tests - Update test ChatModelOptions setup since update to it's schema - Fix stale function calls using their updated signatures	2024-06-09 07:24:47 +05:30
Debanjum Singh Solanky	5f2442450c	Update truncation test to reduce flakyness in cloud tests Removed dependency on faker, factory for the truncation tests as that seems to be the point of flakiness	2024-06-07 19:42:48 +05:30
Debanjum Singh Solanky	18f7e6e7ed	Remove "Path" prefix from org ancestor heading in compiled entry	2024-06-06 16:51:26 +05:30
Debanjum Singh Solanky	22289a0002	Improve task scheduling by using json mode and agent scratchpad - The task scheduling actor was having trouble calculating the timezone. Giving the actor a scratchpad to improve correctness by thinking step by step - Add more examples to reduce chances of the inferred query looping to create another reminder instead of running the query and sharing results with user - Improve task scheduling chat actor test with more tests and by ensuring unexpected words not present in response	2024-05-01 08:30:10 +05:30
Debanjum Singh Solanky	7f5981594c	Only notify when scheduled task results satisfy user's requirements There's a difference between running a scheduled task and notifying the user about the results of running the scheduled task. Decide to notify the user only when the results of running the scheduled task satisfy the user's requirements. Use sync version of send_message_to_model_wrapper for scheduled tasks	2024-05-01 08:30:10 +05:30
Debanjum Singh Solanky	c28d7d3414	Add basic chat actor test to infer scheduled queries	2024-05-01 08:28:59 +05:30
Debanjum	17a06f152c	Support Llama 3 and Improve Offline Chat Actors (#724 ) - Add support for Llama 3 in Khoj offline mode - Make chat actors generate valid json with more local models - Fix offline chat actor tests	2024-04-25 14:00:56 +05:30
Debanjum Singh Solanky	ec41482324	Upgrade default cross-encoder to mixedbread ai's mxbai-rerank-xsmall Previous cross-encoder model was a few years old, newer models should have improved in quality. Model size increases by 50% compared to previous for better performance, at least on benchmarks	2024-04-24 09:50:09 +05:30
Debanjum Singh Solanky	f2db8d7d99	Fix offline chat actor tests Do not check for original q in extracted questions. Since this was removed in a previous commit	2024-04-24 09:40:00 +05:30
sabaimran	60658a8037	Get rid of enable flag for the offline chat processor config - Default, assume that offline chat is enabled if there is an offline chat model option configured	2024-04-23 23:08:29 +05:30
sabaimran	6de4a4873a	Fix image-related client unit test	2024-04-17 13:28:48 +05:30
sabaimran	3132430737	Add tests for the db lock	2024-04-17 13:22:41 +05:30
sabaimran	d11354f9c8	Remove additional references to image content config	2024-04-17 13:00:50 +05:30
sabaimran	87b9a93fa1	Update assertion line to match new logic	2024-04-12 13:09:19 +05:30
sabaimran	e58bd0e485	Remove mbox file from list of files expected to be included	2024-04-12 12:55:22 +05:30
Debanjum Singh Solanky	8291b898ca	Standardize structure of text to entries to match other entry processors Add process_single_plaintext_file func etc with similar signatures as org_to_entries and markdown_to_entries processors The standardization makes modifications, abstractions easier to create	2024-04-09 20:19:40 +05:30
Debanjum	11ce3e2268	Update Text Chunking Strategy to Improve Search Context (#645 ) ## Major - Parse markdown, org parent entries as single entry if fit within max tokens - Parse a file as single entry if it fits with max token limits - Add parent heading ancestry to extracted markdown entries for context - Chunk text in preference order of para, sentence, word, character ## Minor - Create wrapper function to get entries from org, md, pdf & text files - Remove unused Entry to Jsonl converter from text to entry class, tests - Dedupe code by using single func to process an org file into entries Resolves #620	2024-04-08 13:56:38 +05:30
Debanjum Singh Solanky	9239c2c2ed	Update drop large words test to ensure newlines considerd word boundary Prevent regression to #620	2024-04-08 13:38:08 +05:30
sabaimran	f57f9f672d	Address Notion, Image tech debt in indexing code path (#687 ) * Add support for using OAuth2.0 in the Notion integration * Add notion to the admin page * Remove unnecessary content_index and image search/setup references * Trigger background job to start indexing Notion after user configures it * Add a log line when a new Notion integration is setup * Fix references to the configure_content methods	2024-04-05 12:10:03 +05:30
Debanjum Singh Solanky	29c1c18042	Increase search distance to get relevant content for chat post indexer update More content indexed per entry would result in an overall scores lowering effect. Increase default search distance threshold to counter that - Details - Fix expected results post indexing updates - Fix search with max distance post indexing updates - Minor - Remove openai chat actor test for after: operator as it's not expected anymore	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	ad4fa4b2f4	Fix adding file path instead of stem to markdown entries	2024-04-04 02:41:55 +05:30
sabaimran	720139c3c1	Fix all unit tests for test_text_search	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44b3247869	Update logical splitting of org-mode text into entries - Major - Do not split org file, entry if it fits within the max token limits - Recurse down org file entries, one heading level at a time until reach leaf node or the current parent tree fits context window - Update `process_single_org_file' func logic to do this recursion - Convert extracted org nodes with children into entries - Previously org node to entry code just had to handle leaf entries - Now it recieve list of org node trees - Only add ancestor path to root org-node of each tree - Indent each entry trees headings by +1 level from base level (=2) - Minor - Stop timing org-node parsing vs org-node to entry conversion Just time the wrapping function for org-mode entry extraction This standardizes what is being timed across at md, org etc. - Move try/catch to `extract_org_nodes' from `parse_single_org_file' func to standardize this also across md, org	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44eab74888	Dedupe code by using single func to process an org file into entries Add type hints to orgnode and org-to-entries packages	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	db2581459f	Parse markdown parent entries as single entry if fit within max tokens These changes improve context available to the search model. Specifically this should improve entry context from short knowledge trees, that is knowledge bases with sparse, short heading/entry trees Previously we'd always split markdown files by headings, even if a parent entry was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to the search model to select appropriate entries for a query, especially from short entry trees Revert back to using regex to parse through markdown file instead of using MarkdownHeaderTextSplitter. It was easier to implement the logical split using regexes rather than bend MarkdowHeaderTextSplitter to implement it. - DFS traverse the markdown knowledge tree, prefix ancestry to each entry	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	982ac1859c	Parse markdown file as single entry if it fits with max token limits These changes improve entry context available to the search model Specifically this should improve entry context from short knowledge trees, that is knowledge bases with small files Previously we split all markdown files by their headings, even if the file was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to select the appropriate entries for a given query for the search model, especially from short knowledge trees	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	d8f01876e5	Add parent heading ancestory to extracted markdown entries for context Improve, update the markdown to entries extractor tests	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	86575b2946	Chunk text in preference order of para, sentence, word, character - Previous simplistic chunking strategy of splitting text by space didn't capture notes with newlines, no spaces. For e.g in #620 - New strategy will try chunk the text at more natural points like paragraph, sentence, word first. If none of those work it'll split at character to fit within max token limit - Drop long words while preserving original delimiters Resolves #620	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	a627f56a64	Remove unused Entry to Jsonl converter from text to entry class, tests This was earlier used when the index was plaintext jsonl file. Now that documents are indexed in a DB this func is not required. Simplify org,md,pdf,plaintext to entries tests by removing the entry to jsonl conversion step	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	28105ee027	Create wrapper function to get entries from org, md, pdf & text files - Convert extract_org_entries function to actually extract org entries Previously it was extracting intermediary org-node objects instead Now it extracts the org-node objects from files and converts them into entries - Create separate, new function to extract_org_nodes from files - Similarly create wrapper funcs for md, pdf, plaintext to entries - Update org, md, pdf, plaintext to entries tests to use the new simplified wrapper function to extract org entries	2024-04-04 02:41:55 +05:30
Debanjum	215ab6e66a	Extract More Dates from entries to improve Date Filter (#683 ) - Overview - Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year) - Extract some natural, partial dates as well from entries - Capability Add ability to extract the following additional date forms: - Natural Dates: 21st April 2000, February 29 2024 - Partial Natural Dates: March 24, Mar 2024 - Structured Dates: 20/12/24, 20.12.2024, 2024/12/20 Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters - Performance Using regexes is MUCH faster than using the `dateparser' python library It's a little crude but gives acceptable performance for large datasets	2024-04-02 16:14:53 +05:30
Debanjum Singh Solanky	4228965c9b	Handle msg truncation when question is larger than max prompt size Notice and truncate the question it self at this point	2024-03-31 15:50:06 +05:30

1 2 3 4 5 ...

379 commits