sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	22f6db0a6b	Upgrade RapidOCR and enable for Python 3.12. Fix PDF OCR test	2024-06-22 16:01:55 +05:30
Raghav Tirumale	bd3b590153	Support Indexing Docx Files (#801 ) * Add support for indexing docx files and associated unit tests --------- Co-authored-by: sabaimran <narmiabas@gmail.com>	2024-06-20 11:18:01 +05:30
Raghav Tirumale	d4e5c95711	Add Ability to Summarize Documents (#800 ) * Uses entire file text and summarizer model to generate document summary. * Uses the contents of the user's query to create a tailored summary. * Integrates with File Filters #788 for a better UX.	2024-06-18 19:31:07 +05:30
Debanjum	6afbd8032e	Improve Intermediate Steps in Formulating Chat Response (#799 ) # Major - Disambiguate Text output mode to disambiguate from Default data source lookup - Fix showing headings in intermediate step in generating chat response - Remove "Path" prefix from org ancestor heading in compiled entry # Minor - Fix OpenAI chat actor, director unit tests	2024-06-09 07:55:01 +05:30
Debanjum Singh Solanky	f440ddbe1d	Fix openai chat actor, director tests - Update test ChatModelOptions setup since update to it's schema - Fix stale function calls using their updated signatures	2024-06-09 07:24:47 +05:30
Debanjum Singh Solanky	5f2442450c	Update truncation test to reduce flakyness in cloud tests Removed dependency on faker, factory for the truncation tests as that seems to be the point of flakiness	2024-06-07 19:42:48 +05:30
Debanjum Singh Solanky	18f7e6e7ed	Remove "Path" prefix from org ancestor heading in compiled entry	2024-06-06 16:51:26 +05:30
Debanjum Singh Solanky	22289a0002	Improve task scheduling by using json mode and agent scratchpad - The task scheduling actor was having trouble calculating the timezone. Giving the actor a scratchpad to improve correctness by thinking step by step - Add more examples to reduce chances of the inferred query looping to create another reminder instead of running the query and sharing results with user - Improve task scheduling chat actor test with more tests and by ensuring unexpected words not present in response	2024-05-01 08:30:10 +05:30
Debanjum Singh Solanky	7f5981594c	Only notify when scheduled task results satisfy user's requirements There's a difference between running a scheduled task and notifying the user about the results of running the scheduled task. Decide to notify the user only when the results of running the scheduled task satisfy the user's requirements. Use sync version of send_message_to_model_wrapper for scheduled tasks	2024-05-01 08:30:10 +05:30
Debanjum Singh Solanky	c28d7d3414	Add basic chat actor test to infer scheduled queries	2024-05-01 08:28:59 +05:30
Debanjum	17a06f152c	Support Llama 3 and Improve Offline Chat Actors (#724 ) - Add support for Llama 3 in Khoj offline mode - Make chat actors generate valid json with more local models - Fix offline chat actor tests	2024-04-25 14:00:56 +05:30
Debanjum Singh Solanky	ec41482324	Upgrade default cross-encoder to mixedbread ai's mxbai-rerank-xsmall Previous cross-encoder model was a few years old, newer models should have improved in quality. Model size increases by 50% compared to previous for better performance, at least on benchmarks	2024-04-24 09:50:09 +05:30
Debanjum Singh Solanky	f2db8d7d99	Fix offline chat actor tests Do not check for original q in extracted questions. Since this was removed in a previous commit	2024-04-24 09:40:00 +05:30
sabaimran	60658a8037	Get rid of enable flag for the offline chat processor config - Default, assume that offline chat is enabled if there is an offline chat model option configured	2024-04-23 23:08:29 +05:30
sabaimran	6de4a4873a	Fix image-related client unit test	2024-04-17 13:28:48 +05:30
sabaimran	3132430737	Add tests for the db lock	2024-04-17 13:22:41 +05:30
sabaimran	d11354f9c8	Remove additional references to image content config	2024-04-17 13:00:50 +05:30
sabaimran	87b9a93fa1	Update assertion line to match new logic	2024-04-12 13:09:19 +05:30
sabaimran	e58bd0e485	Remove mbox file from list of files expected to be included	2024-04-12 12:55:22 +05:30
Debanjum Singh Solanky	8291b898ca	Standardize structure of text to entries to match other entry processors Add process_single_plaintext_file func etc with similar signatures as org_to_entries and markdown_to_entries processors The standardization makes modifications, abstractions easier to create	2024-04-09 20:19:40 +05:30
Debanjum	11ce3e2268	Update Text Chunking Strategy to Improve Search Context (#645 ) ## Major - Parse markdown, org parent entries as single entry if fit within max tokens - Parse a file as single entry if it fits with max token limits - Add parent heading ancestry to extracted markdown entries for context - Chunk text in preference order of para, sentence, word, character ## Minor - Create wrapper function to get entries from org, md, pdf & text files - Remove unused Entry to Jsonl converter from text to entry class, tests - Dedupe code by using single func to process an org file into entries Resolves #620	2024-04-08 13:56:38 +05:30
Debanjum Singh Solanky	9239c2c2ed	Update drop large words test to ensure newlines considerd word boundary Prevent regression to #620	2024-04-08 13:38:08 +05:30
sabaimran	f57f9f672d	Address Notion, Image tech debt in indexing code path (#687 ) * Add support for using OAuth2.0 in the Notion integration * Add notion to the admin page * Remove unnecessary content_index and image search/setup references * Trigger background job to start indexing Notion after user configures it * Add a log line when a new Notion integration is setup * Fix references to the configure_content methods	2024-04-05 12:10:03 +05:30
Debanjum Singh Solanky	29c1c18042	Increase search distance to get relevant content for chat post indexer update More content indexed per entry would result in an overall scores lowering effect. Increase default search distance threshold to counter that - Details - Fix expected results post indexing updates - Fix search with max distance post indexing updates - Minor - Remove openai chat actor test for after: operator as it's not expected anymore	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	ad4fa4b2f4	Fix adding file path instead of stem to markdown entries	2024-04-04 02:41:55 +05:30
sabaimran	720139c3c1	Fix all unit tests for test_text_search	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44b3247869	Update logical splitting of org-mode text into entries - Major - Do not split org file, entry if it fits within the max token limits - Recurse down org file entries, one heading level at a time until reach leaf node or the current parent tree fits context window - Update `process_single_org_file' func logic to do this recursion - Convert extracted org nodes with children into entries - Previously org node to entry code just had to handle leaf entries - Now it recieve list of org node trees - Only add ancestor path to root org-node of each tree - Indent each entry trees headings by +1 level from base level (=2) - Minor - Stop timing org-node parsing vs org-node to entry conversion Just time the wrapping function for org-mode entry extraction This standardizes what is being timed across at md, org etc. - Move try/catch to `extract_org_nodes' from `parse_single_org_file' func to standardize this also across md, org	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44eab74888	Dedupe code by using single func to process an org file into entries Add type hints to orgnode and org-to-entries packages	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	db2581459f	Parse markdown parent entries as single entry if fit within max tokens These changes improve context available to the search model. Specifically this should improve entry context from short knowledge trees, that is knowledge bases with sparse, short heading/entry trees Previously we'd always split markdown files by headings, even if a parent entry was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to the search model to select appropriate entries for a query, especially from short entry trees Revert back to using regex to parse through markdown file instead of using MarkdownHeaderTextSplitter. It was easier to implement the logical split using regexes rather than bend MarkdowHeaderTextSplitter to implement it. - DFS traverse the markdown knowledge tree, prefix ancestry to each entry	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	982ac1859c	Parse markdown file as single entry if it fits with max token limits These changes improve entry context available to the search model Specifically this should improve entry context from short knowledge trees, that is knowledge bases with small files Previously we split all markdown files by their headings, even if the file was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to select the appropriate entries for a given query for the search model, especially from short knowledge trees	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	d8f01876e5	Add parent heading ancestory to extracted markdown entries for context Improve, update the markdown to entries extractor tests	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	86575b2946	Chunk text in preference order of para, sentence, word, character - Previous simplistic chunking strategy of splitting text by space didn't capture notes with newlines, no spaces. For e.g in #620 - New strategy will try chunk the text at more natural points like paragraph, sentence, word first. If none of those work it'll split at character to fit within max token limit - Drop long words while preserving original delimiters Resolves #620	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	a627f56a64	Remove unused Entry to Jsonl converter from text to entry class, tests This was earlier used when the index was plaintext jsonl file. Now that documents are indexed in a DB this func is not required. Simplify org,md,pdf,plaintext to entries tests by removing the entry to jsonl conversion step	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	28105ee027	Create wrapper function to get entries from org, md, pdf & text files - Convert extract_org_entries function to actually extract org entries Previously it was extracting intermediary org-node objects instead Now it extracts the org-node objects from files and converts them into entries - Create separate, new function to extract_org_nodes from files - Similarly create wrapper funcs for md, pdf, plaintext to entries - Update org, md, pdf, plaintext to entries tests to use the new simplified wrapper function to extract org entries	2024-04-04 02:41:55 +05:30
Debanjum	215ab6e66a	Extract More Dates from entries to improve Date Filter (#683 ) - Overview - Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year) - Extract some natural, partial dates as well from entries - Capability Add ability to extract the following additional date forms: - Natural Dates: 21st April 2000, February 29 2024 - Partial Natural Dates: March 24, Mar 2024 - Structured Dates: 20/12/24, 20.12.2024, 2024/12/20 Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters - Performance Using regexes is MUCH faster than using the `dateparser' python library It's a little crude but gives acceptable performance for large datasets	2024-04-02 16:14:53 +05:30
Debanjum Singh Solanky	4228965c9b	Handle msg truncation when question is larger than max prompt size Notice and truncate the question it self at this point	2024-03-31 15:50:06 +05:30
Debanjum Singh Solanky	886d49e3a4	Merge branch 'master' into migrate-to-llama-cpp-for-offline-chat	2024-03-31 00:59:20 +05:30
Debanjum Singh Solanky	7923903d21	Improve date filter regexes to extract structured, natural, partial dates - Much faster than using dateparser - It took 2x-4x for improved regex to extracts 1-15% more dates - Whereas It took 33x to 100x for dateparser to extract 65% - 400% more dates - Improve date extractor tests to test deduping dates, natural, structured date extraction from content - Extract some natural, partial dates and more structured dates Using regex is much faster than using dateparser. It's a little crude but should pay off in performance. Supports dates of form: - (Day-of-Month) Month\|AbbreviatedMonth Year\|2DigitYear - Month\|AbbreviatedMonth (Day-of-Month) Year\|2DigitYear	2024-03-30 00:07:19 +05:30
Debanjum Singh Solanky	104eeea274	Extract natural language and locale specific dates in content Previously we just extracted dates in YYYY-MM-DD format from content for date filterings during search. Use dateparser to extract dates across locales and natural language This should improve notes returned as context when chat searches knowledge base with date filters Fallback to regex for date parsing from content if dateparser fails - Limit natural date extractor capabilities to improve performance - Assume language is english Language detection otherwise takes a REALLY long time - Do not extract unix timestamps, timezone - This isn't required, as just using date and approximating dates as UTC	2024-03-30 00:06:56 +05:30
Debanjum Singh Solanky	8ca39a436c	Use llama.cpp for offline chat models - Benefits of moving to llama-cpp-python from gpt4all: - Support for all GGUF format chat models - Support for AMD, Nvidia, Mac, Vulcan GPU machines (instead of just Vulcan, Mac) - Supports models with more capabilities like tools, schema enforcement, speculative ddecoding, image gen etc. - Upgrade default chat model, prompt size, tokenizer for new supported chat models - Load offline chat model when present on disk without requiring internet - Load model onto GPU if not disabled and device has GPU - Load model onto CPU if loading model onto GPU fails - Create helper function to check and load model from disk, when model glob is present on disk. `Llama.from_pretrained' needs internet to get repo info from HuggingFace. This isn't required, if the model is already downloaded Didn't find any existing HF or llama.cpp method that looked for model glob on disk without internet	2024-03-26 22:33:01 +05:30
sabaimran	fdf78525b4	Part 2: Add web UI updates for basic agent interactions (#675 ) * Initial pass at backend changes to support agents - Add a db model for Agents, attaching them to conversations - When an agent is added to a conversation, override the system prompt to tweak the instructions - Agents can be configured with prompt modification, model specification, a profile picture, and other things - Admin-configured models will not be editable by individual users - Add unit tests to verify agent behavior. Unit tests demonstrate imperfect adherence to prompt specifications * Customize default behaviors for conversations without agents or with default agents * Add a new web client route for viewing all agents * Use agent_id for getting correct agent * Add web UI views for agents - Add a page to view all agents - Add slugs to manage agents - Add a view to view single agent - Display active agent when in chat window - Fix post-login redirect issue * Fix agent view * Spruce up the 404 page and improve the overall layout for agents pages * Create chat actor for directly reading webpages based on user message - Add prompt for the read webpages chat actor to extract, infer webpage links - Make chat actor infer or extract webpage to read directly from user message - Rename previous read_webpage function to more narrow read_webpage_at_url function * Rename agents_page -> agent_page * Fix unit test for adding the filename to the compiled markdown entry * Fix layout of agent, agents pages * Merge migrations * Let the name, slug of the default agent be Khoj, khoj * Fix chat-related unit tests * Add webpage chat command for read web pages requested by user Update auto chat command inference prompt to show example of when to use webpage chat command (i.e when url is directly provided in link) * Support webpage command in chat API - Fallback to use webpage when SERPER not setup and online command was attempted - Do not stop responding if can't retrieve online results. Try to respond without the online context * Test select webpage as data source and extract web urls chat actors * Tweak prompts to extract information from webpages, online results - Show more of the truncated messages for debugging context - Update Khoj personality prompt to encourage it to remember it's capabilities * Rename extract_content online results field to webpages * Parallelize simple webpage read and extractor Similar to what is being done with search_online with olostep * Pass multiple webpages with their urls in online results context Previously even if MAX_WEBPAGES_TO_READ was > 1, only 1 extracted content would ever be passed. URL of the extracted webpage content wasn't passed to clients in online results context. This limited them from being rendered * Render webpage read in chat response references on Web, Desktop apps * Time chat actor responses & chat api request start for perf analysis * Increase the keep alive timeout in the main application for testing * Do not pipe access/error logs to separate files. Flow to stdout/stderr * [Temp] Reduce to 1 gunicorn worker * Change prod docker image to use jammy, rather than nvidia base image * Use Khoj icon when Khoj web is installed on iOS as a PWA * Make slug required for agents * Simplify calling logic and prevent agent access for unauthenticated users * Standardize to use personality over tuning in agent nomenclature * Make filtering logic more stringent for accessible agents and remove unused method: * Format chat message query --------- Co-authored-by: Debanjum Singh Solanky <debanjum@gmail.com>	2024-03-26 18:13:24 +05:30
Debanjum	586654e2af	Allow directly reading web pages, even when SERP not enabled (#676 ) ### Overview Khoj can now read website directly without needing to go through the search step first ### Details - Parallelize simple webpage read and extractor - Rename extract_content online results field to web pages - Tweak prompts to extract information from webpages, online results - Test select webpage as data source and extract web urls chat actors - Render webpage read in chat response references on Web, Desktop apps - Pass multiple webpages with their urls in online results context - Support webpage command in chat API - Add webpage chat command for read web pages requested by user - Create chat actor for directly reading webpages based on user message	2024-03-24 16:25:25 +05:30
Debanjum Singh Solanky	85c62efca1	Test select webpage as data source and extract web urls chat actors	2024-03-24 15:46:29 +05:30
sabaimran	8abc8ded82	Part 1: Server-side changes to support agents integrated with Conversations (#671 ) * Initial pass at backend changes to support agents - Add a db model for Agents, attaching them to conversations - When an agent is added to a conversation, override the system prompt to tweak the instructions - Agents can be configured with prompt modification, model specification, a profile picture, and other things - Admin-configured models will not be editable by individual users - Add unit tests to verify agent behavior. Unit tests demonstrate imperfect adherence to prompt specifications * Customize default behaviors for conversations without agents or with default agents * Use agent_id for getting correct agent * Merge migrations * Simplify some variable definitions, add additional security checks for agents * Rename agent.tuning -> agent.personality	2024-03-23 22:09:38 +05:30
Debanjum Singh Solanky	ecddf98430	Handle truncation when single long non-system chat message Previously was assuming the system prompt is being always passed as the first message. So expected there to be at least 2 messages in logs. This broke chat actors querying with single long non system message. A more robust way to extract system prompt is via the message role instead	2024-03-15 15:58:39 +05:30
Debanjum Singh Solanky	6118d1ff57	Create chat actor for directly reading webpages based on user message - Add prompt for the read webpages chat actor to extract, infer webpage links - Make chat actor infer or extract webpage to read directly from user message - Rename previous read_webpage function to more narrow read_webpage_at_url function	2024-03-14 14:58:37 +05:30
Debanjum Singh Solanky	dd883dc53a	Dedupe query in notes prompt. Improve OAI chat actor, director tests - Remove stale tests - Improve tests to pass across gpt-3.5 and gpt-4-turbo - The haiku creation director was failing because of duplicate query in instantiated prompt	2024-03-14 01:22:33 +05:30
Debanjum Singh Solanky	70b04d16c0	Test data source, output mode selector, web search query chat actors	2024-03-14 01:22:33 +05:30
Debanjum Singh Solanky	88f096977b	Read webpages directly when Olostep proxy not setup This is useful for self-hosted, individual user, low traffic setups where a proxy service is not required	2024-03-11 18:41:02 +05:30
Debanjum Singh Solanky	ca2f962e95	Read, extract information from web pages in parallel to lower response time - Time reading webpage, extract info from webpage steps for perf analysis - Deduplicate webpages to read gathered across separate google searches - Use aiohttp to make API requests non-blocking, pair with asyncio to parallelize all the online search webpage read and extract calls	2024-03-11 18:41:02 +05:30

1 2 3 4 5 ...

365 commits