sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-27 17:35:07 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	14fbf594b2	Support using Python 3.12 with Khoj - RapidOCR for indexing image PDFs doesn't currently support python 3.12. It's an optional dependency anyway, so only install it if python < 3.12 - Run unit tests with python version 3.12 as well Resolves #522	2024-04-07 11:23:44 +05:30
sabaimran	86c831f7e2	Add a link to the data sources portion in the clients documentation	2024-04-07 09:32:58 +05:30
sabaimran	351fb31a34	Add webpage search to socket codepath, add a feature page for online search	2024-04-07 09:23:29 +05:30
Debanjum Singh Solanky	4be4c53222	Release Khoj version 1.9.0	2024-04-05 17:13:58 +05:30
sabaimran	54db0152b9	Add link to the khoj cloud service for connection to Notion	2024-04-05 15:41:43 +05:30
sabaimran	81f1450c1c	Update yarn.lock to sync with package.json for documentation	2024-04-05 15:36:23 +05:30
sabaimran	d22fd6dfe3	Get rid of unnecessary package-lock.json file	2024-04-05 15:34:02 +05:30
sabaimran	7d7ce92e46	Add updated information in docs about the Notion integration	2024-04-05 15:31:43 +05:30
sabaimran	2aedd3c819	Increase freq. of telemetry upload to every 5 minutes	2024-04-05 14:13:47 +05:30
sabaimran	3b1234d084	Await the calls to the db in the notion.py file	2024-04-05 13:58:14 +05:30
sabaimran	19c10b1418	Upgrade the package versions used in yarn.lock for the documentation project	2024-04-05 13:25:41 +05:30
sabaimran	00a67e9524	Add additional log lines when configuring the Notion settings for a user in the callback	2024-04-05 13:19:24 +05:30
sabaimran	d23f7da8e3	Handle the case where a previous serach model isn't set when updating the model	2024-04-05 13:18:51 +05:30
sabaimran	f57f9f672d	Address Notion, Image tech debt in indexing code path (#687 ) * Add support for using OAuth2.0 in the Notion integration * Add notion to the admin page * Remove unnecessary content_index and image search/setup references * Trigger background job to start indexing Notion after user configures it * Add a log line when a new Notion integration is setup * Fix references to the configure_content methods	2024-04-05 12:10:03 +05:30
sabaimran	69dee75c34	Update the readme for accuracy, updated demos	2024-04-04 10:57:24 +05:30
sabaimran	a60321b68e	Push khoj to include inline references when possible	2024-04-04 10:31:13 +05:30
sabaimran	5bdcb4e69c	Wait for location data to be returned before setting up the socket connection	2024-04-04 10:31:13 +05:30
Debanjum Singh Solanky	00f599ea78	Fix passing flags to re.split to break org, md content by heading level `re.MULTILINE' should be passed to the `flags' argument, not the `max_splits' argument of the `re.split' func This was messing up the indexing by only allowing a maximum of re.MULTILINE splits. Fixing this improves the search quality to previous state	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	32ac0622ff	Extract dates from compiled text entries	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	29c1c18042	Increase search distance to get relevant content for chat post indexer update More content indexed per entry would result in an overall scores lowering effect. Increase default search distance threshold to counter that - Details - Fix expected results post indexing updates - Fix search with max distance post indexing updates - Minor - Remove openai chat actor test for after: operator as it's not expected anymore	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	ad4fa4b2f4	Fix adding file path instead of stem to markdown entries	2024-04-04 02:41:55 +05:30
sabaimran	720139c3c1	Fix all unit tests for test_text_search	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44b3247869	Update logical splitting of org-mode text into entries - Major - Do not split org file, entry if it fits within the max token limits - Recurse down org file entries, one heading level at a time until reach leaf node or the current parent tree fits context window - Update `process_single_org_file' func logic to do this recursion - Convert extracted org nodes with children into entries - Previously org node to entry code just had to handle leaf entries - Now it recieve list of org node trees - Only add ancestor path to root org-node of each tree - Indent each entry trees headings by +1 level from base level (=2) - Minor - Stop timing org-node parsing vs org-node to entry conversion Just time the wrapping function for org-mode entry extraction This standardizes what is being timed across at md, org etc. - Move try/catch to `extract_org_nodes' from `parse_single_org_file' func to standardize this also across md, org	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	eaa27ca841	Only add spaces after heading if any tags in orgnode raw entry repr	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	2ea8a832a0	Log error when fail to index md file. Fix, improve typing in md_to_entries	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44eab74888	Dedupe code by using single func to process an org file into entries Add type hints to orgnode and org-to-entries packages	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	db2581459f	Parse markdown parent entries as single entry if fit within max tokens These changes improve context available to the search model. Specifically this should improve entry context from short knowledge trees, that is knowledge bases with sparse, short heading/entry trees Previously we'd always split markdown files by headings, even if a parent entry was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to the search model to select appropriate entries for a query, especially from short entry trees Revert back to using regex to parse through markdown file instead of using MarkdownHeaderTextSplitter. It was easier to implement the logical split using regexes rather than bend MarkdowHeaderTextSplitter to implement it. - DFS traverse the markdown knowledge tree, prefix ancestry to each entry	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	982ac1859c	Parse markdown file as single entry if it fits with max token limits These changes improve entry context available to the search model Specifically this should improve entry context from short knowledge trees, that is knowledge bases with small files Previously we split all markdown files by their headings, even if the file was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to select the appropriate entries for a given query for the search model, especially from short knowledge trees	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	d8f01876e5	Add parent heading ancestory to extracted markdown entries for context Improve, update the markdown to entries extractor tests	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	86575b2946	Chunk text in preference order of para, sentence, word, character - Previous simplistic chunking strategy of splitting text by space didn't capture notes with newlines, no spaces. For e.g in #620 - New strategy will try chunk the text at more natural points like paragraph, sentence, word first. If none of those work it'll split at character to fit within max token limit - Drop long words while preserving original delimiters Resolves #620	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	a627f56a64	Remove unused Entry to Jsonl converter from text to entry class, tests This was earlier used when the index was plaintext jsonl file. Now that documents are indexed in a DB this func is not required. Simplify org,md,pdf,plaintext to entries tests by removing the entry to jsonl conversion step	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	28105ee027	Create wrapper function to get entries from org, md, pdf & text files - Convert extract_org_entries function to actually extract org entries Previously it was extracting intermediary org-node objects instead Now it extracts the org-node objects from files and converts them into entries - Create separate, new function to extract_org_nodes from files - Similarly create wrapper funcs for md, pdf, plaintext to entries - Update org, md, pdf, plaintext to entries tests to use the new simplified wrapper function to extract org entries	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	f01a12b1d2	Improve styling of chat sessions side panel - Move green server connected dot to the bottom. Show status when disconnected from server - Move "New conversation" button to right of the "Conversation" title - Center alignment of the new conversation and connection status buttons	2024-04-04 01:43:26 +05:30
sabaimran	dd1e5e145a	Use List[Any] for typing	2024-04-03 21:46:41 +05:30
sabaimran	b8087c4c8e	Add typing to empty list variables in github_to_entries	2024-04-03 21:41:36 +05:30
sabaimran	d036fdfc26	If tree is not in the contents, then just return empty files list	2024-04-03 17:55:25 +05:30
Debanjum Singh Solanky	f915b2bd14	Fix passing model_name param to chatml formatter for online chat	2024-04-03 17:21:43 +05:30
sabaimran	6aa88761b8	Skip creating the default agent if there's no default conversation config	2024-04-03 17:21:01 +05:30
sabaimran	9c42c8be6b	Merge pull request #679 from khoj-ai/features/chat-socket-streaming Add a websocket for streaming from the chat UI	2024-04-03 04:43:31 -07:00
sabaimran	b4f71e06b3	Add timeout after 10 minutes of inactivity on socket	2024-04-02 22:12:27 +05:30
sabaimran	f48426623d	resolve merge conflict in chat.html	2024-04-02 17:29:48 +05:30
sabaimran	bf1187f465	Use new online/websearch logic and add agent to chat_metadata	2024-04-02 17:20:38 +05:30
sabaimran	867e1007d1	Remove superfluous newline	2024-04-02 17:20:08 +05:30
sabaimran	228ad68042	Merge with origin/master	2024-04-02 17:02:21 +05:30
sabaimran	776550d5ce	Add a migration for updating the default chat model, update for existing users	2024-04-02 17:01:31 +05:30
sabaimran	47fc7e1ce6	Rebase with matser	2024-04-02 16:16:06 +05:30
Debanjum	215ab6e66a	Extract More Dates from entries to improve Date Filter (#683 ) - Overview - Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year) - Extract some natural, partial dates as well from entries - Capability Add ability to extract the following additional date forms: - Natural Dates: 21st April 2000, February 29 2024 - Partial Natural Dates: March 24, Mar 2024 - Structured Dates: 20/12/24, 20.12.2024, 2024/12/20 Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters - Performance Using regexes is MUCH faster than using the `dateparser' python library It's a little crude but gives acceptable performance for large datasets	2024-04-02 16:14:53 +05:30
Debanjum	3c3e48b18c	Migrate to Llama.cpp for Offline Chat (#680 ) ## Benefits - Support all GGUF format chat models - Support more GPUs like AMD, Nvidia, Mac, Vulcan (previously just Vulcan, Mac) - Support more capabilities like larger context window, schema enforcement, speculative decoding etc. ## Changes ### Major - Use llama.cpp for offline chat models - Support larger context window - Automatically apply appropriate chat template. So offline chat models not using llama2 format are now supported - Use better default offline chat model, NousResearch/Hermes-2-Pro-Mistral-7B - Enable extract queries actor to improve notes search with offline chat - Update documentation to use llama.cpp for offline chat in Khoj ### Minor - Migrate to use NouseResearch's Hermes-2-Pro 7B as default offline chat model in khoj.yml - Rename GPT4AllChatProcessor to OfflineChatProcessor Config, Model - Only add location to image prompt generator when location known	2024-04-02 15:49:42 +05:30
Debanjum Singh Solanky	7afee2d55c	Let offline chat model set context window. Improve, fix prompts	2024-03-31 16:19:35 +05:30
Debanjum Singh Solanky	4228965c9b	Handle msg truncation when question is larger than max prompt size Notice and truncate the question it self at this point	2024-03-31 15:50:06 +05:30

... 3 4 5 6 7 ...

2688 commits