sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-18 02:27:10 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	67b1178aec	Remove debug logs generated while compiling org-mode entries	2024-04-08 13:01:24 +05:30
Debanjum Singh Solanky	00f599ea78	Fix passing flags to re.split to break org, md content by heading level `re.MULTILINE' should be passed to the `flags' argument, not the `max_splits' argument of the `re.split' func This was messing up the indexing by only allowing a maximum of re.MULTILINE splits. Fixing this improves the search quality to previous state	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	32ac0622ff	Extract dates from compiled text entries	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	29c1c18042	Increase search distance to get relevant content for chat post indexer update More content indexed per entry would result in an overall scores lowering effect. Increase default search distance threshold to counter that - Details - Fix expected results post indexing updates - Fix search with max distance post indexing updates - Minor - Remove openai chat actor test for after: operator as it's not expected anymore	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	ad4fa4b2f4	Fix adding file path instead of stem to markdown entries	2024-04-04 02:41:55 +05:30
sabaimran	720139c3c1	Fix all unit tests for test_text_search	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44b3247869	Update logical splitting of org-mode text into entries - Major - Do not split org file, entry if it fits within the max token limits - Recurse down org file entries, one heading level at a time until reach leaf node or the current parent tree fits context window - Update `process_single_org_file' func logic to do this recursion - Convert extracted org nodes with children into entries - Previously org node to entry code just had to handle leaf entries - Now it recieve list of org node trees - Only add ancestor path to root org-node of each tree - Indent each entry trees headings by +1 level from base level (=2) - Minor - Stop timing org-node parsing vs org-node to entry conversion Just time the wrapping function for org-mode entry extraction This standardizes what is being timed across at md, org etc. - Move try/catch to `extract_org_nodes' from `parse_single_org_file' func to standardize this also across md, org	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	eaa27ca841	Only add spaces after heading if any tags in orgnode raw entry repr	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	2ea8a832a0	Log error when fail to index md file. Fix, improve typing in md_to_entries	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	44eab74888	Dedupe code by using single func to process an org file into entries Add type hints to orgnode and org-to-entries packages	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	db2581459f	Parse markdown parent entries as single entry if fit within max tokens These changes improve context available to the search model. Specifically this should improve entry context from short knowledge trees, that is knowledge bases with sparse, short heading/entry trees Previously we'd always split markdown files by headings, even if a parent entry was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to the search model to select appropriate entries for a query, especially from short entry trees Revert back to using regex to parse through markdown file instead of using MarkdownHeaderTextSplitter. It was easier to implement the logical split using regexes rather than bend MarkdowHeaderTextSplitter to implement it. - DFS traverse the markdown knowledge tree, prefix ancestry to each entry	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	982ac1859c	Parse markdown file as single entry if it fits with max token limits These changes improve entry context available to the search model Specifically this should improve entry context from short knowledge trees, that is knowledge bases with small files Previously we split all markdown files by their headings, even if the file was small enough to fit entirely within the max token limits of the search model. This used to reduce the context available to select the appropriate entries for a given query for the search model, especially from short knowledge trees	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	d8f01876e5	Add parent heading ancestory to extracted markdown entries for context Improve, update the markdown to entries extractor tests	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	86575b2946	Chunk text in preference order of para, sentence, word, character - Previous simplistic chunking strategy of splitting text by space didn't capture notes with newlines, no spaces. For e.g in #620 - New strategy will try chunk the text at more natural points like paragraph, sentence, word first. If none of those work it'll split at character to fit within max token limit - Drop long words while preserving original delimiters Resolves #620	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	a627f56a64	Remove unused Entry to Jsonl converter from text to entry class, tests This was earlier used when the index was plaintext jsonl file. Now that documents are indexed in a DB this func is not required. Simplify org,md,pdf,plaintext to entries tests by removing the entry to jsonl conversion step	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	28105ee027	Create wrapper function to get entries from org, md, pdf & text files - Convert extract_org_entries function to actually extract org entries Previously it was extracting intermediary org-node objects instead Now it extracts the org-node objects from files and converts them into entries - Create separate, new function to extract_org_nodes from files - Similarly create wrapper funcs for md, pdf, plaintext to entries - Update org, md, pdf, plaintext to entries tests to use the new simplified wrapper function to extract org entries	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	f01a12b1d2	Improve styling of chat sessions side panel - Move green server connected dot to the bottom. Show status when disconnected from server - Move "New conversation" button to right of the "Conversation" title - Center alignment of the new conversation and connection status buttons	2024-04-04 01:43:26 +05:30
sabaimran	dd1e5e145a	Use List[Any] for typing	2024-04-03 21:46:41 +05:30
sabaimran	b8087c4c8e	Add typing to empty list variables in github_to_entries	2024-04-03 21:41:36 +05:30
sabaimran	d036fdfc26	If tree is not in the contents, then just return empty files list	2024-04-03 17:55:25 +05:30
Debanjum Singh Solanky	f915b2bd14	Fix passing model_name param to chatml formatter for online chat	2024-04-03 17:21:43 +05:30
sabaimran	6aa88761b8	Skip creating the default agent if there's no default conversation config	2024-04-03 17:21:01 +05:30
sabaimran	9c42c8be6b	Merge pull request #679 from khoj-ai/features/chat-socket-streaming Add a websocket for streaming from the chat UI	2024-04-03 04:43:31 -07:00
sabaimran	b4f71e06b3	Add timeout after 10 minutes of inactivity on socket	2024-04-02 22:12:27 +05:30
sabaimran	f48426623d	resolve merge conflict in chat.html	2024-04-02 17:29:48 +05:30
sabaimran	bf1187f465	Use new online/websearch logic and add agent to chat_metadata	2024-04-02 17:20:38 +05:30
sabaimran	867e1007d1	Remove superfluous newline	2024-04-02 17:20:08 +05:30
sabaimran	228ad68042	Merge with origin/master	2024-04-02 17:02:21 +05:30
sabaimran	776550d5ce	Add a migration for updating the default chat model, update for existing users	2024-04-02 17:01:31 +05:30
sabaimran	47fc7e1ce6	Rebase with matser	2024-04-02 16:16:06 +05:30
Debanjum	215ab6e66a	Extract More Dates from entries to improve Date Filter (#683 ) - Overview - Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year) - Extract some natural, partial dates as well from entries - Capability Add ability to extract the following additional date forms: - Natural Dates: 21st April 2000, February 29 2024 - Partial Natural Dates: March 24, Mar 2024 - Structured Dates: 20/12/24, 20.12.2024, 2024/12/20 Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters - Performance Using regexes is MUCH faster than using the `dateparser' python library It's a little crude but gives acceptable performance for large datasets	2024-04-02 16:14:53 +05:30
Debanjum	3c3e48b18c	Migrate to Llama.cpp for Offline Chat (#680 ) ## Benefits - Support all GGUF format chat models - Support more GPUs like AMD, Nvidia, Mac, Vulcan (previously just Vulcan, Mac) - Support more capabilities like larger context window, schema enforcement, speculative decoding etc. ## Changes ### Major - Use llama.cpp for offline chat models - Support larger context window - Automatically apply appropriate chat template. So offline chat models not using llama2 format are now supported - Use better default offline chat model, NousResearch/Hermes-2-Pro-Mistral-7B - Enable extract queries actor to improve notes search with offline chat - Update documentation to use llama.cpp for offline chat in Khoj ### Minor - Migrate to use NouseResearch's Hermes-2-Pro 7B as default offline chat model in khoj.yml - Rename GPT4AllChatProcessor to OfflineChatProcessor Config, Model - Only add location to image prompt generator when location known	2024-04-02 15:49:42 +05:30
Debanjum Singh Solanky	7afee2d55c	Let offline chat model set context window. Improve, fix prompts	2024-03-31 16:19:35 +05:30
Debanjum Singh Solanky	4228965c9b	Handle msg truncation when question is larger than max prompt size Notice and truncate the question it self at this point	2024-03-31 15:50:06 +05:30
Debanjum Singh Solanky	c6487f2e48	Fix docs showing how to setup llama-cpp with Khoj	2024-03-31 15:36:40 +05:30
Debanjum Singh Solanky	886d49e3a4	Merge branch 'master' into migrate-to-llama-cpp-for-offline-chat	2024-03-31 00:59:20 +05:30
Debanjum Singh Solanky	4f65dde201	Release Khoj version 1.8.0	2024-03-31 00:06:15 +05:30
sabaimran	c0e78fd56d	Fix broken get-started documentation links	2024-03-30 15:05:12 +05:30
sabaimran	dd2a3f712b	Add more demo videos, images, add feature sections	2024-03-30 14:48:46 +05:30
sabaimran	4cb91a042e	Add an agents feature page, and clarification around custom domains	2024-03-30 14:20:46 +05:30
sabaimran	928f273bbe	Configure production setup for moving to single worker model	2024-03-30 10:35:55 +05:30
Debanjum Singh Solanky	7923903d21	Improve date filter regexes to extract structured, natural, partial dates - Much faster than using dateparser - It took 2x-4x for improved regex to extracts 1-15% more dates - Whereas It took 33x to 100x for dateparser to extract 65% - 400% more dates - Improve date extractor tests to test deduping dates, natural, structured date extraction from content - Extract some natural, partial dates and more structured dates Using regex is much faster than using dateparser. It's a little crude but should pay off in performance. Supports dates of form: - (Day-of-Month) Month\|AbbreviatedMonth Year\|2DigitYear - Month\|AbbreviatedMonth (Day-of-Month) Year\|2DigitYear	2024-03-30 00:07:19 +05:30
Debanjum Singh Solanky	104eeea274	Extract natural language and locale specific dates in content Previously we just extracted dates in YYYY-MM-DD format from content for date filterings during search. Use dateparser to extract dates across locales and natural language This should improve notes returned as context when chat searches knowledge base with date filters Fallback to regex for date parsing from content if dateparser fails - Limit natural date extractor capabilities to improve performance - Assume language is english Language detection otherwise takes a REALLY long time - Do not extract unix timestamps, timezone - This isn't required, as just using date and approximating dates as UTC	2024-03-30 00:06:56 +05:30
Debanjum Singh Solanky	90c5b3c410	Update stale Khoj pypi package metadata Use latest License, Intended Audience and Dev Status	2024-03-29 00:06:55 +05:30
sabaimran	1195f843a3	Remove forward slash from the root agents endpoint	2024-03-28 23:06:55 +05:30
Debanjum Singh Solanky	a374288cea	Use OIDC TrustedPublisher to publish khoj python package to PyPi	2024-03-28 22:58:36 +05:30
sabaimran	3417164ec2	Bump gunicorn workers up to 8	2024-03-28 22:34:13 +05:30
sabaimran	a1729b9b9e	Add telemetry for agents used in conversation, increase image width in agents page	2024-03-28 22:18:11 +05:30
sabaimran	d503b3e867	Use Personality vernacular in agent page - When setting up the default agent, configure every conversation that doesn't have an agent to use the Khoj agent - Fix reverse migration for the locale removal migration	2024-03-28 15:07:02 +05:30
sabaimran	e59de8c9b1	Constrain width/size of agent image in agents view	2024-03-28 13:32:11 +05:30

1 2 3 4 5 ...

2472 commits