sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-27 17:35:07 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	5673bd5b96	Keep original formatting in compiled text entry strings - Explicity split entry string by space during split by max_tokens - Prevent formatting of compiled entry from being lost - The formatting itself contains useful information No point in dropping the formatting unnecessarily, even if (say) the currrent search models don't account for it (yet)	2023-03-30 14:02:46 +07:00
Debanjum Singh Solanky	a2ab68a7a2	Include filename of markdown entries for search indexing Append originating filename to compiled string of each entry for better search quality by providing more context to model Update markdown_to_jsonl tests to ensure filename being added Resolves #142	2023-03-30 13:51:36 +07:00
Debanjum Singh Solanky	7e36f421f9	Truncate message logs to below max supported prompt size by model - Use tiktoken to count tokens for chat models - Make conversation turns to add to prompt configurable via method argument to generate_chatml_messages_with_context method	2023-03-25 05:13:56 +07:00
Debanjum Singh Solanky	508b2176b7	Update Chat API, Logs, Interfaces to store, use references as list - Remove the need to split by magic string in emacs and chat interfaces - Move compiling references into string as context for GPT to GPT layer - Update setup in tests to use new style of setting references - Name first argument to converse as more appropriate "references"	2023-03-24 22:10:11 +07:00
Debanjum	b351cfb8a0	Add Search Actor to Improve Querying Notes for Khoj Chat Merge pull request #189 from debanjum/add-search-actor-to-improve-notes-lookup-for-chat ### Introduce Search Actor Search actor infers Search Queries from user's message - Capabilities - Use previous messages to add context to current search queries[^1] This improves quality of responses in multi-turn conversations. - Deconstruct users message into multiple search queries to lookup notes[^2] - Use relative date awareness to add date filters to search queries[^3] - Chat Director now does the following: 1. [NEW] Use Search Actor to generate search queries from user's message 2. Retrieve relevant notes from Knowledge Base using the Search queries 3. Pass retrieved relevant notes to Chat Actor to respond to user ### Add Chat Quality Tests - Test Search Actor capabilities - Mark Chat Director Tests for Relative Date, Multiple Search Queries as Expected Pass ### Give More Search Results as Context to Chat Actor - Loosen search results score threshold to work better for searches with date filters - Pass more search results (up to 5 from 2) as context to Chat Actor to improve inference [^1]: Multi-Turn Example Q: "When did I go to Mars?" Search: "When did I go to Mars?" A: "You went to Mars in the future" Q: "How was that experience?" Search: "How my Mars experience?" This gives better context for the Chat actor to respond [^2]: Deconstruct Example: Is Alpha older than Beta? => What is Alpha's age? & When was Beta born? [^3]: Date Example: Convert user messages containing relative dates like last month, yesterday to date filters on specific dates like dt>="2023-03-01"	2023-03-18 18:02:12 -06:00
Debanjum Singh Solanky	08f5fb315f	Add answers to context for Search Actor to generate relevant queries Update Search Actor prompt with answers, more precise primer and two more examples for context Mark the 3 chat quality tests using answer as context to generate queries as expected to pass. Verify that the 3 tests pass now, unlike before when the Search Actor did not have the answers for context	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	f09bdd515b	Expect Chat Director can extract relative dates using new Search Actor	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	36c7389b46	Test Search Actor generating search query from Chat History	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	2600cc9d4d	Test Search Actor extracting relative dates & multiple questions	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	d0f14d3f85	Test usage of = in date filter queries	2023-03-16 14:52:59 -06:00
Debanjum Singh Solanky	dfb277ee37	Set skipif at module level if OpenAI API key not set for chat tests - Remove stale message_to_prompt test It is too broad, reduces maintainability. Remove as it doesn't really need its own test right now - Setting skipif at module level for chat actor, director tests reduces code duplication as earlier was using decorator on each chat test	2023-03-16 12:23:52 -06:00
Debanjum Singh Solanky	4e15b4e411	Create test notes dataset for chat testing Combine hand-written custom notes and PG essays with personal content to bulk up notes count Delete old documentation markdown as not a representative dataset for application (which is more tuned for personal notes)	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	1b4d562700	Test Chat Director Capabilities: Answer from notes, chat history etc - Chat directors are broad agents. - Chat directors orchestrate narrow actor agents to synthesize final response for the user - Agents are Prompts + ML Model - Test Chat Director Capabilities 1. [X] Answer from retrieved notes 2. [X] Answer from chat history 3. [X] Answer general questions 4. [X] Carry out multi-turn conversation 5. [X] Say don't know when answer not in provided context 6. [X] Answers that require current date awareness This test is expected to fail as the chat is not capable of doing this without the Search actor. But the test allows assessing chat quality 7. [X] Date-aware aggregation across multiple different notes This test is expected to fail as the chat is not capable of doing this without the Search actor. But the test allows assessing chat quality 8. [X] Ask clarification questions if no unambiguous answer in provided context 9. [X] Retrieve answer from chat history beyond lookback window This test is expected to fail as the chat director is not capable of searching chat history yet. But the test allows assessing chat quality 10. [X] Retrieve context for answer using multiple independent searches on knowledge base This test is expected to fail as the chat is not capable of doing this without the Search actor. But the test allows assessing chat quality	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	b6d63137f1	Setup Pytest fixture for conversation processor to test chat API - Index markdown test data as knowledge base. As easier to get good markdown content (vs org) - Setup markdown_content_config, processor_config and chat_client to test chat API	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	3f719c9e17	Rename Chat Model+Prompt tests to chat actor tests	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	7526a50dd4	Extract conversation processor utility funcs from gpt.py into utils.py	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	7c4d546039	Configure tests to mark chat quality tests & filter unhelpful warnings - Mark chat quality tests, register custom mark for chat quality - Filter unhelpful deprecation warnings from within dateparser library - Error if tests use unregistered marks	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	c1128a1ad8	Test Chat Actor Capabilities; ability to answer from notes, chat logs etc - Chat actors are narrow agents (prompt + ML model) Chat actors are different from the Chat director. who orchestrates the narrow actor agents to synthesize final response to the user - Test Chat Actor Capabilities 1. Answer from retrieved notes 2. Answer from chat history 3. Answer general questions 4. Carry out multi-turn conversation 5. Say don't know when answer not in provided context 6. Answers that require current date awareness 7. Date-aware aggregation across multiple different notes 8. Ask clarification questions if no unambiguous answer in provided context This test is expected to fail as the chat is not capable of doing this consistently yet. But having the test allows assessing chat quality - Use Openai API Key from OPENAI_API_KEY environment variable - Gitignore .env file, python virtualenv directory Put OpenAI API Key in .env file to run chatbot tests via vscode The .env file is default location for importing env vars	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	9306cd901a	Clean up chat tests to work with updated chat methods in gpt.py	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	b6cdc5c7cb	Do not expose answer API as a chat type in chat web interface or API Answer does not rely on past conversations, just the knowledge base. It is meant for one off interactions, like search rather than a continuing conversation like chat For now it is only exposed via API. Later it will be expose in the interfaces as well Remove ability to select different chat types from the chat web interface as there is only a single chat type Stop appending answers to the conversation logs	2023-03-05 18:21:59 -06:00
Debanjum Singh Solanky	211e460398	Output date filter from cache log at debug level. Remove unused imports Other logs not directly useful to user have already been converted to debug log levels in `1ae4016`. Just forgot to convert this log line too	2023-03-02 15:41:32 -06:00
Debanjum Singh Solanky	c823f46d89	Test error on missing fields in ContentConfig pulled from Khoj.yml Resolves #9	2023-03-02 15:35:39 -06:00
Debanjum Singh Solanky	fe03ba3dce	Index intro text before headings in org files - Text before headings was not being indexed due to buggy orgnode parsing logic - Resolved indexing intro text from files with and without headings in them - Ensure intro text node has heading set to all title lines collected from the file Resolves #165	2023-03-01 12:11:33 -06:00
Debanjum Singh Solanky	2bed4c3b50	Fix configuring search types & /config/types API when no plugin configured - Test /config/types API when no plugin configured, only plugin configured and no content configured scenarios - Do not throw null reference exception while configuring search types when no plugin configured - Do not throw null reference exception on calling /config/types API when no plugin configured Resolves bug introduced by #173	2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky	b09350c052	Fix to return only enabled content types via the new config/types API - Previously was return all core content types even if they had not been setup - Add test to validate only configured content types are returned by the api/config/types API endpoint	2023-02-28 22:08:26 -06:00
Debanjum Singh Solanky	ede6eb6879	Re-enable testing search and update API with image content type It may have been disabled due to issues with image search earlier	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	88a9eadfba	Use client pytest fixture to test API with plugin type configured	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	ab501a56c9	Create pytest fixture to configure app with plugin, search types	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	f944408e69	Update content_config pytest fixture to index plugin content	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	68bd5d9ebc	Configure API routes after set up search types while configuring server Configure app routes after configuring server. Import API routers after search type is dynamically populated. Allow API to recognize the dynamically populated plugin search types as valid type query param. Enable searching for plugin type content.	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	55a032e8c4	Add processor to index entries from jsonl files for plugins - Read, merge entries from input jsonl files and filters - Mark new, modified entries for update	2023-02-24 02:54:12 -06:00
Debanjum Singh Solanky	fcbbe8c759	Read content plugin configs from Khoj config YAML Configure external text content plugins via the Khoj YAML Reuse existing TextContentConfig definition for external text content plugins	2023-02-23 23:57:32 -06:00
Debanjum Singh Solanky	47569da38e	Fix usage of "\" in orgnode test string to resolve DeprecationWarning	2023-02-17 17:15:44 -06:00
Debanjum Singh Solanky	051f0e3fb5	Add, configure and run pre-commit locally and in test workflow	2023-02-17 13:31:36 -06:00
Debanjum Singh Solanky	5e83baab21	Use Black to format Khoj server code and tests	2023-02-17 11:55:17 -06:00
Debanjum Singh Solanky	af6d65a909	Create tagged Docker image on new tag/release	2023-02-14 20:04:06 -06:00
Debanjum Singh Solanky	bc7477ea3e	Move Emacs, Obsidian plugin code out from under src/khoj directory - What - The Emacs and Obsidian interfaces stay in their original directories under src/ - src/khoj now only contains code meant for pypi packaging - Benefits - This avoids having to update khoj MELPA, Obsidian plugin config as the Emacs, Obsidian code is under their original directories - It separates the code in src/khoj meant for python packaging from code for external interfaces like Emacs and Obsidian	2023-02-14 15:44:22 -06:00
Debanjum Singh Solanky	25a749ca1d	Use the src/ layout to fix packaging Khoj for PyPi - Why The khoj pypi packages should be installed in `khoj' directory. Previously it was being installed into `src' directory, which is a generic top level directory name that is discouraged from being used - Changes - move src/* to src/khoj/* - update `setup.py' to `find_packages' in `src' instead of project root - rename imports to form `from khoj.*' in complete project - update `constants.web_directory' path to use `khoj' directory - rename root logger to `khoj' in `main.py' - fix image_search tests to use the newly rename `khoj' logger - update config, docs, workflows to reference new path `src/khoj'	2023-02-14 15:19:06 -06:00
Debanjum Singh Solanky	6908b6eed3	Truncate image queries below max tokens length supported by ML model This would previously return the infamous tensor size mismatch error Verify this error is not raised since adding the query truncation logic	2023-01-21 14:11:00 -03:00
Debanjum Singh Solanky	3d9ed91e42	Search by image at path only if query of form "file:/path/to/image" Previously no query syntax helpers, like the "file:" prefix, were used before checking if query contains file path. This made query to image search brittle to misinterpretation and pointless checking Add test to verify search by image at file works as expected	2023-01-21 14:06:56 -03:00
Debanjum Singh Solanky	7b4f78776c	Fix extracting Markdown Entries with Top Level Headings - Previously top level headings would have get stripped of the space between heading text and the prefix # symbols. That is, `# Top Level Heading' would get converted to `#Top Level Heading' - This would mess up their rendering as a heading in search results - Add unit tests to text_to_jsonl processors to prevent regression	2023-01-17 13:06:28 -03:00
Debanjum Singh Solanky	d40076fcd6	Deduplicate test code, make teardown more robust using pytest fixtures	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	237123d18c	Fix tests for the conversation processor - Use latest davinci model for tests - Wrap prompt in triple quotes to improve legibilty - `understand' method returns dictionary instead of string. Fix its test - Fix prompt for new model to pass `chat_with_history' test	2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky	826f9dc054	Drop long words from compiled entries to be within max token limit of models Long words (>500 characters) provide less useful context to models. Dropping very long words allow models to create better embeddings by passing more of the useful context from the entry to the model	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	24676f95d8	Fix comments, use minimal test case, regenerate test index, merge debug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion	2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky	53cd2e5605	Regenerate initial model in asymmetric reload test to reduce flakyness - Fix logger message when converting org node to entries - Remove unused import from conftest	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	c79919bd68	Split entries by max tokens while converting Org entries To JSONL - Test usage the entry splitting by max tokens in text search	2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky	e057c8e208	Add method to split entries by specified max tokens limit - Issue ML Models truncate entries exceeding some max token limit. This lowers the quality of search results - Fix Split entries by max tokens before indexing. This should improve searching for content in longer entries. - Miscellaneous - Test method to split entries by max tokens	2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky	d292bdcc11	Do not version API. Premature given current state of the codebase - Reason - All clients that currently consume the API are part of Khoj - Any breaking API changes will be fixed in clients immediately - So decoupling client from API is not required - This removes the burden of maintaining muliple versions of the API	2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky	2c548133f3	Remove unused imports, `embeddings' variable from text search tests	2022-10-08 12:06:05 +03:00

1 2 3

149 commits