sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Author	SHA1	Message	Date
Saba	751edfefe5	Add separate unit test for github. Will only run of a PAT token is set	2023-06-13 16:55:58 -07:00
Saba	3a61919344	Fix failing unit tests by hard-coding model presence of expected search types	2023-06-13 16:32:47 -07:00
Saba	019d3732de	Rename orgmode_search to org_search	2023-06-13 16:06:54 -07:00
Saba	5d5ebcbf7c	Rename truncate messages method and update unit tests to simplify assertion logic	2023-06-06 23:25:43 -07:00
Saba	7119ed0849	Run pre-commit script	2023-06-05 19:29:23 -07:00
Saba	948ba6ddca	Remove unused logger	2023-06-05 19:01:03 -07:00
Saba	f65ff9815d	Move message truncation logic into a separate function. Add unit tests with factory boy.	2023-06-05 18:58:29 -07:00
Debanjum Singh Solanky	acd14a5e41	Wire up PDF to jsonl processor to Khoj server layer (API, config) - Specify PDF content to index via khoj.yml - Index PDF content on app start, reconfigure - Expose PDF as a search type via API	2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky	d63194c3a9	Create tests for PDF to JSONL processor	2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky	131b8407b5	Allow Khoj Chat to respond to general queries not in reference notes - Khoj chat will now respond to general queries if: 1. no relevant reference notes available or 2. when explicitly induced by prefixing the chat message with "@general" - Previously Khoj Chat would a lot of times refuse to respond to general queries not answerable from reference notes or chat history - Make chat quality tests more robust - Add more equivalent chat response options refusing to answer - Force haiku writing to not give any preable, just the haiku	2023-05-12 18:42:40 +08:00
Debanjum Singh Solanky	cc75f986b2	Test text search index only updates on changes to text content	2023-05-12 17:37:34 +08:00
Debanjum Singh Solanky	02aeee60aa	Set filename as top heading of org entries for better search context Previously filename was only being appended to markdown entries. Test filename getting prepended to compiled entry as heading	2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky	5de04621b5	Set filename as top heading of md entries for better search context Previously filename was appended to the end of the compiled entry. This didn't provide appropriate structured context Test filename getting prepended as heading to compiled entry	2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky	0e3fb59e09	Entries with no md headings should not get heading prefix prepended Files with no headings would previously get their entry be prefixed with a markdown heading prefix (#)	2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky	45a991d75c	Prepend entry heading to all compiled org snippets to improve search context All compiled snippets split by max tokens (apart from first) do not get the heading as context. This limits search context required to retrieve these continuation entries	2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky	5673bd5b96	Keep original formatting in compiled text entry strings - Explicity split entry string by space during split by max_tokens - Prevent formatting of compiled entry from being lost - The formatting itself contains useful information No point in dropping the formatting unnecessarily, even if (say) the currrent search models don't account for it (yet)	2023-03-30 14:02:46 +07:00
Debanjum Singh Solanky	a2ab68a7a2	Include filename of markdown entries for search indexing Append originating filename to compiled string of each entry for better search quality by providing more context to model Update markdown_to_jsonl tests to ensure filename being added Resolves #142	2023-03-30 13:51:36 +07:00
Debanjum Singh Solanky	7e36f421f9	Truncate message logs to below max supported prompt size by model - Use tiktoken to count tokens for chat models - Make conversation turns to add to prompt configurable via method argument to generate_chatml_messages_with_context method	2023-03-25 05:13:56 +07:00
Debanjum Singh Solanky	508b2176b7	Update Chat API, Logs, Interfaces to store, use references as list - Remove the need to split by magic string in emacs and chat interfaces - Move compiling references into string as context for GPT to GPT layer - Update setup in tests to use new style of setting references - Name first argument to converse as more appropriate "references"	2023-03-24 22:10:11 +07:00
Debanjum	b351cfb8a0	Add Search Actor to Improve Querying Notes for Khoj Chat Merge pull request #189 from debanjum/add-search-actor-to-improve-notes-lookup-for-chat ### Introduce Search Actor Search actor infers Search Queries from user's message - Capabilities - Use previous messages to add context to current search queries[^1] This improves quality of responses in multi-turn conversations. - Deconstruct users message into multiple search queries to lookup notes[^2] - Use relative date awareness to add date filters to search queries[^3] - Chat Director now does the following: 1. [NEW] Use Search Actor to generate search queries from user's message 2. Retrieve relevant notes from Knowledge Base using the Search queries 3. Pass retrieved relevant notes to Chat Actor to respond to user ### Add Chat Quality Tests - Test Search Actor capabilities - Mark Chat Director Tests for Relative Date, Multiple Search Queries as Expected Pass ### Give More Search Results as Context to Chat Actor - Loosen search results score threshold to work better for searches with date filters - Pass more search results (up to 5 from 2) as context to Chat Actor to improve inference [^1]: Multi-Turn Example Q: "When did I go to Mars?" Search: "When did I go to Mars?" A: "You went to Mars in the future" Q: "How was that experience?" Search: "How my Mars experience?" This gives better context for the Chat actor to respond [^2]: Deconstruct Example: Is Alpha older than Beta? => What is Alpha's age? & When was Beta born? [^3]: Date Example: Convert user messages containing relative dates like last month, yesterday to date filters on specific dates like dt>="2023-03-01"	2023-03-18 18:02:12 -06:00
Debanjum Singh Solanky	08f5fb315f	Add answers to context for Search Actor to generate relevant queries Update Search Actor prompt with answers, more precise primer and two more examples for context Mark the 3 chat quality tests using answer as context to generate queries as expected to pass. Verify that the 3 tests pass now, unlike before when the Search Actor did not have the answers for context	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	f09bdd515b	Expect Chat Director can extract relative dates using new Search Actor	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	36c7389b46	Test Search Actor generating search query from Chat History	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	2600cc9d4d	Test Search Actor extracting relative dates & multiple questions	2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky	d0f14d3f85	Test usage of = in date filter queries	2023-03-16 14:52:59 -06:00
Debanjum Singh Solanky	dfb277ee37	Set skipif at module level if OpenAI API key not set for chat tests - Remove stale message_to_prompt test It is too broad, reduces maintainability. Remove as it doesn't really need its own test right now - Setting skipif at module level for chat actor, director tests reduces code duplication as earlier was using decorator on each chat test	2023-03-16 12:23:52 -06:00
Debanjum Singh Solanky	4e15b4e411	Create test notes dataset for chat testing Combine hand-written custom notes and PG essays with personal content to bulk up notes count Delete old documentation markdown as not a representative dataset for application (which is more tuned for personal notes)	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	1b4d562700	Test Chat Director Capabilities: Answer from notes, chat history etc - Chat directors are broad agents. - Chat directors orchestrate narrow actor agents to synthesize final response for the user - Agents are Prompts + ML Model - Test Chat Director Capabilities 1. [X] Answer from retrieved notes 2. [X] Answer from chat history 3. [X] Answer general questions 4. [X] Carry out multi-turn conversation 5. [X] Say don't know when answer not in provided context 6. [X] Answers that require current date awareness This test is expected to fail as the chat is not capable of doing this without the Search actor. But the test allows assessing chat quality 7. [X] Date-aware aggregation across multiple different notes This test is expected to fail as the chat is not capable of doing this without the Search actor. But the test allows assessing chat quality 8. [X] Ask clarification questions if no unambiguous answer in provided context 9. [X] Retrieve answer from chat history beyond lookback window This test is expected to fail as the chat director is not capable of searching chat history yet. But the test allows assessing chat quality 10. [X] Retrieve context for answer using multiple independent searches on knowledge base This test is expected to fail as the chat is not capable of doing this without the Search actor. But the test allows assessing chat quality	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	b6d63137f1	Setup Pytest fixture for conversation processor to test chat API - Index markdown test data as knowledge base. As easier to get good markdown content (vs org) - Setup markdown_content_config, processor_config and chat_client to test chat API	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	3f719c9e17	Rename Chat Model+Prompt tests to chat actor tests	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	7526a50dd4	Extract conversation processor utility funcs from gpt.py into utils.py	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	7c4d546039	Configure tests to mark chat quality tests & filter unhelpful warnings - Mark chat quality tests, register custom mark for chat quality - Filter unhelpful deprecation warnings from within dateparser library - Error if tests use unregistered marks	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	c1128a1ad8	Test Chat Actor Capabilities; ability to answer from notes, chat logs etc - Chat actors are narrow agents (prompt + ML model) Chat actors are different from the Chat director. who orchestrates the narrow actor agents to synthesize final response to the user - Test Chat Actor Capabilities 1. Answer from retrieved notes 2. Answer from chat history 3. Answer general questions 4. Carry out multi-turn conversation 5. Say don't know when answer not in provided context 6. Answers that require current date awareness 7. Date-aware aggregation across multiple different notes 8. Ask clarification questions if no unambiguous answer in provided context This test is expected to fail as the chat is not capable of doing this consistently yet. But having the test allows assessing chat quality - Use Openai API Key from OPENAI_API_KEY environment variable - Gitignore .env file, python virtualenv directory Put OpenAI API Key in .env file to run chatbot tests via vscode The .env file is default location for importing env vars	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	9306cd901a	Clean up chat tests to work with updated chat methods in gpt.py	2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky	b6cdc5c7cb	Do not expose answer API as a chat type in chat web interface or API Answer does not rely on past conversations, just the knowledge base. It is meant for one off interactions, like search rather than a continuing conversation like chat For now it is only exposed via API. Later it will be expose in the interfaces as well Remove ability to select different chat types from the chat web interface as there is only a single chat type Stop appending answers to the conversation logs	2023-03-05 18:21:59 -06:00
Debanjum Singh Solanky	211e460398	Output date filter from cache log at debug level. Remove unused imports Other logs not directly useful to user have already been converted to debug log levels in `1ae4016`. Just forgot to convert this log line too	2023-03-02 15:41:32 -06:00
Debanjum Singh Solanky	c823f46d89	Test error on missing fields in ContentConfig pulled from Khoj.yml Resolves #9	2023-03-02 15:35:39 -06:00
Debanjum Singh Solanky	fe03ba3dce	Index intro text before headings in org files - Text before headings was not being indexed due to buggy orgnode parsing logic - Resolved indexing intro text from files with and without headings in them - Ensure intro text node has heading set to all title lines collected from the file Resolves #165	2023-03-01 12:11:33 -06:00
Debanjum Singh Solanky	2bed4c3b50	Fix configuring search types & /config/types API when no plugin configured - Test /config/types API when no plugin configured, only plugin configured and no content configured scenarios - Do not throw null reference exception while configuring search types when no plugin configured - Do not throw null reference exception on calling /config/types API when no plugin configured Resolves bug introduced by #173	2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky	b09350c052	Fix to return only enabled content types via the new config/types API - Previously was return all core content types even if they had not been setup - Add test to validate only configured content types are returned by the api/config/types API endpoint	2023-02-28 22:08:26 -06:00
Debanjum Singh Solanky	ede6eb6879	Re-enable testing search and update API with image content type It may have been disabled due to issues with image search earlier	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	88a9eadfba	Use client pytest fixture to test API with plugin type configured	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	ab501a56c9	Create pytest fixture to configure app with plugin, search types	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	f944408e69	Update content_config pytest fixture to index plugin content	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	68bd5d9ebc	Configure API routes after set up search types while configuring server Configure app routes after configuring server. Import API routers after search type is dynamically populated. Allow API to recognize the dynamically populated plugin search types as valid type query param. Enable searching for plugin type content.	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	55a032e8c4	Add processor to index entries from jsonl files for plugins - Read, merge entries from input jsonl files and filters - Mark new, modified entries for update	2023-02-24 02:54:12 -06:00
Debanjum Singh Solanky	fcbbe8c759	Read content plugin configs from Khoj config YAML Configure external text content plugins via the Khoj YAML Reuse existing TextContentConfig definition for external text content plugins	2023-02-23 23:57:32 -06:00
Debanjum Singh Solanky	47569da38e	Fix usage of "\" in orgnode test string to resolve DeprecationWarning	2023-02-17 17:15:44 -06:00
Debanjum Singh Solanky	051f0e3fb5	Add, configure and run pre-commit locally and in test workflow	2023-02-17 13:31:36 -06:00
Debanjum Singh Solanky	5e83baab21	Use Black to format Khoj server code and tests	2023-02-17 11:55:17 -06:00
Debanjum Singh Solanky	af6d65a909	Create tagged Docker image on new tag/release	2023-02-14 20:04:06 -06:00
Debanjum Singh Solanky	bc7477ea3e	Move Emacs, Obsidian plugin code out from under src/khoj directory - What - The Emacs and Obsidian interfaces stay in their original directories under src/ - src/khoj now only contains code meant for pypi packaging - Benefits - This avoids having to update khoj MELPA, Obsidian plugin config as the Emacs, Obsidian code is under their original directories - It separates the code in src/khoj meant for python packaging from code for external interfaces like Emacs and Obsidian	2023-02-14 15:44:22 -06:00
Debanjum Singh Solanky	25a749ca1d	Use the src/ layout to fix packaging Khoj for PyPi - Why The khoj pypi packages should be installed in `khoj' directory. Previously it was being installed into `src' directory, which is a generic top level directory name that is discouraged from being used - Changes - move src/* to src/khoj/* - update `setup.py' to `find_packages' in `src' instead of project root - rename imports to form `from khoj.*' in complete project - update `constants.web_directory' path to use `khoj' directory - rename root logger to `khoj' in `main.py' - fix image_search tests to use the newly rename `khoj' logger - update config, docs, workflows to reference new path `src/khoj'	2023-02-14 15:19:06 -06:00
Debanjum Singh Solanky	6908b6eed3	Truncate image queries below max tokens length supported by ML model This would previously return the infamous tensor size mismatch error Verify this error is not raised since adding the query truncation logic	2023-01-21 14:11:00 -03:00
Debanjum Singh Solanky	3d9ed91e42	Search by image at path only if query of form "file:/path/to/image" Previously no query syntax helpers, like the "file:" prefix, were used before checking if query contains file path. This made query to image search brittle to misinterpretation and pointless checking Add test to verify search by image at file works as expected	2023-01-21 14:06:56 -03:00
Debanjum Singh Solanky	7b4f78776c	Fix extracting Markdown Entries with Top Level Headings - Previously top level headings would have get stripped of the space between heading text and the prefix # symbols. That is, `# Top Level Heading' would get converted to `#Top Level Heading' - This would mess up their rendering as a heading in search results - Add unit tests to text_to_jsonl processors to prevent regression	2023-01-17 13:06:28 -03:00
Debanjum Singh Solanky	d40076fcd6	Deduplicate test code, make teardown more robust using pytest fixtures	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	237123d18c	Fix tests for the conversation processor - Use latest davinci model for tests - Wrap prompt in triple quotes to improve legibilty - `understand' method returns dictionary instead of string. Fix its test - Fix prompt for new model to pass `chat_with_history' test	2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky	826f9dc054	Drop long words from compiled entries to be within max token limit of models Long words (>500 characters) provide less useful context to models. Dropping very long words allow models to create better embeddings by passing more of the useful context from the entry to the model	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	24676f95d8	Fix comments, use minimal test case, regenerate test index, merge debug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion	2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky	53cd2e5605	Regenerate initial model in asymmetric reload test to reduce flakyness - Fix logger message when converting org node to entries - Remove unused import from conftest	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	c79919bd68	Split entries by max tokens while converting Org entries To JSONL - Test usage the entry splitting by max tokens in text search	2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky	e057c8e208	Add method to split entries by specified max tokens limit - Issue ML Models truncate entries exceeding some max token limit. This lowers the quality of search results - Fix Split entries by max tokens before indexing. This should improve searching for content in longer entries. - Miscellaneous - Test method to split entries by max tokens	2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky	d292bdcc11	Do not version API. Premature given current state of the codebase - Reason - All clients that currently consume the API are part of Khoj - Any breaking API changes will be fixed in clients immediately - So decoupling client from API is not required - This removes the burden of maintaining muliple versions of the API	2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky	2c548133f3	Remove unused imports, `embeddings' variable from text search tests	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	7e9298f315	Use new Text Entry class to track text entries in Intermediate Format - Context - The app maintains all text content in a standard, intermediate format - The intermediate format was loaded, passed around as a dictionary for easier, faster updates to the intermediate format schema initially - The intermediate format is reasonably stable now, given it's usage by all 3 text content types currently implemented - Changes - Concretize text entries into `Entries' class instead of using dictionaries - Code is updated to load, pass around entries as `Entries' objects instead of as dictionaries - `text_search' and `text_to_jsonl' methods are annotated with type hints for the new `Entries' type - Code and Tests referencing entries are updated to use class style access patterns instead of the previous dictionary access patterns - Move `mark_entries_for_update' method into `TextToJsonl' base class - This is a more natural location for the method as it is only (to be) used by `text_to_jsonl' classes - Avoid circular reference issues on importing `Entries' class	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	e42a38e825	Version Khoj API, Update frontends, tests and docs to reflect it - Split router.py into v1.0, beta and frontend (no-prefix) api modules under new router package. Version tag in main.py via prefix - Update frontends to use the versioned api endpoints - Update tests to work with versioned api endpoints - Update docs to mentioned, reference only versioned api endpoints	2022-09-28 20:08:38 +03:00
Debanjum Singh Solanky	02d944030f	Use Base TextToJsonl class to standardize <text>_to_jsonl processors - Start standardizing implementation of the `text_to_jsonl' processors - `text_to_jsonl; scripts already had a shared structure - This change starts to codify that implicit structure - Benefits - Ease adding more `text_to_jsonl; processors - Allow merging shared functionality - Help with type hinting - Drawbacks - Lower agility to change. But this was already an implicit issue as the text_to_jsonl processors got more deeply wired into the app	2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky	bf1ae038cb	Get XMP metadata from image using Pillow. Remove ExifTool dependency - Pillow already supports reading XMP metadata from Images - Removes need to maintain my fork of unmaintained PyExiftool - This also removes dependency on system Exiftool package for XMP metadata extraction - Add test to verify XMP metadata extracted from test images - Remove references to Exiftool from Documentation	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	1bfe9c4ef2	Handle filter only queries. Short-circuit and return filtered results - For queries with only filters in them short-circuit and return filtered results. No need to run semantic search, re-ranking. - Add client test for filter only query and quote query in client tests	2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky	536f03af8f	Process text content files in sorted order for stable indexing - Image search already uses a sorted list of images to process - Prevents index of entries to desync when entries, embeddings generated by a separate server/app instance	2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky	a701ad08b9	Support multiple input-filters to configure content to index via khoj.yml - Update existings code, tests to process input-filters as list instead of str - Test `text_to_jsonl' get files methods to work with combination of `input-files' and `input-filters' Resolves #84	2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky	9d369ae4df	Fix OrgNode render of entries with property drawers and empty body - Issue - Indent regex was previously catching escape sequences like newlines - This was resulting in entries with only escape sequences in body to be prepended to property drawers etc during rendering - Fix - Update indent regex to only look for spaces in each line - Only render body when body contains non-escape characters - Create test to prevent this regression from silently resurfacing	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	253c9eae9a	Set index_heading_entries field in config to index entries with no body - Previously heading entries were not indexed to maintain search quality - But given that there are use-cases for indexing entries with no body - Add a configurable `index_heading_entries' field to index heading entries - This `TextContentConfig' field is currently only used for OrgMode content	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	1d3b3d5f39	Convert field get/set methods in OrgNode class to @property - Use more descriptive variable names in OrgNode parser and class - Convert OrgNode fields to private/protected, use property methods to get/set them	2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky	e951ba37ad	Raise exception when org file not found - No need to catch the IOError in OrgNode	2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky	9b2845de06	Add basic tests for beancount to jsonl conversion	2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky	d3267554ae	Add basic tests for markdown to jsonl conversion	2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky	ebd5039bd1	Merge branch 'master' into support-incremental-updates-of-embeddings	2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky	ed8d432fdd	Clean-up generated file after image search test run - Clean-up unused imports in test files	2022-09-10 21:43:31 +03:00
Debanjum Singh Solanky	899bfc5c3e	Test incremental update triggered on calling text_search.setup - Previously updates to index required explicitly setting `regenerate=True` - Now incremental update check made everytime on `text_search.setup` now - Test if index automatically updates when call `text_search.setup` with new content even with `regenerate=False`	2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky	c17a0fd05b	Do not store word filters index to file. Not necessary for now - It's more of a hassle to not let word filter go stale on entry updates - Generating index on 120K lines of notes takes 1s. Loading from file takes 0.2s. For less content load time difference will be even smaller - Let go of startup time improvement for simplicity for now	2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky	b9a6e80629	Make OrgNode tags stable sorted to find new entries for incremental updates - Having Tags as sets was returning them in a different order everytime - This resulted in spuriously identifying existing entries as new because their tags ordering changed - Converting tags to list fixes the issue and identifies updated new entries for incremental update correctly	2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky	2f7a6af56a	Support incremental update of org-mode entries and embeddings - What - Hash the entries and compare to find new/updated entries - Reuse embeddings encoded for existing entries - Only encode embeddings for updated or new entries - Merge the existing and new entries and embeddings to get the updated entries, embeddings - Why - Given most note text entries are expected to be unchanged across time. Reusing their earlier encoded embeddings should significantly speed up embeddings updates - Previously we were regenerating embeddings for all entries, even if they had existed in previous runs	2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky	976397bd82	Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings	2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky	2b58218b56	Reuse search models across sessions. Merge unused pytest fixtures - Remove unused model_dir pytest fixture. It was only being used by the content_config fixture, not by any tests - Reuse existing search models downloaded to khoj directory. Downloading search models for each pytest sessions seems excessive and slows down tests quite a bit	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	11917c6ddd	Do not normalize absolute filenames for creating links in OrgNode	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	d6bd7bf3e1	Fix initializing OrgNode level to string to parse org files - Parsed `level` argument passed to OrgNode during init is expected to be a string, not an integer - This was resulting in app failure only when parsing org files with no headings, like in issue #83, as level is set to string of `*`s the moment a heading is found in the current file	2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky	d835467f2c	Throw exception if no valid entries found in specified content files - Previously we were failing if no valid entries while computing embeddings. This was obscuring the actual issue of no valid entries found in the specified content files - Throwing an exception early with clear message when no entries found should make clarify the issue to be fixed - See issue #83 for details	2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky	31503e7afd	Do not pass embeddings as argument to filter.apply method	2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky	965bd052f1	Make search filters return entry ids satisfying filter - Filter entries, embeddings by ids satisfying all filters in query func, after each filter has returned entry ids satisfying their individual acceptance criteria - Previously each filter would return a filtered list of entries. Each filter would be applied on entries filtered by previous filters. This made the filtering order dependent - Benefits - Filters can be applied independent of their order of execution - Precomputed indexes for each filter is not in danger of running into index out of bound errors, as filters run on original entries instead of on entries filtered by filters that have run before it - Extract entries satisfying filter only once instead of doing this for each filter - Costs - Each filter has to process all entries even if previous filters may have already marked them as non-satisfactory	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7606724dbc	Add file of each entry to entry dict in org_to_jsonl converter - This will help filter query to org content type using file filter - Do not explicitly specify items being extracted from json of each entry in text_search as all text search content types do not have file being set in jsonl converters	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	f634399f23	Convert simple file filters with no path separator into regex - Specify just file name to get all notes associated with file at path - E.g `query` with `file:"file1.org"` will return `entry1` if `entry1` is in `file1.org` at `~/notes/file.org` - Test - Test converting simple file name filter to regex for path match - Test file filter with space in file name	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	092b9e329d	Setup Filters when configuring Text Search for each Search Type - Allows enabling different filters for different Text Search Types - Use FileFilter in Text Search on Org Files	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	1f9fd28b34	Create File Filter to filter files to query. Add tests for file filter	2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky	f930324350	Rename explicit filter to word filter to be more specific	2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky	8f3326c8d4	Create LRU helper class for caching	2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky	cdcee89ae5	Wrap words in quotes to trigger explicit filter from query - Do not run the more expensive explicit filter until the word to be filtered is completed by user. This requires an end sequence marker to identify end of explicit word filter to trigger filtering - Space isn't a good enough delimiter as the explicit filter could be at the end of the query in which case no space	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	858d86075b	Use regexes to check if any explicit filters in query. Test can_filter	2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky	b7d259b1ec	Test Explicit Include, Exclude Filters	2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky	30c3eb372a	Update Tests to Configure Filters and Setup Text Search	2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky	ea4fdd9134	Fix logic to ignore notes with no body. Add tests to prevent regression - Notes with empty newlines in body were not being ignored - Add regression tests to avoid above regression in org_to_jsonl conversion	2022-08-21 19:41:40 +03:00
Debanjum Singh Solanky	5e107eedc0	Rename test_asymmetric_search to now more appropriate test_text_search	2022-08-21 18:36:14 +03:00
Debanjum Singh Solanky	972523e8a9	Re-enable tests for image search Verify if recent fixes resolve test flakiness	2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky	82d2891765	Do not pass ML compute `device' around as argument to search funcs - It is a non-user configurable, app state that is set on app start - Reduce passing unneeded arguments around. Just set device where required by looking for ML compute device in global state	2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky	fd952e7273	Fix CLI tests as config_file path made absolute during CLI parsing	2022-08-12 01:47:52 +03:00
Debanjum Singh Solanky	fc48ee62ad	Update CLI tests since config_file arg has become optional (again)	2022-08-11 22:27:11 +03:00
Debanjum Singh Solanky	a748acfeeb	Merge branch 'master' of github.com:debanjum/khoj into create-native-gui Conflicts: - src/main.py - router functions have moved to router - move logic to handle null query perf timer variables into router.py - set main.py to current branch, not master	2022-08-11 21:09:42 +03:00
Debanjum Singh Solanky	a02d9db457	Test Task State Extraction in OrgNode Tests	2022-08-10 13:48:18 +03:00
Debanjum Singh Solanky	7b04978f52	Put global state variables into separate state module - Variables storing app, device state aren't constants. Do not mix with actual constants like empty_escape_sequence, web_directory	2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky	bc423d8f76	Disable image search in tests. Import global state from constants module - Upstream issues causing load of image search model to fail. Disable tests related to image search for now	2022-08-06 02:47:52 +03:00
Debanjum Singh Solanky	ca5a8bd113	Make config file a positional argument, as it is required - Test invalid config file path throws. Remove redundant cli test - Simplify cli parser code - Do not need to explicitly check if args.config_file set. argparser checks for positional arguments automatically - Use standard semantics for cli args - All positional args are required. Non positional args are optional - Improve command line --help description	2022-08-05 01:09:40 +03:00
Debanjum Singh Solanky	1374065092	Mark all required fields for config. Throw if no input_* field specified - Add custom validator to throw if neither input_filter or input_<files\|directories> are specified - Set field expecting paths to type Path - Now that default_config isn't used in code. We can update fields in rawconfig to specify whether they're required or not. This lets pydantic validate config file and throw appropriate error	2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky	4788143aa6	Set clip model name in conftest to sentence-tranformers/clip as well	2022-08-04 22:54:39 +03:00
Debanjum Singh Solanky	f50f343f73	Rename org-mode test data directory to more specific org/ from notes/	2022-08-04 22:29:57 +03:00
Debanjum Singh Solanky	a4eb55dd00	Rename khoj config yml file to follow more specific khoj*.yml pattern - That is, sample_config.yml is renamed to khoj_sample.yml - This makes the application config filename less generic, more easily identifiable with the application - Update docs, app accordingly	2022-08-03 12:06:55 +03:00
Debanjum Singh Solanky	7d7259bd92	Remove tests that validate configuring org using commandline arguments	2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky	a12eaa4ce0	Move Khoj image results into a child images/ directory	2022-07-28 20:45:12 +04:00
Debanjum Singh Solanky	1168244c92	Make cross-encoder re-rank results if query param set on /search API - Improve search speed by ~10x Tested on corpus of 125K lines, 12.5K entries - Allow cross-encoder to re-rank results by settings &?r=true when querying /search API - It's an optional param that default to False - Earlier all results were re-ranked by cross-encoder - Making this configurable allows for much faster results, if desired but for lower accuracy	2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky	b1e64fd4a8	Improve search speed. Only apply filter if filter keywords in query - Formalize filters into class with can_filter() and filter() methods - Use can_filter() method to decide whether to apply filter and create deep copies of entries and embeddings for it - Improve search speed for queries with no filters as deep copying entries, embeddings takes the most time after cross-encodes scoring when calling the /search API Earlier we would create deep copies of entries, embeddings even if the query did not contain any filter keywords	2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky	65fea7681a	Rename notes search type to org search, now that markdown notes supported	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	1f4b5ac112	Create test markdown files. Use them in sample config, docker-compose	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	0602d018c0	Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types	2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky	d50bfb5188	Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests	2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky	70e70d4b15	Rename 'embed' key to more generic 'compiled' for jsonl extracted results - While it's true those strings are going to be used to generated embeddings, the more generic term allows them to be used elsewhere as well - Their main property is that they are processed, compiled for usage by semantic search - Unlike the 'raw' string which contains the external representation of the data, as is	2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky	c1369233db	Consistently use "entry", "score" in json response for all search types - Had already made some progress on this earlier by updating the image search responses. But needed to update the text search responses to use lowercase entry and score - Update khoj.el to consume the updated json response keys for text search	2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky	c9ff97451b	Fix tests to handle updated response types by API	2022-07-20 03:01:56 +04:00
Debanjum Singh Solanky	6c9ffdba57	Allow indexing multiple image directories for image search	2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky	68ee88cebc	Fix image search tests after update to API response for image search types - Look for 'entry' key in response json instead of 'Entry' - Expect image where id = alphanumeric order of image name	2022-07-20 01:37:01 +04:00
Debanjum Singh Solanky	b673d26a12	Extract Entries in a standardized format across text search types Issue: - Had different schema of extracted entries for symmetric_ledger vs asymmetric - Entry extraction for asymmetric was dirty, relying on cryptic indices to store raw entry vs cleaned entry meant to be passed to embeddings - This was pushing the load of figuring out what property to extract from each entry to downstream processes like the filters - This limited the filters to only work for asymmetric search, not for symmetric_ledger - Fix - Use consistent format for extracted entries { 'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings, 'raw' : raw_entry_string_meant_to_be_passed_to_use } - Result - Now filters can be applied across search types, and the specific field they should be applied on can be configured by each search type	2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky	732b2d287f	Give the project a short, less generic name. Rename it to Khoj - Semantic Search was just a placeholder used to test the idea out Didn't want to get into naming at that point of time	2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky	989526ae54	Use a more accurate model for symmetric semantic search - The all-MiniLM-L6-v2 is more accurate - The exact previous model isn't benchmarked but based on the performance of the closest model to it. Seems like the new model maybe similar in speed and size - On very preliminary evaluation of the model, the new model seems faster, with pretty decent results	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	4a90972e38	Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	85077bc1d1	Handle unparseable date range passed via date filter in query - Do not reuse the same list - Just create new list, so only parsed data is in it	2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky	9de2097182	Fix date filter usage with multi word queries. Simplify date regex	2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky	67e9366c0f	Minor style fix. Use consistent/standard dates for date_filter tests	2022-07-14 20:06:39 +04:00
Debanjum Singh Solanky	dcb6fe479e	Fix date_filter query, entry in query range check. Add tests for it - Fix date_filter date_in_entry within query range check - Extracted_date_range is in [included_date, excluded_date) format - But check was checking for date_in_entry <= excluded_date - Fixed it to do date_in_entry < excluded_date - Fix removal of date filter from query - Add tests for date_filter	2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky	011f81fac5	Fix date_filter to handle non overlapping date ranges	2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky	70ac35b2a5	Compute Date Range to filter entries to, from Comparators, Dates in Query	2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky	e6db3e3d00	Prefer Dates From Future only when specific words in date string - Default to looking at dates from past, as most notes are from past - Look for dates in future for cases where it's obvious query is for dates in the future but dateparser's parse doesn't parse it at all. E.g parse('5 months from now') returns nothing - Setting PREFER_DATES_FROM_FUTURE in this case and passing just parse('5 months') to dateparser.parse works as expected	2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky	4a201d52af	Add, test date filter regex and date parsing to get natural date range	2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky	741fca0e6b	Fix asymmetric search test to pass entries returned by query to collate_results	2022-07-12 18:48:49 +04:00
Debanjum Singh Solanky	8bb9a49994	Cleanup Test Asymmetric Search to Fix Test - test_regenerate_with_valid_content failed when run after test_asymmetric_search - test_asymmetric_search did't clean the temporary update to config it had made - This was resulting in regenerate looking for a file that didn't exist	2022-07-07 01:25:31 +04:00
Debanjum Singh Solanky	eda4b65ddb	Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine	2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky	2f7ef08b11	Add Unit Tests to verify the Reload API functions as desired	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	85fbe1c42b	Normalize org notes path to be relative to home directory - This is still clunky but it should be commitable - General enough that it'll work even when a users notes are not in the home directory - While solving for the special case where: - Notes are being processed on a different machine and used on a different machine - But the notes directory is in the same location relative to home on both the machines	2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky	f66192f2a7	Test OrgNode Parsing and Rendering	2022-06-17 19:13:11 +03:00
Debanjum Singh Solanky	79c2224eaa	Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings	2022-01-29 02:03:17 -05:00
Debanjum Singh Solanky	179153dc5a	Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation	2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky	ed144f7984	Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests	2022-01-14 20:13:14 -05:00

1 2 3 4 5 ...

264 commits