sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-27 17:35:07 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	1374065092	Mark all required fields for config. Throw if no input_* field specified - Add custom validator to throw if neither input_filter or input_<files\|directories> are specified - Set field expecting paths to type Path - Now that default_config isn't used in code. We can update fields in rawconfig to specify whether they're required or not. This lets pydantic validate config file and throw appropriate error	2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky	4788143aa6	Set clip model name in conftest to sentence-tranformers/clip as well	2022-08-04 22:54:39 +03:00
Debanjum Singh Solanky	f50f343f73	Rename org-mode test data directory to more specific org/ from notes/	2022-08-04 22:29:57 +03:00
Debanjum Singh Solanky	a4eb55dd00	Rename khoj config yml file to follow more specific khoj*.yml pattern - That is, sample_config.yml is renamed to khoj_sample.yml - This makes the application config filename less generic, more easily identifiable with the application - Update docs, app accordingly	2022-08-03 12:06:55 +03:00
Debanjum Singh Solanky	7d7259bd92	Remove tests that validate configuring org using commandline arguments	2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky	a12eaa4ce0	Move Khoj image results into a child images/ directory	2022-07-28 20:45:12 +04:00
Debanjum Singh Solanky	1168244c92	Make cross-encoder re-rank results if query param set on /search API - Improve search speed by ~10x Tested on corpus of 125K lines, 12.5K entries - Allow cross-encoder to re-rank results by settings &?r=true when querying /search API - It's an optional param that default to False - Earlier all results were re-ranked by cross-encoder - Making this configurable allows for much faster results, if desired but for lower accuracy	2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky	b1e64fd4a8	Improve search speed. Only apply filter if filter keywords in query - Formalize filters into class with can_filter() and filter() methods - Use can_filter() method to decide whether to apply filter and create deep copies of entries and embeddings for it - Improve search speed for queries with no filters as deep copying entries, embeddings takes the most time after cross-encodes scoring when calling the /search API Earlier we would create deep copies of entries, embeddings even if the query did not contain any filter keywords	2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky	65fea7681a	Rename notes search type to org search, now that markdown notes supported	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	1f4b5ac112	Create test markdown files. Use them in sample config, docker-compose	2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky	0602d018c0	Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types	2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky	d50bfb5188	Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests	2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky	70e70d4b15	Rename 'embed' key to more generic 'compiled' for jsonl extracted results - While it's true those strings are going to be used to generated embeddings, the more generic term allows them to be used elsewhere as well - Their main property is that they are processed, compiled for usage by semantic search - Unlike the 'raw' string which contains the external representation of the data, as is	2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky	c1369233db	Consistently use "entry", "score" in json response for all search types - Had already made some progress on this earlier by updating the image search responses. But needed to update the text search responses to use lowercase entry and score - Update khoj.el to consume the updated json response keys for text search	2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky	c9ff97451b	Fix tests to handle updated response types by API	2022-07-20 03:01:56 +04:00
Debanjum Singh Solanky	6c9ffdba57	Allow indexing multiple image directories for image search	2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky	68ee88cebc	Fix image search tests after update to API response for image search types - Look for 'entry' key in response json instead of 'Entry' - Expect image where id = alphanumeric order of image name	2022-07-20 01:37:01 +04:00
Debanjum Singh Solanky	b673d26a12	Extract Entries in a standardized format across text search types Issue: - Had different schema of extracted entries for symmetric_ledger vs asymmetric - Entry extraction for asymmetric was dirty, relying on cryptic indices to store raw entry vs cleaned entry meant to be passed to embeddings - This was pushing the load of figuring out what property to extract from each entry to downstream processes like the filters - This limited the filters to only work for asymmetric search, not for symmetric_ledger - Fix - Use consistent format for extracted entries { 'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings, 'raw' : raw_entry_string_meant_to_be_passed_to_use } - Result - Now filters can be applied across search types, and the specific field they should be applied on can be configured by each search type	2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky	732b2d287f	Give the project a short, less generic name. Rename it to Khoj - Semantic Search was just a placeholder used to test the idea out Didn't want to get into naming at that point of time	2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky	989526ae54	Use a more accurate model for symmetric semantic search - The all-MiniLM-L6-v2 is more accurate - The exact previous model isn't benchmarked but based on the performance of the closest model to it. Seems like the new model maybe similar in speed and size - On very preliminary evaluation of the model, the new model seems faster, with pretty decent results	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	4a90972e38	Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers	2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky	85077bc1d1	Handle unparseable date range passed via date filter in query - Do not reuse the same list - Just create new list, so only parsed data is in it	2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky	9de2097182	Fix date filter usage with multi word queries. Simplify date regex	2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky	67e9366c0f	Minor style fix. Use consistent/standard dates for date_filter tests	2022-07-14 20:06:39 +04:00
Debanjum Singh Solanky	dcb6fe479e	Fix date_filter query, entry in query range check. Add tests for it - Fix date_filter date_in_entry within query range check - Extracted_date_range is in [included_date, excluded_date) format - But check was checking for date_in_entry <= excluded_date - Fixed it to do date_in_entry < excluded_date - Fix removal of date filter from query - Add tests for date_filter	2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky	011f81fac5	Fix date_filter to handle non overlapping date ranges	2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky	70ac35b2a5	Compute Date Range to filter entries to, from Comparators, Dates in Query	2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky	e6db3e3d00	Prefer Dates From Future only when specific words in date string - Default to looking at dates from past, as most notes are from past - Look for dates in future for cases where it's obvious query is for dates in the future but dateparser's parse doesn't parse it at all. E.g parse('5 months from now') returns nothing - Setting PREFER_DATES_FROM_FUTURE in this case and passing just parse('5 months') to dateparser.parse works as expected	2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky	4a201d52af	Add, test date filter regex and date parsing to get natural date range	2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky	741fca0e6b	Fix asymmetric search test to pass entries returned by query to collate_results	2022-07-12 18:48:49 +04:00
Debanjum Singh Solanky	8bb9a49994	Cleanup Test Asymmetric Search to Fix Test - test_regenerate_with_valid_content failed when run after test_asymmetric_search - test_asymmetric_search did't clean the temporary update to config it had made - This was resulting in regenerate looking for a file that didn't exist	2022-07-07 01:25:31 +04:00
Debanjum Singh Solanky	eda4b65ddb	Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine	2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky	2f7ef08b11	Add Unit Tests to verify the Reload API functions as desired	2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky	85fbe1c42b	Normalize org notes path to be relative to home directory - This is still clunky but it should be commitable - General enough that it'll work even when a users notes are not in the home directory - While solving for the special case where: - Notes are being processed on a different machine and used on a different machine - But the notes directory is in the same location relative to home on both the machines	2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky	f66192f2a7	Test OrgNode Parsing and Rendering	2022-06-17 19:13:11 +03:00
Debanjum Singh Solanky	79c2224eaa	Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings	2022-01-29 02:03:17 -05:00
Debanjum Singh Solanky	179153dc5a	Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation	2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky	ed144f7984	Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests	2022-01-14 20:13:14 -05:00
Debanjum Singh Solanky	2e53fbc844	Fix the user intent extraction prompt for GPT. Clean up chatbot test	2022-01-12 10:36:01 -05:00
Debanjum	ef911aa6be	Skip Flaky Image Search Test Image search doesn't always return expected image path. Should resolve remaining issues with failing cloud test. See #11	2021-12-12 02:15:20 -08:00
Saba	ba8dc9ed5f	Update the search_config instantiated for tests in conftest	2021-12-11 14:14:31 -05:00
Saba	d65190c3ee	Update unit tests, files with removing model suffix to config types	2021-12-09 08:50:38 -05:00
Saba	76e9e9da2f	Update unit tests to use the new BaseModel types	2021-12-05 09:31:39 -05:00
Saba	c4cd4b57f1	Update types used in conftest.py	2021-12-04 12:02:19 -05:00
Debanjum Singh Solanky	882e0f81b4	Skip running the inconsistent image search test	2021-11-27 18:38:44 +05:30
Debanjum Singh Solanky	d4e1120b22	Add GPT based conversation processor to understand intent and converse with user - Allow conversing with user using GPT's contextually aware, generative capability - Extract metadata, user intent from user's messages using GPT's general understanding	2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky	9dcffe3e8e	Rename test_main to test_client. It only contains client specific tests	2021-10-02 20:32:53 -07:00
Debanjum Singh Solanky	7e0d9bafa7	Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file	2021-10-02 20:28:33 -07:00
Debanjum Singh Solanky	da33e9e743	Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests	2021-10-02 19:46:29 -07:00
Debanjum Singh Solanky	866ccb5cd3	Add all configurables to sample_config. Add test music, ledger data - Get ledger sample from github.com/debanjum/company-ledger - Get music sample from github.com/debanjum/org-music	2021-10-02 16:11:27 -07:00

1 2

52 commits