sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-29 07:38:09 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	6a8fd9bf33	Reorder embeddings search arguments based on argument importance	2024-10-10 04:45:00 -07:00
Debanjum Singh Solanky	bba4e0b529	Accept file deletion requests by clients during sync - Remove unused full_corpus boolean. The full_corpus=False code path wasn't being used (accept for in a test) - The full_corpus=True code path used was ignoring file deletion requests sent by clients during sync. Unclear why this was done - Added unit test to prevent regression and show file deletion by clients during sync not ignored now	2024-07-19 04:53:01 +05:30
sabaimran	720139c3c1	Fix all unit tests for test_text_search	2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky	86575b2946	Chunk text in preference order of para, sentence, word, character - Previous simplistic chunking strategy of splitting text by space didn't capture notes with newlines, no spaces. For e.g in #620 - New strategy will try chunk the text at more natural points like paragraph, sentence, word first. If none of those work it'll split at character to fit within max token limit - Drop long words while preserving original delimiters Resolves #620	2024-04-04 02:41:55 +05:30
sabaimran	79913d4c17	Add isort to the pre-commit configuration and apply it to the whole project (#595 ) * Apply isort to the entire repository * Fix missing import issues in text_to_entries * Fix imports in migration files	2023-12-28 18:04:02 +05:30
sabaimran	1e2af083f0	Rename the data_sources module to content	2023-11-21 22:11:32 -08:00
sabaimran	2bb989e9d8	Resolve merge conflicts and fix some import ordering	2023-11-21 12:30:43 -08:00
sabaimran	a474c31e02	Move the django app into the src/khoj folder for better organization and functionality - Our pypi package currently does not work because the django app and associated database is not included. To remedy this issue, move the app into the src/khoj folder. This has the added benefit of improved organization of the codebase, as all server related code is now in a single folder - Update associated file paths and system references	2023-11-21 10:56:04 -08:00
sabaimran	b8e6883a81	Merge branch 'master' of github.com:khoj-ai/khoj into features/internet-enabled-search	2023-11-19 16:20:08 -08:00
Debanjum Singh Solanky	33ad9b8e64	Update text search test since indexing ancestor hierarchy added	2023-11-17 15:26:55 -08:00
sabaimran	ec06d2c446	Move data indexer files into a separate folder under processor. Update assoc UTs	2023-11-16 17:19:55 -08:00
Debanjum Singh Solanky	ddb07def0d	Test search uses ancestor headings as context for improved results - Update test data to add deeper outline hierarchy for testing hierarchy as context - Update collateral tests that need count of entries updated, deleted asserts to be updated	2023-11-16 03:05:19 -08:00
Debanjum Singh Solanky	9ab327a2b6	Store the data source of each entry in database This will be useful for updating, deleting entries by their data source. Data source can be one of Computer, Github or Notion for now Store each file/entries source in database	2023-11-07 02:18:48 -08:00
Debanjum Singh Solanky	f212cc7174	Arrange remaining text search tests in arrange, act, assert order	2023-11-05 02:04:52 -08:00
Debanjum Singh Solanky	022017dd0f	Fix text search tests to test updated indexing log messages	2023-11-05 02:04:52 -08:00
Debanjum Singh Solanky	d92a2d03a7	Rename Files, Classes from X_To_JSONL to more appropriate X_To_Entries These content processors are converting content into entries in DB instead of entries in JSONL file	2023-11-01 14:51:33 -07:00
Debanjum Singh Solanky	bcbee05a9e	Rename DbModels Embeddings, EmbeddingsAdapter to Entry, EntryAdapter Improves readability as name has closer match to underlying constructs - Entry is any atomic item indexed by Khoj. This can be an org-mode entry, a markdown section, a PDF or Notion page etc. - Embeddings are semantic vectors generated by the search ML model that encodes for meaning contained in an entries text. - An "Entry" contains "Embeddings" vectors but also other metadata about the entry like filename etc.	2023-10-31 18:50:54 -07:00
sabaimran	54a387326c	[Multi-User Part 6]: Address small bugs and upstream PR comments (#518 ) - `08654163cb`: Add better parsing for XML files - `f3acfac7fb`: Add a try/catch around the dateparser in order to avoid internal server errors in app - `7d43cd62c0`: Chunk embeddings generation in order to avoid large memory load - `e02d751eb3`: Addresses comments from PR #498 - `a3f393edb4`: Addresses comments from PR #503 - `66eb078286`: Addresses comments from PR #511 - Address various items in https://github.com/khoj-ai/khoj/issues/527	2023-10-31 17:59:53 -07:00
sabaimran	5f3f6b7c61	[Multi-User Part 5]: Add a production Docker file and use a gunicorn configuration with it (#514 ) - Add a productionized setup for the Khoj server using `gunicorn` with multiple workers for handling requests - Add a new Dockerfile meant for production config at `ghcr.io/khoj-ai/khoj:prod`; the existing Docker config should remain the same	2023-10-26 13:15:31 -07:00
sabaimran	4b6ec248a6	[Multi-User Part 3]: Separate chat sesssions based on authenticated users (#511 ) - Add a data model which allows us to store Conversations with users. This does a minimal lift over the current setup, where the underlying data is stored in a JSON file. This maintains parity with that configuration. - There does _seem_ to be some regression in chat quality, which is most likely attributable to search results. This will help us with #275. It should become much easier to maintain multiple Conversations in a given table in the backend now. We will have to do some thinking on the UI.	2023-10-26 11:37:41 -07:00
sabaimran	216acf545f	[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account (#498 ) - Partition configuration for indexing local data based on user accounts - Store indexed data in an underlying postgres db using the `pgvector` extension - Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time - Apply filters using SQL queries - Start removing many server-level configuration settings - Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB. - Update the Docker image and docker-compose.yml to work with the new application design	2023-10-26 09:42:29 -07:00
Debanjum Singh Solanky	e3cd8b4150	Only index files returned by input-filter globs in fs_syncer Ignore .org, .pdf etc. suffixed directories under `input-filter' from being evaluated as files. Explicitly filter results by input-filter globs to only index files, not directory for each text type Add test to prevent regression Closes #448	2023-10-17 23:32:10 -07:00
Debanjum Singh Solanky	d9d133dfb9	Read text files as utf-8, instead of default os locale On Windows, the default locale isn't utf8. Khoj had regressed to reading files in OS specified locale encoding, e.g cp1252, cp949 etc. It now explicitly uses utf8 encoding to read text files for indexing Resolves #495, resolves #472	2023-10-17 21:47:19 -07:00
sabaimran	76562f4250	Add front-end Electron application for Khoj local file syncing (#473 ) * Initial version - setup a file-push architecture for generating embeddings with Khoj * Use state.host and state.port for configuring the URL for the indexer * Fix parsing of PDF files * Read markdown files from streamed data and update unit tests * On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system * Init: refactor indexer/batch endpoint to support a generic file ingestion format * Add features to better support indexing from files sent by the desktop client * Initial commit with Electron application - Adds electron app * Add import for pymupdf, remove import for pypdf * Allow user to configure khoj host URL * Remove search type configuration from index.html * Use v1 path for current indexer routes	2023-09-06 12:04:18 -07:00
sabaimran	4854258047	Move to a push-first model for retrieving embeddings from local files (#457 ) * Initial version - setup a file-push architecture for generating embeddings with Khoj * Update unit tests to fix with new application design * Allow configure server to be called without regenerating the index; this no longer works because the API for indexing files is not up in time for the server to send a request * Use state.host and state.port for configuring the URL for the indexer * On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system	2023-08-31 12:55:17 -07:00
sabaimran	0ea901c7c1	Allow indexing to continue even if there's an issue parsing a particular org file (#430 ) * Allow indexing to continue even if there's an issue parsing a particular org file * Use approximation in pytorch comparison in text_search UT, skip additional file parser errors for org files * Change error of expected failure	2023-08-14 07:56:33 -07:00
Debanjum Singh Solanky	9b1048caf7	Remove asymmetric from name of remaining text search tests Asymmetric search is the only search type used now in khoj.el. So making distinction between between symmetric and asymmetric search isn't necessary anymore	2023-07-28 15:33:22 -07:00
Debanjum Singh Solanky	ef6a0044f4	Drop embeddings of deleted text entries from index Previously the deleted embeddings would continue to be in the index, even after the entry was deleted	2023-07-16 03:47:05 -07:00
Debanjum Singh Solanky	c73feebf25	Test index embeddings are stable on incremental update & no norm Ensure order of new embedding insertion on incremental update does not affect the order and value of existing embeddings when normalization is turned off	2023-07-16 02:22:28 -07:00
Debanjum Singh Solanky	1482fd4d4d	Test index is stable sorted on incremental update with new entry Ensure order of new embedding, entry insertion on incremental update is stable	2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky	b02323ade6	Improve name of text search test functions Asymmetric was older name used to differentiate between symmetric, asymmetric search. Now that text search just uses asymmetric search stick to simpler name	2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky	7669b85da6	Test index is stable sorted on regenerate with new entry	2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky	88d1a29a84	Test index is stable for duplicate entries across regenerate, update - Current incorrect behavior: All entries with duplicate compiled form are kept on regenerate but on update only the last of the duplicated entries is kept This divergent behavior is not ideal to prevent index corruption across reconfigure and update	2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky	da98b92dd4	Create helper function to test value, order of entries & embeddings This helper should be used to observe if the current embeddings are stable sorted on regenerate and incremental update of index in text search tests	2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky	b9fb656657	Update Tests to setup both content_index, search_models before testing This is required by the updated structure of Khoj setup - Add content_config pytest fixture, pass bi_encoder from search_models.[text\|image]_search	2023-07-14 01:29:48 -07:00
Debanjum Singh Solanky	56ce97ef9e	Use async/await in tests for query method of text and image search The text, image search query method has become async. So async/await is required to get results correctly in tests etc	2023-06-28 22:07:02 -07:00
Debanjum Singh Solanky	595cc5b0f5	Use printer icon for PDF logs. Only split lines if file at web link in web interface	2023-06-18 02:26:03 -07:00
Saba	751edfefe5	Add separate unit test for github. Will only run of a PAT token is set	2023-06-13 16:55:58 -07:00
Debanjum Singh Solanky	cc75f986b2	Test text search index only updates on changes to text content	2023-05-12 17:37:34 +08:00
Debanjum Singh Solanky	211e460398	Output date filter from cache log at debug level. Remove unused imports Other logs not directly useful to user have already been converted to debug log levels in `1ae4016`. Just forgot to convert this log line too	2023-03-02 15:41:32 -06:00
Debanjum Singh Solanky	5e83baab21	Use Black to format Khoj server code and tests	2023-02-17 11:55:17 -06:00
Debanjum Singh Solanky	25a749ca1d	Use the src/ layout to fix packaging Khoj for PyPi - Why The khoj pypi packages should be installed in `khoj' directory. Previously it was being installed into `src' directory, which is a generic top level directory name that is discouraged from being used - Changes - move src/* to src/khoj/* - update `setup.py' to `find_packages' in `src' instead of project root - rename imports to form `from khoj.*' in complete project - update `constants.web_directory' path to use `khoj' directory - rename root logger to `khoj' in `main.py' - fix image_search tests to use the newly rename `khoj' logger - update config, docs, workflows to reference new path `src/khoj'	2023-02-14 15:19:06 -06:00
Debanjum Singh Solanky	d40076fcd6	Deduplicate test code, make teardown more robust using pytest fixtures	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	24676f95d8	Fix comments, use minimal test case, regenerate test index, merge debug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion	2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky	53cd2e5605	Regenerate initial model in asymmetric reload test to reduce flakyness - Fix logger message when converting org node to entries - Remove unused import from conftest	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	c79919bd68	Split entries by max tokens while converting Org entries To JSONL - Test usage the entry splitting by max tokens in text search	2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky	7e9298f315	Use new Text Entry class to track text entries in Intermediate Format - Context - The app maintains all text content in a standard, intermediate format - The intermediate format was loaded, passed around as a dictionary for easier, faster updates to the intermediate format schema initially - The intermediate format is reasonably stable now, given it's usage by all 3 text content types currently implemented - Changes - Concretize text entries into `Entries' class instead of using dictionaries - Code is updated to load, pass around entries as `Entries' objects instead of as dictionaries - `text_search' and `text_to_jsonl' methods are annotated with type hints for the new `Entries' type - Code and Tests referencing entries are updated to use class style access patterns instead of the previous dictionary access patterns - Move `mark_entries_for_update' method into `TextToJsonl' base class - This is a more natural location for the method as it is only (to be) used by `text_to_jsonl' classes - Avoid circular reference issues on importing `Entries' class	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	02d944030f	Use Base TextToJsonl class to standardize <text>_to_jsonl processors - Start standardizing implementation of the `text_to_jsonl' processors - `text_to_jsonl; scripts already had a shared structure - This change starts to codify that implicit structure - Benefits - Ease adding more `text_to_jsonl; processors - Allow merging shared functionality - Help with type hinting - Drawbacks - Lower agility to change. But this was already an implicit issue as the text_to_jsonl processors got more deeply wired into the app	2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky	a701ad08b9	Support multiple input-filters to configure content to index via khoj.yml - Update existings code, tests to process input-filters as list instead of str - Test `text_to_jsonl' get files methods to work with combination of `input-files' and `input-filters' Resolves #84	2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky	e951ba37ad	Raise exception when org file not found - No need to catch the IOError in OrgNode	2022-09-11 01:09:24 +03:00

1 2

57 commits