sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-18 02:27:10 +00:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	af6d65a909	Create tagged Docker image on new tag/release	2023-02-14 20:04:06 -06:00
Debanjum Singh Solanky	bc7477ea3e	Move Emacs, Obsidian plugin code out from under src/khoj directory - What - The Emacs and Obsidian interfaces stay in their original directories under src/ - src/khoj now only contains code meant for pypi packaging - Benefits - This avoids having to update khoj MELPA, Obsidian plugin config as the Emacs, Obsidian code is under their original directories - It separates the code in src/khoj meant for python packaging from code for external interfaces like Emacs and Obsidian	2023-02-14 15:44:22 -06:00
Debanjum Singh Solanky	25a749ca1d	Use the src/ layout to fix packaging Khoj for PyPi - Why The khoj pypi packages should be installed in `khoj' directory. Previously it was being installed into `src' directory, which is a generic top level directory name that is discouraged from being used - Changes - move src/* to src/khoj/* - update `setup.py' to `find_packages' in `src' instead of project root - rename imports to form `from khoj.*' in complete project - update `constants.web_directory' path to use `khoj' directory - rename root logger to `khoj' in `main.py' - fix image_search tests to use the newly rename `khoj' logger - update config, docs, workflows to reference new path `src/khoj'	2023-02-14 15:19:06 -06:00
Debanjum Singh Solanky	6908b6eed3	Truncate image queries below max tokens length supported by ML model This would previously return the infamous tensor size mismatch error Verify this error is not raised since adding the query truncation logic	2023-01-21 14:11:00 -03:00
Debanjum Singh Solanky	3d9ed91e42	Search by image at path only if query of form "file:/path/to/image" Previously no query syntax helpers, like the "file:" prefix, were used before checking if query contains file path. This made query to image search brittle to misinterpretation and pointless checking Add test to verify search by image at file works as expected	2023-01-21 14:06:56 -03:00
Debanjum Singh Solanky	7b4f78776c	Fix extracting Markdown Entries with Top Level Headings - Previously top level headings would have get stripped of the space between heading text and the prefix # symbols. That is, `# Top Level Heading' would get converted to `#Top Level Heading' - This would mess up their rendering as a heading in search results - Add unit tests to text_to_jsonl processors to prevent regression	2023-01-17 13:06:28 -03:00
Debanjum Singh Solanky	d40076fcd6	Deduplicate test code, make teardown more robust using pytest fixtures	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	237123d18c	Fix tests for the conversation processor - Use latest davinci model for tests - Wrap prompt in triple quotes to improve legibilty - `understand' method returns dictionary instead of string. Fix its test - Fix prompt for new model to pass `chat_with_history' test	2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky	826f9dc054	Drop long words from compiled entries to be within max token limit of models Long words (>500 characters) provide less useful context to models. Dropping very long words allow models to create better embeddings by passing more of the useful context from the entry to the model	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	24676f95d8	Fix comments, use minimal test case, regenerate test index, merge debug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion	2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky	53cd2e5605	Regenerate initial model in asymmetric reload test to reduce flakyness - Fix logger message when converting org node to entries - Remove unused import from conftest	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	c79919bd68	Split entries by max tokens while converting Org entries To JSONL - Test usage the entry splitting by max tokens in text search	2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky	e057c8e208	Add method to split entries by specified max tokens limit - Issue ML Models truncate entries exceeding some max token limit. This lowers the quality of search results - Fix Split entries by max tokens before indexing. This should improve searching for content in longer entries. - Miscellaneous - Test method to split entries by max tokens	2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky	d292bdcc11	Do not version API. Premature given current state of the codebase - Reason - All clients that currently consume the API are part of Khoj - Any breaking API changes will be fixed in clients immediately - So decoupling client from API is not required - This removes the burden of maintaining muliple versions of the API	2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky	2c548133f3	Remove unused imports, `embeddings' variable from text search tests	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	7e9298f315	Use new Text Entry class to track text entries in Intermediate Format - Context - The app maintains all text content in a standard, intermediate format - The intermediate format was loaded, passed around as a dictionary for easier, faster updates to the intermediate format schema initially - The intermediate format is reasonably stable now, given it's usage by all 3 text content types currently implemented - Changes - Concretize text entries into `Entries' class instead of using dictionaries - Code is updated to load, pass around entries as `Entries' objects instead of as dictionaries - `text_search' and `text_to_jsonl' methods are annotated with type hints for the new `Entries' type - Code and Tests referencing entries are updated to use class style access patterns instead of the previous dictionary access patterns - Move `mark_entries_for_update' method into `TextToJsonl' base class - This is a more natural location for the method as it is only (to be) used by `text_to_jsonl' classes - Avoid circular reference issues on importing `Entries' class	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	e42a38e825	Version Khoj API, Update frontends, tests and docs to reflect it - Split router.py into v1.0, beta and frontend (no-prefix) api modules under new router package. Version tag in main.py via prefix - Update frontends to use the versioned api endpoints - Update tests to work with versioned api endpoints - Update docs to mentioned, reference only versioned api endpoints	2022-09-28 20:08:38 +03:00
Debanjum Singh Solanky	02d944030f	Use Base TextToJsonl class to standardize <text>_to_jsonl processors - Start standardizing implementation of the `text_to_jsonl' processors - `text_to_jsonl; scripts already had a shared structure - This change starts to codify that implicit structure - Benefits - Ease adding more `text_to_jsonl; processors - Allow merging shared functionality - Help with type hinting - Drawbacks - Lower agility to change. But this was already an implicit issue as the text_to_jsonl processors got more deeply wired into the app	2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky	bf1ae038cb	Get XMP metadata from image using Pillow. Remove ExifTool dependency - Pillow already supports reading XMP metadata from Images - Removes need to maintain my fork of unmaintained PyExiftool - This also removes dependency on system Exiftool package for XMP metadata extraction - Add test to verify XMP metadata extracted from test images - Remove references to Exiftool from Documentation	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	1bfe9c4ef2	Handle filter only queries. Short-circuit and return filtered results - For queries with only filters in them short-circuit and return filtered results. No need to run semantic search, re-ranking. - Add client test for filter only query and quote query in client tests	2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky	536f03af8f	Process text content files in sorted order for stable indexing - Image search already uses a sorted list of images to process - Prevents index of entries to desync when entries, embeddings generated by a separate server/app instance	2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky	a701ad08b9	Support multiple input-filters to configure content to index via khoj.yml - Update existings code, tests to process input-filters as list instead of str - Test `text_to_jsonl' get files methods to work with combination of `input-files' and `input-filters' Resolves #84	2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky	9d369ae4df	Fix OrgNode render of entries with property drawers and empty body - Issue - Indent regex was previously catching escape sequences like newlines - This was resulting in entries with only escape sequences in body to be prepended to property drawers etc during rendering - Fix - Update indent regex to only look for spaces in each line - Only render body when body contains non-escape characters - Create test to prevent this regression from silently resurfacing	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	253c9eae9a	Set index_heading_entries field in config to index entries with no body - Previously heading entries were not indexed to maintain search quality - But given that there are use-cases for indexing entries with no body - Add a configurable `index_heading_entries' field to index heading entries - This `TextContentConfig' field is currently only used for OrgMode content	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	1d3b3d5f39	Convert field get/set methods in OrgNode class to @property - Use more descriptive variable names in OrgNode parser and class - Convert OrgNode fields to private/protected, use property methods to get/set them	2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky	e951ba37ad	Raise exception when org file not found - No need to catch the IOError in OrgNode	2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky	9b2845de06	Add basic tests for beancount to jsonl conversion	2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky	d3267554ae	Add basic tests for markdown to jsonl conversion	2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky	ebd5039bd1	Merge branch 'master' into support-incremental-updates-of-embeddings	2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky	ed8d432fdd	Clean-up generated file after image search test run - Clean-up unused imports in test files	2022-09-10 21:43:31 +03:00
Debanjum Singh Solanky	899bfc5c3e	Test incremental update triggered on calling text_search.setup - Previously updates to index required explicitly setting `regenerate=True` - Now incremental update check made everytime on `text_search.setup` now - Test if index automatically updates when call `text_search.setup` with new content even with `regenerate=False`	2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky	c17a0fd05b	Do not store word filters index to file. Not necessary for now - It's more of a hassle to not let word filter go stale on entry updates - Generating index on 120K lines of notes takes 1s. Loading from file takes 0.2s. For less content load time difference will be even smaller - Let go of startup time improvement for simplicity for now	2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky	b9a6e80629	Make OrgNode tags stable sorted to find new entries for incremental updates - Having Tags as sets was returning them in a different order everytime - This resulted in spuriously identifying existing entries as new because their tags ordering changed - Converting tags to list fixes the issue and identifies updated new entries for incremental update correctly	2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky	2f7a6af56a	Support incremental update of org-mode entries and embeddings - What - Hash the entries and compare to find new/updated entries - Reuse embeddings encoded for existing entries - Only encode embeddings for updated or new entries - Merge the existing and new entries and embeddings to get the updated entries, embeddings - Why - Given most note text entries are expected to be unchanged across time. Reusing their earlier encoded embeddings should significantly speed up embeddings updates - Previously we were regenerating embeddings for all entries, even if they had existed in previous runs	2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky	976397bd82	Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings	2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky	2b58218b56	Reuse search models across sessions. Merge unused pytest fixtures - Remove unused model_dir pytest fixture. It was only being used by the content_config fixture, not by any tests - Reuse existing search models downloaded to khoj directory. Downloading search models for each pytest sessions seems excessive and slows down tests quite a bit	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	11917c6ddd	Do not normalize absolute filenames for creating links in OrgNode	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	d6bd7bf3e1	Fix initializing OrgNode level to string to parse org files - Parsed `level` argument passed to OrgNode during init is expected to be a string, not an integer - This was resulting in app failure only when parsing org files with no headings, like in issue #83, as level is set to string of `*`s the moment a heading is found in the current file	2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky	d835467f2c	Throw exception if no valid entries found in specified content files - Previously we were failing if no valid entries while computing embeddings. This was obscuring the actual issue of no valid entries found in the specified content files - Throwing an exception early with clear message when no entries found should make clarify the issue to be fixed - See issue #83 for details	2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky	31503e7afd	Do not pass embeddings as argument to filter.apply method	2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky	965bd052f1	Make search filters return entry ids satisfying filter - Filter entries, embeddings by ids satisfying all filters in query func, after each filter has returned entry ids satisfying their individual acceptance criteria - Previously each filter would return a filtered list of entries. Each filter would be applied on entries filtered by previous filters. This made the filtering order dependent - Benefits - Filters can be applied independent of their order of execution - Precomputed indexes for each filter is not in danger of running into index out of bound errors, as filters run on original entries instead of on entries filtered by filters that have run before it - Extract entries satisfying filter only once instead of doing this for each filter - Costs - Each filter has to process all entries even if previous filters may have already marked them as non-satisfactory	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7606724dbc	Add file of each entry to entry dict in org_to_jsonl converter - This will help filter query to org content type using file filter - Do not explicitly specify items being extracted from json of each entry in text_search as all text search content types do not have file being set in jsonl converters	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	f634399f23	Convert simple file filters with no path separator into regex - Specify just file name to get all notes associated with file at path - E.g `query` with `file:"file1.org"` will return `entry1` if `entry1` is in `file1.org` at `~/notes/file.org` - Test - Test converting simple file name filter to regex for path match - Test file filter with space in file name	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	092b9e329d	Setup Filters when configuring Text Search for each Search Type - Allows enabling different filters for different Text Search Types - Use FileFilter in Text Search on Org Files	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	1f9fd28b34	Create File Filter to filter files to query. Add tests for file filter	2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky	f930324350	Rename explicit filter to word filter to be more specific	2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky	8f3326c8d4	Create LRU helper class for caching	2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky	cdcee89ae5	Wrap words in quotes to trigger explicit filter from query - Do not run the more expensive explicit filter until the word to be filtered is completed by user. This requires an end sequence marker to identify end of explicit word filter to trigger filtering - Space isn't a good enough delimiter as the explicit filter could be at the end of the query in which case no space	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	858d86075b	Use regexes to check if any explicit filters in query. Test can_filter	2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky	b7d259b1ec	Test Explicit Include, Exclude Filters	2022-09-03 23:41:43 +03:00

1 2 3

114 commits