sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-25 08:25:07 +01:00

Author	SHA1	Message	Date
Debanjum	3d86d763c5	Support Multiple Input Filters to Configure Content to Index - `536f03a` Process text content files in sorted order for stable indexing - `a701ad0` Support multiple input-filters to configure content to index via `khoj.yml` Resolves #84	2022-09-12 08:19:52 +00:00
Debanjum Singh Solanky	536f03af8f	Process text content files in sorted order for stable indexing - Image search already uses a sorted list of images to process - Prevents index of entries to desync when entries, embeddings generated by a separate server/app instance	2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky	a701ad08b9	Support multiple input-filters to configure content to index via khoj.yml - Update existings code, tests to process input-filters as list instead of str - Test `text_to_jsonl' get files methods to work with combination of `input-files' and `input-filters' Resolves #84	2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky	940c8fac8c	Use app LRU, not functools LRU decorator, to cache search results in router - Provides more control to invalidate cache on update to entries, embeddings - Allows logging when results are being returned from cache etc - FastAPI, Swagger API docs look better as the `search' controller not wrapped in generically named function when using functools LRU decorator	2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky	c6fa09d8fc	Fix querying with include word filter from web interface - Not encoding the `query' string before querying the backend API with it was causing the "+" prefix for include word filter to be lost	2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky	1502fbc9e9	Add index_heading_entries flag to default and sample khoj configs	2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky	7216cdff58	Add Date, Word filter for Org-Music content	2022-09-11 17:29:34 +03:00
Debanjum	182fbbd8df	Allow Indexing Heading Entries. Improve Org, TextToJsonl Parser ### Summary - Set `index_heading_entries` field in `~/.khoj/khoj.yml` to `true` to index entries with empty body ### Main Changes #### Make Indexing Org-Mode Entries with Empty Body Configurable - `253c9ea` Set `index_heading_entries` field in `khoj.yml` to index entries with no body ### Fix, Improve OrgNode, TextToJsonl Parser - `9d369ae` Fix `OrgNode` render of entries with property drawers and empty body - `1d3b3d5` Convert field get/set methods in `OrgNode` class to `@property` - `db37e38` Create `OrgNode` `hasBody` method. Use it in `org_to_jsonl` checks - `b4878d7` Extract entries from scratch when regenerate requested - `52e3dd9` Pass the whole `TextContentConfig` as argument to `text_to_jsonl` methods - `e951ba3` Raise exception when org file not found Resolves #87	2022-09-11 13:46:11 +00:00
Debanjum Singh Solanky	9d369ae4df	Fix OrgNode render of entries with property drawers and empty body - Issue - Indent regex was previously catching escape sequences like newlines - This was resulting in entries with only escape sequences in body to be prepended to property drawers etc during rendering - Fix - Update indent regex to only look for spaces in each line - Only render body when body contains non-escape characters - Create test to prevent this regression from silently resurfacing	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	253c9eae9a	Set index_heading_entries field in config to index entries with no body - Previously heading entries were not indexed to maintain search quality - But given that there are use-cases for indexing entries with no body - Add a configurable `index_heading_entries' field to index heading entries - This `TextContentConfig' field is currently only used for OrgMode content	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	1d3b3d5f39	Convert field get/set methods in OrgNode class to @property - Use more descriptive variable names in OrgNode parser and class - Convert OrgNode fields to private/protected, use property methods to get/set them	2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky	db37e38df7	Create OrgNode hasBody method. Use it in org_to_jsonl checks	2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky	b4878d76ea	Extract entries from scratch when regenerate requested - Do not rely on previously extracted entries to find new entries in regenerate scenario	2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky	52e3dd9835	Pass the whole TextContentConfig as argument to text_to_jsonl methods - Let the specific text_to_jsonl method decide which of the TextContentConfig fields it needs to convert <text> type to jsonl - This simplifies extending TextContentConfig for a specific type without modifying all text_to_jsonl methods - It keeps the number of args being passed to the `text_to_jsonl' methods in check	2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky	e951ba37ad	Raise exception when org file not found - No need to catch the IOError in OrgNode	2022-09-11 01:09:24 +03:00
Debanjum	c415af32d5	Support Incremental Update of Entries, Embeddings for OrgMode, Markdown, Beancount Content ### Major Changes - `030fab9` Support incremental update of Markdown entries, embeddings - `91aac83` Support incremental update of Beancount transactions, embeddings - `2f7a6af` Support incremental update of Org-Mode entries, embeddings - Encode embeddings for updated or new entries - Reuse embeddings encoded for existing entries earlier - Merge the existing and new entries and embeddings to get the updated entries, embeddings - `91d11cc` Only hash compiled entry to identify new/updated entries to update - `b9a6e80` Make OrgNode tags stable sorted to find new entries for incremental updates ### Minor Changes - `c17a0fd` Do not store word filters index to file. Not necessary for now - `4eb84c7` Log performance metrics for jsonl conversion - `2e1bbe0` Fix striping empty escape sequences from strings ### Why - Encoding embeddings is the slowest step to index content - Previously we regenerated embeddings for all entries, even if they existed in previous runs - Reusing previously generated embeddings should significantly speed up index updates, given most user generated content can be expected to be unchanged across time Resolves #36	2022-09-10 21:38:05 +00:00
Debanjum Singh Solanky	9b2845de06	Add basic tests for beancount to jsonl conversion	2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky	d3267554ae	Add basic tests for markdown to jsonl conversion	2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky	2e1bbe0cac	Fix striping empty escape sequences from strings - Fix log message on jsonl write	2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky	a7cf6c8458	Use dictionary instead of list to track entry to file maps	2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky	3e1323971b	Stack function calls in jsonl converters to avoid unneeded variables	2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky	4eb84c7f51	Log performance metrics for beancount, markdown to jsonl conversion	2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky	ebd5039bd1	Merge branch 'master' into support-incremental-updates-of-embeddings	2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky	ed8d432fdd	Clean-up generated file after image search test run - Clean-up unused imports in test files	2022-09-10 21:43:31 +03:00
Debanjum Singh Solanky	030fab9bb2	Support incremental update of Markdown entries, embeddings	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	91aac83c6a	Support incremental update of Beancount transactions, embeddings	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	cfaf7aa6f4	Update Indexing Performance Section in Readme	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	b01b4d7daa	Extract logic to mark entries for embeddings update into helper function - This could be re-used by other text_to_jsonl converters like markdown, beancount	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	f97308bef2	Fix log message on writing JSONL data to file	2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky	899bfc5c3e	Test incremental update triggered on calling text_search.setup - Previously updates to index required explicitly setting `regenerate=True` - Now incremental update check made everytime on `text_search.setup` now - Test if index automatically updates when call `text_search.setup` with new content even with `regenerate=False`	2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky	c17a0fd05b	Do not store word filters index to file. Not necessary for now - It's more of a hassle to not let word filter go stale on entry updates - Generating index on 120K lines of notes takes 1s. Loading from file takes 0.2s. For less content load time difference will be even smaller - Let go of startup time improvement for simplicity for now	2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky	91d11ccb49	Only hash compiled entry to identify new/updated entries to update - Comparing compiled entries is the appropriately narrow target to identify entries that need to encode their embedding vectors. Given we pass the compiled form of the entry to the model for encoding - Hashing the whole entry along with it's raw form was resulting in a bunch of entries being marked for updated as LINE: <entry_line_no> is a string added to each entries raw format. - This results in an update to a single entry resulting in all entries below it in the file being marked for update (as all their line numbers have changed) - Log performance metrics for steps to convert org entries to jsonl	2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky	b9a6e80629	Make OrgNode tags stable sorted to find new entries for incremental updates - Having Tags as sets was returning them in a different order everytime - This resulted in spuriously identifying existing entries as new because their tags ordering changed - Converting tags to list fixes the issue and identifies updated new entries for incremental update correctly	2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky	2f7a6af56a	Support incremental update of org-mode entries and embeddings - What - Hash the entries and compare to find new/updated entries - Reuse embeddings encoded for existing entries - Only encode embeddings for updated or new entries - Merge the existing and new entries and embeddings to get the updated entries, embeddings - Why - Given most note text entries are expected to be unchanged across time. Reusing their earlier encoded embeddings should significantly speed up embeddings updates - Previously we were regenerating embeddings for all entries, even if they had existed in previous runs	2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky	ec675d27d3	Suppress non-actionable HuggingFace FutureWarning shown on app start	2022-09-10 16:43:14 +03:00
Debanjum Singh Solanky	1ac6a71ff0	Add --version flag to show installed version of khoj	2022-09-10 16:40:19 +03:00
Debanjum	372dcd2dbc	Handle Empty Org Files or Org Files with No Headings ### Main Changes - bf01a4f Use filename or "#+TITLE" as heading for 0th level content in org files - `d6bd7bf` Fix initializing `OrgNode` `level` to string to parse org files with no headings - `d835467` Throw exception if no valid entries found in specified content files ### Miscellaneous Improvements - 7df39e5 Reuse search models across `pytest` sessions. Merge unused pytest fixtures - 2dc0588 Do not normalize absolute filenames for entry links in `OrgNode` - `e00bb53` Init word filter dictionary with default value as set to simplify code Resolves #83	2022-09-10 12:42:07 +00:00
Debanjum Singh Solanky	976397bd82	Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings	2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky	2b58218b56	Reuse search models across sessions. Merge unused pytest fixtures - Remove unused model_dir pytest fixture. It was only being used by the content_config fixture, not by any tests - Reuse existing search models downloaded to khoj directory. Downloading search models for each pytest sessions seems excessive and slows down tests quite a bit	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	11917c6ddd	Do not normalize absolute filenames for creating links in OrgNode	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	07b98d35f1	Use filename or #+TITLE as heading for 0th level content in org files - Set LINE, SOURCE link properties in property drawer correctly for content which falls under no heading - See Issue #83 for more details	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	d6bd7bf3e1	Fix initializing OrgNode level to string to parse org files - Parsed `level` argument passed to OrgNode during init is expected to be a string, not an integer - This was resulting in app failure only when parsing org files with no headings, like in issue #83, as level is set to string of `*`s the moment a heading is found in the current file	2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky	d835467f2c	Throw exception if no valid entries found in specified content files - Previously we were failing if no valid entries while computing embeddings. This was obscuring the actual issue of no valid entries found in the specified content files - Throwing an exception early with clear message when no entries found should make clarify the issue to be fixed - See issue #83 for details	2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky	e00bb53336	Init word filter dictionary with default value as set to simplify code	2022-09-10 12:19:09 +03:00
Debanjum Singh Solanky	4d776d9c7a	Bump khoj version to 0.1.9	2022-09-09 07:50:15 +03:00
Debanjum	b58b7d7483	Create App Directory, Fix Initialization GUI on First Run - `588f598` Pass empty list of `input_files` to `FileBrowser` on first run - `3ddffdf` Create config directory before setting up logging to file under it Resolves #78 Resolves #79 Resolves #80	2022-09-09 04:40:22 +00:00
Debanjum Singh Solanky	588f598949	Pass empty list of `input_files' to FileBrowser on first run - Default config has `input_files' set to None - This was being passed to `FileBrowser' on Initialization - But `FileBrowser' expects `content_files' of list type, not None - This resulted in an unexpected NoneType failure	2022-09-09 07:26:40 +03:00
Debanjum Singh Solanky	3ddffdfba4	Create config directory before setting up logging to file under it - The logging to file code expects the config directory to already be setup - But parent directory of config file was being set up later in code - This resulted in app start failing with ~/.khoj dir does not exist error	2022-09-09 07:21:42 +03:00
Debanjum	79894efc7a	Resolve GUI Issues in Docker Build - `17354aa` Install `pyqt` system package in Docker image to get qt dependencies - `5d3aeba` Do not start GUI when Khoj started from Docker - `26ff66f` (Re-)Enable image search via Docker image as image search issues fixed Resolves #76	2022-09-08 07:55:06 +00:00
Debanjum Singh Solanky	26ff66f38b	(Re-)Enable image search via Docker image as image search issues fixed	2022-09-08 10:42:34 +03:00

... 14 15 16 17 18 ...

1466 commits