sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-29 10:23:02 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	2c548133f3	Remove unused imports, `embeddings' variable from text search tests	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	7e9298f315	Use new Text Entry class to track text entries in Intermediate Format - Context - The app maintains all text content in a standard, intermediate format - The intermediate format was loaded, passed around as a dictionary for easier, faster updates to the intermediate format schema initially - The intermediate format is reasonably stable now, given it's usage by all 3 text content types currently implemented - Changes - Concretize text entries into `Entries' class instead of using dictionaries - Code is updated to load, pass around entries as `Entries' objects instead of as dictionaries - `text_search' and `text_to_jsonl' methods are annotated with type hints for the new `Entries' type - Code and Tests referencing entries are updated to use class style access patterns instead of the previous dictionary access patterns - Move `mark_entries_for_update' method into `TextToJsonl' base class - This is a more natural location for the method as it is only (to be) used by `text_to_jsonl' classes - Avoid circular reference issues on importing `Entries' class	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	99754970ab	Type the /search API response to better document the response schema - Both Text, Image Search were already giving list of entry, score - This change just concretizes this change and exposes this in the API documentation (i.e OpenAPI, Swagger, Redocs)	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	0521ea10d6	Put image score breakdown under `additional' field in search response - Update web, emacs interfaces to consume the scores from new schema	2022-10-08 12:06:01 +03:00
Debanjum Singh Solanky	e42a38e825	Version Khoj API, Update frontends, tests and docs to reflect it - Split router.py into v1.0, beta and frontend (no-prefix) api modules under new router package. Version tag in main.py via prefix - Update frontends to use the versioned api endpoints - Update tests to work with versioned api endpoints - Update docs to mentioned, reference only versioned api endpoints	2022-09-28 20:08:38 +03:00
Debanjum Singh Solanky	ee65a4f2c7	Merge /reload, /regenerate into single /update API endpoint - Pass force=true to /update API to force regenerating index from scratch - Otherwise calls to the /update API endpoint will result in an incremental update to index	2022-09-16 00:53:19 +03:00
Debanjum Singh Solanky	02d944030f	Use Base TextToJsonl class to standardize <text>_to_jsonl processors - Start standardizing implementation of the `text_to_jsonl' processors - `text_to_jsonl; scripts already had a shared structure - This change starts to codify that implicit structure - Benefits - Ease adding more `text_to_jsonl; processors - Allow merging shared functionality - Help with type hinting - Drawbacks - Lower agility to change. But this was already an implicit issue as the text_to_jsonl processors got more deeply wired into the app	2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky	c16ae9e344	Ignore "Legacy way to download model" warning for upstream dependency	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	3169e3b78e	Use ellipsis instead of pass in base filter abstract methods for aesthetic	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	bf1ae038cb	Get XMP metadata from image using Pillow. Remove ExifTool dependency - Pillow already supports reading XMP metadata from Images - Removes need to maintain my fork of unmaintained PyExiftool - This also removes dependency on system Exiftool package for XMP metadata extraction - Add test to verify XMP metadata extracted from test images - Remove references to Exiftool from Documentation	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	8f57a62675	Remove unused imports. Fix typing and indentation - Typing issues discovered using `mypy'. Fixed manually - Unused imports discovered and fixed using `autoflake' - Fix indentation in `org_to_jsonl' manually	2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky	be57c711fd	Revert OrgNode.hasTag func to method instead of property as accepts argument	2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky	0109c7bd91	Disable ability to call <text>_to_jsonl, <type>_search packages directly - This code is de-synced with expected args by above scripts - Better to remove unused capabilitity that needlessly increases maintainance burden	2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky	1680a617da	Reflect updates to query and results count in URL - Simplify tracking khoj query history, saving/sharing links - Do not execute search, when query only contains whitespaces - Prevents error when try process results of empty query	2022-09-13 23:39:24 +03:00
Debanjum Singh Solanky	34314e859a	Call /reload instead of /regenerate API to update index from web interface - As `/reload` updates index incrementally, it's relatively quick - This makes exposing `/reload` endpoint a better default to expose via the web interface than `the /regenerate' endpoint	2022-09-12 23:39:10 +03:00
Debanjum Singh Solanky	13b5d5082f	Create input field to set results count on the web interface Resolves #96	2022-09-12 23:24:46 +03:00
Debanjum Singh Solanky	0ce0c00090	Bump khoj version to 0.1.10	2022-09-12 23:03:22 +03:00
Debanjum Singh Solanky	1bfe9c4ef2	Handle filter only queries. Short-circuit and return filtered results - For queries with only filters in them short-circuit and return filtered results. No need to run semantic search, re-ranking. - Add client test for filter only query and quote query in client tests	2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky	afc84de234	Make word filter regex explicit. Allow hyphen in word filters Helps with #88	2022-09-12 17:05:29 +03:00
Debanjum	3d86d763c5	Support Multiple Input Filters to Configure Content to Index - `536f03a` Process text content files in sorted order for stable indexing - `a701ad0` Support multiple input-filters to configure content to index via `khoj.yml` Resolves #84	2022-09-12 08:19:52 +00:00
Debanjum Singh Solanky	536f03af8f	Process text content files in sorted order for stable indexing - Image search already uses a sorted list of images to process - Prevents index of entries to desync when entries, embeddings generated by a separate server/app instance	2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky	a701ad08b9	Support multiple input-filters to configure content to index via khoj.yml - Update existings code, tests to process input-filters as list instead of str - Test `text_to_jsonl' get files methods to work with combination of `input-files' and `input-filters' Resolves #84	2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky	940c8fac8c	Use app LRU, not functools LRU decorator, to cache search results in router - Provides more control to invalidate cache on update to entries, embeddings - Allows logging when results are being returned from cache etc - FastAPI, Swagger API docs look better as the `search' controller not wrapped in generically named function when using functools LRU decorator	2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky	c6fa09d8fc	Fix querying with include word filter from web interface - Not encoding the `query' string before querying the backend API with it was causing the "+" prefix for include word filter to be lost	2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky	1502fbc9e9	Add index_heading_entries flag to default and sample khoj configs	2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky	7216cdff58	Add Date, Word filter for Org-Music content	2022-09-11 17:29:34 +03:00
Debanjum	182fbbd8df	Allow Indexing Heading Entries. Improve Org, TextToJsonl Parser ### Summary - Set `index_heading_entries` field in `~/.khoj/khoj.yml` to `true` to index entries with empty body ### Main Changes #### Make Indexing Org-Mode Entries with Empty Body Configurable - `253c9ea` Set `index_heading_entries` field in `khoj.yml` to index entries with no body ### Fix, Improve OrgNode, TextToJsonl Parser - `9d369ae` Fix `OrgNode` render of entries with property drawers and empty body - `1d3b3d5` Convert field get/set methods in `OrgNode` class to `@property` - `db37e38` Create `OrgNode` `hasBody` method. Use it in `org_to_jsonl` checks - `b4878d7` Extract entries from scratch when regenerate requested - `52e3dd9` Pass the whole `TextContentConfig` as argument to `text_to_jsonl` methods - `e951ba3` Raise exception when org file not found Resolves #87	2022-09-11 13:46:11 +00:00
Debanjum Singh Solanky	9d369ae4df	Fix OrgNode render of entries with property drawers and empty body - Issue - Indent regex was previously catching escape sequences like newlines - This was resulting in entries with only escape sequences in body to be prepended to property drawers etc during rendering - Fix - Update indent regex to only look for spaces in each line - Only render body when body contains non-escape characters - Create test to prevent this regression from silently resurfacing	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	253c9eae9a	Set index_heading_entries field in config to index entries with no body - Previously heading entries were not indexed to maintain search quality - But given that there are use-cases for indexing entries with no body - Add a configurable `index_heading_entries' field to index heading entries - This `TextContentConfig' field is currently only used for OrgMode content	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	1d3b3d5f39	Convert field get/set methods in OrgNode class to @property - Use more descriptive variable names in OrgNode parser and class - Convert OrgNode fields to private/protected, use property methods to get/set them	2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky	db37e38df7	Create OrgNode hasBody method. Use it in org_to_jsonl checks	2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky	b4878d76ea	Extract entries from scratch when regenerate requested - Do not rely on previously extracted entries to find new entries in regenerate scenario	2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky	52e3dd9835	Pass the whole TextContentConfig as argument to text_to_jsonl methods - Let the specific text_to_jsonl method decide which of the TextContentConfig fields it needs to convert <text> type to jsonl - This simplifies extending TextContentConfig for a specific type without modifying all text_to_jsonl methods - It keeps the number of args being passed to the `text_to_jsonl' methods in check	2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky	e951ba37ad	Raise exception when org file not found - No need to catch the IOError in OrgNode	2022-09-11 01:09:24 +03:00
Debanjum	c415af32d5	Support Incremental Update of Entries, Embeddings for OrgMode, Markdown, Beancount Content ### Major Changes - `030fab9` Support incremental update of Markdown entries, embeddings - `91aac83` Support incremental update of Beancount transactions, embeddings - `2f7a6af` Support incremental update of Org-Mode entries, embeddings - Encode embeddings for updated or new entries - Reuse embeddings encoded for existing entries earlier - Merge the existing and new entries and embeddings to get the updated entries, embeddings - `91d11cc` Only hash compiled entry to identify new/updated entries to update - `b9a6e80` Make OrgNode tags stable sorted to find new entries for incremental updates ### Minor Changes - `c17a0fd` Do not store word filters index to file. Not necessary for now - `4eb84c7` Log performance metrics for jsonl conversion - `2e1bbe0` Fix striping empty escape sequences from strings ### Why - Encoding embeddings is the slowest step to index content - Previously we regenerated embeddings for all entries, even if they existed in previous runs - Reusing previously generated embeddings should significantly speed up index updates, given most user generated content can be expected to be unchanged across time Resolves #36	2022-09-10 21:38:05 +00:00
Debanjum Singh Solanky	9b2845de06	Add basic tests for beancount to jsonl conversion	2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky	d3267554ae	Add basic tests for markdown to jsonl conversion	2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky	2e1bbe0cac	Fix striping empty escape sequences from strings - Fix log message on jsonl write	2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky	a7cf6c8458	Use dictionary instead of list to track entry to file maps	2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky	3e1323971b	Stack function calls in jsonl converters to avoid unneeded variables	2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky	4eb84c7f51	Log performance metrics for beancount, markdown to jsonl conversion	2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky	ebd5039bd1	Merge branch 'master' into support-incremental-updates-of-embeddings	2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky	ed8d432fdd	Clean-up generated file after image search test run - Clean-up unused imports in test files	2022-09-10 21:43:31 +03:00
Debanjum Singh Solanky	030fab9bb2	Support incremental update of Markdown entries, embeddings	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	91aac83c6a	Support incremental update of Beancount transactions, embeddings	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	cfaf7aa6f4	Update Indexing Performance Section in Readme	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	b01b4d7daa	Extract logic to mark entries for embeddings update into helper function - This could be re-used by other text_to_jsonl converters like markdown, beancount	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	f97308bef2	Fix log message on writing JSONL data to file	2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky	899bfc5c3e	Test incremental update triggered on calling text_search.setup - Previously updates to index required explicitly setting `regenerate=True` - Now incremental update check made everytime on `text_search.setup` now - Test if index automatically updates when call `text_search.setup` with new content even with `regenerate=False`	2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky	c17a0fd05b	Do not store word filters index to file. Not necessary for now - It's more of a hassle to not let word filter go stale on entry updates - Generating index on 120K lines of notes takes 1s. Loading from file takes 0.2s. For less content load time difference will be even smaller - Let go of startup time improvement for simplicity for now	2022-09-10 21:01:54 +03:00

1 2 3 4 5 ...

735 commits