- 536f03a Process text content files in sorted order for stable indexing
- a701ad0 Support multiple input-filters to configure content to index via `khoj.yml`
Resolves#84
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
generated by a separate server/app instance
- Update existings code, tests to process input-filters as list
instead of str
- Test `text_to_jsonl' get files methods to work with combination of
`input-files' and `input-filters'
Resolves#84
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
wrapped in generically named function when using functools LRU decorator
### Summary
- Set `index_heading_entries` field in `~/.khoj/khoj.yml` to `true` to index entries with empty body
### Main Changes
#### Make Indexing Org-Mode Entries with Empty Body Configurable
- 253c9ea Set `index_heading_entries` field in `khoj.yml` to index entries with no body
### Fix, Improve OrgNode, TextToJsonl Parser
- 9d369ae Fix `OrgNode` render of entries with property drawers and empty body
- 1d3b3d5 Convert field get/set methods in `OrgNode` class to `@property`
- db37e38 Create `OrgNode` `hasBody` method. Use it in `org_to_jsonl` checks
- b4878d7 Extract entries from scratch when regenerate requested
- 52e3dd9 Pass the whole `TextContentConfig` as argument to `text_to_jsonl` methods
- e951ba3 Raise exception when org file not found
Resolves#87
- Issue
- Indent regex was previously catching escape sequences like newlines
- This was resulting in entries with only escape sequences in body to
be prepended to property drawers etc during rendering
- Fix
- Update indent regex to only look for spaces in each line
- Only render body when body contains non-escape characters
- Create test to prevent this regression from silently resurfacing
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
- Let the specific text_to_jsonl method decide which of the
TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
methods in check
### Major Changes
- 030fab9 Support incremental update of **Markdown** entries, embeddings
- 91aac83 Support incremental update of **Beancount** transactions, embeddings
- 2f7a6af Support incremental update of **Org-Mode** entries, embeddings
- Encode embeddings for updated or new entries
- Reuse embeddings encoded for existing entries earlier
- Merge the existing and new entries and embeddings to get the updated entries, embeddings
- 91d11cc Only hash compiled entry to identify new/updated entries to update
- b9a6e80 Make OrgNode tags stable sorted to find new entries for incremental updates
### Minor Changes
- c17a0fd Do not store word filters index to file. Not necessary for now
- 4eb84c7 Log performance metrics for jsonl conversion
- 2e1bbe0 Fix striping empty escape sequences from strings
### Why
- Encoding embeddings is the slowest step to index content
- Previously we regenerated embeddings for all entries, even if they existed in previous runs
- Reusing previously generated embeddings should significantly speed up index updates,
given most user generated content can be expected to be unchanged across time
Resolves#36
- Previously updates to index required explicitly setting `regenerate=True`
- Now incremental update check made everytime on `text_search.setup` now
- Test if index automatically updates when call `text_search.setup`
with new content even with `regenerate=False`
- It's more of a hassle to not let word filter go stale on entry
updates
- Generating index on 120K lines of notes takes 1s. Loading from file
takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
- Comparing compiled entries is the appropriately narrow target to
identify entries that need to encode their embedding vectors. Given we
pass the compiled form of the entry to the model for encoding
- Hashing the whole entry along with it's raw form was resulting in a
bunch of entries being marked for updated as LINE: <entry_line_no>
is a string added to each entries raw format.
- This results in an update to a single entry resulting in all entries
below it in the file being marked for update (as all their line
numbers have changed)
- Log performance metrics for steps to convert org entries to jsonl
- Having Tags as sets was returning them in a different order
everytime
- This resulted in spuriously identifying existing entries as new
because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
entries for incremental update correctly
- What
- Hash the entries and compare to find new/updated entries
- Reuse embeddings encoded for existing entries
- Only encode embeddings for updated or new entries
- Merge the existing and new entries and embeddings to get the updated
entries, embeddings
- Why
- Given most note text entries are expected to be unchanged
across time. Reusing their earlier encoded embeddings should
significantly speed up embeddings updates
- Previously we were regenerating embeddings for all entries,
even if they had existed in previous runs
### Main Changes
- bf01a4f Use filename or "#+TITLE" as heading for 0th level content in org files
- d6bd7bf Fix initializing `OrgNode` `level` to string to parse org files with no headings
- d835467 Throw exception if no valid entries found in specified content files
### Miscellaneous Improvements
- 7df39e5 Reuse search models across `pytest` sessions. Merge unused pytest fixtures
- 2dc0588 Do not normalize absolute filenames for entry links in `OrgNode`
- e00bb53 Init word filter dictionary with default value as set to simplify code
Resolves#83
- Remove unused model_dir pytest fixture. It was only being used by
the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
Downloading search models for each pytest sessions seems excessive and
slows down tests quite a bit
- Parsed `level` argument passed to OrgNode during init is expected to
be a string, not an integer
- This was resulting in app failure only when parsing org files with
no headings, like in issue #83, as level is set to string of `*`s
the moment a heading is found in the current file
- Previously we were failing if no valid entries while computing
embeddings. This was obscuring the actual issue of no valid entries
found in the specified content files
- Throwing an exception early with clear message when no entries found
should make clarify the issue to be fixed
- See issue #83 for details
- 588f598 Pass empty list of `input_files` to `FileBrowser` on first run
- 3ddffdf Create config directory before setting up logging to file under it
Resolves#78Resolves#79Resolves#80
- Default config has `input_files' set to None
- This was being passed to `FileBrowser' on Initialization
- But `FileBrowser' expects `content_files' of list type, not None
- This resulted in an unexpected NoneType failure
- The logging to file code expects the config directory to already be setup
- But parent directory of config file was being set up later in code
- This resulted in app start failing with ~/.khoj dir does not exist error
- 17354aa Install `pyqt` system package in Docker image to get qt dependencies
- 5d3aeba Do not start GUI when Khoj started from Docker
- 26ff66f (Re-)Enable image search via Docker image as image search issues fixed
Resolves#76