- **Improve API Endpoints**
- ee65a4f Merge /reload, /regenerate into single /update API endpoint
- 9975497 Type the /search API response to better document the response schema
- 0521ea1 Put image score breakdown under `additional` field in search response
- **Formalize Intermediary Format to Index Text Content**
- 7e9298f Use new Text `Entry` class to track text entries in Intermediate Format
- 02d9440 Use Base `TextToJsonl` class to standardize `<text>_to_jsonl` processors
- **Modularize API router code**
- e42a38e Split router code into `web_client`, `api`, `api_beta` routers. Version Khoj API
- d292bdc Remove API versioning. Premature given current state of the codebase
- **Miscellaneous**
- c467df8 Setup `mypy` for static type checking
- 2c54813 Remove unused imports, `embeddings` variable from text search tests
- bf1ae038cb Get XMP metadata from image using `Pillow`. Remove `ExifTool` dependency
- Pillow library is already used in Khoj and it can extract XMP Metadata from Images
- Reduce unmaintained dependencies by using Pillow instead of Exiftool
- Pillow is much better maintained than my fork of the Exiftool python package
- c16ae9e344 Ignore *"Legacy way to download model"* warning for upstream dependency
- Reason
- All clients that currently consume the API are part of Khoj
- Any breaking API changes will be fixed in clients immediately
- So decoupling client from API is not required
- This removes the burden of maintaining muliple versions of the API
- Context
- The app maintains all text content in a standard, intermediate format
- The intermediate format was loaded, passed around as a dictionary
for easier, faster updates to the intermediate format schema initially
- The intermediate format is reasonably stable now, given it's usage
by all 3 text content types currently implemented
- Changes
- Concretize text entries into `Entries' class instead of using dictionaries
- Code is updated to load, pass around entries as `Entries' objects
instead of as dictionaries
- `text_search' and `text_to_jsonl' methods are annotated with
type hints for the new `Entries' type
- Code and Tests referencing entries are updated to use class style
access patterns instead of the previous dictionary access patterns
- Move `mark_entries_for_update' method into `TextToJsonl' base class
- This is a more natural location for the method as it is only
(to be) used by `text_to_jsonl' classes
- Avoid circular reference issues on importing `Entries' class
- Both Text, Image Search were already giving list of entry, score
- This change just concretizes this change and exposes this in the API
documentation (i.e OpenAPI, Swagger, Redocs)
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
In my installation, it appears that `url-request-method` is sometimes set
globally to POST. Need to explicitly set it to ensure that GET is always
used as intended.
- Pass force=true to /update API to force regenerating index from
scratch
- Otherwise calls to the /update API endpoint will result in an
incremental update to index
- Start standardizing implementation of the `text_to_jsonl' processors
- `text_to_jsonl; scripts already had a shared structure
- This change starts to codify that implicit structure
- Benefits
- Ease adding more `text_to_jsonl; processors
- Allow merging shared functionality
- Help with type hinting
- Drawbacks
- Lower agility to change. But this was already an implicit issue as
the text_to_jsonl processors got more deeply wired into the app
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
- This also removes dependency on system Exiftool package for
XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
- To support dispatch, set the image label based on the branch name
- Master build should still be tagged with latest to get benefit of the standard production Docker label
- Simplify tracking khoj query history, saving/sharing links
- Do not execute search, when query only contains whitespaces
- Prevents error when try process results of empty query
- As `/reload` updates index incrementally, it's relatively quick
- This makes exposing `/reload` endpoint a better default to expose
via the web interface than `the /regenerate' endpoint
- For queries with only filters in them short-circuit and return
filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
- 536f03a Process text content files in sorted order for stable indexing
- a701ad0 Support multiple input-filters to configure content to index via `khoj.yml`
Resolves#84
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
generated by a separate server/app instance
- Update existings code, tests to process input-filters as list
instead of str
- Test `text_to_jsonl' get files methods to work with combination of
`input-files' and `input-filters'
Resolves#84
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
wrapped in generically named function when using functools LRU decorator
### Summary
- Set `index_heading_entries` field in `~/.khoj/khoj.yml` to `true` to index entries with empty body
### Main Changes
#### Make Indexing Org-Mode Entries with Empty Body Configurable
- 253c9ea Set `index_heading_entries` field in `khoj.yml` to index entries with no body
### Fix, Improve OrgNode, TextToJsonl Parser
- 9d369ae Fix `OrgNode` render of entries with property drawers and empty body
- 1d3b3d5 Convert field get/set methods in `OrgNode` class to `@property`
- db37e38 Create `OrgNode` `hasBody` method. Use it in `org_to_jsonl` checks
- b4878d7 Extract entries from scratch when regenerate requested
- 52e3dd9 Pass the whole `TextContentConfig` as argument to `text_to_jsonl` methods
- e951ba3 Raise exception when org file not found
Resolves#87
- Issue
- Indent regex was previously catching escape sequences like newlines
- This was resulting in entries with only escape sequences in body to
be prepended to property drawers etc during rendering
- Fix
- Update indent regex to only look for spaces in each line
- Only render body when body contains non-escape characters
- Create test to prevent this regression from silently resurfacing
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
- Let the specific text_to_jsonl method decide which of the
TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
methods in check
### Major Changes
- 030fab9 Support incremental update of **Markdown** entries, embeddings
- 91aac83 Support incremental update of **Beancount** transactions, embeddings
- 2f7a6af Support incremental update of **Org-Mode** entries, embeddings
- Encode embeddings for updated or new entries
- Reuse embeddings encoded for existing entries earlier
- Merge the existing and new entries and embeddings to get the updated entries, embeddings
- 91d11cc Only hash compiled entry to identify new/updated entries to update
- b9a6e80 Make OrgNode tags stable sorted to find new entries for incremental updates
### Minor Changes
- c17a0fd Do not store word filters index to file. Not necessary for now
- 4eb84c7 Log performance metrics for jsonl conversion
- 2e1bbe0 Fix striping empty escape sequences from strings
### Why
- Encoding embeddings is the slowest step to index content
- Previously we regenerated embeddings for all entries, even if they existed in previous runs
- Reusing previously generated embeddings should significantly speed up index updates,
given most user generated content can be expected to be unchanged across time
Resolves#36