- Remove property drawer from test entry for max_words splitting test
- Property drawer is not required for the test
- Keep minimal test case to reduce chance for confusion
- Required because entries are now split by the max_word count supported
by the ML models
- This would now result in potentially duplicate hits, entries being
returned to user
- Do deduplication after ranking to get the top ranked deduplicated
results
- The instructions suggest installing khoj-assistant via pip install.
This installs the latest tagged/release version of khoj
- To match that version user should install khoj.el from MELPA stable
instead of MELPA
- Issue
ML Models truncate entries exceeding some max token limit.
This lowers the quality of search results
- Fix
Split entries by max tokens before indexing.
This should improve searching for content in longer entries.
- Miscellaneous
- Test method to split entries by max tokens
Update readme to ask user to install khoj.el from MELPA when a
pre-release version of the main khoj app is installed. Else install
khoj.el from MELPA Stable
- Reason
- All clients that currently consume the API are part of Khoj
- Any breaking API changes will be fixed in clients immediately
- So decoupling client from API is not required
- This removes the burden of maintaining muliple versions of the API
- Context
- The app maintains all text content in a standard, intermediate format
- The intermediate format was loaded, passed around as a dictionary
for easier, faster updates to the intermediate format schema initially
- The intermediate format is reasonably stable now, given it's usage
by all 3 text content types currently implemented
- Changes
- Concretize text entries into `Entries' class instead of using dictionaries
- Code is updated to load, pass around entries as `Entries' objects
instead of as dictionaries
- `text_search' and `text_to_jsonl' methods are annotated with
type hints for the new `Entries' type
- Code and Tests referencing entries are updated to use class style
access patterns instead of the previous dictionary access patterns
- Move `mark_entries_for_update' method into `TextToJsonl' base class
- This is a more natural location for the method as it is only
(to be) used by `text_to_jsonl' classes
- Avoid circular reference issues on importing `Entries' class
- Both Text, Image Search were already giving list of entry, score
- This change just concretizes this change and exposes this in the API
documentation (i.e OpenAPI, Swagger, Redocs)
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
In my installation, it appears that `url-request-method` is sometimes set
globally to POST. Need to explicitly set it to ensure that GET is always
used as intended.
- Pass force=true to /update API to force regenerating index from
scratch
- Otherwise calls to the /update API endpoint will result in an
incremental update to index
- Start standardizing implementation of the `text_to_jsonl' processors
- `text_to_jsonl; scripts already had a shared structure
- This change starts to codify that implicit structure
- Benefits
- Ease adding more `text_to_jsonl; processors
- Allow merging shared functionality
- Help with type hinting
- Drawbacks
- Lower agility to change. But this was already an implicit issue as
the text_to_jsonl processors got more deeply wired into the app
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
- This also removes dependency on system Exiftool package for
XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
- Simplify tracking khoj query history, saving/sharing links
- Do not execute search, when query only contains whitespaces
- Prevents error when try process results of empty query
- As `/reload` updates index incrementally, it's relatively quick
- This makes exposing `/reload` endpoint a better default to expose
via the web interface than `the /regenerate' endpoint
- For queries with only filters in them short-circuit and return
filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
generated by a separate server/app instance
- Update existings code, tests to process input-filters as list
instead of str
- Test `text_to_jsonl' get files methods to work with combination of
`input-files' and `input-filters'
Resolves#84
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
wrapped in generically named function when using functools LRU decorator
- Issue
- Indent regex was previously catching escape sequences like newlines
- This was resulting in entries with only escape sequences in body to
be prepended to property drawers etc during rendering
- Fix
- Update indent regex to only look for spaces in each line
- Only render body when body contains non-escape characters
- Create test to prevent this regression from silently resurfacing
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content