Commit graph

3497 commits

Author SHA1 Message Date
Debanjum Singh Solanky
a9cfd8b800 Extract hash func for incremental text indexing into separate method 2022-10-26 13:56:58 +05:30
Debanjum Singh Solanky
0de2ff9c97 Add __init__.py to routers directory to register it as a package 2022-10-25 20:40:40 +05:30
Debanjum Singh Solanky
55d2fea9be Move Custom Formatter class for logger to util.helper module from main.py 2022-10-20 00:32:24 +05:30
Debanjum
5ed17ccbd7
Modularize, Improve API. Formalize Intermediate Text Content Format. Add Type Checking
- **Improve API Endpoints**
  - ee65a4f Merge /reload, /regenerate into single /update API endpoint
  - 9975497 Type the /search API response to better document the response schema
  - 0521ea1 Put image score breakdown under `additional` field in search response
- **Formalize Intermediary Format to Index Text Content**
  - 7e9298f Use new Text `Entry` class to track text entries in Intermediate Format
  - 02d9440 Use Base `TextToJsonl` class to standardize `<text>_to_jsonl` processors
- **Modularize API router code**
  - e42a38e Split router code into `web_client`, `api`, `api_beta` routers. Version Khoj API
  - d292bdc Remove API versioning. Premature given current state of the codebase
- **Miscellaneous**
  - c467df8 Setup `mypy` for static type checking
  - 2c54813 Remove unused imports, `embeddings` variable from text search tests
2022-10-19 11:23:04 +00:00
Debanjum Singh Solanky
1c40f97114 Merge branch 'master' of github.com:debanjum/khoj into modularize-api-and-increase-typing
- Conflicts:
  - src/interface/emacs/khoj.el
    Use our update to `config-url', use their `url-request-method'
2022-10-19 16:46:53 +05:30
Debanjum Singh Solanky
e1b5a87920 Rename Frontend Router to Web Client. Fix logger usage in routers
- Use logger in api_beta router instead of print statements
- Remove unused logger in web client router
2022-10-19 16:36:48 +05:30
Debanjum
4abd51cb04
Merge pull request #99 from telotortium/method
Explicitly set `url-request-method' to GET in khoj.el
2022-10-19 10:31:37 +00:00
Debanjum
74f32eedb8
Remove Exiftool Dependency. Ignore legacy model warning on app start
- bf1ae038cb Get XMP metadata from image using `Pillow`. Remove `ExifTool` dependency
  - Pillow library is already used in Khoj and it can extract XMP Metadata from Images
  - Reduce unmaintained dependencies by using Pillow instead of Exiftool
  - Pillow is much better maintained than my fork of the Exiftool python package
- c16ae9e344 Ignore *"Legacy way to download model"* warning for upstream dependency
2022-10-08 14:57:58 +00:00
Debanjum Singh Solanky
c467df8fa3 Setup `mypy' for static type checking 2022-10-08 17:33:13 +03:00
Debanjum Singh Solanky
d292bdcc11 Do not version API. Premature given current state of the codebase
- Reason
  - All clients that currently consume the API are part of Khoj
  - Any breaking API changes will be fixed in clients immediately
  - So decoupling client from API is not required
  - This removes the burden of maintaining muliple versions of the API
2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky
2c548133f3 Remove unused imports, `embeddings' variable from text search tests 2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
99754970ab Type the /search API response to better document the response schema
- Both Text, Image Search were already giving list of entry, score
- This change just concretizes this change and exposes this in the API
  documentation (i.e OpenAPI, Swagger, Redocs)
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
0521ea10d6 Put image score breakdown under `additional' field in search response
- Update web, emacs interfaces to consume the scores from new schema
2022-10-08 12:06:01 +03:00
Debanjum Singh Solanky
e42a38e825 Version Khoj API, Update frontends, tests and docs to reflect it
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
  under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
2022-09-28 20:08:38 +03:00
Robert Irelan
d25e1d8e86
fix: explicitly set url-request-method
In my installation, it appears that `url-request-method` is sometimes set
globally to POST.  Need to explicitly set it to ensure that GET is always
used as intended.
2022-09-19 15:46:46 -04:00
Debanjum Singh Solanky
ee65a4f2c7 Merge /reload, /regenerate into single /update API endpoint
- Pass force=true to /update API to force regenerating index from
scratch
- Otherwise calls to the /update API endpoint will result in an
incremental update to index
2022-09-16 00:53:19 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
c16ae9e344 Ignore "Legacy way to download model" warning for upstream dependency 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
3169e3b78e Use ellipsis instead of pass in base filter abstract methods for aesthetic 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
bf1ae038cb Get XMP metadata from image using Pillow. Remove ExifTool dependency
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
  - This also removes dependency on system Exiftool package for
    XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
2022-09-16 00:48:45 +03:00
Saba
a53094ec92 Add workflow dispatch support in build.yml
- To support dispatch, set the image label based on the branch name
- Master build should still be tagged with latest to get benefit of the standard production Docker label
2022-09-15 20:28:41 +03:00
Debanjum Singh Solanky
8f57a62675 Remove unused imports. Fix typing and indentation
- Typing issues discovered using `mypy'. Fixed manually
- Unused imports discovered and fixed using `autoflake'
- Fix indentation in `org_to_jsonl' manually
2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky
be57c711fd Revert OrgNode.hasTag func to method instead of property as accepts argument 2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
0109c7bd91 Disable ability to call <text>_to_jsonl, <type>_search packages directly
- This code is de-synced with expected args by above scripts
- Better to remove unused capabilitity that needlessly increases
  maintainance burden
2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
1680a617da Reflect updates to query and results count in URL
- Simplify tracking khoj query history, saving/sharing links
- Do not execute search, when query only contains whitespaces
  - Prevents error when try process results of empty query
2022-09-13 23:39:24 +03:00
Debanjum Singh Solanky
34314e859a Call /reload instead of /regenerate API to update index from web interface
- As `/reload` updates index incrementally, it's relatively quick
- This makes exposing `/reload` endpoint a better default to expose
  via the web interface than `the /regenerate' endpoint
2022-09-12 23:39:10 +03:00
Debanjum Singh Solanky
13b5d5082f Create input field to set results count on the web interface
Resolves #96
2022-09-12 23:24:46 +03:00
Debanjum Singh Solanky
0ce0c00090 Bump khoj version to 0.1.10 2022-09-12 23:03:22 +03:00
Debanjum Singh Solanky
1bfe9c4ef2 Handle filter only queries. Short-circuit and return filtered results
- For queries with only filters in them short-circuit and return
  filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky
afc84de234 Make word filter regex explicit. Allow hyphen in word filters
Helps with #88
2022-09-12 17:05:29 +03:00
Debanjum
3d86d763c5
Support Multiple Input Filters to Configure Content to Index
- 536f03a Process text content files in sorted order for stable indexing
- a701ad0 Support multiple input-filters to configure content to index via `khoj.yml`

Resolves #84
2022-09-12 08:19:52 +00:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
940c8fac8c Use app LRU, not functools LRU decorator, to cache search results in router
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
  wrapped in generically named function when using functools LRU decorator
2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky
c6fa09d8fc Fix querying with include word filter from web interface
- Not encoding the `query' string before querying the backend API with
  it was causing the "+" prefix for include word filter to be lost
2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky
1502fbc9e9 Add index_heading_entries flag to default and sample khoj configs 2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky
7216cdff58 Add Date, Word filter for Org-Music content 2022-09-11 17:29:34 +03:00
Debanjum
182fbbd8df
Allow Indexing Heading Entries. Improve Org, TextToJsonl Parser
### Summary
- Set `index_heading_entries` field in `~/.khoj/khoj.yml` to `true` to index entries with empty body

### Main Changes
#### Make Indexing Org-Mode Entries with Empty Body Configurable
- 253c9ea Set `index_heading_entries` field in `khoj.yml` to index entries with no body

### Fix, Improve OrgNode, TextToJsonl Parser
- 9d369ae Fix `OrgNode` render of entries with property drawers and empty body
- 1d3b3d5 Convert field get/set methods in `OrgNode` class to `@property`
- db37e38 Create `OrgNode` `hasBody` method. Use it in `org_to_jsonl` checks
- b4878d7 Extract entries from scratch when regenerate requested
- 52e3dd9 Pass the whole `TextContentConfig` as argument to `text_to_jsonl` methods
- e951ba3 Raise exception when org file not found

Resolves #87
2022-09-11 13:46:11 +00:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky
db37e38df7 Create OrgNode hasBody method. Use it in org_to_jsonl checks 2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
b4878d76ea Extract entries from scratch when regenerate requested
- Do not rely on previously extracted entries to find new entries in
regenerate scenario
2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
52e3dd9835 Pass the whole TextContentConfig as argument to text_to_jsonl methods
- Let the specific text_to_jsonl method decide which of the
  TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
  modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
  methods in check
2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky
e951ba37ad Raise exception when org file not found
- No need to catch the IOError in OrgNode
2022-09-11 01:09:24 +03:00
Debanjum
c415af32d5
Support Incremental Update of Entries, Embeddings for OrgMode, Markdown, Beancount Content
### Major Changes
  - 030fab9 Support incremental update of **Markdown** entries, embeddings
  - 91aac83 Support incremental update of **Beancount** transactions, embeddings
  - 2f7a6af Support incremental update of **Org-Mode** entries, embeddings
    - Encode embeddings for updated or new entries
    - Reuse embeddings encoded for existing entries earlier
    - Merge the existing and new entries and embeddings to get the updated entries, embeddings
  - 91d11cc Only hash compiled entry to identify new/updated entries to update
  - b9a6e80 Make OrgNode tags stable sorted to find new entries for incremental updates

### Minor Changes
  - c17a0fd Do not store word filters index to file. Not necessary for now
  - 4eb84c7 Log performance metrics for jsonl conversion
  - 2e1bbe0 Fix striping empty escape sequences from strings

### Why
  - Encoding embeddings is the slowest step to index content
  - Previously we regenerated embeddings for all entries, even if they existed in previous runs
  - Reusing previously generated embeddings should significantly speed up index updates,
    given most user generated content can be expected to be unchanged across time

Resolves #36
2022-09-10 21:38:05 +00:00
Debanjum Singh Solanky
9b2845de06 Add basic tests for beancount to jsonl conversion 2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky
d3267554ae Add basic tests for markdown to jsonl conversion 2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky
2e1bbe0cac Fix striping empty escape sequences from strings
- Fix log message on jsonl write
2022-09-10 23:57:05 +03:00