Commit graph

3980 commits

Author SHA1 Message Date
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
c16ae9e344 Ignore "Legacy way to download model" warning for upstream dependency 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
3169e3b78e Use ellipsis instead of pass in base filter abstract methods for aesthetic 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
bf1ae038cb Get XMP metadata from image using Pillow. Remove ExifTool dependency
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
  - This also removes dependency on system Exiftool package for
    XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
2022-09-16 00:48:45 +03:00
Saba
a53094ec92 Add workflow dispatch support in build.yml
- To support dispatch, set the image label based on the branch name
- Master build should still be tagged with latest to get benefit of the standard production Docker label
2022-09-15 20:28:41 +03:00
Debanjum Singh Solanky
8f57a62675 Remove unused imports. Fix typing and indentation
- Typing issues discovered using `mypy'. Fixed manually
- Unused imports discovered and fixed using `autoflake'
- Fix indentation in `org_to_jsonl' manually
2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky
be57c711fd Revert OrgNode.hasTag func to method instead of property as accepts argument 2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
0109c7bd91 Disable ability to call <text>_to_jsonl, <type>_search packages directly
- This code is de-synced with expected args by above scripts
- Better to remove unused capabilitity that needlessly increases
  maintainance burden
2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
1680a617da Reflect updates to query and results count in URL
- Simplify tracking khoj query history, saving/sharing links
- Do not execute search, when query only contains whitespaces
  - Prevents error when try process results of empty query
2022-09-13 23:39:24 +03:00
Debanjum Singh Solanky
34314e859a Call /reload instead of /regenerate API to update index from web interface
- As `/reload` updates index incrementally, it's relatively quick
- This makes exposing `/reload` endpoint a better default to expose
  via the web interface than `the /regenerate' endpoint
2022-09-12 23:39:10 +03:00
Debanjum Singh Solanky
13b5d5082f Create input field to set results count on the web interface
Resolves #96
2022-09-12 23:24:46 +03:00
Debanjum Singh Solanky
0ce0c00090 Bump khoj version to 0.1.10 2022-09-12 23:03:22 +03:00
Debanjum Singh Solanky
1bfe9c4ef2 Handle filter only queries. Short-circuit and return filtered results
- For queries with only filters in them short-circuit and return
  filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky
afc84de234 Make word filter regex explicit. Allow hyphen in word filters
Helps with #88
2022-09-12 17:05:29 +03:00
Debanjum
3d86d763c5
Support Multiple Input Filters to Configure Content to Index
- 536f03a Process text content files in sorted order for stable indexing
- a701ad0 Support multiple input-filters to configure content to index via `khoj.yml`

Resolves #84
2022-09-12 08:19:52 +00:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
940c8fac8c Use app LRU, not functools LRU decorator, to cache search results in router
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
  wrapped in generically named function when using functools LRU decorator
2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky
c6fa09d8fc Fix querying with include word filter from web interface
- Not encoding the `query' string before querying the backend API with
  it was causing the "+" prefix for include word filter to be lost
2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky
1502fbc9e9 Add index_heading_entries flag to default and sample khoj configs 2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky
7216cdff58 Add Date, Word filter for Org-Music content 2022-09-11 17:29:34 +03:00
Debanjum
182fbbd8df
Allow Indexing Heading Entries. Improve Org, TextToJsonl Parser
### Summary
- Set `index_heading_entries` field in `~/.khoj/khoj.yml` to `true` to index entries with empty body

### Main Changes
#### Make Indexing Org-Mode Entries with Empty Body Configurable
- 253c9ea Set `index_heading_entries` field in `khoj.yml` to index entries with no body

### Fix, Improve OrgNode, TextToJsonl Parser
- 9d369ae Fix `OrgNode` render of entries with property drawers and empty body
- 1d3b3d5 Convert field get/set methods in `OrgNode` class to `@property`
- db37e38 Create `OrgNode` `hasBody` method. Use it in `org_to_jsonl` checks
- b4878d7 Extract entries from scratch when regenerate requested
- 52e3dd9 Pass the whole `TextContentConfig` as argument to `text_to_jsonl` methods
- e951ba3 Raise exception when org file not found

Resolves #87
2022-09-11 13:46:11 +00:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky
db37e38df7 Create OrgNode hasBody method. Use it in org_to_jsonl checks 2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
b4878d76ea Extract entries from scratch when regenerate requested
- Do not rely on previously extracted entries to find new entries in
regenerate scenario
2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
52e3dd9835 Pass the whole TextContentConfig as argument to text_to_jsonl methods
- Let the specific text_to_jsonl method decide which of the
  TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
  modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
  methods in check
2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky
e951ba37ad Raise exception when org file not found
- No need to catch the IOError in OrgNode
2022-09-11 01:09:24 +03:00
Debanjum
c415af32d5
Support Incremental Update of Entries, Embeddings for OrgMode, Markdown, Beancount Content
### Major Changes
  - 030fab9 Support incremental update of **Markdown** entries, embeddings
  - 91aac83 Support incremental update of **Beancount** transactions, embeddings
  - 2f7a6af Support incremental update of **Org-Mode** entries, embeddings
    - Encode embeddings for updated or new entries
    - Reuse embeddings encoded for existing entries earlier
    - Merge the existing and new entries and embeddings to get the updated entries, embeddings
  - 91d11cc Only hash compiled entry to identify new/updated entries to update
  - b9a6e80 Make OrgNode tags stable sorted to find new entries for incremental updates

### Minor Changes
  - c17a0fd Do not store word filters index to file. Not necessary for now
  - 4eb84c7 Log performance metrics for jsonl conversion
  - 2e1bbe0 Fix striping empty escape sequences from strings

### Why
  - Encoding embeddings is the slowest step to index content
  - Previously we regenerated embeddings for all entries, even if they existed in previous runs
  - Reusing previously generated embeddings should significantly speed up index updates,
    given most user generated content can be expected to be unchanged across time

Resolves #36
2022-09-10 21:38:05 +00:00
Debanjum Singh Solanky
9b2845de06 Add basic tests for beancount to jsonl conversion 2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky
d3267554ae Add basic tests for markdown to jsonl conversion 2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky
2e1bbe0cac Fix striping empty escape sequences from strings
- Fix log message on jsonl write
2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky
a7cf6c8458 Use dictionary instead of list to track entry to file maps 2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky
3e1323971b Stack function calls in jsonl converters to avoid unneeded variables 2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky
4eb84c7f51 Log performance metrics for beancount, markdown to jsonl conversion 2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky
ebd5039bd1 Merge branch 'master' into support-incremental-updates-of-embeddings 2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky
ed8d432fdd Clean-up generated file after image search test run
- Clean-up unused imports in test files
2022-09-10 21:43:31 +03:00
Debanjum Singh Solanky
030fab9bb2 Support incremental update of Markdown entries, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
91aac83c6a Support incremental update of Beancount transactions, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
cfaf7aa6f4 Update Indexing Performance Section in Readme 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
b01b4d7daa Extract logic to mark entries for embeddings update into helper function
- This could be re-used by other text_to_jsonl converters like
  markdown, beancount
2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
f97308bef2 Fix log message on writing JSONL data to file 2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky
899bfc5c3e Test incremental update triggered on calling text_search.setup
- Previously updates to index required explicitly setting `regenerate=True`
- Now incremental update check made everytime on `text_search.setup` now
- Test if index automatically updates when call `text_search.setup`
  with new content even with `regenerate=False`
2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
91d11ccb49 Only hash compiled entry to identify new/updated entries to update
- Comparing compiled entries is the appropriately narrow target to
  identify entries that need to encode their embedding vectors. Given we
  pass the compiled form of the entry to the model for encoding

- Hashing the whole entry along with it's raw form was resulting in a
  bunch of entries being marked for updated as LINE: <entry_line_no>
  is a string added to each entries raw format.

- This results in an update to a single entry resulting in all entries
  below it in the file being marked for update (as all their line
  numbers have changed)

- Log performance metrics for steps to convert org entries to jsonl
2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky
b9a6e80629 Make OrgNode tags stable sorted to find new entries for incremental updates
- Having Tags as sets was returning them in a different order
  everytime
- This resulted in spuriously identifying existing entries as new
  because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
  entries for incremental update correctly
2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
ec675d27d3 Suppress non-actionable HuggingFace FutureWarning shown on app start 2022-09-10 16:43:14 +03:00
Debanjum Singh Solanky
1ac6a71ff0 Add --version flag to show installed version of khoj 2022-09-10 16:40:19 +03:00