Commit graph

941 commits

Author SHA1 Message Date
Debanjum Singh Solanky
91aac83c6a Support incremental update of Beancount transactions, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
cfaf7aa6f4 Update Indexing Performance Section in Readme 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
b01b4d7daa Extract logic to mark entries for embeddings update into helper function
- This could be re-used by other text_to_jsonl converters like
  markdown, beancount
2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
f97308bef2 Fix log message on writing JSONL data to file 2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky
899bfc5c3e Test incremental update triggered on calling text_search.setup
- Previously updates to index required explicitly setting `regenerate=True`
- Now incremental update check made everytime on `text_search.setup` now
- Test if index automatically updates when call `text_search.setup`
  with new content even with `regenerate=False`
2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
91d11ccb49 Only hash compiled entry to identify new/updated entries to update
- Comparing compiled entries is the appropriately narrow target to
  identify entries that need to encode their embedding vectors. Given we
  pass the compiled form of the entry to the model for encoding

- Hashing the whole entry along with it's raw form was resulting in a
  bunch of entries being marked for updated as LINE: <entry_line_no>
  is a string added to each entries raw format.

- This results in an update to a single entry resulting in all entries
  below it in the file being marked for update (as all their line
  numbers have changed)

- Log performance metrics for steps to convert org entries to jsonl
2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky
b9a6e80629 Make OrgNode tags stable sorted to find new entries for incremental updates
- Having Tags as sets was returning them in a different order
  everytime
- This resulted in spuriously identifying existing entries as new
  because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
  entries for incremental update correctly
2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
ec675d27d3 Suppress non-actionable HuggingFace FutureWarning shown on app start 2022-09-10 16:43:14 +03:00
Debanjum Singh Solanky
1ac6a71ff0 Add --version flag to show installed version of khoj 2022-09-10 16:40:19 +03:00
Debanjum
372dcd2dbc
Handle Empty Org Files or Org Files with No Headings
### Main Changes
- bf01a4f Use filename or "#+TITLE" as heading for 0th level content in org files
- d6bd7bf Fix initializing `OrgNode` `level` to string to parse org files with no headings
- d835467 Throw exception if no valid entries found in specified content files

### Miscellaneous Improvements
- 7df39e5 Reuse search models across `pytest` sessions. Merge unused pytest fixtures
- 2dc0588 Do not normalize absolute filenames for entry links in `OrgNode`
- e00bb53 Init word filter dictionary with default value as set to simplify code

Resolves #83
2022-09-10 12:42:07 +00:00
Debanjum Singh Solanky
976397bd82 Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings 2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky
2b58218b56 Reuse search models across sessions. Merge unused pytest fixtures
- Remove unused model_dir pytest fixture. It was only being used by
  the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
  Downloading search models for each pytest sessions seems excessive and
  slows down tests quite a bit
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
11917c6ddd Do not normalize absolute filenames for creating links in OrgNode 2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
07b98d35f1 Use filename or #+TITLE as heading for 0th level content in org files
- Set LINE, SOURCE link properties in property drawer correctly for
  content which falls under no heading
- See Issue #83 for more details
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
d6bd7bf3e1 Fix initializing OrgNode level to string to parse org files
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue #83, as level is set to string of `*`s
  the moment a heading is found in the current file
2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky
d835467f2c Throw exception if no valid entries found in specified content files
- Previously we were failing if no valid entries while computing
  embeddings. This was obscuring the actual issue of no valid entries
  found in the specified content files
- Throwing an exception early with clear message when no entries found
  should make clarify the issue to be fixed
- See issue #83 for details
2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky
e00bb53336 Init word filter dictionary with default value as set to simplify code 2022-09-10 12:19:09 +03:00
Debanjum Singh Solanky
4d776d9c7a Bump khoj version to 0.1.9 2022-09-09 07:50:15 +03:00
Debanjum
b58b7d7483
Create App Directory, Fix Initialization GUI on First Run
- 588f598 Pass empty list of `input_files` to `FileBrowser` on first run
- 3ddffdf Create config directory before setting up logging to file under it

Resolves #78
Resolves #79
Resolves #80
2022-09-09 04:40:22 +00:00
Debanjum Singh Solanky
588f598949 Pass empty list of `input_files' to FileBrowser on first run
- Default config has `input_files' set to None
- This was being passed to `FileBrowser' on Initialization
- But `FileBrowser' expects `content_files' of list type, not None
- This resulted in an unexpected NoneType failure
2022-09-09 07:26:40 +03:00
Debanjum Singh Solanky
3ddffdfba4 Create config directory before setting up logging to file under it
- The logging to file code expects the config directory to already be setup
- But parent directory of config file was being set up later in code
- This resulted in app start failing with ~/.khoj dir does not exist error
2022-09-09 07:21:42 +03:00
Debanjum
79894efc7a
Resolve GUI Issues in Docker Build
- 17354aa Install `pyqt` system package in Docker image to get qt dependencies
- 5d3aeba Do not start GUI when Khoj started from Docker
- 26ff66f (Re-)Enable image search via Docker image as image search issues fixed

Resolves #76
2022-09-08 07:55:06 +00:00
Debanjum Singh Solanky
26ff66f38b (Re-)Enable image search via Docker image as image search issues fixed 2022-09-08 10:42:34 +03:00
Debanjum Singh Solanky
17354aaffd Install pyqt system package in Docker image to get qt dependencies
Otherwise app start fails with pyqt package import related errors.
See #76 for bug
2022-09-08 10:39:11 +03:00
Debanjum Singh Solanky
5d3aeba22f Use --no-gui flag on starting Khoj from docker-compose
As the GUI wouldn't work when run from a docker container
2022-09-08 10:37:39 +03:00
Debanjum Singh Solanky
e4d40e4d4d Update setup.py version, Readme. Remove faulty release badge for now 2022-09-07 14:51:03 +03:00
Debanjum Singh Solanky
35d81de1a1 Update khoj version to 0.1.7 in setup.py
This should have been done right after the 0.1.6 release. To allow
pre-release versions for 0.1.7 published to pypi from master to be
installable. Currently their being published as 0.1.6 pre-release
versions instead
2022-09-07 13:38:15 +03:00
Debanjum Singh Solanky
762607fc9f Log processed entries by org_to_jsonl only if verbosity > 2
Output too verbose for even debug mode logging. So gated behind -vvv
2022-09-06 23:03:29 +03:00
Debanjum Singh Solanky
490157cafa Setup File Filter for Markdown and Ledger content types
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
  - Word, Date Filters were accidently removed from the above types yesterday
  - File Filter is the only filter that newly got added
2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky
94cf3e97f3 Log app logs to file for posthoc debugging and performance analysis 2022-09-06 14:51:48 +03:00
Debanjum
0a78cd5477
Create File Filter. Improve, Consolidate Filter Code
### General Filter Improvements
- e441874 Create Abstract Base Class for all filters to inherit from
- 965bd05 Make search filters return entry ids satisfying filter
- 092b9e3 Setup Filters when configuring Text Search for each Search Type
- 31503e7 Do not pass embeddings in argument to `filter.apply` method as unused

### Create File Filter
- 7606724 Add file associated with each entry to entry dict in `org_to_jsonl` converter
- 1f9fd28 Create File Filter to filter files specified in query
- 7dd20d7 Pre-compute file to entry map in speed up file based filter
- 7e083d3 Cache results for file filters passed in query for faster filtering
- 2890b4c Simplify extracting entries satisfying file filter

### Miscellaneous
- f930324 Rename `explicit filter` to more appropriate name `word filter`
- 3707a4c Improve date filter perf. Precompute date to entry map, Cache results
2022-09-05 15:29:55 +00:00
Debanjum Singh Solanky
3707a4cdd4 Improve date filter perf. Precompute date to entry map, Cache results
- Precompute date to entry map
- Cache results for faster recall
- Log preformance timers in date filter
2022-09-05 18:21:29 +03:00
Debanjum Singh Solanky
31503e7afd Do not pass embeddings as argument to filter.apply method 2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky
965bd052f1 Make search filters return entry ids satisfying filter
- Filter entries, embeddings by ids satisfying all filters in query
  func, after each filter has returned entry ids satisfying their
  individual acceptance criteria

- Previously each filter would return a filtered list of entries.
  Each filter would be applied on entries filtered by previous filters.
  This made the filtering order dependent

- Benefits
  - Filters can be applied independent of their order of execution
  - Precomputed indexes for each filter is not in danger of running
    into index out of bound errors, as filters run on original entries
    instead of on entries filtered by filters that have run before it
  - Extract entries satisfying filter only once instead of doing
    this for each filter

- Costs
  - Each filter has to process all entries even if previous filters
    may have already marked them as non-satisfactory
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7dd20d764c Pre-compute file to entry map in file filter to mark ids to include faster 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
2890b4cd44 Simplify extracting entries satisfying file filter 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7606724dbc Add file of each entry to entry dict in org_to_jsonl converter
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
  entry in text_search as all text search content types do not have
  file being set in jsonl converters
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7e083d3e96 Cache results for file filters passed in query for faster filtering 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
f634399f23 Convert simple file filters with no path separator into regex
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
  if `entry1` is in `file1.org` at `~/notes/file.org`

- Test
  - Test converting simple file name filter to regex for path match
  - Test file filter with space in file name
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
1f9fd28b34 Create File Filter to filter files to query. Add tests for file filter 2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky
e4418746f2 Create Abstract Base Class for Filters. Make Word, Date Filter Child of BaseFilter 2022-09-04 18:48:16 +03:00
Debanjum Singh Solanky
c9f6200007 Ignore pytest_cache directory from git using .gitignore 2022-09-04 17:19:22 +03:00
Debanjum Singh Solanky
f930324350 Rename explicit filter to word filter to be more specific 2022-09-04 17:18:47 +03:00
Debanjum
d153d420fc
Improve Latency of Explicit Filter
### Goal
  - Improve explicit filter latency to work better with incremental search

### Reasons for High Explicit Filter Latency
  - Deleting entries to be excluded from existing list of entries, embeddings
  - Explicit filtering on partial words during incremental search
  - Creating word set for all entries on the fly during query
  - Deep copying of entries, embeddings before applying filter

### Improvement Details
  - **Major**
    - 191a656 Use word to entry map, list comprehension to speed up explicit filter
      - Use list comprehension and `torch.index_select` methods
        - to speed selection of entries, embedding tensors satisfying filter
        - avoid deep copy and direct manipulation of entries, embeddings
      - Use word to entry map and set operations to mark entries 
        satisfying inclusion, exclusion filters
    - c7de57b Pre-compute entry word sets to improve explicit filter query performance
    - 3308e68 Cache explicitly filtered entries, embeddings by required, blocked words
    - cdcee89 Wrap explicit filter words in quotes to trigger filter
      - E.g `+"word_to_include"` instead of `+word_to_include`
      - Signals explicit filter term completed
      - Prevents latency due to incremental search with explicit filtering on partial terms
  - **Minor**
    - 28d3dc1 Deep copy entries, embeddings in filters. Defer till actual filtering
    - 8d9f507 Load entries_by_word_set from file only once on first load of explicit filter
    - 546fad5 Use regex to check for and extract include, exclude filter words from query
    - b7d259b Test Explicit Include, Exclude Filters
    
### Results
  - Improve exclude word filter latency from **20s+ to 0.02s** on 120K line notes corpus
2022-09-04 13:55:17 +00:00
Debanjum Singh Solanky
6087862521 Use LRU helper class for explicit filter cache 2022-09-04 16:42:28 +03:00
Debanjum Singh Solanky
8f3326c8d4 Create LRU helper class for caching 2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky
191a656ed7 Use word to entry map, list comprehension to speed up explicit filter
- Code Changes
  - Use list comprehension and `torch.index_select' methods
    - to speed selection of entries, embedding tensors satisfying filter
    - avoid deep copy of entries, embeddings
    - avoid updating existing lists (of entries, embeddings)

  - Use word to entry map and set operations to mark entries satisfying
    inclusion, exclusion filters

- Results
  - Speed up explicit filtering by two orders of magnitude
  - Improve consistency of speed up across inclusion and exclusion filtering
2022-09-04 15:22:35 +03:00