- Parsed `level` argument passed to OrgNode during init is expected to
be a string, not an integer
- This was resulting in app failure only when parsing org files with
no headings, like in issue #83, as level is set to string of `*`s
the moment a heading is found in the current file
- Previously we were failing if no valid entries while computing
embeddings. This was obscuring the actual issue of no valid entries
found in the specified content files
- Throwing an exception early with clear message when no entries found
should make clarify the issue to be fixed
- See issue #83 for details
- 588f598 Pass empty list of `input_files` to `FileBrowser` on first run
- 3ddffdf Create config directory before setting up logging to file under it
Resolves#78Resolves#79Resolves#80
- Default config has `input_files' set to None
- This was being passed to `FileBrowser' on Initialization
- But `FileBrowser' expects `content_files' of list type, not None
- This resulted in an unexpected NoneType failure
- The logging to file code expects the config directory to already be setup
- But parent directory of config file was being set up later in code
- This resulted in app start failing with ~/.khoj dir does not exist error
- 17354aa Install `pyqt` system package in Docker image to get qt dependencies
- 5d3aeba Do not start GUI when Khoj started from Docker
- 26ff66f (Re-)Enable image search via Docker image as image search issues fixed
Resolves#76
This should have been done right after the 0.1.6 release. To allow
pre-release versions for 0.1.7 published to pypi from master to be
installable. Currently their being published as 0.1.6 pre-release
versions instead
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
- Word, Date Filters were accidently removed from the above types yesterday
- File Filter is the only filter that newly got added
### General Filter Improvements
- e441874 Create Abstract Base Class for all filters to inherit from
- 965bd05 Make search filters return entry ids satisfying filter
- 092b9e3 Setup Filters when configuring Text Search for each Search Type
- 31503e7 Do not pass embeddings in argument to `filter.apply` method as unused
### Create File Filter
- 7606724 Add file associated with each entry to entry dict in `org_to_jsonl` converter
- 1f9fd28 Create File Filter to filter files specified in query
- 7dd20d7 Pre-compute file to entry map in speed up file based filter
- 7e083d3 Cache results for file filters passed in query for faster filtering
- 2890b4c Simplify extracting entries satisfying file filter
### Miscellaneous
- f930324 Rename `explicit filter` to more appropriate name `word filter`
- 3707a4c Improve date filter perf. Precompute date to entry map, Cache results
- Filter entries, embeddings by ids satisfying all filters in query
func, after each filter has returned entry ids satisfying their
individual acceptance criteria
- Previously each filter would return a filtered list of entries.
Each filter would be applied on entries filtered by previous filters.
This made the filtering order dependent
- Benefits
- Filters can be applied independent of their order of execution
- Precomputed indexes for each filter is not in danger of running
into index out of bound errors, as filters run on original entries
instead of on entries filtered by filters that have run before it
- Extract entries satisfying filter only once instead of doing
this for each filter
- Costs
- Each filter has to process all entries even if previous filters
may have already marked them as non-satisfactory
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
entry in text_search as all text search content types do not have
file being set in jsonl converters
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
if `entry1` is in `file1.org` at `~/notes/file.org`
- Test
- Test converting simple file name filter to regex for path match
- Test file filter with space in file name
### Goal
- Improve explicit filter latency to work better with incremental search
### Reasons for High Explicit Filter Latency
- Deleting entries to be excluded from existing list of entries, embeddings
- Explicit filtering on partial words during incremental search
- Creating word set for all entries on the fly during query
- Deep copying of entries, embeddings before applying filter
### Improvement Details
- **Major**
- 191a656 Use word to entry map, list comprehension to speed up explicit filter
- Use list comprehension and `torch.index_select` methods
- to speed selection of entries, embedding tensors satisfying filter
- avoid deep copy and direct manipulation of entries, embeddings
- Use word to entry map and set operations to mark entries
satisfying inclusion, exclusion filters
- c7de57b Pre-compute entry word sets to improve explicit filter query performance
- 3308e68 Cache explicitly filtered entries, embeddings by required, blocked words
- cdcee89 Wrap explicit filter words in quotes to trigger filter
- E.g `+"word_to_include"` instead of `+word_to_include`
- Signals explicit filter term completed
- Prevents latency due to incremental search with explicit filtering on partial terms
- **Minor**
- 28d3dc1 Deep copy entries, embeddings in filters. Defer till actual filtering
- 8d9f507 Load entries_by_word_set from file only once on first load of explicit filter
- 546fad5 Use regex to check for and extract include, exclude filter words from query
- b7d259b Test Explicit Include, Exclude Filters
### Results
- Improve exclude word filter latency from **20s+ to 0.02s** on 120K line notes corpus
- Code Changes
- Use list comprehension and `torch.index_select' methods
- to speed selection of entries, embedding tensors satisfying filter
- avoid deep copy of entries, embeddings
- avoid updating existing lists (of entries, embeddings)
- Use word to entry map and set operations to mark entries satisfying
inclusion, exclusion filters
- Results
- Speed up explicit filtering by two orders of magnitude
- Improve consistency of speed up across inclusion and exclusion filtering
- Only the filter knows when entries, embeddings are to be manipulated.
So move the responsibility to deep copy before manipulating entries,
embeddings to the filters
- Create deep copy in filters. Avoids creating deep copy of entries,
embeddings when filter results are being loaded from cache etc
- Do not run the more expensive explicit filter until the word to be
filtered is completed by user. This requires an end sequence marker
to identify end of explicit word filter to trigger filtering
- Space isn't a good enough delimiter as the explicit filter could be
at the end of the query in which case no space
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
- verbose = 0 maps to logging.WARN
- verbose = 1 maps to logging.INFO
- verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
- This also pushes the updated URL state to history
- Allows jumping back to the web interface after clicking on an image
and having the type set to image search
- Previously type would get reset to the default search type on
jumping back
- CLIP doesn't need full size images for generating embeddings with
decent search results. The sentence transformers docs use images
scaled to 640px width
- Benefits
- Normalize image sizes
- Increase image embeddings generation speed
- Decrease memory usage while generating embeddings from images