Commit graph

93 commits

Author SHA1 Message Date
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky
e951ba37ad Raise exception when org file not found
- No need to catch the IOError in OrgNode
2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky
9b2845de06 Add basic tests for beancount to jsonl conversion 2022-09-11 00:16:02 +03:00
Debanjum Singh Solanky
d3267554ae Add basic tests for markdown to jsonl conversion 2022-09-11 00:15:27 +03:00
Debanjum Singh Solanky
ebd5039bd1 Merge branch 'master' into support-incremental-updates-of-embeddings 2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky
ed8d432fdd Clean-up generated file after image search test run
- Clean-up unused imports in test files
2022-09-10 21:43:31 +03:00
Debanjum Singh Solanky
899bfc5c3e Test incremental update triggered on calling text_search.setup
- Previously updates to index required explicitly setting `regenerate=True`
- Now incremental update check made everytime on `text_search.setup` now
- Test if index automatically updates when call `text_search.setup`
  with new content even with `regenerate=False`
2022-09-10 21:02:27 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
b9a6e80629 Make OrgNode tags stable sorted to find new entries for incremental updates
- Having Tags as sets was returning them in a different order
  everytime
- This resulted in spuriously identifying existing entries as new
  because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
  entries for incremental update correctly
2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
976397bd82 Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings 2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky
2b58218b56 Reuse search models across sessions. Merge unused pytest fixtures
- Remove unused model_dir pytest fixture. It was only being used by
  the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
  Downloading search models for each pytest sessions seems excessive and
  slows down tests quite a bit
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
11917c6ddd Do not normalize absolute filenames for creating links in OrgNode 2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
d6bd7bf3e1 Fix initializing OrgNode level to string to parse org files
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue #83, as level is set to string of `*`s
  the moment a heading is found in the current file
2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky
d835467f2c Throw exception if no valid entries found in specified content files
- Previously we were failing if no valid entries while computing
  embeddings. This was obscuring the actual issue of no valid entries
  found in the specified content files
- Throwing an exception early with clear message when no entries found
  should make clarify the issue to be fixed
- See issue #83 for details
2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky
31503e7afd Do not pass embeddings as argument to filter.apply method 2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky
965bd052f1 Make search filters return entry ids satisfying filter
- Filter entries, embeddings by ids satisfying all filters in query
  func, after each filter has returned entry ids satisfying their
  individual acceptance criteria

- Previously each filter would return a filtered list of entries.
  Each filter would be applied on entries filtered by previous filters.
  This made the filtering order dependent

- Benefits
  - Filters can be applied independent of their order of execution
  - Precomputed indexes for each filter is not in danger of running
    into index out of bound errors, as filters run on original entries
    instead of on entries filtered by filters that have run before it
  - Extract entries satisfying filter only once instead of doing
    this for each filter

- Costs
  - Each filter has to process all entries even if previous filters
    may have already marked them as non-satisfactory
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7606724dbc Add file of each entry to entry dict in org_to_jsonl converter
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
  entry in text_search as all text search content types do not have
  file being set in jsonl converters
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
f634399f23 Convert simple file filters with no path separator into regex
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
  if `entry1` is in `file1.org` at `~/notes/file.org`

- Test
  - Test converting simple file name filter to regex for path match
  - Test file filter with space in file name
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
1f9fd28b34 Create File Filter to filter files to query. Add tests for file filter 2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky
f930324350 Rename explicit filter to word filter to be more specific 2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky
8f3326c8d4 Create LRU helper class for caching 2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky
cdcee89ae5 Wrap words in quotes to trigger explicit filter from query
- Do not run the more expensive explicit filter until the word to be
  filtered is completed by user. This requires an end sequence marker
  to identify end of explicit word filter to trigger filtering

- Space isn't a good enough delimiter as the explicit filter could be
  at the end of the query in which case no space
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
858d86075b Use regexes to check if any explicit filters in query. Test can_filter 2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky
b7d259b1ec Test Explicit Include, Exclude Filters 2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky
30c3eb372a Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
ea4fdd9134 Fix logic to ignore notes with no body. Add tests to prevent regression
- Notes with empty newlines in body were not being ignored
- Add regression tests to avoid above regression in org_to_jsonl conversion
2022-08-21 19:41:40 +03:00
Debanjum Singh Solanky
5e107eedc0 Rename test_asymmetric_search to now more appropriate test_text_search 2022-08-21 18:36:14 +03:00
Debanjum Singh Solanky
972523e8a9 Re-enable tests for image search
Verify if recent fixes resolve test flakiness
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
82d2891765 Do not pass ML compute `device' around as argument to search funcs
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
  required by looking for ML compute device in global state
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
fd952e7273 Fix CLI tests as config_file path made absolute during CLI parsing 2022-08-12 01:47:52 +03:00
Debanjum Singh Solanky
fc48ee62ad Update CLI tests since config_file arg has become optional (again) 2022-08-11 22:27:11 +03:00
Debanjum Singh Solanky
a748acfeeb Merge branch 'master' of github.com:debanjum/khoj into create-native-gui
Conflicts:
- src/main.py
  - router functions have moved to router
  - move logic to handle null query perf timer variables into
    router.py
  - set main.py to current branch, not master
2022-08-11 21:09:42 +03:00
Debanjum Singh Solanky
a02d9db457 Test Task State Extraction in OrgNode Tests 2022-08-10 13:48:18 +03:00
Debanjum Singh Solanky
7b04978f52 Put global state variables into separate state module
- Variables storing app, device state aren't constants.
  Do not mix with actual constants like empty_escape_sequence, web_directory
2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky
bc423d8f76 Disable image search in tests. Import global state from constants module
- Upstream issues causing load of image search model to fail.
  Disable tests related to image search for now
2022-08-06 02:47:52 +03:00
Debanjum Singh Solanky
ca5a8bd113 Make config file a positional argument, as it is required
- Test invalid config file path throws. Remove redundant cli test

- Simplify cli parser code
  - Do not need to explicitly check if args.config_file set.
    argparser checks for positional arguments automatically

- Use standard semantics for cli args
  - All positional args are required. Non positional args are optional

- Improve command line --help description
2022-08-05 01:09:40 +03:00
Debanjum Singh Solanky
1374065092 Mark all required fields for config. Throw if no input_* field specified
- Add custom validator to throw if neither input_filter or
  input_<files|directories> are specified

- Set field expecting paths to type Path

- Now that default_config isn't used in code. We can update
  fields in rawconfig to specify whether they're required or not.
  This lets pydantic validate config file and throw appropriate error
2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky
4788143aa6 Set clip model name in conftest to sentence-tranformers/clip as well 2022-08-04 22:54:39 +03:00
Debanjum Singh Solanky
f50f343f73 Rename org-mode test data directory to more specific org/ from notes/ 2022-08-04 22:29:57 +03:00
Debanjum Singh Solanky
a4eb55dd00 Rename khoj config yml file to follow more specific khoj*.yml pattern
- That is, sample_config.yml is renamed to khoj_sample.yml
- This makes the application config filename less generic,
  more easily identifiable with the application
- Update docs, app accordingly
2022-08-03 12:06:55 +03:00
Debanjum Singh Solanky
7d7259bd92 Remove tests that validate configuring org using commandline arguments 2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky
a12eaa4ce0 Move Khoj image results into a child images/ directory 2022-07-28 20:45:12 +04:00
Debanjum Singh Solanky
1168244c92 Make cross-encoder re-rank results if query param set on /search API
- Improve search speed by ~10x
  Tested on corpus of 125K lines, 12.5K entries

- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
  - It's an optional param that default to False
  - Earlier all results were re-ranked by cross-encoder
  - Making this configurable allows for much faster results, if desired
    but for lower accuracy
2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky
b1e64fd4a8 Improve search speed. Only apply filter if filter keywords in query
- Formalize filters into class with can_filter() and filter() methods

- Use can_filter() method to decide whether to apply filter and
  create deep copies of entries and embeddings for it

- Improve search speed for queries with no filters
  as deep copying entries, embeddings takes the most time
  after cross-encodes scoring when calling the /search API

  Earlier we would create deep copies of entries, embeddings
  even if the query did not contain any filter keywords
2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky
65fea7681a Rename notes search type to org search, now that markdown notes supported 2022-07-21 22:09:44 +04:00