Commit graph

80 commits

Author SHA1 Message Date
Debanjum Singh Solanky
976397bd82 Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings 2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky
2b58218b56 Reuse search models across sessions. Merge unused pytest fixtures
- Remove unused model_dir pytest fixture. It was only being used by
  the content_config fixture, not by any tests
- Reuse existing search models downloaded to khoj directory.
  Downloading search models for each pytest sessions seems excessive and
  slows down tests quite a bit
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
11917c6ddd Do not normalize absolute filenames for creating links in OrgNode 2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
d6bd7bf3e1 Fix initializing OrgNode level to string to parse org files
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue #83, as level is set to string of `*`s
  the moment a heading is found in the current file
2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky
d835467f2c Throw exception if no valid entries found in specified content files
- Previously we were failing if no valid entries while computing
  embeddings. This was obscuring the actual issue of no valid entries
  found in the specified content files
- Throwing an exception early with clear message when no entries found
  should make clarify the issue to be fixed
- See issue #83 for details
2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky
31503e7afd Do not pass embeddings as argument to filter.apply method 2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky
965bd052f1 Make search filters return entry ids satisfying filter
- Filter entries, embeddings by ids satisfying all filters in query
  func, after each filter has returned entry ids satisfying their
  individual acceptance criteria

- Previously each filter would return a filtered list of entries.
  Each filter would be applied on entries filtered by previous filters.
  This made the filtering order dependent

- Benefits
  - Filters can be applied independent of their order of execution
  - Precomputed indexes for each filter is not in danger of running
    into index out of bound errors, as filters run on original entries
    instead of on entries filtered by filters that have run before it
  - Extract entries satisfying filter only once instead of doing
    this for each filter

- Costs
  - Each filter has to process all entries even if previous filters
    may have already marked them as non-satisfactory
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7606724dbc Add file of each entry to entry dict in org_to_jsonl converter
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
  entry in text_search as all text search content types do not have
  file being set in jsonl converters
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
f634399f23 Convert simple file filters with no path separator into regex
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
  if `entry1` is in `file1.org` at `~/notes/file.org`

- Test
  - Test converting simple file name filter to regex for path match
  - Test file filter with space in file name
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
1f9fd28b34 Create File Filter to filter files to query. Add tests for file filter 2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky
f930324350 Rename explicit filter to word filter to be more specific 2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky
8f3326c8d4 Create LRU helper class for caching 2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky
cdcee89ae5 Wrap words in quotes to trigger explicit filter from query
- Do not run the more expensive explicit filter until the word to be
  filtered is completed by user. This requires an end sequence marker
  to identify end of explicit word filter to trigger filtering

- Space isn't a good enough delimiter as the explicit filter could be
  at the end of the query in which case no space
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
858d86075b Use regexes to check if any explicit filters in query. Test can_filter 2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky
b7d259b1ec Test Explicit Include, Exclude Filters 2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky
30c3eb372a Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
ea4fdd9134 Fix logic to ignore notes with no body. Add tests to prevent regression
- Notes with empty newlines in body were not being ignored
- Add regression tests to avoid above regression in org_to_jsonl conversion
2022-08-21 19:41:40 +03:00
Debanjum Singh Solanky
5e107eedc0 Rename test_asymmetric_search to now more appropriate test_text_search 2022-08-21 18:36:14 +03:00
Debanjum Singh Solanky
972523e8a9 Re-enable tests for image search
Verify if recent fixes resolve test flakiness
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
82d2891765 Do not pass ML compute `device' around as argument to search funcs
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
  required by looking for ML compute device in global state
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
fd952e7273 Fix CLI tests as config_file path made absolute during CLI parsing 2022-08-12 01:47:52 +03:00
Debanjum Singh Solanky
fc48ee62ad Update CLI tests since config_file arg has become optional (again) 2022-08-11 22:27:11 +03:00
Debanjum Singh Solanky
a748acfeeb Merge branch 'master' of github.com:debanjum/khoj into create-native-gui
Conflicts:
- src/main.py
  - router functions have moved to router
  - move logic to handle null query perf timer variables into
    router.py
  - set main.py to current branch, not master
2022-08-11 21:09:42 +03:00
Debanjum Singh Solanky
a02d9db457 Test Task State Extraction in OrgNode Tests 2022-08-10 13:48:18 +03:00
Debanjum Singh Solanky
7b04978f52 Put global state variables into separate state module
- Variables storing app, device state aren't constants.
  Do not mix with actual constants like empty_escape_sequence, web_directory
2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky
bc423d8f76 Disable image search in tests. Import global state from constants module
- Upstream issues causing load of image search model to fail.
  Disable tests related to image search for now
2022-08-06 02:47:52 +03:00
Debanjum Singh Solanky
ca5a8bd113 Make config file a positional argument, as it is required
- Test invalid config file path throws. Remove redundant cli test

- Simplify cli parser code
  - Do not need to explicitly check if args.config_file set.
    argparser checks for positional arguments automatically

- Use standard semantics for cli args
  - All positional args are required. Non positional args are optional

- Improve command line --help description
2022-08-05 01:09:40 +03:00
Debanjum Singh Solanky
1374065092 Mark all required fields for config. Throw if no input_* field specified
- Add custom validator to throw if neither input_filter or
  input_<files|directories> are specified

- Set field expecting paths to type Path

- Now that default_config isn't used in code. We can update
  fields in rawconfig to specify whether they're required or not.
  This lets pydantic validate config file and throw appropriate error
2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky
4788143aa6 Set clip model name in conftest to sentence-tranformers/clip as well 2022-08-04 22:54:39 +03:00
Debanjum Singh Solanky
f50f343f73 Rename org-mode test data directory to more specific org/ from notes/ 2022-08-04 22:29:57 +03:00
Debanjum Singh Solanky
a4eb55dd00 Rename khoj config yml file to follow more specific khoj*.yml pattern
- That is, sample_config.yml is renamed to khoj_sample.yml
- This makes the application config filename less generic,
  more easily identifiable with the application
- Update docs, app accordingly
2022-08-03 12:06:55 +03:00
Debanjum Singh Solanky
7d7259bd92 Remove tests that validate configuring org using commandline arguments 2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky
a12eaa4ce0 Move Khoj image results into a child images/ directory 2022-07-28 20:45:12 +04:00
Debanjum Singh Solanky
1168244c92 Make cross-encoder re-rank results if query param set on /search API
- Improve search speed by ~10x
  Tested on corpus of 125K lines, 12.5K entries

- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
  - It's an optional param that default to False
  - Earlier all results were re-ranked by cross-encoder
  - Making this configurable allows for much faster results, if desired
    but for lower accuracy
2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky
b1e64fd4a8 Improve search speed. Only apply filter if filter keywords in query
- Formalize filters into class with can_filter() and filter() methods

- Use can_filter() method to decide whether to apply filter and
  create deep copies of entries and embeddings for it

- Improve search speed for queries with no filters
  as deep copying entries, embeddings takes the most time
  after cross-encodes scoring when calling the /search API

  Earlier we would create deep copies of entries, embeddings
  even if the query did not contain any filter keywords
2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky
65fea7681a Rename notes search type to org search, now that markdown notes supported 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
1f4b5ac112 Create test markdown files. Use them in sample config, docker-compose 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
0602d018c0 Merge Symmetric, Asymmetric Search Types into a single Text Search Type
- The code for both the text search types were mostly the same
  It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
  text_search type
- This simplifies the app and making it easier to process other
  text types
2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky
d50bfb5188 Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests 2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky
70e70d4b15 Rename 'embed' key to more generic 'compiled' for jsonl extracted results
- While it's true those strings are going to be used to generated
  embeddings, the more generic term allows them to be used elsewhere as
  well

- Their main property is that they are processed, compiled for
  usage by semantic search

- Unlike the 'raw' string which contains the external representation
  of the data, as is
2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky
c1369233db Consistently use "entry", "score" in json response for all search types
- Had already made some progress on this earlier by updating the image
  search responses. But needed to update the text search responses to
  use lowercase entry and score

- Update khoj.el to consume the updated json response keys for text
  search
2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky
c9ff97451b Fix tests to handle updated response types by API 2022-07-20 03:01:56 +04:00
Debanjum Singh Solanky
6c9ffdba57 Allow indexing multiple image directories for image search 2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky
68ee88cebc Fix image search tests after update to API response for image search types
- Look for 'entry' key in response json instead of 'Entry'
- Expect image where id = alphanumeric order of image name
2022-07-20 01:37:01 +04:00
Debanjum Singh Solanky
b673d26a12 Extract Entries in a standardized format across text search types
Issue:
 - Had different schema of extracted entries for symmetric_ledger vs asymmetric

 - Entry extraction for asymmetric was dirty, relying on cryptic
   indices to store raw entry vs cleaned entry meant to be passed to embeddings

 - This was pushing the load of figuring out what property to extract
   from each entry to downstream processes like the filters

 - This limited the filters to only work for asymmetric search, not for
   symmetric_ledger

- Fix
   - Use consistent format for extracted entries
     {
       'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
       'raw'  : raw_entry_string_meant_to_be_passed_to_use
     }

 - Result
   - Now filters can be applied across search types, and the specific
     field they should be applied on can be configured by each search
     type
2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky
732b2d287f Give the project a short, less generic name. Rename it to Khoj
- Semantic Search was just a placeholder used to test the idea out
  Didn't want to get into naming at that point of time
2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky
989526ae54 Use a more accurate model for symmetric semantic search
- The all-MiniLM-L6-v2 is more accurate
  - The exact previous model isn't benchmarked but based on the
    performance of the closest model to it. Seems like the new model
    maybe similar in speed and size

- On very preliminary evaluation of the model, the new model seems
  faster, with pretty decent results
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
4a90972e38 Use a better model for asymmetric semantic search
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
  - It doubles the encoding speed of all entries (down from ~8min to 4mins)
  - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)

[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
85077bc1d1 Handle unparseable date range passed via date filter in query
- Do not reuse the same list
- Just create new list, so only parsed data is in it
2022-07-14 22:47:23 +04:00