Commit graph

1695 commits

Author SHA1 Message Date
Debanjum
d153d420fc
Improve Latency of Explicit Filter
### Goal
  - Improve explicit filter latency to work better with incremental search

### Reasons for High Explicit Filter Latency
  - Deleting entries to be excluded from existing list of entries, embeddings
  - Explicit filtering on partial words during incremental search
  - Creating word set for all entries on the fly during query
  - Deep copying of entries, embeddings before applying filter

### Improvement Details
  - **Major**
    - 191a656 Use word to entry map, list comprehension to speed up explicit filter
      - Use list comprehension and `torch.index_select` methods
        - to speed selection of entries, embedding tensors satisfying filter
        - avoid deep copy and direct manipulation of entries, embeddings
      - Use word to entry map and set operations to mark entries 
        satisfying inclusion, exclusion filters
    - c7de57b Pre-compute entry word sets to improve explicit filter query performance
    - 3308e68 Cache explicitly filtered entries, embeddings by required, blocked words
    - cdcee89 Wrap explicit filter words in quotes to trigger filter
      - E.g `+"word_to_include"` instead of `+word_to_include`
      - Signals explicit filter term completed
      - Prevents latency due to incremental search with explicit filtering on partial terms
  - **Minor**
    - 28d3dc1 Deep copy entries, embeddings in filters. Defer till actual filtering
    - 8d9f507 Load entries_by_word_set from file only once on first load of explicit filter
    - 546fad5 Use regex to check for and extract include, exclude filter words from query
    - b7d259b Test Explicit Include, Exclude Filters
    
### Results
  - Improve exclude word filter latency from **20s+ to 0.02s** on 120K line notes corpus
2022-09-04 13:55:17 +00:00
Debanjum Singh Solanky
6087862521 Use LRU helper class for explicit filter cache 2022-09-04 16:42:28 +03:00
Debanjum Singh Solanky
8f3326c8d4 Create LRU helper class for caching 2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky
191a656ed7 Use word to entry map, list comprehension to speed up explicit filter
- Code Changes
  - Use list comprehension and `torch.index_select' methods
    - to speed selection of entries, embedding tensors satisfying filter
    - avoid deep copy of entries, embeddings
    - avoid updating existing lists (of entries, embeddings)

  - Use word to entry map and set operations to mark entries satisfying
    inclusion, exclusion filters

- Results
  - Speed up explicit filtering by two orders of magnitude
  - Improve consistency of speed up across inclusion and exclusion filtering
2022-09-04 15:22:35 +03:00
Debanjum Singh Solanky
28d3dc1434 Deep copy entries, embeddings in filters. Defer till actual filtering
- Only the filter knows when entries, embeddings are to be manipulated.
  So move the responsibility to deep copy before manipulating entries,
  embeddings to the filters

- Create deep copy in filters. Avoids creating deep copy of entries,
  embeddings when filter results are being loaded from cache etc
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
3308e68edf Cache explicitly filtered entries, embeddings by required, blocked words 2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
cdcee89ae5 Wrap words in quotes to trigger explicit filter from query
- Do not run the more expensive explicit filter until the word to be
  filtered is completed by user. This requires an end sequence marker
  to identify end of explicit word filter to trigger filtering

- Space isn't a good enough delimiter as the explicit filter could be
  at the end of the query in which case no space
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
8d9f507df3 Load entries_by_word_set from file only once on first load of explicit filter 2022-09-04 00:37:37 +03:00
Debanjum Singh Solanky
858d86075b Use regexes to check if any explicit filters in query. Test can_filter 2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky
546fad570d Use regex to extract include, exclude filter words from query 2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky
b7d259b1ec Test Explicit Include, Exclude Filters 2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky
ffb8e3988e Use Python Logging Framework to Time Performance of Explicit Filter 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
30c3eb372a Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
c7de57b8ea Pre-compute entry word sets to improve explicit filter query performance 2022-09-03 16:16:31 +03:00
Debanjum Singh Solanky
094bd18e57 Use python standard logging framework for app logs
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
  - verbose = 0 maps to logging.WARN
  - verbose = 1 maps to logging.INFO
  - verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
2022-09-03 14:43:32 +03:00
Debanjum Singh Solanky
d0531c3064 Update URL QueryParam when Type set in Dropdown on Web Interface
- This also pushes the updated URL state to history
- Allows jumping back to the web interface after clicking on an image
  and having the type set to image search
- Previously type would get reset to the default search type on
  jumping back
2022-08-28 12:22:22 +03:00
Debanjum Singh Solanky
2eae32d743 Time, Log Image Search Performance 2022-08-28 00:28:46 +03:00
Debanjum Singh Solanky
c3ca99841b Scale down images to generate image embeddings faster, with less memory
- CLIP doesn't need full size images for generating embeddings with
  decent search results. The sentence transformers docs use images
  scaled to 640px width

- Benefits
  - Normalize image sizes
  - Increase image embeddings generation speed
  - Decrease memory usage while generating embeddings from images
2022-08-24 14:09:02 +03:00
Debanjum Singh Solanky
ea4fdd9134 Fix logic to ignore notes with no body. Add tests to prevent regression
- Notes with empty newlines in body were not being ignored
- Add regression tests to avoid above regression in org_to_jsonl conversion
2022-08-21 19:41:40 +03:00
Debanjum Singh Solanky
5e107eedc0 Rename test_asymmetric_search to now more appropriate test_text_search 2022-08-21 18:36:14 +03:00
Debanjum
144986ebfd
Fix, Improve Desktop GUI Splash Screen and Main Window
- 5e6625a Fix file browser to not add empty line when no file/dir selected
- 8098b8c Bring main window to Top when open from System Tray
- 1c122a8 Place window near top so buttons are not hidden by OS bottom bar
- dfe2546 Set Khoj Icon on Main Desktop Window
- 1b1f8f9 Move Splash screen text below icon. Set the text color to black
- 450f644 Fix path to remove shared libraries when packaging the Windows app
2022-08-20 23:19:01 +00:00
Debanjum
d6f624dc75
Use MPS, CUDA to GPU Accelerate Query Performance
- Load Models and Embeddings onto GPU if available
- Use MPS for GPU acceleration when available
  - Note: Support for [MPS](https://developer.apple.com/metal/) in Pytorch is currently in v1.13.0 nightly builds. See [Announcement](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)
  - Users will have to wait for PyTorch MPS support to land in stable builds
  - Until then code can be tuned and tested for GPU acceleration on newer Macs
- Re-enable Tests for Image Search
2022-08-20 23:16:44 +00:00
Debanjum Singh Solanky
5e6625ac68 Fix file browser to not add empty line when no file/dir selected
- When no file selected in file browser an empty line/entry gets added
  to input entries list
- Bug got introduced due to insufficient update on change to add
  instead of insert
- Update is_none_or_empty helper method to also check for empty string
2022-08-21 02:03:28 +03:00
Debanjum Singh Solanky
8098b8c3a8 Bring Configure Window to Top when Opened from System Tray
- Previously the window could get hidden behind other app windows when
  user clicked configure from the system tray
2022-08-20 23:38:43 +03:00
Debanjum Singh Solanky
1c122a8a91 Place window near top so buttons are not hidden by OS bottom bar 2022-08-20 22:38:06 +03:00
Debanjum Singh Solanky
dfe2546c04 Set Khoj Icon on Main Desktop Window 2022-08-20 20:36:15 +03:00
Debanjum Singh Solanky
1b1f8f9272 Move Splash screen text below icon. Set the text color to black
- It is currently on top of the splash screen icon
- Ballpark pixels to move such that text positioned below icon
- Test later to verify if text is positioned fine now
2022-08-20 20:32:31 +03:00
Debanjum Singh Solanky
450f6441e2 Fix path to remove shared libraries when packaging the Windows app 2022-08-20 20:29:33 +03:00
Debanjum Singh Solanky
e6abe76875 Upgrade torch, torchvision package versions 2022-08-20 14:45:43 +03:00
Debanjum Singh Solanky
972523e8a9 Re-enable tests for image search
Verify if recent fixes resolve test flakiness
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
82d2891765 Do not pass ML compute `device' around as argument to search funcs
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
  required by looking for ML compute device in global state
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
acc9091260 Use MPS on Apple Mac M1 to GPU accelerate Encode, Query Performance
- Note: Support for MPS in Pytorch is currently in v1.13.0 nightly builds
  - Users will have to wait for PyTorch MPS support to land in stable builds
- Until then the code can be tweaked and tested to make use of the GPU
  acceleration on newer Macs
2022-08-20 14:44:06 +03:00
Debanjum Singh Solanky
7de9c58a1c Load models, corpus embeddings onto GPU device for text search, if available
- Pass device to load models onto from app state.
  - SentenceTransformer models accept device to load models onto during initialization
- Pass device to load corpus embeddings onto from app state
2022-08-20 14:04:18 +03:00
Debanjum Singh Solanky
7fe3e844d2 Fix setup of Reproducible Build environment in publish workflow
- Note: Reproducible builds have not been validated.
  This is just preliminary work to get there.
  Further testing and fixes maybe required
2022-08-19 21:00:12 +03:00
Debanjum Singh Solanky
dc8dcc94a6 Bump Khoj.el package version to 0.1.6 2022-08-19 20:48:42 +03:00
Debanjum
a7b4d58865
Fix Image Search and Improve Desktop App
### Fix Image Search
  - Do not use XMP metadata by default for image search
    - It seems to be buggy currently. The returned results do not make sense with XMP metadata enabled

### Fix Image Search using Desktop App
  - Fix configuring Image Search via Desktop GUI
    - Set `input-directories`, instead of unused `input-files` for `content-type.image` in `khoj.yml`
  - Fix running Image Search via Desktop apps. 
    - Previously the transformers wasn't getting packaged into the app by pyinstaller
    - This is required by image search to run. So the desktop apps would fail to start when image search was enabled
    - Resolves #68
  - Append selected files, directories via "Add" button in Desktop GUI
    - This allows selecting multiple files, directories using Desktop GUI
    - Previously selecting multiple image directories had to be entered manually

### Improve Desktop App
  - Show Splash Screen to Desktop on App Initialization
    - The app takes a while to load during first run 
    - A splash screen signals that app is loading and not being unresponsive
    - Note: _Pyinstaller only supports splash screens on Windows, Linux. Not on Macs._
  - Add Khoj icon to the Windows, Linux app. Windows expects a `.ico` icon type
  - Only exclude `libtorch_{cuda, cpu, python}` on Linux machine
    - Seems those libraries are being used on Mac (and maybe Windows). 
    - Linux is where the app size benefits from removing these is maximum anyway
  - Fix PyInstaller Warnings on App Start
    - The warning show up as annoying error popups on Windows
2022-08-19 17:37:09 +00:00
Debanjum Singh Solanky
b9a54c03ee Add transformers package into Khoj app to run image search 2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
ffbf15eff8 Add helper function to identify when app running as pyinstaller app
Useful for when want the app to behave differently in pyinstaller app
scenario with frozen python. And in development scenarios
2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
6c5c1c33c1 Turn off Tokenizers Parallelism. Khoj doesn't support it right now
- Forking and multiprocess are problemantic in frozen python
  scenarios. This will cause issues when running App packaged by
  pyinstaller
2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
d4072974d7 Use of XMP metadata in Khoj Image Search is broken. Disable by default
- CLIP Image score and XMP metadata score are not combining well.
  When combined they give non sensical results. Enable only once
  figure how best to combine the two.

- Show scores with higher precision for image search
  - Image search scores seem to be mostly be between 0.2 - 0.3 for some reason
  - Higher precision scores make it easier to understand the quality
    of returned results perceived by the model itself
2022-08-19 19:17:28 +03:00
Debanjum Singh Solanky
7c4417126c Append files, directories selected by user to config in Desktop GUI
- Allows adding multiple image directories via GUI
- Allow adding multiple files in different directories via GUI
- Previously users couldn't add multiple directories via GUI
  They'd have to manually append to input field if multiple files, directories
- To clear/overwrite is much easier.
  The user can just select text to delete in input area
2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
00ddcfdac8 Use .ico icon when packaging for Windows (and Linux) using Pynstaller 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
812838da54 Only exclude libtorch_cuda, libtorch_cpu under torch/lib on Linux
- On Mac excluding the .dylib version of these files throws errors
  Not sure why it didn't throw during testing. Maybe the libs were cached?

- Tested on Linux again. It still seems to be passing with the above
  libs excluded. So going to keep those excluded for now. Unless
  further testing reveals those libs are really required for app
2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
6ddcbe2e75 Remove files that triggered warnings during app start 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
60dacf3f2c Show splash screen on app start. Only supported on Windows, Linux 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
0079c13bf7 Set input-directories in config for image search type on Desktop GUI
- Issue
  Fix configuring image search from Desktop GUI. It was broken before.
  The Desktop GUI was updating input-files field under content-type > image.
  This field is not used for image search. So image search couldn't be
  configured from the Desktop GUI

- Fix
  - Set input-directories when field of search type image is set from GUI
  - Otherwise set input-files field in config
2022-08-18 18:29:55 +03:00
Debanjum Singh Solanky
082fe937b9 Reduce Windows App Size by Removing Unused Libraries under Torch/Lib
Tighten the duplicate library removal code in Khoj.spec
2022-08-18 11:28:28 +03:00
Debanjum
b78ee317ae
Reduce Debian, Mac App Size. Remove unused libraries under Torch/Lib
## Changes
- On **Debian**
  - `libtorch_cuda.so` (1Gb) and `libtorch_cpu.so` (700Mb) are large shared libs
  - They are available at package root and under `torch/lib` directory in the package
  - We remove the unused, duplicate libraries from under `torch/lib` as only the top level libraries are used
- On **Mac**
  - Remove `libtorch_{cpu,python}.dylib` under `torch/lib` directory from the Mac app.

## App Size Reduction
  - **Debian amd64** app size by **42%** from **1.6Gb to 920Mb**  
  - **Mac arm64** app by **15%** from **190Mb to 160Mb**
  - **Mac amd64** app by **33%** from **340Mb to 230Mb**

## Reference
- [Release Workflow Run](https://github.com/debanjum/khoj/actions/runs/2878104171) after changes
- [Release Workflow Run](https://github.com/debanjum/khoj/actions/runs/2869906116) before changes
2022-08-17 20:48:56 +00:00
Debanjum Singh Solanky
d25ddb93f7 Fix missing closing bracket from SOURCE_DATE_EPOCH def in release.yml 2022-08-17 23:17:27 +03:00
Debanjum Singh Solanky
9ee02b0804 Add --noconfirm in call to pyinstaller from Github release workflow
Added just for safety, workflow works fine without it too
2022-08-17 23:04:26 +03:00