Commit graph

510 commits

Author SHA1 Message Date
Debanjum Singh Solanky
513c86c6a1 Set index file paths relative to current or default path on khoj backend
We need the index file paths to make sense on the khoj backend server

Having path of index on backend relative to current vault directory
on frontend ignores the fact that the frontend maybe on a different
machine than the khoj backend server

Using unique index name per vault allows switching vaults without
overwriting indices of other vaults created on khoj backend when khoj
obsidian plugin is loaded on opening a different vault
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
4407e23c19 Only index current vault on Khoj. Remove plugin setting to configure it
- Overview
  Limits using Khoj with a single vault at a time. This is
  automatically configured to the most recently opened vault.

  Once directory filters are supported on backend, the plugin will be
  updated to index multiple vault but search only current vault from
  current vaults khoj obsidian plugin

- Code Details
 - Remove setting to configure Vault directory from Khoj Obsidian plugin
 - Automatically configure Khoj to index only current Vault.
 - Overwrites any previous vaults that were intended to be indexed by
   Khoj backend
 - Force update of index after configuring vault

- Why
  It's not helpful for now and can lead to more problems, confusion.
  Once directory filters
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
86a1e43605 Return HTTP Exception on /api/update API call failure
- Previously the backend was just throwing backend error.
  The frontend calling the /update API wasn't getting notified
- Now the frontend can react appropriately and make the issue
  visible to the user
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
5af2b68e2b Update plugin notifications for errors and success
- Only show notification on plugin load and failure.
- In settings page, set current backend status at top of pane instead
  of showing notification
  Notices bubbles cluttered the UI while typing updates to settings
- Show notification once index updated via settings pane button click
  There was no notification on index updated, which usually takes time
  on the backend
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
853192932a setCTA on Khoj Obsidian plugin button. Minor cleanup of space, tabs 2023-01-10 23:36:02 -03:00
Debanjum Singh Solanky
da49ea272c Add placeholder text to modal in Khoj Obsidian plugin 2023-01-10 22:50:11 -03:00
Debanjum Singh Solanky
580f4aca23 Add hints to Modal for available Keybindings 2023-01-10 22:03:47 -03:00
Debanjum Singh Solanky
b52cd85c76 Allow Reranking Results using Keybinding from Khoj Search Modal 2023-01-10 21:59:38 -03:00
Debanjum Singh Solanky
7991ab7a86 Add button in Obsidian plugin settings to force re-indexing your vault 2023-01-10 19:49:12 -03:00
Debanjum Singh Solanky
f046a95f3d Track connectedToBackend as a setting. Use it across obsidian plugin
- Display warning at top of khoj obsidian plugin settings
- Make search command available only if connected to backend
- Show warning notice on clicking khoj search ribbon button

- Call saveData after configureKhojBackend to ensure
  connnectedToBackend setting saved after being (potentially) updated
  in configureKhojBackend function
2023-01-10 17:28:47 -03:00
Debanjum Singh Solanky
768e874185 Load obsidian plugin even if fail to connect to backend but show warning
- Previously the plugin would not load if cannot connect to Khoj backend
  - Silently failing to load with no reason provided is not helpful
- Load plugin to allow user to fix the Khoj URL in their plugin setting
- Show reason for khoj plugin not working. More helpful than failing silently
2023-01-10 17:20:02 -03:00
Debanjum Singh Solanky
aa22d83172 Create and use a context manager to time code
Use the timer context manager in all places where code was being timed

- Benefits
  - Deduplicate timing code scattered across codebase.
  - Provides single place to manage perf timing code
  - Use consistent timing log patterns
2023-01-09 19:48:16 -03:00
Debanjum Singh Solanky
93f39dbd43 Add typing to text_search. Reformat code to set existing_embedding 2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
db7483329c Only import type hint packages for type checking. Avoids circular imports
Use annotations from the __future__ package to avoid having to quote
type hints. This import will not be required after Python 3.11
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
e5254a8e56 Create BaseEncoder class. Make OpenAI encoder its child. Use for typing
- Set type of all bi_encoders to BaseEncoder

- Make load_model return type Union of CrossEncoder and BaseEncoder
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
cf7400759b Remove unused render_results method from text and image search
It's a relic from when khoj was being used as a python module
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
afcfc3cd62 Split text_search.query logic into separate methods for modularity
The query method had become too big.

Extract out filter, score, sort and deduplicate logic used by
text_search.query into separate methods.

This should improve readabilty of code.
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
8dc6ee8b6c Pass `model' arg to extract_search_type method from beta search API
Issue caught by mypy
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
8498903641 Fix, add typing to Filter and TextSearchModel classes
- Changes
  - Fix method signatures of BaseFilter subclasses.
    Else typing information isn't translating to them
  - Explicitly pass `entries: list[Entry]' as arg to `load' method
  - Fix type of `raw_entries' arg to `apply' method
    to list[Entry] from list[str]
  - Rename `raw_entries' arg to `apply' method to `entries'
  - Fix `raw_query' arg used in `apply' method of subclasses to `query'
  - Set type of entries, corpus_embeddings in TextSearchModel

- Verification
  Ran `mypy --config-file .mypy.ini src' to verify typing
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
eace7c6215 Use torch.tensor as torch.Tensor cannot create tensor on MPS device
- `torch.Tensor' is apparently a legacy tensor constructor
- Using that to create tensor on MPS devices throws error:
  RuntimeError: legacy constructor expects device type: cpu but device type: mps was passed
- `torch.tensor' can handle creating tensors on Mac GPU (MPS) fine
2023-01-09 19:47:19 -03:00
Debanjum Singh Solanky
9def3f8c6f Add exception handling to beta APIs, in case OpenAI API call fails 2023-01-09 01:27:06 -03:00
Debanjum Singh Solanky
7b164de021 Add beta API to summarize top search result using an OpenAI model
This is unlike the more general chat API that combines summarization
of top search result and conversing with the OpenAI model

This should give faster summary results. As no intent categorization
API call required
2023-01-09 01:25:59 -03:00
Debanjum Singh Solanky
d36da46f7b Truncate prompt to not exceed OpenAI prompt limit
Truncate prompt containing the top retrieved entry to 500 words to
avoid triggering the max_token limit error
2023-01-09 00:51:46 -03:00
Debanjum Singh Solanky
237123d18c Fix tests for the conversation processor
- Use latest davinci model for tests
- Wrap prompt in triple quotes to improve legibilty
- `understand' method returns dictionary instead of string. Fix its test
- Fix prompt for new model to pass `chat_with_history' test
2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky
918af5e6f8 Make OpenAI conversation model configurable via khoj.yml
- Default to using `text-davinci-003' if conversation model not
  explicitly configured by user. Stop using the older `davinci' and
  `davinci-instruct' models

- Use `model' instead of `engine' as parameter.
  Usage of `engine' parameter in OpenAI API is deprecated
2023-01-09 00:17:51 -03:00
Debanjum Singh Solanky
74e779f8d0 Fix /beta/chat API to use Entry class instead of old dictionary pattern
Search returns response of type SearchResponse instead of a dict now
2023-01-08 15:28:26 -03:00
Debanjum Singh Solanky
f2436039a0 Improve readability of GPT prompt strings in conversation processor 2023-01-08 15:27:41 -03:00
Debanjum Singh Solanky
6119005838 Improve comments, exceptions, typing and init of OpenAI model code 2023-01-08 00:36:18 -03:00
Debanjum Singh Solanky
c0ae8eee99 Allow using OpenAI models for search in Khoj
- Init processor before search to instantiate `openai_api_key'
  from `khoj.yml'. The key is used to configure search with openai models

- To use OpenAI models for search in Khoj
  - Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002
  - Set `encoder-type' in `khoj.yml' to `src.utils.models.OpenAI'
  - Set `model-directory' to `null', as online model cannot be stored on disk
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
826f9dc054 Drop long words from compiled entries to be within max token limit of models
Long words (>500 characters) provide less useful context to models.

Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
6a30a13326 Only create model directory if the optional field is set in SearchConfig 2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
2fe37a090f Make type of encoder to use for embeddings configurable via khoj.yml
- Previously `model_type' was set in the setup of each `search_type'
  - All encoders were of type `SentenceTransformer'
  - All cross_encoders were of type `CrossEncoder'

- Now `encoder-type' can be configured via the new `encoder_type' field
  in `TextSearchConfig' under `search-type` in `khoj.yml`.

- All the specified `encoder-type' class needs is an `encode' method
  that takes entries and returns embedding vectors
2023-01-07 23:09:12 -03:00
Debanjum Singh Solanky
d55d7d53dc Fix GPU usage by Khoj on Macs to speed up search and indexing
- Ensure all tensors are on MPS device before doing operations across them

- Background
  - GPU is used by default for Khoj on MacOS now
    - Needed PyTorch > 1.13.0 on Macs to use GPU, which we do now
  - MPS should speed up search and indexing on MacOS
2023-01-05 15:39:09 -03:00
Debanjum
abd035e2fa
Merge PR #112 to fix quote usage in khoj.el docstring from suliveevil/master
Fix usage warning for unescaped single quote in `khoj.el' docstring. 
Converts usage of '<text>' into `<text>' to use the correct quote forms in generated docs
2023-01-05 13:24:11 -03:00
Debanjum Singh Solanky
e792523849 Bump version in metadata packages for khoj, khoj.el and obsidian plugin 2023-01-05 12:50:27 -03:00
suliveevil
b2812b409f
fix docstring usage warning
 Warning (comp): khoj.el:119:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:120:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:121:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:168:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
2023-01-05 16:47:38 +08:00
Debanjum Singh Solanky
47015ee6cc Fold Demo video descriptions, analysis by default in main Readme 2023-01-04 20:13:43 -03:00
Debanjum Singh Solanky
da17ff6ac8 Add Upgrade instructions for Khoj.el Readme. Fix version of khoj.el 2023-01-04 20:06:39 -03:00
Debanjum Singh Solanky
66ccd0c970 Create Obsidian plugin for Khoj
- Features
  - Search using Khoj from within the Obsidian app
    Allow Natural language search on your (markdown) notes in Obsidian Vault

  - Show search results as rendered (instead of raw) Markdown
    Improve legibility of the results

  - Jump to selected note from search result in Khoj search modal
    Simplify seeing result within its original note context

  - Automatically configure khoj to index markdown files in current vault
    Reduce khoj setup steps for plugin users by using reasonable defaults

    - Code updates the markdown config in khoj.yml and triggers index update
    - It can be configured by user in khoj plugin settings, if required

  - Add Demo and detailed Readme for the Obsidian plugin
    Ease setup and usage. Give context about capabilities

- Miscellaneous
  - Trying keep a mono repo until the Khoj project is mature enough
    to reduce maintainance burden
2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky
feddb6ce62 Add start_url to khoj webmanifest to show Khoj as PWA on Chrome 2023-01-04 13:37:56 -03:00
Debanjum Singh Solanky
3dee1aed9e Create /config/data/default API endpoint to serve default khoj config
This can ease configuring khoj from the different interfaces

- Don't need to know all the (default) config used by khoj.
- Just get default config by calling the above API endpoint.
- Then modify desired portions and call POST /api/config/data to
  configure khoj.
2023-01-03 21:52:34 -03:00
Debanjum Singh Solanky
ce945f7a90 Configure processors too on calling /update API
- Previously only search was being reconfigured
- But Processors are configured on app start too
- Match that behavior on calling /update API
2023-01-03 21:51:02 -03:00
Debanjum Singh Solanky
9d31988f42 Allow starting khoj in non-GUI mode without config file instantiated
- Start khoj server (in non-GUI mode) without needing config file
  already instantiated.
  - But throw warning to configure khoj to use it
- This allows plugins to configure the app via the /config/data APIs
- To be used by the Khoj obsidian plugin to configure markdown content
  in khoj
2023-01-03 21:36:59 -03:00
Debanjum Singh Solanky
52664dd96c Allow recursive glob pattern (**) to add files to search index
- Simplify configuring files to index For Obsidian/Org-Roam type
  systems with lots of small files in khoj.yml using `input-filter'
2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky
152e5f1661 Return the file of each search result in response
- Useful for enabling jump to note functionality in interfaces
- It will be used in the Khoj plugin for Obsidian
2023-01-03 01:25:34 -03:00
Debanjum Singh Solanky
c535953915 Update index automatically in non GUI mode too
- Poll scheduler every minute using threading.Timer
  - Use 60 seconds polling interval to avoid fork bombing
- Schedule next via the same poll scheduler
- Allow clean program interrupt by running scheduler in daemon mode
2023-01-01 21:03:19 -03:00
Debanjum Singh Solanky
701d92e17b Lock the index before updating it via API or Scheduler
- There are 3 paths to updating/setting the index (stored in state.model)
  - App start
  - API
  - Scheduler

- Put all updates to the index behind a lock. As multiple updates path
that could (potentially) run at the same time (via API or Scheduler)
2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky
3b0783aab9 Automate updating embeddings, search index on a hourly schedule
- Use the schedule pypi package
- Use QTimer to poll schedule.run_pending() regularly for jobs to run
2023-01-01 17:09:36 -03:00
Debanjum
06c25682c9
Split text entries by max tokens supported by ML models
### Background
There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated)

### Issue
Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

### Fix
- e057c8e Add method to split entries by specified max tokens limit
- Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL
- b283650 Deduplicate results for user query by raw text before returning results

### Results
- The quality of the search results should improve
- Relevant, long entries should show up in results more often
2022-12-26 18:23:43 +00:00
Debanjum Singh Solanky
17fa123b4e Split entries by max tokens while converting Beancount entries To JSONL 2022-12-26 15:14:32 -03:00