Commit graph

2210 commits

Author SHA1 Message Date
Debanjum Singh Solanky
7b164de021 Add beta API to summarize top search result using an OpenAI model
This is unlike the more general chat API that combines summarization
of top search result and conversing with the OpenAI model

This should give faster summary results. As no intent categorization
API call required
2023-01-09 01:25:59 -03:00
Debanjum Singh Solanky
d36da46f7b Truncate prompt to not exceed OpenAI prompt limit
Truncate prompt containing the top retrieved entry to 500 words to
avoid triggering the max_token limit error
2023-01-09 00:51:46 -03:00
Debanjum Singh Solanky
237123d18c Fix tests for the conversation processor
- Use latest davinci model for tests
- Wrap prompt in triple quotes to improve legibilty
- `understand' method returns dictionary instead of string. Fix its test
- Fix prompt for new model to pass `chat_with_history' test
2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky
918af5e6f8 Make OpenAI conversation model configurable via khoj.yml
- Default to using `text-davinci-003' if conversation model not
  explicitly configured by user. Stop using the older `davinci' and
  `davinci-instruct' models

- Use `model' instead of `engine' as parameter.
  Usage of `engine' parameter in OpenAI API is deprecated
2023-01-09 00:17:51 -03:00
Debanjum Singh Solanky
7e05389776 Quote all values passed to input-filter fields in sample yaml files 2023-01-08 22:40:18 -03:00
Debanjum Singh Solanky
0440f3fd57 Add encoder-type field to the search-type sections in khoj_sample.yml 2023-01-08 22:07:13 -03:00
Debanjum Singh Solanky
8b8e202ab3 Set input-filter to list in khoj_docker.yml and khoj_sample.yml
`input-filter' was converted to a list a while back but the sample
khoj configs were not updated to reflect this. This change fixes that
2023-01-08 21:08:00 -03:00
Debanjum Singh Solanky
74e779f8d0 Fix /beta/chat API to use Entry class instead of old dictionary pattern
Search returns response of type SearchResponse instead of a dict now
2023-01-08 15:28:26 -03:00
Debanjum Singh Solanky
f2436039a0 Improve readability of GPT prompt strings in conversation processor 2023-01-08 15:27:41 -03:00
Debanjum
1c091e509b
Make Encoder Type Configurable. Allow using OpenAI Model for Search
- 2fe37a0 Make type of encoder to use for embeddings configurable via `khoj.yml'
  - Previously `encoder_type' was set in the setup code of search_type
    - All *encoders* were of type `SentenceTransformer'
    - All *cross_encoders* were of type `CrossEncoder'
  - Now the `encoder_type' can be configured via the new `encoder_type' field 
    in `TextSearchConfig' under `search_type` in `khoj.yml'
  - All the specified `encoder-type' class needs is an `encode' method
    that takes entries and returns embedding vectors
  
- 826f9dc Drop long words from compiled entries to be within max token limit of models
  Long words (>500 characters) provide less useful context to models.
   
  Dropping very long words allow models to create better embeddings by
  passing more of the useful context from the entry to the model

- c0ae8ee Allow using OpenAI models for search in Khoj
  To use OpenAI models for search in Khoj, in `~/.khoj/khoj.yml'
  1. Set `encoder' to name of an OpenAI model. E.g *text-embedding-ada-002*
  2. Set `encoder-type' to *src.utils.models.OpenAI*
  3. Set `model-directory` to *null*, as this is an online model and
     cannot be stored on the file system
2023-01-08 11:10:25 -03:00
Debanjum Singh Solanky
6119005838 Improve comments, exceptions, typing and init of OpenAI model code 2023-01-08 00:36:18 -03:00
Debanjum Singh Solanky
c0ae8eee99 Allow using OpenAI models for search in Khoj
- Init processor before search to instantiate `openai_api_key'
  from `khoj.yml'. The key is used to configure search with openai models

- To use OpenAI models for search in Khoj
  - Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002
  - Set `encoder-type' in `khoj.yml' to `src.utils.models.OpenAI'
  - Set `model-directory' to `null', as online model cannot be stored on disk
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
826f9dc054 Drop long words from compiled entries to be within max token limit of models
Long words (>500 characters) provide less useful context to models.

Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
6a30a13326 Only create model directory if the optional field is set in SearchConfig 2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
2fe37a090f Make type of encoder to use for embeddings configurable via khoj.yml
- Previously `model_type' was set in the setup of each `search_type'
  - All encoders were of type `SentenceTransformer'
  - All cross_encoders were of type `CrossEncoder'

- Now `encoder-type' can be configured via the new `encoder_type' field
  in `TextSearchConfig' under `search-type` in `khoj.yml`.

- All the specified `encoder-type' class needs is an `encode' method
  that takes entries and returns embedding vectors
2023-01-07 23:09:12 -03:00
Debanjum Singh Solanky
fa92adcf0d Add Visualization of Codebase to Readme under Development Section
Source from Github vNext Repo Visualizer at
https://githubnext.com/projects/repo-visualization/
2023-01-05 20:11:56 -03:00
Debanjum Singh Solanky
8c7ffd7aee Add Readme doc to fix failure to build tokenizer dependency 2023-01-05 20:11:56 -03:00
Debanjum Singh Solanky
d55d7d53dc Fix GPU usage by Khoj on Macs to speed up search and indexing
- Ensure all tensors are on MPS device before doing operations across them

- Background
  - GPU is used by default for Khoj on MacOS now
    - Needed PyTorch > 1.13.0 on Macs to use GPU, which we do now
  - MPS should speed up search and indexing on MacOS
2023-01-05 15:39:09 -03:00
Debanjum Singh Solanky
7380518f24 Upgrade PyTorch, Pillow version to resolve Dependabot Security Advisories
This also enables GPU usage by Khoj on MacOS as MPS support is now in
PyTorch mainline
2023-01-05 15:39:09 -03:00
Debanjum
abd035e2fa
Merge PR #112 to fix quote usage in khoj.el docstring from suliveevil/master
Fix usage warning for unescaped single quote in `khoj.el' docstring. 
Converts usage of '<text>' into `<text>' to use the correct quote forms in generated docs
2023-01-05 13:24:11 -03:00
Debanjum Singh Solanky
1dc1472c55 In publish workflow, make twine upload verbose to troubleshoot 2023-01-05 12:56:46 -03:00
Debanjum Singh Solanky
e792523849 Bump version in metadata packages for khoj, khoj.el and obsidian plugin 2023-01-05 12:50:27 -03:00
suliveevil
b2812b409f
fix docstring usage warning
 Warning (comp): khoj.el:119:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:120:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:121:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:168:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
2023-01-05 16:47:38 +08:00
Debanjum Singh Solanky
3d1199540c Update the publish workflow to also run on any tag push 2023-01-04 20:47:23 -03:00
Debanjum Singh Solanky
4842daca5f Run releases workflow on pushing any tag. 'v' prefix not required
Obsidian for some reason cannot pick up plugin assets from releases
made with prefixed tags
2023-01-04 20:27:56 -03:00
Debanjum Singh Solanky
47015ee6cc Fold Demo video descriptions, analysis by default in main Readme 2023-01-04 20:13:43 -03:00
Debanjum Singh Solanky
da17ff6ac8 Add Upgrade instructions for Khoj.el Readme. Fix version of khoj.el 2023-01-04 20:06:39 -03:00
Debanjum
65917eb5c9
Create Obsidian plugin for Khoj
### Plugin Features
  - Search Obsidian notes using Khoj
    *Provide Natural language search on your (markdown) notes in Obsidian Vault*

  - Show search results as rendered Markdown
    *Improve legibility of the results*

  - Jump to selected note from search result in Khoj search modal
    *Simplify seeing result within its original note context*

  - Automatically configure khoj to index markdown files in current vault
    *Reduce khoj setup steps for plugin users by using reasonable defaults*

    - Code updates the markdown config in `khoj.yml` and triggers index update
    - It can be configured by user in khoj plugin settings, if required

  - Add Demo and detailed Readme for the Obsidian plugin
    *Ease setup and usage. Give context about capabilities*

### Miscellaneous
  - (Try) Keep a mono repo until the Khoj project is mature enough
    to reduce maintainance burden

### Commits Details
  - 0e39e0f Add details about the Khoj Obsidian plugin to the main Readme
  - cd8b918 Add `manifest.json`, `versions.json` of Obsidian plugin to project root
  - 66ccd0c Create Obsidian plugin for Khoj
2023-01-04 20:02:42 -03:00
Debanjum Singh Solanky
3dd69f7505 Add Upgrade instructions for Obsidian, Emacs to main Readme 2023-01-04 19:50:26 -03:00
Debanjum Singh Solanky
0e39e0ff71 Add details about the Khoj Obsidian plugin to the main Readme
- Add Khoj in Obsidian Demo

- Update Interfaces Screenshot to include Obsidian Plugin Screenshot

- Update .gitignore to ignore obsidian plugin ignorelist
  Section the .gitignore for better readability

- Update the Setup, Usage instructions to include information about
  the Obsidian plugin
2023-01-04 18:42:53 -03:00
Debanjum Singh Solanky
cd8b918a55 Add manifest.json, versions.json of Obsidian plugin to project root
- Obsidian provides limited support for plugins in larger repositories.
  Currently, it does not have a way to specify the directory of a plugin
  So it expects the plugins `manifest.json' and `versions.json' to be at
  project root

- While this unnecessarily litters the codebase. It is the (current)
  required tradeoff for keeping the core plugins in a mono repo
2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky
66ccd0c970 Create Obsidian plugin for Khoj
- Features
  - Search using Khoj from within the Obsidian app
    Allow Natural language search on your (markdown) notes in Obsidian Vault

  - Show search results as rendered (instead of raw) Markdown
    Improve legibility of the results

  - Jump to selected note from search result in Khoj search modal
    Simplify seeing result within its original note context

  - Automatically configure khoj to index markdown files in current vault
    Reduce khoj setup steps for plugin users by using reasonable defaults

    - Code updates the markdown config in khoj.yml and triggers index update
    - It can be configured by user in khoj plugin settings, if required

  - Add Demo and detailed Readme for the Obsidian plugin
    Ease setup and usage. Give context about capabilities

- Miscellaneous
  - Trying keep a mono repo until the Khoj project is mature enough
    to reduce maintainance burden
2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky
e5ef7789fc Add screenshot of Khoj as PWA on Android Homescreen to Readme 2023-01-04 15:47:08 -03:00
Debanjum Singh Solanky
feddb6ce62 Add start_url to khoj webmanifest to show Khoj as PWA on Chrome 2023-01-04 13:37:56 -03:00
Debanjum Singh Solanky
5ca60a2df7 Add How to Access Khoj on Mobile instructions to Readme 2023-01-04 13:37:40 -03:00
Debanjum Singh Solanky
3dee1aed9e Create /config/data/default API endpoint to serve default khoj config
This can ease configuring khoj from the different interfaces

- Don't need to know all the (default) config used by khoj.
- Just get default config by calling the above API endpoint.
- Then modify desired portions and call POST /api/config/data to
  configure khoj.
2023-01-03 21:52:34 -03:00
Debanjum Singh Solanky
ce945f7a90 Configure processors too on calling /update API
- Previously only search was being reconfigured
- But Processors are configured on app start too
- Match that behavior on calling /update API
2023-01-03 21:51:02 -03:00
Debanjum Singh Solanky
9d31988f42 Allow starting khoj in non-GUI mode without config file instantiated
- Start khoj server (in non-GUI mode) without needing config file
  already instantiated.
  - But throw warning to configure khoj to use it
- This allows plugins to configure the app via the /config/data APIs
- To be used by the Khoj obsidian plugin to configure markdown content
  in khoj
2023-01-03 21:36:59 -03:00
Debanjum Singh Solanky
52664dd96c Allow recursive glob pattern (**) to add files to search index
- Simplify configuring files to index For Obsidian/Org-Roam type
  systems with lots of small files in khoj.yml using `input-filter'
2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky
152e5f1661 Return the file of each search result in response
- Useful for enabling jump to note functionality in interfaces
- It will be used in the Khoj plugin for Obsidian
2023-01-03 01:25:34 -03:00
Debanjum
fe1398401d
Automatically update search index hourly
- c535953 Update index automatically in non GUI mode too
- 701d92e Lock the index before updating it via API or Scheduler
- 3b0783a Automate updating embeddings, search index on a hourly schedule

Resolves #106
2023-01-02 00:37:59 +00:00
Debanjum Singh Solanky
c535953915 Update index automatically in non GUI mode too
- Poll scheduler every minute using threading.Timer
  - Use 60 seconds polling interval to avoid fork bombing
- Schedule next via the same poll scheduler
- Allow clean program interrupt by running scheduler in daemon mode
2023-01-01 21:03:19 -03:00
Debanjum Singh Solanky
701d92e17b Lock the index before updating it via API or Scheduler
- There are 3 paths to updating/setting the index (stored in state.model)
  - App start
  - API
  - Scheduler

- Put all updates to the index behind a lock. As multiple updates path
that could (potentially) run at the same time (via API or Scheduler)
2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky
3b0783aab9 Automate updating embeddings, search index on a hourly schedule
- Use the schedule pypi package
- Use QTimer to poll schedule.run_pending() regularly for jobs to run
2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky
a58c243bc0 Document using Word, Date and File Query Filter in Readme 2022-12-26 16:12:49 -03:00
Debanjum
06c25682c9
Split text entries by max tokens supported by ML models
### Background
There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated)

### Issue
Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

### Fix
- e057c8e Add method to split entries by specified max tokens limit
- Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL
- b283650 Deduplicate results for user query by raw text before returning results

### Results
- The quality of the search results should improve
- Relevant, long entries should show up in results more often
2022-12-26 18:23:43 +00:00
Debanjum Singh Solanky
17fa123b4e Split entries by max tokens while converting Beancount entries To JSONL 2022-12-26 15:14:32 -03:00
Debanjum Singh Solanky
f209e30a3b Split entries by max tokens while converting Markdown entries To JSONL 2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky
24676f95d8 Fix comments, use minimal test case, regenerate test index, merge debug logs
- Remove property drawer from test entry for max_words splitting test
  - Property drawer is not required for the test
  - Keep minimal test case to reduce chance for confusion
2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky
b283650991 Deduplicate results for user query by raw text before returning results
- Required because entries are now split by the max_word count supported
  by the ML models
- This would now result in potentially duplicate hits, entries being
  returned to user
- Do deduplication after ranking to get the top ranked deduplicated
  results
2022-12-25 21:36:15 -03:00