- Changes
- Fix method signatures of BaseFilter subclasses.
Else typing information isn't translating to them
- Explicitly pass `entries: list[Entry]' as arg to `load' method
- Fix type of `raw_entries' arg to `apply' method
to list[Entry] from list[str]
- Rename `raw_entries' arg to `apply' method to `entries'
- Fix `raw_query' arg used in `apply' method of subclasses to `query'
- Set type of entries, corpus_embeddings in TextSearchModel
- Verification
Ran `mypy --config-file .mypy.ini src' to verify typing
- `torch.Tensor' is apparently a legacy tensor constructor
- Using that to create tensor on MPS devices throws error:
RuntimeError: legacy constructor expects device type: cpu but device type: mps was passed
- `torch.tensor' can handle creating tensors on Mac GPU (MPS) fine
This is unlike the more general chat API that combines summarization
of top search result and conversing with the OpenAI model
This should give faster summary results. As no intent categorization
API call required
- Use latest davinci model for tests
- Wrap prompt in triple quotes to improve legibilty
- `understand' method returns dictionary instead of string. Fix its test
- Fix prompt for new model to pass `chat_with_history' test
- Default to using `text-davinci-003' if conversation model not
explicitly configured by user. Stop using the older `davinci' and
`davinci-instruct' models
- Use `model' instead of `engine' as parameter.
Usage of `engine' parameter in OpenAI API is deprecated
- 2fe37a0 Make type of encoder to use for embeddings configurable via `khoj.yml'
- Previously `encoder_type' was set in the setup code of search_type
- All *encoders* were of type `SentenceTransformer'
- All *cross_encoders* were of type `CrossEncoder'
- Now the `encoder_type' can be configured via the new `encoder_type' field
in `TextSearchConfig' under `search_type` in `khoj.yml'
- All the specified `encoder-type' class needs is an `encode' method
that takes entries and returns embedding vectors
- 826f9dc Drop long words from compiled entries to be within max token limit of models
Long words (>500 characters) provide less useful context to models.
Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
- c0ae8ee Allow using OpenAI models for search in Khoj
To use OpenAI models for search in Khoj, in `~/.khoj/khoj.yml'
1. Set `encoder' to name of an OpenAI model. E.g *text-embedding-ada-002*
2. Set `encoder-type' to *src.utils.models.OpenAI*
3. Set `model-directory` to *null*, as this is an online model and
cannot be stored on the file system
- Init processor before search to instantiate `openai_api_key'
from `khoj.yml'. The key is used to configure search with openai models
- To use OpenAI models for search in Khoj
- Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002
- Set `encoder-type' in `khoj.yml' to `src.utils.models.OpenAI'
- Set `model-directory' to `null', as online model cannot be stored on disk
Long words (>500 characters) provide less useful context to models.
Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
- Previously `model_type' was set in the setup of each `search_type'
- All encoders were of type `SentenceTransformer'
- All cross_encoders were of type `CrossEncoder'
- Now `encoder-type' can be configured via the new `encoder_type' field
in `TextSearchConfig' under `search-type` in `khoj.yml`.
- All the specified `encoder-type' class needs is an `encode' method
that takes entries and returns embedding vectors
- Ensure all tensors are on MPS device before doing operations across them
- Background
- GPU is used by default for Khoj on MacOS now
- Needed PyTorch > 1.13.0 on Macs to use GPU, which we do now
- MPS should speed up search and indexing on MacOS
Fix usage warning for unescaped single quote in `khoj.el' docstring.
Converts usage of '<text>' into `<text>' to use the correct quote forms in generated docs
⛔ Warning (comp): khoj.el:119:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
⛔ Warning (comp): khoj.el:120:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
⛔ Warning (comp): khoj.el:121:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
⛔ Warning (comp): khoj.el:168:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
### Plugin Features
- Search Obsidian notes using Khoj
*Provide Natural language search on your (markdown) notes in Obsidian Vault*
- Show search results as rendered Markdown
*Improve legibility of the results*
- Jump to selected note from search result in Khoj search modal
*Simplify seeing result within its original note context*
- Automatically configure khoj to index markdown files in current vault
*Reduce khoj setup steps for plugin users by using reasonable defaults*
- Code updates the markdown config in `khoj.yml` and triggers index update
- It can be configured by user in khoj plugin settings, if required
- Add Demo and detailed Readme for the Obsidian plugin
*Ease setup and usage. Give context about capabilities*
### Miscellaneous
- (Try) Keep a mono repo until the Khoj project is mature enough
to reduce maintainance burden
### Commits Details
- 0e39e0f Add details about the Khoj Obsidian plugin to the main Readme
- cd8b918 Add `manifest.json`, `versions.json` of Obsidian plugin to project root
- 66ccd0c Create Obsidian plugin for Khoj
- Add Khoj in Obsidian Demo
- Update Interfaces Screenshot to include Obsidian Plugin Screenshot
- Update .gitignore to ignore obsidian plugin ignorelist
Section the .gitignore for better readability
- Update the Setup, Usage instructions to include information about
the Obsidian plugin
- Obsidian provides limited support for plugins in larger repositories.
Currently, it does not have a way to specify the directory of a plugin
So it expects the plugins `manifest.json' and `versions.json' to be at
project root
- While this unnecessarily litters the codebase. It is the (current)
required tradeoff for keeping the core plugins in a mono repo
- Features
- Search using Khoj from within the Obsidian app
Allow Natural language search on your (markdown) notes in Obsidian Vault
- Show search results as rendered (instead of raw) Markdown
Improve legibility of the results
- Jump to selected note from search result in Khoj search modal
Simplify seeing result within its original note context
- Automatically configure khoj to index markdown files in current vault
Reduce khoj setup steps for plugin users by using reasonable defaults
- Code updates the markdown config in khoj.yml and triggers index update
- It can be configured by user in khoj plugin settings, if required
- Add Demo and detailed Readme for the Obsidian plugin
Ease setup and usage. Give context about capabilities
- Miscellaneous
- Trying keep a mono repo until the Khoj project is mature enough
to reduce maintainance burden
This can ease configuring khoj from the different interfaces
- Don't need to know all the (default) config used by khoj.
- Just get default config by calling the above API endpoint.
- Then modify desired portions and call POST /api/config/data to
configure khoj.
- Start khoj server (in non-GUI mode) without needing config file
already instantiated.
- But throw warning to configure khoj to use it
- This allows plugins to configure the app via the /config/data APIs
- To be used by the Khoj obsidian plugin to configure markdown content
in khoj
- c535953 Update index automatically in non GUI mode too
- 701d92e Lock the index before updating it via API or Scheduler
- 3b0783a Automate updating embeddings, search index on a hourly schedule
Resolves#106
- Poll scheduler every minute using threading.Timer
- Use 60 seconds polling interval to avoid fork bombing
- Schedule next via the same poll scheduler
- Allow clean program interrupt by running scheduler in daemon mode
- There are 3 paths to updating/setting the index (stored in state.model)
- App start
- API
- Scheduler
- Put all updates to the index behind a lock. As multiple updates path
that could (potentially) run at the same time (via API or Scheduler)