0f993b332e
Khoj will soon get a generic text indexing content type. This along with a file filter should suffice for searching through Ledger transactions, if required. Having a specific content type for niche use-case like ledger isn't useful. Removing unused content types will reduce khoj code to manage. |
||
---|---|---|
.github/workflows | ||
config | ||
docs | ||
scripts | ||
src | ||
tests | ||
.dockerignore | ||
.gitignore | ||
.pre-commit-config.yaml | ||
docker-compose.yml | ||
Dockerfile | ||
Khoj.desktop | ||
Khoj.spec | ||
LICENSE | ||
manifest.json | ||
pyproject.toml | ||
README.md | ||
versions.json |
An AI personal assistant for your digital brain
Supported Plugins
Table of Contents
- Features
- Demos
- Architecture
- Setup
- Use
- Upgrade
- Uninstall
- Troubleshoot
- Advanced Usage
- Miscellaneous
- Performance
- Development
- Credits
Features
- Search
- Local: Your personal data stays local. All search and indexing is done on your machine. Unlike chat which requires access to GPT.
- Incremental: Incremental search for a fast, search-as-you-type experience
- Chat
- Faster answers: Find answers faster, smoother than search. No need to manually scan through your notes to find answers.
- Iterative discovery: Iteratively explore and (re-)discover your notes
- Assisted creativity: Smoothly weave across answers retrieval and content generation
- General
- Natural: Advanced natural language understanding using Transformer based ML Models
- Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
- Multiple Sources: Index your Org-mode and Markdown notes, PDF files, Github repositories, and Photos
- Multiple Interfaces: Interact from your Web Browser, Emacs or Obsidian
Demos
Khoj in Obsidian
https://github.com/khoj-ai/khoj/assets/6413477/3e33d8ea-25bb-46c8-a3bf-c92f78d0f56b
Description
- Install Khoj via
pip
and start Khoj backend in a terminal (Runkhoj
)python -m pip install khoj-assistant khoj
- Install Khoj plugin via Community Plugins settings pane on Obsidian app
- Check the new Khoj plugin settings
- Let Khoj backend index the markdown, pdf, Github markdown files in the current Vault
- Open Khoj plugin on Obsidian via Search button on Left Pane
- Search "Announce plugin to folks" in the Obsidian Plugin docs
- Jump to the search result
Khoj in Emacs, Browser
https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4
Description
- Install Khoj via pip
- Start Khoj app
- Add this readme and khoj.el readme as org-mode for Khoj to index
- Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
- Top result is what we are looking for, the section to Install Khoj.el on Emacs
Analysis
- The results do not have any words used in the query
- Based on the top result it seems the re-ranking model understands that Emacs is an editor?
- The results incrementally update as the query is entered
- The results are re-ranked, for better accuracy, once user hits enter
Interfaces
Architecture
Setup
These are the general setup instructions for Khoj.
- Make sure python and pip are installed on your machine
- Check the Khoj.el Readme to setup Khoj with Emacs
Its simpler as it can skip the server install, run and configure step below. - Check the Khoj Obsidian Readme to setup Khoj with Obsidian
Its simpler as it can skip the configure step below.
1. Install
-
On Linux/MacOS
python -m pip install khoj-assistant
-
On Windows
py -m pip install khoj-assistant
2. Run
khoj
Note: To start Khoj automatically in the background use Task scheduler on Windows or Cron on Mac, Linux (e.g with @reboot khoj
)
3. Configure
- Enable content types and point to files to search in the First Run Screen that pops up on app start
- Click
Configure
and wait. The app will download ML models and index the content for search
4. Install Interface Plugins
Khoj exposes a web interface by default.
The optional steps below allow using Khoj from within an existing application like Obsidian or Emacs.
Use
Khoj Search
- Khoj via Obsidian
- Click the Khoj search icon 🔎 on the Ribbon or Search for Khoj: Search in the Command Palette
- Khoj via Emacs
- Run
M-x khoj <user-query>
- Run
- Khoj via Web
- Open http://localhost:8000/ directly
- Khoj via API
- See the Khoj FastAPI Swagger Docs, ReDocs
Query Filters
Use structured query syntax to filter the natural language search results
- Word Filter: Get entries that include/exclude a specified term
- Entries that contain term_to_include:
+"term_to_include"
- Entries that contain term_to_exclude:
-"term_to_exclude"
- Entries that contain term_to_include:
- Date Filter: Get entries containing dates in YYYY-MM-DD format from specified date (range)
- Entries from April 1st 1984:
dt:"1984-04-01"
- Entries after March 31st 1984:
dt>="1984-04-01"
- Entries before April 2nd 1984 :
dt<="1984-04-01"
- Entries from April 1st 1984:
- File Filter: Get entries from a specified file
- Entries from incoming.org file:
file:"incoming.org"
- Entries from incoming.org file:
- Combined Example
what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"
- Adds all filters to the natural language query. It should return entries
- from the file 1984.org
- containing dates from the year 1984
- excluding words "big" and "brother"
- that best match the natural language query "what is the meaning of life?"
Khoj Chat
Overview
- Creates a personal assistant for you to inquire and engage with your notes
- Uses ChatGPT and Khoj search
- Supports multi-turn conversations with the relevant notes for context
- Shows reference notes used to generate a response
- Note: Your query and top notes from khoj search will be sent to OpenAI for processing
Setup
Use
Demo
Details
- Your query is used to retrieve the most relevant notes, if any, using Khoj search
- These notes, the last few messages and associated metadata is passed to ChatGPT along with your query for a response
Upgrade
Upgrade Khoj Server
pip install --upgrade khoj-assistant
Note: To upgrade to the latest pre-release version of the khoj server run below command
# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant
Upgrade Khoj on Emacs
- Use your Emacs Package Manager to Upgrade
- See khoj.el readme for details
Upgrade Khoj on Obsidian
- Upgrade via the Community plugins tab on the settings pane in the Obsidian app
- See the khoj plugin readme for details
Uninstall
- (Optional) Hit
Ctrl-C
in the terminal running the khoj server to stop it - Delete the khoj directory in your home folder (i.e
~/.khoj
on Linux, Mac orC:\Users\<your-username>\.khoj
on Windows) - Uninstall the khoj server with
pip uninstall khoj-assistant
- (Optional) Uninstall khoj.el or the khoj obsidian plugin in the standard way on Emacs, Obsidian
Troubleshoot
Install fails while building Tokenizer dependency
- Details:
pip install khoj-assistant
fails while building thetokenizers
dependency. Complains about Rust. - Fix: Install Rust to build the tokenizers package. For example on Mac run:
brew install rustup rustup-init source ~/.cargo/env
- Refer: Issue with Fix for more details
Search starts giving wonky results
- Fix: Open /api/update?force=true1 in browser to regenerate index from scratch
- Note: This is a fix for when you percieve the search results have degraded. Not if you think they've always given wonky results
Khoj in Docker errors out with "Killed" in error message
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
Khoj errors out complaining about Tensors mismatch or null
- Mitigation: Disable
image
search using the desktop GUI
Advanced Usage
Access Khoj on Mobile
- Setup Khoj on your personal server. This can be any always-on machine, i.e an old computer, RaspberryPi(?) etc
- Install Tailscale on your personal server and phone
- Open the Khoj web interface of the server from your phone browser.
It should behttp://tailscale-ip-of-server:8000
orhttp://name-of-server:8000
if you've setup MagicDNS - Click the Add to Homescreen button
- Enjoy exploring your notes, documents and images from your phone!
Use OpenAI Models for Search
Setup
- Set
encoder-type
,encoder
andmodel-directory
underasymmetric
and/orsymmetric
search-type
in yourkhoj.yml
2:asymmetric: - encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-v1" + encoder: text-embedding-ada-002 + encoder-type: khoj.utils.models.OpenAI cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2" - encoder-type: sentence_transformers.SentenceTransformer - model_directory: "~/.khoj/search/asymmetric/" + model-directory: null
- Setup your OpenAI API key in Khoj
- Restart Khoj server to generate embeddings. It will take longer than with offline models.
Warnings
This configuration uses an online model
- It will send all notes to OpenAI to generate embeddings
- All queries will be sent to OpenAI when you search with Khoj
- You will be charged by OpenAI based on the total tokens processed
- It requires an active internet connection to search and index
Search across Different Languages
To search for notes in multiple, different languages, you can use a multi-lingual model.
For example, the paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages, has good search quality and speed. To use it:
- Manually update
search-type > asymmetric > encoder
toparaphrase-multilingual-MiniLM-L12-v2
in your~/.khoj/khoj.yml
file for now. See diff ofkhoj.yml
below for illustration:
asymmetric:
- encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
+ encoder: "paraphrase-multilingual-MiniLM-L12-v2"
cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2"
model_directory: "~/.khoj/search/asymmetric/"
- Regenerate your content index. For example, by opening <khoj-url>/api/update?t=force
Miscellaneous
Set your OpenAI API key in Khoj
If you want, Khoj can be configured to use OpenAI for search and chat.
Add your OpenAI API to Khoj by using either of the two options below:
- Open your Khoj settings, add your OpenAI API key, and click Save. Then go to your Khoj settings and click
Configure
. This will refresh Khoj with your OpenAI API key. - Set
openai-api-key
field underprocessor.conversation
section in yourkhoj.yml
2 to your OpenAI API key and restart khoj:processor: conversation: - openai-api-key: # "YOUR_OPENAI_API_KEY" + openai-api-key: sk-aaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhh model: "text-davinci-003" conversation-logfile: "~/.khoj/processor/conversation/conversation_logs.json"
Warning: This will enable Khoj to send your query and note(s) to OpenAI for processing
GPT API
- The chat, answer and search API endpoints use OpenAI API
- They are disabled by default
- To use them:
- Setup your OpenAI API key in Khoj
- Interact with them from the Khoj Swagger docs1
Index Github Repository for Search, Chat
The Khoj Github plugin can index issues, commit messages and markdown, org-mode and PDF files from any repositories you have access to. This allows you to chat or search with these repositories. Get answers, resolve issues or just explore a repo with the help of your AI personal assistant.
See the Khoj FAQ for a demo of Khoj search and chat. It makes the Khoj github repo available for exploring.
Note: Khoj will ignore code files in the repository for now as the default AI model used works best with natural language text, not code.
Setup Khoj Github plugin
- Get a pat token with
repo
andread:org
scopes in the classic flow. - Configure Khoj settings to include the
owner
andrepo_name
. Theowner
will be the organization name if the repo is in an organization. Therepo_name
will be the name of the repository. Optionally, you can also supply a branch name. If no branch name is supplied, themaster
branch will be used.
Performance
Query performance
- Semantic search using the bi-encoder is fairly fast at <50 ms
- Reranking using the cross-encoder is slower at <2s on 15 results. Tweak
top_k
to tradeoff speed for accuracy of results - Filters in query (e.g by file, word or date) usually add <20ms to query latency
Indexing performance
- Indexing is more strongly impacted by the size of the source data
- Indexing 100K+ line corpus of notes takes about 10 minutes
- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
- Note: It should only take this long on the first run as the index is incrementally updated
Miscellaneous
- Testing done on a Mac M1 and a >100K line corpus of notes
- Search, indexing on a GPU has not been tested yet
Development
Visualize Codebase
Setup
Using Pip
1. Install
# Get Khoj Code
git clone https://github.com/khoj-ai/khoj && cd khoj
# Create, Activate Virtual Environment
python3 -m venv .venv && source .venv/bin/activate
# Install Khoj for Development
pip install -e .[dev]
2. Run
- Start Khoj
khoj -vv
- Configure Khoj
- Via the Settings UI: Add files, directories to index the Khoj settings UI once Khoj has started up. Once you've saved all your settings, click
Configure
. - Manually:
- Copy the
config/khoj_sample.yml
to~/.khoj/khoj.yml
- Set
input-files
orinput-filter
in each relevantcontent-type
section of~/.khoj/khoj.yml
- Set
input-directories
field inimage
content-type
section
- Set
- Delete
content-type
andprocessor
sub-section(s) irrelevant for your use-case - Restart khoj
- Copy the
- Via the Settings UI: Add files, directories to index the Khoj settings UI once Khoj has started up. Once you've saved all your settings, click
Note: Wait after configuration for khoj to Load ML model, generate embeddings and expose API to query notes, images, documents etc specified in config YAML
Using Docker
1. Clone
git clone https://github.com/khoj-ai/khoj && cd khoj
2. Configure
- Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes, PDFs and Github repositories
- Optional: Edit application configuration in khoj_docker.yml
3. Run
docker-compose up -d
Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings
4. Upgrade
docker-compose build --pull
Using Conda
1. Install Dependencies
2. Install Khoj
git clone https://github.com/khoj-ai/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
python3 -m pip install pyqt6 # As conda does not support pyqt6 yet
3. Configure
- Copy the
config/khoj_sample.yml
to~/.khoj/khoj.yml
- Set
input-files
orinput-filter
in each relevantcontent-type
section of~/.khoj/khoj.yml
- Set
input-directories
field inimage
content-type
section
- Set
- Delete
content-type
,processor
sub-sections irrelevant for your use-case
4. Run
python3 -m src.khoj.main -vv
Load ML model, generate embeddings and expose API to query notes, images, documents etc specified in config YAML
5. Upgrade
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj
Validate
Before Make Changes
- Install Git Hooks for Validation
pre-commit install -t pre-push -t pre-commit
- This ensures standard code formatting fixes and other checks run automatically on every commit and push
- Note 1: If pre-commit didn't already get installed, install it via
pip install pre-commit
- Note 2: To run the pre-commit changes manually, use
pre-commit run --hook-stage manual --all
before creating PR
Before Creating PR
-
Run Tests. If you get an error complaining about a missing
fast_tokenizer_file
, follow the solution in this Github issue.pytest
-
Run MyPy to check types
mypy --config-file pyproject.toml
After Creating PR
-
Automated validation workflows run for every PR.
Ensure any issues seen by them our fixed
-
Test the python packge created for a PR
- Download and extract the zipped
.whl
artifact generated from the pypi workflow run for the PR. - Install (in your virtualenv) with
pip install /path/to/download*.whl>
- Start and use the application to see if it works fine
- Download and extract the zipped
Credits
- Multi-QA MiniLM Model, All MiniLM Model for Text Search. See SBert Documentation
- OpenAI CLIP Model for Image Search. See SBert Documentation
- Charles Cave for OrgNode Parser
- Org.js to render Org-mode results on the Web interface
- Markdown-it to render Markdown results on the Web interface
-
Default Khoj url @ http://localhost:8000 ↩︎
-
Default Khoj config file @
~/.khoj/khoj.yml
↩︎