khoj/README.md at 41ac1e24c9ffd221d3d9efb4f22aeff65e7eb0e1

sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 15:38:55 +01:00

Debanjum Singh Solanky 41ac1e24c9 Add docs for a pre-emptive setup of Khoj for later offline usage

Closes #151

2023-07-05 20:48:51 -07:00

24 KiB

Raw Blame History

An AI personal assistant for your digital brain

Supported Plugins

Features
Demos
Architecture
Setup
- Install
- Run
- Configure
- Install Plugins
Use
- Khoj Search
- Khoj Chat
Upgrade
Uninstall
Troubleshoot
Advanced Usage
Miscellaneous
- Setup OpenAI API key in Khoj
- GPT API
Performance
Development
Credits

Features

Search
- Local: Your personal data stays local. All search and indexing is done on your machine. Unlike chat which requires access to GPT.
- Incremental: Incremental search for a fast, search-as-you-type experience
Chat
- Faster answers: Find answers faster, smoother than search. No need to manually scan through your notes to find answers.
- Iterative discovery: Iteratively explore and (re-)discover your notes
- Assisted creativity: Smoothly weave across answers retrieval and content generation
General
- Natural: Advanced natural language understanding using Transformer based ML Models
- Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
- Multiple Sources: Index your Org-mode and Markdown notes, PDF files, Github repositories, and Photos
- Multiple Interfaces: Interact from your Web Browser, Emacs or Obsidian

Demos

Khoj in Obsidian

https://github.com/khoj-ai/khoj/assets/6413477/3e33d8ea-25bb-46c8-a3bf-c92f78d0f56b

Description

Install Khoj via pip and start Khoj backend in a terminal (Run khoj)
```
python -m pip install khoj-assistant
khoj
```
Install Khoj plugin via Community Plugins settings pane on Obsidian app
- Check the new Khoj plugin settings
- Let Khoj backend index the markdown, pdf, Github markdown files in the current Vault
- Open Khoj plugin on Obsidian via Search button on Left Pane
- Search "Announce plugin to folks" in the Obsidian Plugin docs
- Jump to the search result

Khoj in Emacs, Browser

https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4

Description

Install Khoj via pip
Start Khoj app
Add this readme and khoj.el readme as org-mode for Khoj to index
Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
Top result is what we are looking for, the section to Install Khoj.el on Emacs

Analysis

The results do not have any words used in the query
- Based on the top result it seems the re-ranking model understands that Emacs is an editor?
The results incrementally update as the query is entered
The results are re-ranked, for better accuracy, once user hits enter

Interfaces

Architecture

Setup

These are the general setup instructions for Khoj.

Make sure python and pip are installed on your machine
Check the Khoj.el Readme to setup Khoj with Emacs
Its simpler as it can skip the server install, run and configure step below.
Check the Khoj Obsidian Readme to setup Khoj with Obsidian
Its simpler as it can skip the configure step below.

1. Install

On Linux/MacOS
```
python -m pip install khoj-assistant
```
On Windows
```
py -m pip install khoj-assistant
```

2. Run

khoj

Note: To start Khoj automatically in the background use Task scheduler on Windows or Cron on Mac, Linux (e.g with @reboot khoj)

3. Configure

Enable content types and point to files to search in the First Run Screen that pops up on app start
Click Configure and wait. The app will download ML models and index the content for search

4. Install Interface Plugins

Khoj exposes a web interface by default.
The optional steps below allow using Khoj from within an existing application like Obsidian or Emacs.

Khoj Obsidian:
Install the Khoj Obsidian plugin
Khoj Emacs:
Install khoj.el

Use

Khoj Search

Khoj via Obsidian
- Click the Khoj search icon 🔎 on the Ribbon or Search for Khoj: Search in the Command Palette
Khoj via Emacs
- Run M-x khoj <user-query>
Khoj via Web
- Open http://localhost:8000/ directly
Khoj via API
- See the Khoj FastAPI Swagger Docs, ReDocs

Query Filters

Use structured query syntax to filter the natural language search results

Word Filter: Get entries that include/exclude a specified term
- Entries that contain term_to_include: +"term_to_include"
- Entries that contain term_to_exclude: -"term_to_exclude"
Date Filter: Get entries containing dates in YYYY-MM-DD format from specified date (range)
- Entries from April 1st 1984: dt:"1984-04-01"
- Entries after March 31st 1984: dt>="1984-04-01"
- Entries before April 2nd 1984 : dt<="1984-04-01"
File Filter: Get entries from a specified file
- Entries from incoming.org file: file:"incoming.org"
Combined Example
- what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"
- Adds all filters to the natural language query. It should return entries
  - from the file 1984.org
  - containing dates from the year 1984
  - excluding words "big" and "brother"
  - that best match the natural language query "what is the meaning of life?"

Khoj Chat

Overview

Creates a personal assistant for you to inquire and engage with your notes
Uses ChatGPT and Khoj search
Supports multi-turn conversations with the relevant notes for context
Shows reference notes used to generate a response
Note: Your query and top notes from khoj search will be sent to OpenAI for processing

Setup

Setup your OpenAI API key in Khoj

Use

Open /chat¹
Type your queries and see response by Khoj from your notes

Demo

Details

Your query is used to retrieve the most relevant notes, if any, using Khoj search
These notes, the last few messages and associated metadata is passed to ChatGPT along with your query for a response

Upgrade

Upgrade Khoj Server

pip install --upgrade khoj-assistant

Note: To upgrade to the latest pre-release version of the khoj server run below command

# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant

Upgrade Khoj on Emacs

Use your Emacs Package Manager to Upgrade
See khoj.el readme for details

Upgrade Khoj on Obsidian

Upgrade via the Community plugins tab on the settings pane in the Obsidian app
See the khoj plugin readme for details

Uninstall

(Optional) Hit Ctrl-C in the terminal running the khoj server to stop it
Delete the khoj directory in your home folder (i.e ~/.khoj on Linux, Mac or C:\Users\<your-username>\.khoj on Windows)
Uninstall the khoj server with pip uninstall khoj-assistant
(Optional) Uninstall khoj.el or the khoj obsidian plugin in the standard way on Emacs, Obsidian

Troubleshoot

Install fails while building Tokenizer dependency

Details: pip install khoj-assistant fails while building the tokenizers dependency. Complains about Rust.
Fix: Install Rust to build the tokenizers package. For example on Mac run:
```
brew install rustup
rustup-init
source ~/.cargo/env
```
Refer: Issue with Fix for more details

Search starts giving wonky results

Fix: Open /api/update?force=true¹ in browser to regenerate index from scratch
Note: This is a fix for when you percieve the search results have degraded. Not if you think they've always given wonky results

Khoj in Docker errors out with "Killed" in error message

Fix: Increase RAM available to Docker Containers in Docker Settings
Refer: StackOverflow Solution, Configure Resources on Docker for Mac

Khoj errors out complaining about Tensors mismatch or null

Mitigation: Disable image search using the desktop GUI

Advanced Usage

Access Khoj on Mobile

Setup Khoj on your personal server. This can be any always-on machine, i.e an old computer, RaspberryPi(?) etc
Install Tailscale on your personal server and phone
Open the Khoj web interface of the server from your phone browser.
It should be http://tailscale-ip-of-server:8000 or http://name-of-server:8000 if you've setup MagicDNS
Click the Add to Homescreen button
Enjoy exploring your notes, documents and images from your phone!

Use OpenAI Models for Search

Setup

Set encoder-type, encoder and model-directory under asymmetric and/or symmetric search-type in your khoj.yml²:

   asymmetric:
-    encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
+    encoder: text-embedding-ada-002
+    encoder-type: khoj.utils.models.OpenAI
     cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2"
-    encoder-type: sentence_transformers.SentenceTransformer
-    model_directory: "~/.khoj/search/asymmetric/"
+    model-directory: null

Setup your OpenAI API key in Khoj
Restart Khoj server to generate embeddings. It will take longer than with offline models.

Warnings

This configuration uses an online model

It will send all notes to OpenAI to generate embeddings
All queries will be sent to OpenAI when you search with Khoj
You will be charged by OpenAI based on the total tokens processed
It requires an active internet connection to search and index

Search across Different Languages

To search for notes in multiple, different languages, you can use a multi-lingual model.
For example, the paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages, has good search quality and speed. To use it:

Manually update search-type > asymmetric > encoder to paraphrase-multilingual-MiniLM-L12-v2 in your ~/.khoj/khoj.yml file for now. See diff of khoj.yml below for illustration:

 asymmetric:
- encoder: "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
+ encoder: "paraphrase-multilingual-MiniLM-L12-v2"
   cross-encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2"
   model_directory: "~/.khoj/search/asymmetric/"

Regenerate your content index. For example, by opening <khoj-url>/api/update?t=force

Bootstrap Khoj Search for Offline Usage later

You can bootstrap Khoj pre-emptively to run on machines that do not have internet access. An example use-case would be to run Khoj on an air-gapped machine. Note: Only search can currently run in fully offline mode, not chat.

With Internet
1. Manually download the asymmetric text, symmetric textand image search models from HuggingFace
2. Pip install khoj (and dependencies) in an associated virtualenv. E.g python -m venv .venv && source .venv/bin/activate && pip install khoj-assistant
Without Internet
1. Copy each of the search models into their respective folders, asymmetric, symmetric and image under the ~/.khoj/search/ directory on the air-gapped machine
2. Copy the khoj virtual environment directory onto the air-gapped machine, activate the environment and start and khoj as normal. E.g source .venv/bin/activate && khoj

Miscellaneous

Set your OpenAI API key in Khoj

If you want, Khoj can be configured to use OpenAI for search and chat.
Add your OpenAI API to Khoj by using either of the two options below:

Open your Khoj settings, add your OpenAI API key, and click Save. Then go to your Khoj settings and click Configure. This will refresh Khoj with your OpenAI API key.

Set openai-api-key field under processor.conversation section in your khoj.yml² to your OpenAI API key and restart khoj:

processor:
  conversation:
-    openai-api-key: # "YOUR_OPENAI_API_KEY"
+    openai-api-key: sk-aaaaaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhhhhhhhhhhhh
    model: "text-davinci-003"
    conversation-logfile: "~/.khoj/processor/conversation/conversation_logs.json"

Warning: This will enable Khoj to send your query and note(s) to OpenAI for processing

GPT API

The chat, answer and search API endpoints use OpenAI API
They are disabled by default
To use them:
1. Setup your OpenAI API key in Khoj
2. Interact with them from the Khoj Swagger docs¹

Index Github Repository for Search, Chat

The Khoj Github plugin can index issues, commit messages and markdown, org-mode and PDF files from any repositories you have access to. This allows you to chat or search with these repositories. Get answers, resolve issues or just explore a repo with the help of your AI personal assistant.

See the Khoj FAQ for a demo of Khoj search and chat. It makes the Khoj github repo available for exploring.

Note: Khoj will ignore code files in the repository for now as the default AI model used works best with natural language text, not code.

Setup Khoj Github plugin

Get a pat token with repo and read:org scopes in the classic flow.
Configure Khoj settings to include the owner and repo_name. The owner will be the organization name if the repo is in an organization. The repo_name will be the name of the repository. Optionally, you can also supply a branch name. If no branch name is supplied, the master branch will be used.

Performance

Query performance

Semantic search using the bi-encoder is fairly fast at <50 ms
Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
Filters in query (e.g by file, word or date) usually add <20ms to query latency

Indexing performance

Indexing is more strongly impacted by the size of the source data
Indexing 100K+ line corpus of notes takes about 10 minutes
Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
Note: It should only take this long on the first run as the index is incrementally updated

Miscellaneous

Testing done on a Mac M1 and a >100K line corpus of notes
Search, indexing on a GPU has not been tested yet

Development

Visualize Codebase

Interactive Visualization

Setup

Using Pip

1. Install

# Get Khoj Code
git clone https://github.com/khoj-ai/khoj && cd khoj

# Create, Activate Virtual Environment
python3 -m venv .venv && source .venv/bin/activate

# Install Khoj for Development
pip install -e .[dev]

2. Run

Start Khoj
```
khoj -vv
```
Configure Khoj
- Via the Settings UI: Add files, directories to index the Khoj settings UI once Khoj has started up. Once you've saved all your settings, click Configure.
- Manually:
  - Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
  - Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
    - Set input-directories field in image content-type section
  - Delete content-type and processor sub-section(s) irrelevant for your use-case
  - Restart khoj

Note: Wait after configuration for khoj to Load ML model, generate embeddings and expose API to query notes, images, documents etc specified in config YAML

Using Docker

1. Clone

git clone https://github.com/khoj-ai/khoj && cd khoj

2. Configure

Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes, PDFs and Github repositories
Optional: Edit application configuration in khoj_docker.yml

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

4. Upgrade

docker-compose build --pull

Using Conda

1. Install Dependencies

Install Conda

2. Install Khoj

git clone https://github.com/khoj-ai/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
python3 -m pip install pyqt6  # As conda does not support pyqt6 yet

3. Configure

Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
- Set input-directories field in image content-type section
Delete content-type, processor sub-sections irrelevant for your use-case

4. Run

python3 -m src.khoj.main -vv

Load ML model, generate embeddings and expose API to query notes, images, documents etc specified in config YAML

5. Upgrade

cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Validate

Before Make Changes

Install Git Hooks for Validation
```
pre-commit install -t pre-push -t pre-commit
```
- This ensures standard code formatting fixes and other checks run automatically on every commit and push
- Note 1: If pre-commit didn't already get installed, install it via pip install pre-commit
- Note 2: To run the pre-commit changes manually, use pre-commit run --hook-stage manual --all before creating PR

Before Creating PR

Run Tests. If you get an error complaining about a missing fast_tokenizer_file, follow the solution in this Github issue.
```
pytest
```
Run MyPy to check types
```
mypy --config-file pyproject.toml
```

After Creating PR

Automated validation workflows run for every PR.

Ensure any issues seen by them our fixed
Test the python packge created for a PR
1. Download and extract the zipped .whl artifact generated from the pypi workflow run for the PR.
2. Install (in your virtualenv) with pip install /path/to/download*.whl>
3. Start and use the application to see if it works fine

Credits

Multi-QA MiniLM Model, All MiniLM Model for Text Search. See SBert Documentation
OpenAI CLIP Model for Image Search. See SBert Documentation
Charles Cave for OrgNode Parser
Org.js to render Org-mode results on the Web interface
Markdown-it to render Markdown results on the Web interface

Default Khoj url @ http://localhost:8000 ↩︎
Default Khoj config file @ ~/.khoj/khoj.yml ↩︎

24 KiB Raw Blame History

Table of Contents

Features

Demos

Khoj in Obsidian

Khoj in Emacs, Browser

Interfaces

Architecture

Setup

1. Install

2. Run

3. Configure

4. Install Interface Plugins

Use

Khoj Search

Khoj Chat

Overview

Setup

Use

Demo

Details

Upgrade

Upgrade Khoj Server

Upgrade Khoj on Emacs

Upgrade Khoj on Obsidian

Uninstall

Troubleshoot

Install fails while building Tokenizer dependency

Search starts giving wonky results

Khoj in Docker errors out with "Killed" in error message

Khoj errors out complaining about Tensors mismatch or null

Advanced Usage

Access Khoj on Mobile

Use OpenAI Models for Search

Setup

Warnings

Search across Different Languages

Bootstrap Khoj Search for Offline Usage later

Miscellaneous

Set your OpenAI API key in Khoj

GPT API

Index Github Repository for Search, Chat

Setup Khoj Github plugin

Performance

Query performance

Indexing performance

Miscellaneous

Development

Visualize Codebase

Setup

Using Pip

1. Install

2. Run

Using Docker

1. Clone

2. Configure

3. Run

4. Upgrade

Using Conda

1. Install Dependencies

2. Install Khoj

3. Configure

4. Run

5. Upgrade

Validate

Before Make Changes

Before Creating PR

After Creating PR

Credits

24 KiB

Raw Blame History