2022-07-29 15:50:29 +02:00
# Khoj 🦅
2022-08-04 04:18:50 +02:00
[![build ](https://github.com/debanjum/khoj/actions/workflows/build.yml/badge.svg )](https://github.com/debanjum/khoj/actions/workflows/build.yml)
[![test ](https://github.com/debanjum/khoj/actions/workflows/test.yml/badge.svg )](https://github.com/debanjum/khoj/actions/workflows/test.yml)
[![publish ](https://github.com/debanjum/khoj/actions/workflows/publish.yml/badge.svg )](https://github.com/debanjum/khoj/actions/workflows/publish.yml)
2022-07-29 15:06:34 +02:00
*A natural language search engine for your personal notes, transactions and images*
## Table of Contents
- [Features ](#Features )
- [Demo ](#Demo )
- [Description ](#Description )
- [Analysis ](#Analysis )
2022-08-15 23:35:06 +02:00
- [Interfaces ](#Interfaces )
2022-07-29 15:06:34 +02:00
- [Architecture ](#Architecture )
- [Setup ](#Setup )
2022-08-04 22:32:32 +02:00
- [Install ](#1-Install )
- [Configure ](#2-Configure )
- [Run ](#3-Run )
2022-07-29 15:06:34 +02:00
- [Use ](#Use )
- [Upgrade ](#Upgrade )
2022-08-01 01:42:48 +02:00
- [Troubleshoot ](#Troubleshoot )
2022-07-29 15:06:34 +02:00
- [Miscellaneous ](#Miscellaneous )
- [Performance ](#Performance )
- [Query Performance ](#Query-performance )
- [Indexing Performance ](#Indexing-performance )
- [Miscellaneous ](#Miscellaneous-1 )
2022-08-05 04:27:09 +02:00
- [Development ](#Development )
- [Setup ](#Setup )
- [Using Pip ](#Using-Pip )
- [Using Docker ](#Using-Docker )
- [Using Conda ](#Test )
- [Test ](#Test )
2022-08-01 01:42:48 +02:00
- [Credits ](#Credits )
2022-07-29 15:06:34 +02:00
## Features
2022-08-13 13:35:32 +02:00
- **Natural**: Advanced natural language understanding using Transformer based ML Models
2022-07-29 15:06:34 +02:00
- **Local**: Your personal data stays local. All search, indexing is done on your machine[\*](https://github.com/debanjum/khoj#miscellaneous)
- **Incremental**: Incremental search for a fast, search-as-you-type experience
2022-08-01 01:42:48 +02:00
- **Pluggable**: Modular architecture makes it easy to plug in new data sources, frontends and ML models
2022-07-29 15:06:34 +02:00
- **Multiple Sources**: Search your Org-mode and Markdown notes, Beancount transactions and Photos
- **Multiple Interfaces**: Search using a [Web Browser ](./src/interface/web/index.html ), [Emacs ](./src/interface/emacs/khoj.el ) or the [API ](http://localhost:8000/docs )
## Demo
2022-08-16 01:15:43 +02:00
https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4
2022-07-29 15:06:34 +02:00
### Description
2022-08-16 01:15:43 +02:00
- Install Khoj via pip
- Start Khoj app
- Add this readme and [khoj.el readme ](https://github.com/debanjum/khoj/tree/master/src/interface/emacs ) as org-mode for Khoj to index
- Search \"*Setup editor*\" on the Web and Emacs. Re-rank the results for better accuracy
2022-07-29 15:06:34 +02:00
- Top result is what we are looking for, the [section to Install Khoj.el on Emacs ](https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation )
### Analysis
2022-08-16 15:38:07 +02:00
- The results do not have any words used in the query
2022-07-29 15:06:34 +02:00
- *Based on the top result it seems the re-ranking model understands that Emacs is an editor?*
- The results incrementally update as the query is entered
2022-08-15 23:35:06 +02:00
- The results are re-ranked, for better accuracy, once user hits enter
### Interfaces
![](https://github.com/debanjum/khoj/blob/master/docs/interfaces.png)
2022-07-29 15:06:34 +02:00
## Architecture
![](https://github.com/debanjum/khoj/blob/master/docs/khoj_architecture.png)
## Setup
2022-08-04 22:32:32 +02:00
### 1. Install
2022-08-16 15:38:07 +02:00
```shell
pip install khoj-assistant
```
2022-07-29 15:06:34 +02:00
2022-08-15 19:52:17 +02:00
### 2. Start App
2022-07-29 15:06:34 +02:00
2022-08-16 15:38:07 +02:00
```shell
khoj
```
2022-08-15 23:35:06 +02:00
2022-08-16 15:38:07 +02:00
### 3. Configure
2022-07-29 15:06:34 +02:00
2022-08-16 15:38:07 +02:00
1. Enable content types and point to files to search in the First Run Screen that pops up on app start
2. Click configure and wait. The app will load ML model, generates embeddings and expose the search API
2022-08-15 23:35:06 +02:00
2022-07-29 15:06:34 +02:00
## Use
- **Khoj via Web**
2022-08-15 23:35:06 +02:00
- Open < http: // localhost:8000 /> via desktop interface or directly
2022-07-29 15:06:34 +02:00
- **Khoj via Emacs**
- [Install ](https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation ) [khoj.el ](./src/interface/emacs/khoj.el )
- Run `M-x khoj <user-query>`
- **Khoj via API**
2022-08-16 15:38:07 +02:00
- See the Khoj FastAPI [Swagger Docs ](http://localhost:8000/docs ), [ReDocs ](http://localhost:8000/redocs )
2022-07-29 15:06:34 +02:00
## Upgrade
2022-08-16 15:38:07 +02:00
```shell
2022-08-04 22:32:32 +02:00
pip install --upgrade khoj-assistant
2022-07-29 15:06:34 +02:00
```
2022-08-01 01:42:48 +02:00
## Troubleshoot
2022-07-29 15:06:34 +02:00
2022-08-04 22:32:32 +02:00
- Symptom: Errors out complaining about Tensors mismatch, null etc
2022-09-07 13:51:03 +02:00
- Mitigation: Disable `image` search using the desktop GUI
2022-08-04 22:32:32 +02:00
- Symptom: Errors out with \"Killed\" in error message in Docker
2022-07-29 15:06:34 +02:00
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: [StackOverflow Solution ](https://stackoverflow.com/a/50770267 ), [Configure Resources on Docker for Mac ](https://docs.docker.com/desktop/mac/#resources )
## Miscellaneous
2022-08-16 15:38:07 +02:00
- The beta [chat ](http://localhost:8000/beta/chat ) and [search ](http://localhost:8000/beta/search ) API endpoints use [OpenAI API ](https://openai.com/api/ )
- It is disabled by default
- To use it add your `openai-api-key` via the app configure screen
- Warning: *If you use the above beta APIs, your query and top result(s) will be sent to OpenAI for processing*
2022-07-29 15:06:34 +02:00
2022-08-05 04:27:09 +02:00
## Performance
### Query performance
2022-08-17 17:32:55 +02:00
- Semantic search using the bi-encoder is fairly fast at \<50 ms
2022-08-05 04:27:09 +02:00
- Reranking using the cross-encoder is slower at \<2s on 15 results. Tweak `top_k` to tradeoff speed for accuracy of results
2022-09-07 13:51:03 +02:00
- Filters in query (e.g by file, word or date) usually add \<20ms to query latency
2022-08-05 04:27:09 +02:00
### Indexing performance
- Indexing is more strongly impacted by the size of the source data
2022-09-07 13:51:03 +02:00
- Indexing 100K+ line corpus of notes takes about 10 minutes
2022-08-05 04:27:09 +02:00
- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
2022-09-07 13:10:38 +02:00
- Note: *It should only take this long on the first run* as the index is incrementally updated
2022-08-05 04:27:09 +02:00
### Miscellaneous
- Testing done on a Mac M1 and a \>100K line corpus of notes
- Search, indexing on a GPU has not been tested yet
2022-08-04 22:32:32 +02:00
## Development
### Setup
2022-08-05 03:59:52 +02:00
#### Using Pip
2022-08-05 04:27:09 +02:00
##### 1. Install
2022-08-16 15:38:07 +02:00
```shell
git clone https://github.com/debanjum/khoj & & cd khoj
python3 -m venv .venv & & source .venv/bin/activate
pip install -e .
```
2022-08-05 04:27:09 +02:00
##### 2. Configure
2022-08-16 15:38:07 +02:00
- Copy the `config/khoj_sample.yml` to `~/.khoj/khoj.yml`
- Set `input-files` or `input-filter` in each relevant `content-type` section of `~/.khoj/khoj.yml`
- Set `input-directories` field in `image` `content-type` section
- Delete `content-type` and `processor` sub-section(s) irrelevant for your use-case
2022-08-05 03:59:52 +02:00
2022-08-05 04:27:09 +02:00
##### 3. Run
2022-08-16 15:38:07 +02:00
```shell
khoj -vv
```
Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
2022-08-05 03:59:52 +02:00
2022-08-05 04:27:09 +02:00
##### 4. Upgrade
```shell
# To Upgrade To Latest Stable Release
2022-08-13 13:35:32 +02:00
# Maps to the latest tagged version of khoj on master branch
2022-08-05 04:27:09 +02:00
pip install --upgrade khoj-assistant
# To Upgrade To Latest Pre-Release
2022-08-13 13:35:32 +02:00
# Maps to the latest commit on the master branch
2022-08-05 04:27:09 +02:00
pip install --upgrade --pre khoj-assistant
2022-08-12 01:24:07 +02:00
# To Upgrade To Specific Development Release.
2022-08-12 00:49:04 +02:00
# Useful to test, review a PR.
2022-08-12 01:24:07 +02:00
# Note: khoj-assistant is published to test PyPi on creating a PR
pip install -i https://test.pypi.org/simple/ khoj-assistant==0.1.5.dev57166025766
2022-08-05 04:27:09 +02:00
```
2022-08-04 22:32:32 +02:00
#### Using Docker
2022-08-05 04:27:09 +02:00
##### 1. Clone
2022-07-29 15:06:34 +02:00
2022-08-16 15:38:07 +02:00
```shell
2022-08-04 22:32:32 +02:00
git clone https://github.com/debanjum/khoj & & cd khoj
```
2022-07-29 15:06:34 +02:00
2022-08-05 04:27:09 +02:00
##### 2. Configure
2022-08-02 20:12:27 +02:00
2022-08-04 22:32:32 +02:00
- **Required**: Update [docker-compose.yml ](./docker-compose.yml ) to mount your images, (org-mode or markdown) notes and beancount directories
- **Optional**: Edit application configuration in [khoj_docker.yml ](./config/khoj_docker.yml )
2022-08-02 20:12:27 +02:00
2022-08-05 04:27:09 +02:00
##### 3. Run
2022-08-02 20:12:27 +02:00
2022-08-16 15:38:07 +02:00
```shell
2022-08-04 22:32:32 +02:00
docker-compose up -d
```
2022-08-02 20:12:27 +02:00
2022-08-04 22:32:32 +02:00
*Note: The first run will take time. Let it run, it\'s mostly not hung, just generating embeddings*
2022-08-02 20:12:27 +02:00
2022-08-05 04:27:09 +02:00
##### 4. Upgrade
2022-08-16 15:38:07 +02:00
```shell
2022-08-05 04:27:09 +02:00
docker-compose build --pull
```
2022-08-02 20:12:27 +02:00
#### Using Conda
2022-08-05 04:27:09 +02:00
##### 1. Install Dependencies
2022-08-16 15:38:07 +02:00
- [Install Conda ](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html ) \[Required\]
- Install Exiftool \[Optional\]
``` shell
sudo apt -y install libimage-exiftool-perl
```
2022-07-29 15:06:34 +02:00
2022-08-05 04:27:09 +02:00
##### 2. Install Khoj
2022-08-16 15:38:07 +02:00
```shell
git clone https://github.com/debanjum/khoj & & cd khoj
conda env create -f config/environment.yml
conda activate khoj
```
2022-07-29 15:06:34 +02:00
2022-08-05 04:27:09 +02:00
##### 3. Configure
2022-08-16 15:38:07 +02:00
- Copy the `config/khoj_sample.yml` to `~/.khoj/khoj.yml`
- Set `input-files` or `input-filter` in each relevant `content-type` section of `~/.khoj/khoj.yml`
- Set `input-directories` field in `image` `content-type` section
- Delete `content-type` , `processor` sub-sections irrelevant for your use-case
2022-07-29 15:06:34 +02:00
2022-08-05 04:27:09 +02:00
##### 4. Run
2022-08-16 15:38:07 +02:00
```shell
python3 -m src.main -vv
```
Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
2022-07-29 15:06:34 +02:00
2022-08-05 04:27:09 +02:00
##### 5. Upgrade
2022-08-16 15:38:07 +02:00
```shell
2022-07-29 15:06:34 +02:00
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj
```
2022-08-04 22:32:32 +02:00
### Test
2022-08-16 15:38:07 +02:00
```shell
2022-07-29 15:06:34 +02:00
pytest
```
2022-08-01 01:42:48 +02:00
## Credits
2022-07-29 15:06:34 +02:00
- [Multi-QA MiniLM Model ](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1 ), [All MiniLM Model ](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 ) for Text Search. See [SBert Documentation ](https://www.sbert.net/examples/applications/retrieve_rerank/README.html )
- [OpenAI CLIP Model ](https://github.com/openai/CLIP ) for Image Search. See [SBert Documentation ](https://www.sbert.net/examples/applications/image-search/README.html )
- Charles Cave for [OrgNode Parser ](http://members.optusnet.com.au/~charles57/GTD/orgnode.html )
- [Org.js ](https://mooz.github.io/org-js/ ) to render Org-mode results on the Web interface
- [Markdown-it ](https://github.com/markdown-it/markdown-it ) to render Markdown results on the Web interface
2022-08-05 03:59:52 +02:00
- Sven Marnach for [PyExifTool ](https://github.com/smarnach/pyexiftool/blob/master/exiftool.py )