sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Mirror of khoj from Github

agent ai assistant chat chatgpt emacs image-generation llama3 llamacpp llm obsidian obsidian-md offline-llm productivity rag research self-hosted semantic-search stt whatsapp-ai

Find a file

Debanjum Singh Solanky a4eb55dd00 Rename khoj config yml file to follow more specific khoj*.yml pattern - That is, sample_config.yml is renamed to khoj_sample.yml - This makes the application config filename less generic, more easily identifiable with the application - Update docs, app accordingly		2022-08-03 12:06:55 +03:00
.github/workflows	Trigger github build, test on updates to setup.py	2022-08-03 11:56:40 +03:00
config	Rename khoj config yml file to follow more specific khoj*.yml pattern	2022-08-03 12:06:55 +03:00
docs	Move demo video to docs/ directory to keep project root clean	2022-08-01 02:41:54 +03:00
src	Disable Incremental Search for Images on Web	2022-08-03 11:52:51 +03:00
tests	Rename khoj config yml file to follow more specific khoj*.yml pattern	2022-08-03 12:06:55 +03:00
.dockerignore	Make Docker ignore unnecessary files	2022-06-29 22:29:34 +04:00
.gitignore	Rename khoj config yml file to follow more specific khoj*.yml pattern	2022-08-03 12:06:55 +03:00
docker-compose.yml	Rename khoj config yml file to follow more specific khoj*.yml pattern	2022-08-03 12:06:55 +03:00
Dockerfile	Give the project a short, less generic name. Rename it to Khoj	2022-07-19 18:26:16 +04:00
LICENSE	Add Readme, License. Update .gitignore	2021-08-15 22:52:37 -07:00
MANIFEST.in	Prepare Khoj for PyPi. Include Readme in dist, Fix metadata in setup.py	2022-08-02 22:53:02 +03:00
Readme.md	Rename khoj config yml file to follow more specific khoj*.yml pattern	2022-08-03 12:06:55 +03:00
setup.py	Upgrade pillow to fix dependabot security advisory	2022-08-03 00:33:29 +03:00

Readme.md

Khoj 🦅

A natural language search engine for your personal notes, transactions and images

Features
Demo
- Description
- Analysis
Architecture
Setup
- Clone
- Configure
- Run
Use
Upgrade
Troubleshoot
Miscellaneous
Development Setup
Performance
Credits

Features

Natural: Advanced Natural language understanding using Transformer based ML Models
Local: Your personal data stays local. All search, indexing is done on your machine*
Incremental: Incremental search for a fast, search-as-you-type experience
Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
Multiple Interfaces: Search using a Web Browser, Emacs or the API

Demo

https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4

Description

User searches for "Setup editor"
The demo looks for the most relevant section in this readme and the khoj.el readme
Top result is what we are looking for, the section to Install Khoj.el on Emacs

Analysis

The results do not have any words used in the query
- Based on the top result it seems the re-ranking model understands that Emacs is an editor?
The results incrementally update as the query is entered
The results are re-ranked, for better accuracy, once user is idle

Architecture

Setup

1. Clone

git clone https://github.com/debanjum/khoj && cd khoj

2. Configure

Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
Optional: Edit application configuration in khoj_sample.yml

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

Use

Khoj via Web
- Go to http://localhost:8000/ or open index.html in your browser
Khoj via Emacs
- Install khoj.el
- Run M-x khoj <user-query>
Khoj via API

Upgrade

docker-compose build --pull

Troubleshoot

Symptom: Errors out with "Killed" in error message
- Fix: Increase RAM available to Docker Containers in Docker Settings
- Refer: StackOverflow Solution, Configure Resources on Docker for Mac
Symptom: Errors out complaining about Tensors mismatch, null etc
- Mitigation: Delete content-type > image section from khoj_sample.yml

Miscellaneous

The experimental chat API endpoint uses the OpenAI API
- It is disabled by default
- To use it add your openai-api-key to config.yml

Development Setup

Setup on Local Machine

Using Pip

Install Dependencies
1. Python3, Pip [Required]
2. Virualenv [Optional]
3. Install Exiftool [Optional]
```
sudo apt-get -y install libimage-exiftool-perl
```

Install Khoj

virtualenv -m python3 .venv && source .venv/bin/activate # Optional
pip install khoj-assistant

Configure
- Configure files/directories to search in content-type section of khoj_sample.yml
- To run application on test data, update file paths containing /data/ to tests/data/ in khoj_sample.yml
  - Example replace /data/notes/*.org with tests/data/notes/*.org
Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
```
khoj -c=config/khoj_sample.yml -vv
```

Using Conda

Install Dependencies
1. Install Python3 [Required]
2. Install Conda [Required]
3. Install Exiftool [Optional]
```
sudo apt-get -y install libimage-exiftool-perl
```

Install Khoj

git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj

Configure
- Configure files/directories to search in content-type section of khoj_sample.yml
- To run application on test data, update file paths containing /data/ to tests/data/ in khoj_sample.yml
  - Example replace /data/notes/*.org with tests/data/notes/*.org
Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML
```
python3 -m src.main -c=config/khoj_sample.yml -vv
```

Upgrade On Local Machine

Using Pip

pip install --upgrade khoj-assistant

Using Conda

cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Run Unit Tests

pytest

Performance

Query performance

Semantic search using the bi-encoder is fairly fast at <5 ms
Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc

Indexing performance

Indexing is more strongly impacted by the size of the source data
Indexing 100K+ line corpus of notes takes 6 minutes
Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run

Miscellaneous

Testing done on a Mac M1 and a >100K line corpus of notes
Search, indexing on a GPU has not been tested yet

Credits

Multi-QA MiniLM Model, All MiniLM Model for Text Search. See SBert Documentation
OpenAI CLIP Model for Image Search. See SBert Documentation
Charles Cave for OrgNode Parser
Org.js to render Org-mode results on the Web interface
Markdown-it to render Markdown results on the Web interface
Sven Marnach for PyExifTool

Readme.md

Khoj 🦅

Table of Contents

Features

Demo

Description

Analysis

Architecture

Setup

1. Clone

2. Configure

3. Run

Use

Upgrade

Troubleshoot

Miscellaneous

Development Setup

Setup on Local Machine

Using Pip

Using Conda

Upgrade On Local Machine

Using Pip

Using Conda

Run Unit Tests

Performance

Query performance

Indexing performance

Miscellaneous

Credits