Mirror of khoj from Github
Find a file
Debanjum Singh Solanky cd8b918a55 Add manifest.json, versions.json of Obsidian plugin to project root
- Obsidian provides limited support for plugins in larger repositories.
  Currently, it does not have a way to specify the directory of a plugin
  So it expects the plugins `manifest.json' and `versions.json' to be at
  project root

- While this unnecessarily litters the codebase. It is the (current)
  required tradeoff for keeping the core plugins in a mono repo
2023-01-04 18:28:16 -03:00
.github/workflows Add workflow dispatch support in build.yml 2022-09-15 20:28:41 +03:00
config Add index_heading_entries flag to default and sample khoj configs 2022-09-11 17:33:37 +03:00
docs Add screenshot of Khoj as PWA on Android Homescreen to Readme 2023-01-04 15:47:08 -03:00
src Create Obsidian plugin for Khoj 2023-01-04 18:28:16 -03:00
tests Fix comments, use minimal test case, regenerate test index, merge debug logs 2022-12-25 22:33:04 -03:00
.dockerignore Update Dockerfile to use Pip instead of Conda to install application 2022-08-04 00:14:25 +03:00
.gitignore Ignore pytest_cache directory from git using .gitignore 2022-09-04 17:19:22 +03:00
.mypy.ini Setup `mypy' for static type checking 2022-10-08 17:33:13 +03:00
docker-compose.yml Use --no-gui flag on starting Khoj from docker-compose 2022-09-08 10:37:39 +03:00
Dockerfile Get XMP metadata from image using Pillow. Remove ExifTool dependency 2022-09-16 00:48:45 +03:00
Khoj.desktop Fix path to Khoj executable in Khoj.desktop for Debian package 2022-08-17 19:52:35 +03:00
Khoj.spec Move Splash screen text below icon. Set the text color to black 2022-08-20 20:32:31 +03:00
LICENSE Add Readme, License. Update .gitignore 2021-08-15 22:52:37 -07:00
MANIFEST.in Update Dockerfile to use Pip instead of Conda to install application 2022-08-04 00:14:25 +03:00
manifest.json Add manifest.json, versions.json of Obsidian plugin to project root 2023-01-04 18:28:16 -03:00
Readme.md Add screenshot of Khoj as PWA on Android Homescreen to Readme 2023-01-04 15:47:08 -03:00
setup.py Automate updating embeddings, search index on a hourly schedule 2023-01-01 17:09:36 -03:00
versions.json Add manifest.json, versions.json of Obsidian plugin to project root 2023-01-04 18:28:16 -03:00

Khoj 🦅

build test publish

A natural language search engine for your personal notes, transactions and images

Table of Contents

Features

  • Natural: Advanced natural language understanding using Transformer based ML Models
  • Local: Your personal data stays local. All search, indexing is done on your machine*
  • Incremental: Incremental search for a fast, search-as-you-type experience
  • Pluggable: Modular architecture makes it easy to plug in new data sources, frontends and ML models
  • Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
  • Multiple Interfaces: Search using a Web Browser, Emacs or the API

Demo

https://user-images.githubusercontent.com/6413477/184735169-92c78bf1-d827-4663-9087-a1ea194b8f4b.mp4

Description

  • Install Khoj via pip
  • Start Khoj app
  • Add this readme and khoj.el readme as org-mode for Khoj to index
  • Search "Setup editor" on the Web and Emacs. Re-rank the results for better accuracy
  • Top result is what we are looking for, the section to Install Khoj.el on Emacs

Analysis

  • The results do not have any words used in the query
    • Based on the top result it seems the re-ranking model understands that Emacs is an editor?
  • The results incrementally update as the query is entered
  • The results are re-ranked, for better accuracy, once user hits enter

Interfaces

Architecture

Setup

1. Install

pip install khoj-assistant

2. Start App

khoj

3. Configure

  1. Enable content types and point to files to search in the First Run Screen that pops up on app start
  2. Click Configure and wait. The app will download ML models and index the content for search

Use

Interfaces

Query Filters

Use structured query syntax to filter the natural language search results

  • Word Filter: Get entries that include/exclude a specified term
    • Entries that contain term_to_include: +"term_to_include"
    • Entries that contain term_to_exclude: -"term_to_exclude"
  • Date Filter: Get entries containing dates in YYYY-MM-DD format from specified date (range)
    • Entries from April 1st 1984: dt:"1984-04-01"
    • Entries after March 31st 1984: dt>="1984-04-01"
    • Entries before April 2nd 1984 : dt<="1984-04-01"
  • File Filter: Get entries from a specified file
    • Entries from incoming.org file: file:"incoming.org"
  • Combined Example
    • what is the meaning of life? file:"1984.org" dt>="1984-01-01" dt<="1985-01-01" -"big" -"brother"
    • Adds all filters to the natural language query. It should return entries
      • from the file 1984.org
      • containing dates from the year 1984
      • excluding words "big" and "brother"
      • that best match the natural language query "what is the meaning of life?"

Upgrade

pip install --upgrade khoj-assistant

Troubleshoot

  • Symptom: Errors out complaining about Tensors mismatch, null etc
    • Mitigation: Disable image search using the desktop GUI
  • Symptom: Errors out with "Killed" in error message in Docker

Advanced Usage

Access Khoj on Mobile

  1. Setup Khoj on your personal server. This can be any always-on machine, i.e an old computer, RaspberryPi(?) etc
  2. Install Tailscale on your personal server and phone
  3. Open the Khoj web interface of the server from your phone browser. It should be http://tailscale-url-of-server:8000 or http://name-of-server:8000 if you've setup MagicDNS
  4. Click the Install/Add to Homescreen button
  5. Enjoy exploring your notes, transactions and images from your phone!

Miscellaneous

  • The beta chat and search API endpoints use OpenAI API
    • It is disabled by default
    • To use it add your openai-api-key via the app configure screen
    • Warning: If you use the above beta APIs, your query and top result(s) will be sent to OpenAI for processing

Performance

Query performance

  • Semantic search using the bi-encoder is fairly fast at <50 ms
  • Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
  • Filters in query (e.g by file, word or date) usually add <20ms to query latency

Indexing performance

  • Indexing is more strongly impacted by the size of the source data
  • Indexing 100K+ line corpus of notes takes about 10 minutes
  • Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
  • Note: It should only take this long on the first run as the index is incrementally updated

Miscellaneous

  • Testing done on a Mac M1 and a >100K line corpus of notes
  • Search, indexing on a GPU has not been tested yet

Development

Setup

Using Pip

1. Install
git clone https://github.com/debanjum/khoj && cd khoj
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
2. Configure
  • Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
  • Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
    • Set input-directories field in image content-type section
  • Delete content-type and processor sub-section(s) irrelevant for your use-case
3. Run
khoj -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

4. Upgrade
# To Upgrade To Latest Stable Release
# Maps to the latest tagged version of khoj on master branch
pip install --upgrade khoj-assistant

# To Upgrade To Latest Pre-Release
# Maps to the latest commit on the master branch
pip install --upgrade --pre khoj-assistant

# To Upgrade To Specific Development Release.
# Useful to test, review a PR.
# Note: khoj-assistant is published to test PyPi on creating a PR
pip install -i https://test.pypi.org/simple/ khoj-assistant==0.1.5.dev57166025766

Using Docker

1. Clone
git clone https://github.com/debanjum/khoj && cd khoj
2. Configure
  • Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
  • Optional: Edit application configuration in khoj_docker.yml
3. Run
docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

4. Upgrade
docker-compose build --pull

Using Conda

1. Install Dependencies
2. Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
python3 -m pip install pyqt6  # As conda does not support pyqt6 yet
3. Configure
  • Copy the config/khoj_sample.yml to ~/.khoj/khoj.yml
  • Set input-files or input-filter in each relevant content-type section of ~/.khoj/khoj.yml
    • Set input-directories field in image content-type section
  • Delete content-type, processor sub-sections irrelevant for your use-case
4. Run
python3 -m src.main -vv

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

5. Upgrade
cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Test

pytest

Credits