Mirror of khoj from Github
Find a file
Debanjum Singh Solanky 17c38b526a Default config for each search types to None
- Setting up default compressed-jsonl, embeddings-file was only required
  for org search_type, while org-files and org-filter were allowed to be
  passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
  command line argument as well for org search type
- Now that all search types are only configurable via config file, We
  can default all search types to None. The default config for the
  rest of the search types wasn't being used anyway
2022-07-31 22:23:57 +03:00
.github/workflows Run build on PR 2022-07-04 18:09:47 -04:00
config Create test markdown files. Use them in sample config, docker-compose 2022-07-21 22:09:44 +04:00
docs Add Khoj Architecture Diagram in Docs. Show it in the Project Readme 2022-07-26 02:09:51 +04:00
src Default config for each search types to None 2022-07-31 22:23:57 +03:00
tests Move Khoj image results into a child images/ directory 2022-07-28 20:45:12 +04:00
views Fix input text behavior for null/empty value fields 2021-12-04 10:45:48 -05:00
.dockerignore Make Docker ignore unnecessary files 2022-06-29 22:29:34 +04:00
.gitignore Create Basic Landing Page to Query Semantic Search and Render Results 2022-07-16 03:36:19 +04:00
demo.mp4 Add Incremental Search Demo to Readme 2022-07-29 06:14:24 +04:00
docker-compose.yml Create test markdown files. Use them in sample config, docker-compose 2022-07-21 22:09:44 +04:00
Dockerfile Give the project a short, less generic name. Rename it to Khoj 2022-07-19 18:26:16 +04:00
LICENSE Add Readme, License. Update .gitignore 2021-08-15 22:52:37 -07:00
Readme.md Add Eagle Icon for Khoj to Web, Emacs Interfaces and Readme 2022-07-29 17:50:29 +04:00

Khoj 🦅

A natural language search engine for your personal notes, transactions and images

Table of Contents

Features

  • Natural: Advanced Natural language understanding using Transformer based ML Models
  • Local: Your personal data stays local. All search, indexing is done on your machine*
  • Incremental: Incremental search for a fast, search-as-you-type experience
  • Pluggable: Modular architecture makes it relatively easy to plug in new data sources, frontends and ML models
  • Multiple Sources: Search your Org-mode and Markdown notes, Beancount transactions and Photos
  • Multiple Interfaces: Search using a Web Browser, Emacs or the API

Demo

https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4

Description

Analysis

  • The results do not have any words used in the query
    • Based on the top result it seems the re-ranking model understands that Emacs is an editor?
  • The results incrementally update as the query is entered
  • The results are re-ranked, for better accuracy, once user is idle

Architecture

Setup

1. Clone

git clone https://github.com/debanjum/khoj && cd khoj

2. Configure

  • Required: Update docker-compose.yml to mount your images, (org-mode or markdown) notes and beancount directories
  • Optional: Edit application configuration in sample_config.yml

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

Use

Upgrade

docker-compose build --pull

Troubleshooting

  • Symptom: Errors out with "Killed" in error message
  • Symptom: Errors out complaining about Tensors mismatch, null etc
    • Mitigation: Delete content-type > image section from docker_sample_config.yml

Miscellaneous

  • The experimental chat API endpoint uses the OpenAI API
    • It is disabled by default
    • To use it add your openai-api-key to config.yml

Development Setup

Setup on Local Machine

  1. Install Dependencies

    1. Install Python3 [Required]
    2. Install Conda [Required]
    3. Install Exiftool [Optional]
      sudo apt-get -y install libimage-exiftool-perl
      
  2. Install Khoj

    git clone https://github.com/debanjum/khoj && cd khoj
    conda env create -f config/environment.yml
    conda activate khoj
    
  3. Configure

    • Configure files/directories to search in content-type section of sample_config.yml
    • To run application on test data, update file paths containing /data/ to tests/data/ in sample_config.yml
      • Example replace /data/notes/*.org with tests/data/notes/*.org
  4. Run Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

    python3 -m src.main -c=config/sample_config.yml -vv
    

Upgrade On Local Machine

cd khoj
git pull origin master
conda deactivate khoj
conda env update -f config/environment.yml
conda activate khoj

Run Unit Tests

pytest

Performance

Query performance

  • Semantic search using the bi-encoder is fairly fast at <5 ms
  • Reranking using the cross-encoder is slower at <2s on 15 results. Tweak top_k to tradeoff speed for accuracy of results
  • Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc

Indexing performance

  • Indexing is more strongly impacted by the size of the source data
  • Indexing 100K+ line corpus of notes takes 6 minutes
  • Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM
  • Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run

Miscellaneous

  • Testing done on a Mac M1 and a >100K line corpus of notes
  • Search, indexing on a GPU has not been tested yet

Acknowledgments