diff --git a/README.org b/README.org deleted file mode 100644 index f4c11d07..00000000 --- a/README.org +++ /dev/null @@ -1,171 +0,0 @@ -[[https://github.com/debanjum/khoj/actions/workflows/test.yml/badge.svg]] [[https://github.com/debanjum/khoj/actions/workflows/build.yml/badge.svg]] - -* Khoj - /A natural language search engine for your personal notes, transactions and images/ - -** Table of Contents - - [[https://github.com/debanjum/khoj#Features][Features]] - - [[https://github.com/debanjum/khoj#Demo][Demo]] - - [[https://github.com/debanjum/khoj#Description][Description]] - - [[https://github.com/debanjum/khoj#Analysis][Analysis]] - - [[https://github.com/debanjum/khoj#Architecture][Architecture]] - - [[https://github.com/debanjum/khoj#Setup][Setup]] - - [[https://github.com/debanjum/khoj#Clone][Clone]] - - [[https://github.com/debanjum/khoj#Configure][Configure]] - - [[https://github.com/debanjum/khoj#Run][Run]] - - [[https://github.com/debanjum/khoj#Use][Use]] - - [[https://github.com/debanjum/khoj#Upgrade][Upgrade]] - - [[https://github.com/debanjum/khoj#Troubleshooting][Troubleshooting]] - - [[https://github.com/debanjum/khoj#Miscellaneous][Miscellaneous]] - - [[https://github.com/debanjum/khoj#Development-setup][Development Setup]] - - [[https://github.com/debanjum/khoj#Setup-on-local-machine][Setup on Local Machine]] - - [[https://github.com/debanjum/khoj#Upgrade-on-local-machine][Upgrade on Local Machine]] - - [[https://github.com/debanjum/khoj#Run-unit-tests][Run Unit Tests]] - - [[https://github.com/debanjum/khoj#Performance][Performance]] - - [[https://github.com/debanjum/khoj#Query-performance][Query Performance]] - - [[https://github.com/debanjum/khoj#Indexing-performance][Indexing Performance]] - - [[https://github.com/debanjum/khoj#Miscellaneous-1][Miscellaneous]] - - [[https://github.com/debanjum/khoj#Acknowledgments][Acknowledgments]] - -** Features - - *Natural*: Advanced Natural language understanding using Transformer based ML Models - - *Local*: Your personal data stays local. All search, indexing is done on your machine[[https://github.com/debanjum/khoj#miscellaneous][*]] - - *Incremental*: Incremental search for a fast, search-as-you-type experience - - *Pluggable*: Modular architecture makes it relatively easy to plug in new data sources, frontends and ML models - - *Multiple Sources*: Search your Org-mode and Markdown notes, Beancount transactions and Photos - - *Multiple Interfaces*: Search using a [[./src/interface/web/index.html][Web Browser]], [[./src/interface/emacs/khoj.el][Emacs]] or the [[http://localhost:8000/docs][API]] - -** Demo - https://user-images.githubusercontent.com/6413477/181664862-31565b0a-0e64-47e1-a79a-599dfc486c74.mp4 - -*** Description - - User searches for "/Setup editor/" - - The demo looks for the most relevant section in this readme and the [[https://github.com/debanjum/khoj/tree/master/src/interface/emacs][khoj.el readme]] - - Top result is what we are looking for, the [[https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation][section to Install Khoj.el on Emacs]] - -*** Analysis - - The results do not have any words used in the query - - /Based on the top result it seems the re-ranking model understands that Emacs is an editor?/ - - The results incrementally update as the query is entered - - The results are re-ranked, for better accuracy, once user is idle - -** Architecture - [[https://github.com/debanjum/khoj/blob/master/docs/khoj_architecture.png]] - -** Setup - -*** 1. Clone - #+begin_src shell - git clone https://github.com/debanjum/khoj && cd khoj - #+end_src - -*** 2. Configure - - *Required*: Update [[./docker-compose.yml][docker-compose.yml]] to mount your images, (org-mode or markdown) notes and beancount directories - - *Optional*: Edit application configuration in [[./config/sample_config.yml][sample_config.yml]] - -*** 3. Run - #+begin_src shell - docker-compose up -d - #+end_src - - /Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings/ - -** Use - - - *Khoj via Web* - - Go to [[http://localhost:8000/]] or open [[./src/interface/web/index.html][index.html]] in your browser - - - *Khoj via Emacs* - - [[https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation][Install]] [[./src/interface/emacs/khoj.el][khoj.el]] - - Run ~M-x khoj ~ - - - *Khoj via API* - - See [[http://localhost:8000/docs][Khoj FastAPI Docs]] - - [[http://localhost:8000/search?q=%22what%20is%20the%20meaning%20of%20life%22][Query]] - - [[http://localhost:8000/regenerate?t=ledger][Regenerate Embeddings]] - - [[https://localhost:8000/ui][Configure Application]] - -** Upgrade - #+begin_src shell - docker-compose build --pull - #+end_src - -** Troubleshooting - - Symptom: Errors out with "Killed" in error message - - Fix: Increase RAM available to Docker Containers in Docker Settings - - Refer: [[https://stackoverflow.com/a/50770267][StackOverflow Solution]], [[https://docs.docker.com/desktop/mac/#resources][Configure Resources on Docker for Mac]] - - Symptom: Errors out complaining about Tensors mismatch, null etc - - Mitigation: Delete content-type > image section from docker_sample_config.yml - -** Miscellaneous - - The experimental [[localhost:8000/chat][chat]] API endpoint uses the [[https://openai.com/api/][OpenAI API]] - - It is disabled by default - - To use it add your ~openai-api-key~ to config.yml - -** Development Setup -*** Setup on Local Machine - -**** 1. Install Dependencies - 1. Install Python3 [Required] - 2. [[https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html][Install Conda]] [Required] - 3. Install Exiftool [Optional] - #+begin_src shell - sudo apt-get -y install libimage-exiftool-perl - #+end_src - -**** 2. Install Khoj - #+begin_src shell - git clone https://github.com/debanjum/khoj && cd khoj - conda env create -f config/environment.yml - conda activate khoj - #+end_src - -**** 3. Configure - - Configure files/directories to search in ~content-type~ section of ~sample_config.yml~ - - To run application on test data, update file paths containing ~/data/~ to ~tests/data/~ in ~sample_config.yml~ - - Example replace ~/data/notes/*.org~ with ~tests/data/notes/*.org~ - -**** 4. Run - Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML - - #+begin_src shell - python3 -m src.main -c=config/sample_config.yml -vv - #+end_src - -*** Upgrade On Local Machine - #+begin_src shell - cd khoj - git pull origin master - conda deactivate khoj - conda env update -f config/environment.yml - conda activate khoj - #+end_src - -*** Run Unit Tests - #+begin_src shell - pytest - #+end_src - -** Performance -*** Query performance - - Semantic search using the bi-encoder is fairly fast at <5 ms - - Reranking using the cross-encoder is slower at <2s on 15 results. Tweak ~top_k~ to tradeoff speed for accuracy of results. - - Applying explicit filters is very slow currently at ~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc. - -*** Indexing performance - - Indexing is more strongly impacted by the size of the source data - - Indexing 100K+ line corpus of notes takes 6 minutes - - Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM - - Once https://github.com/debanjum/khoj/issues/36 is implemented, it should only take this long on first run - -*** Miscellaneous - - Testing done on a Mac M1 and a >100K line corpus of notes - - Search, indexing on a GPU has not been tested yet - -** Acknowledgments - - [[https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1][Multi-QA MiniLM Model]], [[https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2][All MiniLM Model]] for Text Search. See [[https://www.sbert.net/examples/applications/retrieve_rerank/README.html][SBert Documentation]] - - [[https://github.com/openai/CLIP][OpenAI CLIP Model]] for Image Search. See [[https://www.sbert.net/examples/applications/image-search/README.html][SBert Documentation]] - - Charles Cave for [[http://members.optusnet.com.au/~charles57/GTD/orgnode.html][OrgNode Parser]] - - [[https://mooz.github.io/org-js/][Org.js]] to render Org-mode results on the Web interface - - [[https://github.com/markdown-it/markdown-it][Markdown-it]] to render Markdown results on the Web interface - - Sven Marnach for [[https://github.com/smarnach/pyexiftool/blob/master/exiftool.py][PyExifTool]] \ No newline at end of file diff --git a/Readme.md b/Readme.md new file mode 100644 index 00000000..b91a5a94 --- /dev/null +++ b/Readme.md @@ -0,0 +1,191 @@ +![](https://github.com/debanjum/khoj/actions/workflows/test.yml/badge.svg) +![](https://github.com/debanjum/khoj/actions/workflows/build.yml/badge.svg) + +# Khoj + +*A natural language search engine for your personal notes, transactions and images* + +## Table of Contents + +- [Features](#Features) +- [Demo](#Demo) + - [Description](#Description) + - [Analysis](#Analysis) +- [Architecture](#Architecture) +- [Setup](#Setup) + - [Clone](#Clone) + - [Configure](#Configure) + - [Run](#Run) +- [Use](#Use) +- [Upgrade](#Upgrade) +- [Troubleshooting](#Troubleshooting) +- [Miscellaneous](#Miscellaneous) +- [Development Setup](#Development-setup) + - [Setup on Local Machine](#Setup-on-local-machine) + - [Upgrade on Local Machine](#Upgrade-on-local-machine) + - [Run Unit Tests](#Run-unit-tests) +- [Performance](#Performance) + - [Query Performance](#Query-performance) + - [Indexing Performance](#Indexing-performance) + - [Miscellaneous](#Miscellaneous-1) +- [Acknowledgments](#Acknowledgments) + +## Features + +- **Natural**: Advanced Natural language understanding using Transformer based ML Models +- **Local**: Your personal data stays local. All search, indexing is done on your machine[\*](https://github.com/debanjum/khoj#miscellaneous) +- **Incremental**: Incremental search for a fast, search-as-you-type experience +- **Pluggable**: Modular architecture makes it relatively easy to plug in new data sources, frontends and ML models +- **Multiple Sources**: Search your Org-mode and Markdown notes, Beancount transactions and Photos +- **Multiple Interfaces**: Search using a [Web Browser](./src/interface/web/index.html), [Emacs](./src/interface/emacs/khoj.el) or the [API](http://localhost:8000/docs) + +## Demo + + + +### Description + +- User searches for \"*Setup editor*\" +- The demo looks for the most relevant section in this readme and the [khoj.el readme](https://github.com/debanjum/khoj/tree/master/src/interface/emacs) +- Top result is what we are looking for, the [section to Install Khoj.el on Emacs](https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation) + +### Analysis + +- The results do not have any words used in the query + - *Based on the top result it seems the re-ranking model understands that Emacs is an editor?* +- The results incrementally update as the query is entered +- The results are re-ranked, for better accuracy, once user is idle + +## Architecture + +![](https://github.com/debanjum/khoj/blob/master/docs/khoj_architecture.png) + +## Setup + +### 1. Clone + +``` shell +git clone https://github.com/debanjum/khoj && cd khoj +``` + +### 2. Configure + +- **Required**: Update [docker-compose.yml](./docker-compose.yml) to mount your images, (org-mode or markdown) notes and beancount directories +- **Optional**: Edit application configuration in [sample_config.yml](./config/sample_config.yml) + +### 3. Run + +``` shell +docker-compose up -d +``` + +*Note: The first run will take time. Let it run, it\'s mostly not hung, just generating embeddings* + +## Use + +- **Khoj via Web** + - Go to or open [index.html](./src/interface/web/index.html) in your browser +- **Khoj via Emacs** + - [Install](https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation) [khoj.el](./src/interface/emacs/khoj.el) + - Run `M-x khoj ` +- **Khoj via API** + - See [Khoj FastAPI Docs](http://localhost:8000/docs) + - [Query](http://localhost:8000/search?q=%22what%20is%20the%20meaning%20of%20life%22) + - [Regenerate Embeddings](http://localhost:8000/regenerate?t=ledger) + - [Configure Application](https://localhost:8000/ui) + +## Upgrade + +``` shell +docker-compose build --pull +``` + +## Troubleshooting + +- Symptom: Errors out with \"Killed\" in error message + - Fix: Increase RAM available to Docker Containers in Docker Settings + - Refer: [StackOverflow Solution](https://stackoverflow.com/a/50770267), [Configure Resources on Docker for Mac](https://docs.docker.com/desktop/mac/#resources) +- Symptom: Errors out complaining about Tensors mismatch, null etc + - Mitigation: Delete content-type > image section from `docker_sample_config.yml` + +## Miscellaneous + +- The experimental [chat](localhost:8000/chat) API endpoint uses the [OpenAI API](https://openai.com/api/) + - It is disabled by default + - To use it add your `openai-api-key` to config.yml + +## Development Setup + +### Setup on Local Machine + +1. Install Dependencies + 1. Install Python3 \[Required\] + 2. [Install Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) \[Required\] + 3. Install Exiftool \[Optional\] + ``` shell + sudo apt-get -y install libimage-exiftool-perl + ``` + +2. Install Khoj + ``` shell + git clone https://github.com/debanjum/khoj && cd khoj + conda env create -f config/environment.yml + conda activate khoj + ``` + +3. Configure + - Configure files/directories to search in `content-type` section of `sample_config.yml` + - To run application on test data, update file paths containing `/data/` to `tests/data/` in `sample_config.yml` + - Example replace `/data/notes/*.org` with `tests/data/notes/*.org` + +4. Run + Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML + + ``` shell + python3 -m src.main -c=config/sample_config.yml -vv + ``` + +### Upgrade On Local Machine + +``` shell +cd khoj +git pull origin master +conda deactivate khoj +conda env update -f config/environment.yml +conda activate khoj +``` + +### Run Unit Tests + +``` shell +pytest +``` + +## Performance + +### Query performance + +- Semantic search using the bi-encoder is fairly fast at \<5 ms +- Reranking using the cross-encoder is slower at \<2s on 15 results. Tweak `top_k` to tradeoff speed for accuracy of results +- Applying explicit filters is very slow currently at \~6s. This is because the filters are rudimentary. Considerable speed-ups can be achieved using indexes etc + +### Indexing performance + +- Indexing is more strongly impacted by the size of the source data +- Indexing 100K+ line corpus of notes takes 6 minutes +- Indexing 4000+ images takes about 15 minutes and more than 8Gb of RAM +- Once is implemented, it should only take this long on first run + +### Miscellaneous + +- Testing done on a Mac M1 and a \>100K line corpus of notes +- Search, indexing on a GPU has not been tested yet + +## Acknowledgments + +- [Multi-QA MiniLM Model](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1), [All MiniLM Model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for Text Search. See [SBert Documentation](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) +- [OpenAI CLIP Model](https://github.com/openai/CLIP) for Image Search. See [SBert Documentation](https://www.sbert.net/examples/applications/image-search/README.html) +- Charles Cave for [OrgNode Parser](http://members.optusnet.com.au/~charles57/GTD/orgnode.html) +- [Org.js](https://mooz.github.io/org-js/) to render Org-mode results on the Web interface +- [Markdown-it](https://github.com/markdown-it/markdown-it) to render Markdown results on the Web interface +- Sven Marnach for [PyExifTool](https://github.com/smarnach/pyexiftool/blob/master/exiftool.py) \ No newline at end of file