diff --git a/config/sample_config.yml b/config/sample_config.yml index 7f5809c1..077c8564 100644 --- a/config/sample_config.yml +++ b/config/sample_config.yml @@ -8,6 +8,12 @@ content-type: compressed-jsonl: "/data/embeddings/notes.jsonl.gz" embeddings-file: "/data/embeddings/note_embeddings.pt" + markdown: + input-files: null + input-filter: "/data/markdown/*.md" + compressed-jsonl: "/data/embeddings/markdown.jsonl.gz" + embeddings-file: "/data/embeddings/markdown_embeddings.pt" + ledger: input-files: null input-filter: /data/ledger/*.beancount diff --git a/docker-compose.yml b/docker-compose.yml index fbf2a6b8..3ea99981 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -24,6 +24,7 @@ services: - ./tests/data/images/:/data/images/ - ./tests/data/ledger/:/data/ledger/ - ./tests/data/music/:/data/music/ + - ./tests/data/markdown/:/data/markdown/ # Embeddings and models are populated after the first run # You can set these volumes to point to empty directories on host - ./tests/data/embeddings/:/data/embeddings/ diff --git a/tests/data/markdown/interface_emacs_readme.md b/tests/data/markdown/interface_emacs_readme.md new file mode 100644 index 00000000..c61abf80 --- /dev/null +++ b/tests/data/markdown/interface_emacs_readme.md @@ -0,0 +1,69 @@ +# Emacs Khoj + +*An Emacs interface for [Khoj](https://github.com/debanjum/khoj)* + +## Requirements + +- Install and Run [Khoj](https://github.com/debanjum/khoj) + +## Installation + +- Direct Install + - Put `khoj.el` in your Emacs load path. For e.g \~/.emacs.d/lisp + + - Load via `use-package` in your \~/.emacs.d/init.el or .emacs + file by adding below snippet + + ``` elisp + ;; Khoj Package + (use-package khoj + :load-path "~/.emacs.d/lisp/khoj.el" + :bind ("C-c s" . 'khoj)) + ``` +- With [straight.el](https://github.com/raxod502/straight.el) + - Add below snippet to your \~/.emacs.d/init.el or .emacs config + file and execute it. + + ``` elisp + ;; Khoj Package for Semantic Search + (use-package khoj + :after org + :straight (khoj :type git :host github :repo "debanjum/khoj" :files (:defaults "src/interface/emacs/khoj.el")) + :bind ("C-c s" . 'khoj)) + ``` +- With [Quelpa](https://github.com/quelpa/quelpa#installation) + - Ensure [Quelpa](https://github.com/quelpa/quelpa#installation), + [quelpa-use-package](https://github.com/quelpa/quelpa-use-package#installation) + are installed + + - Add below snippet to your \~/.emacs.d/init.el or .emacs config + file and execute it. + + ``` elisp + ;; Khoj Package + (use-package khoj + :after org + :quelpa (khoj :fetcher url :url "https://raw.githubusercontent.com/debanjum/khoj/master/interface/emacs/khoj.el") + :bind ("C-c s" . 'khoj)) + ``` + +## Usage + +1. Open Query Interface on Client + + - In Emacs: Call `khoj` using keybinding `C-c s` or `M-x khoj` + - On Web: Open + +2. Query in Natural Language + + e.g \"What is the meaning of life?\" \"What are my life goals?\" + + **Note: It takes about 4s on a Mac M1 and a \>100K line corpus of + notes** + +3. (Optional) Narrow down results further + + Include/Exclude specific words or date range from results by + updating query with below query format + + e.g \`What is the meaning of life? -god +none dt:\"last week\"\` diff --git a/tests/data/markdown/main_readme.md b/tests/data/markdown/main_readme.md new file mode 100644 index 00000000..682515aa --- /dev/null +++ b/tests/data/markdown/main_readme.md @@ -0,0 +1,153 @@ +![](https://github.com/debanjum/khoj/actions/workflows/test.yml/badge.svg) +![](https://github.com/debanjum/khoj/actions/workflows/build.yml/badge.svg) + +# Khoj + +*Allow natural language search on user content like notes, images, +transactions using transformer ML models* + +User can interface with Khoj via [Web](./src/interface/web/index.html), +[Emacs](./src/interface/emacs/khoj.el) or the API. All search is done +locally[\*](https://github.com/debanjum/khoj#miscellaneous) + +## Demo + + + +## Setup + +### 1. Clone + +``` shell +git clone https://github.com/debanjum/khoj && cd khoj +``` + +### 2. Configure + +- \[Required\] Update [docker-compose.yml](./docker-compose.yml) to + mount your images, (org-mode or markdown) notes and beancount + directories +- \[Optional\] Edit application configuration in + [sample~config~.yml](./config/sample_config.yml) + +### 3. Run + +``` shell +docker-compose up -d +``` + +*Note: The first run will take time. Let it run, it\'s mostly not hung, +just generating embeddings* + +## Use + +- **Khoj via API** + - See [Khoj API Docs](http://localhost:8000/docs) + - [Query](http://localhost:8000/search?q=%22what%20is%20the%20meaning%20of%20life%22) + - [Regenerate + Embeddings](http://localhost:8000/regenerate?t=ledger) + - [Configure Application](https://localhost:8000/ui) +- **Khoj via Emacs** + - [Install](https://github.com/debanjum/khoj/tree/master/src/interface/emacs#installation) + [khoj.el](./src/interface/emacs/khoj.el) + - Run `M-x khoj ` + +## Run Unit tests + +``` shell +pytest +``` + +## Upgrade + +``` shell +docker-compose build --pull +``` + +## Troubleshooting + +- Symptom: Errors out with \"Killed\" in error message + - Fix: Increase RAM available to Docker Containers in Docker + Settings + - Refer: [StackOverflow + Solution](https://stackoverflow.com/a/50770267), [Configure + Resources on Docker for + Mac](https://docs.docker.com/desktop/mac/#resources) +- Symptom: Errors out complaining about Tensors mismatch, null etc + - Mitigation: Delete content-type \> image section from + docker~sampleconfig~.yml + +## Miscellaneous + +- The experimental [chat](localhost:8000/chat) API endpoint uses the + [OpenAI API](https://openai.com/api/) + - It is disabled by default + - To use it add your `openai-api-key` to config.yml + +## Development Setup + +### Setup on Local Machine + +1. 1\. Install Dependencies + + 1. Install Python3 \[Required\] + + 2. [Install + Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) + \[Required\] + + 3. Install Exiftool \[Optional\] + + ``` shell + sudo apt-get -y install libimage-exiftool-perl + ``` + +2. 2\. Install Khoj + + ``` shell + git clone https://github.com/debanjum/khoj && cd khoj + conda env create -f config/environment.yml + conda activate khoj + ``` + +3. 3\. Configure + + - Configure files/directories to search in `content-type` section + of `sample_config.yml` + - To run application on test data, update file paths containing + `/data/` to `tests/data/` in `sample_config.yml` + - Example replace `/data/notes/*.org` with + `tests/data/notes/*.org` + +4. 4\. Run + + Load ML model, generate embeddings and expose API to query notes, + images, transactions etc specified in config YAML + + ``` shell + python3 -m src.main -c=config/sample_config.yml -vv + ``` + +### Upgrade On Local Machine + +``` shell +cd khoj +git pull origin master +conda deactivate khoj +conda env update -f config/environment.yml +conda activate khoj +``` + +## Acknowledgments + +- [Multi-QA MiniLM + Model](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) + for Asymmetric Text Search. See [SBert + Documentation](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) +- [OpenAI CLIP Model](https://github.com/openai/CLIP) for Image + Search. See [SBert + Documentation](https://www.sbert.net/examples/applications/image-search/README.html) +- Charles Cave for [OrgNode + Parser](http://members.optusnet.com.au/~charles57/GTD/orgnode.html) +- Sven Marnach for + [PyExifTool](https://github.com/smarnach/pyexiftool/blob/master/exiftool.py)