Mirror of khoj from Github
Find a file
Debanjum Singh Solanky b673d26a12 Extract Entries in a standardized format across text search types
Issue:
 - Had different schema of extracted entries for symmetric_ledger vs asymmetric

 - Entry extraction for asymmetric was dirty, relying on cryptic
   indices to store raw entry vs cleaned entry meant to be passed to embeddings

 - This was pushing the load of figuring out what property to extract
   from each entry to downstream processes like the filters

 - This limited the filters to only work for asymmetric search, not for
   symmetric_ledger

- Fix
   - Use consistent format for extracted entries
     {
       'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
       'raw'  : raw_entry_string_meant_to_be_passed_to_use
     }

 - Result
   - Now filters can be applied across search types, and the specific
     field they should be applied on can be configured by each search
     type
2022-07-19 20:52:25 +04:00
.github/workflows Run build on PR 2022-07-04 18:09:47 -04:00
config Give the project a short, less generic name. Rename it to Khoj 2022-07-19 18:26:16 +04:00
src Extract Entries in a standardized format across text search types 2022-07-19 20:52:25 +04:00
tests Extract Entries in a standardized format across text search types 2022-07-19 20:52:25 +04:00
views Fix input text behavior for null/empty value fields 2021-12-04 10:45:48 -05:00
.dockerignore Make Docker ignore unnecessary files 2022-06-29 22:29:34 +04:00
.gitignore Create Basic Landing Page to Query Semantic Search and Render Results 2022-07-16 03:36:19 +04:00
demo.mp4 Add demo of semantic search to repository 2022-05-14 04:29:25 -04:00
docker-compose.yml Correct syntax of memory limit in docker-compose.yml 2022-07-06 20:07:11 -04:00
Dockerfile Give the project a short, less generic name. Rename it to Khoj 2022-07-19 18:26:16 +04:00
LICENSE Add Readme, License. Update .gitignore 2021-08-15 22:52:37 -07:00
README.org Give the project a short, less generic name. Rename it to Khoj 2022-07-19 18:26:16 +04:00

https://github.com/debanjum/khoj/actions/workflows/test.yml/badge.svg https://github.com/debanjum/khoj/actions/workflows/build.yml/badge.svg

Khoj

Allow natural language search on user content like notes, images, transactions using transformer ML models

User can interface with Khoj via the API or Emacs. All search is done locally*

Setup

1. Clone

  git clone https://github.com/debanjum/khoj && cd khoj

2. Configure

3. Run

docker-compose up -d

Note: The first run will take time. Let it run, it's mostly not hung, just generating embeddings

Use

Run Unit tests

pytest

Upgrade

  docker-compose build --pull

Troubleshooting

  • Symptom: Errors out with "Killed" in error message

  • Symptom: Errors out complaining about Tensors mismatch, null etc

    • Mitigation: Delete content-type > image section from docker_sample_config.yml

Miscellaneous

  • The experimental chat API endpoint uses the OpenAI API

    • It is disabled by default
    • To use it add your openai-api-key to config.yml

Development Setup

Setup on Local Machine

1. Install Dependencies
  1. Install Python3 [Required]
  2. Install Conda [Required]
  3. Install Exiftool [Optional]

    sudo apt-get -y install libimage-exiftool-perl
2. Install Khoj
git clone https://github.com/debanjum/khoj && cd khoj
conda env create -f config/environment.yml
conda activate khoj
3. Configure
  • Configure files/directories to search in content-type section of sample_config.yml
  • To run application on test data, update file paths containing /data/ to tests/data/ in sample_config.yml

    • Example replace /data/notes/*.org with tests/data/notes/*.org
4. Run

Load ML model, generate embeddings and expose API to query notes, images, transactions etc specified in config YAML

python3 -m src.main -c=config/sample_config.yml -vv

Upgrade On Local Machine

  cd khoj
  git pull origin master
  conda deactivate khoj
  conda env update -f config/environment.yml
  conda activate khoj

Acknowledgments