Commit graph

207 commits

Author SHA1 Message Date
sabaimran
7b907add77
Add support for indexing plaintext files (#420)
* Add support for indexing plaintext files
- Adds backend support for parsing plaintext files generically (.html, .txt, .xml, .csv, .md)
- Add equivalent frontend views for setting up plaintext file indexing
- Update config, rawconfig, default config, search API, setup endpoints
* Add a nifty plaintext file icon to configure plaintext files in the Web UI
* Use generic glob path for plaintext files. Skip indexing files that aren't in whitelist
2023-08-09 15:44:40 -07:00
Debanjum Singh Solanky
95acb1583d Update local Chat Actor and Director tests expected to fail 2023-08-01 20:52:00 -07:00
Debanjum Singh Solanky
c2b7a14ed5 Fix context, response size for Llama 2 to stay within max token limits
Create regression text to ensure it does not throw the prompt size
exceeded context window error
2023-08-01 20:52:00 -07:00
sabaimran
d8fa967b43 Update chat actor unit tests for greater accuracy and benchmarking 2023-08-01 12:24:43 -07:00
sabaimran
48363ec861 Add additional check for chat_messages length in UT 2023-08-01 09:25:52 -07:00
sabaimran
90efc2ea7a Update comments and add explanations 2023-08-01 09:24:03 -07:00
sabaimran
e55e9a7b67 Fix unit tests and truncation logic 2023-07-31 21:37:59 -07:00
sabaimran
95c7b07c20 Make the fake message longer 2023-07-31 20:55:19 -07:00
sabaimran
8dd5756ce9 Add new director tests for the offline chat model with llama v2 2023-07-31 20:24:52 -07:00
sabaimran
5ccb01343e
Add Offline chat to Obsidian (#359)
* Add support for configuring/using offline chat from within Obsidian
* Fix type checking for search type
* If Github is not configured, /update call should fail
* Fix regenerate tests same as the update ones
* Update help text for offline chat in obsidian
* Update relevant description for Khoj settings in Obsidian
* Simplify configuration logic and use smarter defaults
2023-07-28 18:47:56 -07:00
Debanjum Singh Solanky
9b1048caf7 Remove asymmetric from name of remaining text search tests
Asymmetric search is the only search type used now in khoj.el. So
making distinction between between symmetric and asymmetric search
isn't necessary anymore
2023-07-28 15:33:22 -07:00
sabaimran
124d97c26d
Replace Falcon 🦅 model with Llama V2 🦙 for offline chat (#352)
* Working example with LlamaV2 running locally on my machine

- Download from huggingface
- Plug in to GPT4All
- Update prompts to fit the llama format

* Add appropriate prompts for extracting questions based on a query based on llama format

* Rename Falcon to Llama and make some improvements to the extract_questions flow

* Do further tuning to extract question prompts and unit tests

* Disable extracting questions dynamically from Llama, as results are still unreliable
2023-07-27 20:51:20 -07:00
Debanjum Singh Solanky
da3f4dc7e4 Fix test config to run OpenAI Chat Actor, Director tests
OpenAI conversation processor schema had updated but conftest hadn't
been updated to reflect the same.

Update conftest setup of conversation processor to fix this
2023-07-27 11:30:04 -07:00
sabaimran
8b2af0b5ef
Add support for our first Local LLM 🤖🏠 (#330)
* Add support for gpt4all's falcon model as an additional conversation processor
- Update the UI pages to allow the user to point to the new endpoints for GPT
- Update the internal schemas to support both GPT4 models and OpenAI
- Add unit tests benchmarking some of the Falcon performance
* Add exc_info to include stack trace in error logs for text processors
* Pull shared functions into utils.py to be used across gpt4 and gpt
* Add migration for new processor conversation schema
* Skip GPT4All actor tests due to typing issues
* Fix Obsidian processor configuration in auto-configure flow
* Rename enable_local_llm to enable_offline_chat
2023-07-26 16:27:08 -07:00
Debanjum Singh Solanky
5bb42e56a8 Fix formatting of khoj test config and unused references in conftests 2023-07-22 00:29:26 -07:00
Debanjum Singh Solanky
d078e7b1f6 Clean up search type usage in khoj server, tests and Readme 2023-07-18 19:57:55 -07:00
Debanjum Singh Solanky
ef6a0044f4 Drop embeddings of deleted text entries from index
Previously the deleted embeddings would continue to be in the index,
even after the entry was deleted
2023-07-16 03:47:05 -07:00
Debanjum Singh Solanky
c73feebf25 Test index embeddings are stable on incremental update & no norm
Ensure order of new embedding insertion on incremental update
does not affect the order and value of existing embeddings when
normalization is turned off
2023-07-16 02:22:28 -07:00
Debanjum Singh Solanky
1482fd4d4d Test index is stable sorted on incremental update with new entry
Ensure order of new embedding, entry insertion on incremental update
is stable
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
b02323ade6 Improve name of text search test functions
Asymmetric was older name used to differentiate between symmetric,
asymmetric search.

Now that text search just uses asymmetric search stick to simpler name
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
7669b85da6 Test index is stable sorted on regenerate with new entry 2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
6e70b914c2 Remove unused dump_jsonl method
The entries index is stored ingzipped jsonl files for each content type
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
88d1a29a84 Test index is stable for duplicate entries across regenerate, update
- Current incorrect behavior:
  All entries with duplicate compiled form are kept on regenerate
  but on update only the last of the duplicated entries is kept

This divergent behavior is not ideal to prevent index corruption
across reconfigure and update
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
da98b92dd4 Create helper function to test value, order of entries & embeddings
This helper should be used to observe if the current embeddings are
stable sorted on regenerate and incremental update of index in text
search tests
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
b9fb656657 Update Tests to setup both content_index, search_models before testing
This is required by the updated structure of Khoj setup

- Add content_config pytest fixture, pass bi_encoder from
  search_models.[text|image]_search
2023-07-14 01:29:48 -07:00
Debanjum Singh Solanky
f664a74e77 Update Khoj server to run on non standard port, 42110 instead of 8000
Resolves #295
2023-07-10 21:27:58 -07:00
sabaimran
4c135ea316
Make streaming optional for the /chat endpoint (#287)
* Update the /chat endpoint to conditionally support streaming

- If streams are enabled, return the threadgenerator as it does currently
- If stream is disabled, return a JSON response with the response/compiled references separated out
- Correspondingly, update the chat.html UI to use the streamed API, as well as Obsidian
- Rename chat/init/ to chat/history

* Update khoj.el to use the /history endpoint

- Update corresponding unit tests to use stream=true

* Remove & from call to /chat for obsidian

* Abstract functions out into a helpers.py file and clean up some of the error-catching
2023-07-09 10:12:09 -07:00
Debanjum Singh Solanky
171ce19e1f Update date filter to allow quoting values in single quotes 2023-07-07 17:13:47 -07:00
Debanjum Singh Solanky
11f0a9f196 Fix chat tests since streaming. Pass args correctly to chat methods
- Fix testing gpt converse method after it started streaming responses
- Pass stop in model_kwargs dictionary and api key in openai_api_key
  parameter to chat completion methods. This should resolve the arg
  warning thrown by OpenAI module
2023-07-07 15:23:44 -07:00
Debanjum Singh Solanky
48870d9170 Fix parsing questions generated by extract_questions actor into list
The previous json parsing was failing to handle questions with date
filters

Fix the chat actor tests to run without throwing error with freezegun
complaining about importing transformers.local_llama model

Remove quote escapes from date filter examples provided to
extract_questions actor
2023-07-07 15:18:55 -07:00
Debanjum Singh Solanky
0f993b332e Drop support for Ledger as a separate content type
Khoj will soon get a generic text indexing content type. This along
with a file filter should suffice for searching through Ledger
transactions, if required.

Having a specific content type for niche use-case like ledger isn't
useful. Removing unused content types will reduce khoj code to manage.
2023-07-02 16:57:49 -07:00
Debanjum Singh Solanky
c9db5321e7 Remove unused org-music as an indexable content type from Khoj
Org-music was just a custom content type that worked with org-music.
It was mostly only useful for me.

Cleaning up that code will reduce number of content types for khoj to
manage.
2023-07-02 16:21:21 -07:00
sabaimran
36537606da Update unit test and preserve prior operational ordering in main.py 2023-07-01 20:02:35 -07:00
sabaimran
f0f6390366 Make --no-gui the default behavior of Khoj and update corresponding documentation 2023-07-01 19:07:59 -07:00
sabaimran
6edc32f2f4 Accept current changes to include issues in rendering flow 2023-06-29 12:25:29 -07:00
sabaimran
e6053951f0 In chat conftest fixtures, use *.markdown rather than *.md 2023-06-29 11:53:47 -07:00
sabaimran
601b738135 Bonus: Rename all md files to markdown for cleanliness 2023-06-29 11:27:47 -07:00
Debanjum Singh Solanky
56ce97ef9e Use async/await in tests for query method of text and image search
The text, image search query method has become async. So async/await
is required to get results correctly in tests etc
2023-06-28 22:07:02 -07:00
Debanjum Singh Solanky
f516d127c8 Update client tests to expect "all" as a valid new content type 2023-06-28 22:07:02 -07:00
sabaimran
2697c7a186 Update org tests to use new method, update Github configuration in tests 2023-06-27 15:04:48 -07:00
Debanjum Singh Solanky
69d4fa6525 Rename project links across repo from debanjum/khoj to khoj-ai/khoj 2023-06-21 00:13:21 -07:00
Debanjum Singh Solanky
595cc5b0f5 Use printer icon for PDF logs. Only split lines if file at web link in web interface 2023-06-18 02:26:03 -07:00
Saba
07ade2262a Set default value of pat_token in conftest.py to be empty string 2023-06-13 17:03:03 -07:00
Saba
751edfefe5 Add separate unit test for github. Will only run of a PAT token is set 2023-06-13 16:55:58 -07:00
Saba
3a61919344 Fix failing unit tests by hard-coding model presence of expected search types 2023-06-13 16:32:47 -07:00
Saba
019d3732de Rename orgmode_search to org_search 2023-06-13 16:06:54 -07:00
Saba
5d5ebcbf7c Rename truncate messages method and update unit tests to simplify assertion logic 2023-06-06 23:25:43 -07:00
Saba
7119ed0849 Run pre-commit script 2023-06-05 19:29:23 -07:00
Saba
948ba6ddca Remove unused logger 2023-06-05 19:01:03 -07:00
Saba
f65ff9815d Move message truncation logic into a separate function. Add unit tests with factory boy. 2023-06-05 18:58:29 -07:00