Commit graph

291 commits

Author SHA1 Message Date
sabaimran
60c23d9e3a Add online search chat director tests 2023-11-21 23:08:36 -08:00
sabaimran
c652a7fd2d Move text_to_entries under the new content folder 2023-11-21 22:25:17 -08:00
sabaimran
1e2af083f0 Rename the data_sources module to content 2023-11-21 22:11:32 -08:00
sabaimran
2bb989e9d8 Resolve merge conflicts and fix some import ordering 2023-11-21 12:30:43 -08:00
sabaimran
a474c31e02 Move the django app into the src/khoj folder for better organization and functionality
- Our pypi package currently does not work because the django app and associated database is not included. To remedy this issue, move the app into the src/khoj folder. This has the added benefit of improved organization of the codebase, as all server related code is now in a single folder
- Update associated file paths and system references
2023-11-21 10:56:04 -08:00
sabaimran
b8e6883a81 Merge branch 'master' of github.com:khoj-ai/khoj into features/internet-enabled-search 2023-11-19 16:20:08 -08:00
sabaimran
4def8cce36
Merge pull request #541 from asim-shrestha/patch-1
Add test separators
2023-11-19 14:14:34 -08:00
Debanjum
71799add0b
Index Parent Headings of Org-Mode Entries to Improve Search Context (#548)
### Overview
The parent hierarchy of org-mode entries can store important context. 
This change updates OrgNode to track parent headings for each org entry and adds the parent outline for each entry to the index

### Details
- Test search uses ancestor headings as context for improved results
- Add ancestor headings of each org-mode entry to their compiled form
- Track ancestor headings for each org-mode entry in org-node parser

Resolves #85
2023-11-19 13:18:19 -08:00
sabaimran
e398a76779 Fix test word filter 2023-11-19 13:14:58 -08:00
sabaimran
33a9304428 Resolve merge conflicts 2023-11-19 12:57:55 -08:00
sabaimran
ef5e9d66c1 Resolve merge conflicts in dependency imports 2023-11-19 11:42:20 -08:00
sabaimran
f688529150 Update the default configuration for the AppConfig 2023-11-17 19:26:31 -08:00
Debanjum Singh Solanky
ca87b4ede9 Wrap common API query parameters into shared class to deduplicate code
- Upgrade FastAPI to >= latest version. Required upgrade of FastAPI.
  Earlier version didn't support wrapping common query params in class

- Use per fixture app instead of a global FastAPI app in conftest

- Upgrade minimum required Django version

- Fix no notes chat director test with updated no notes message
  No notes message was updated in commit 118f1143
2023-11-17 18:43:49 -08:00
Debanjum Singh Solanky
33ad9b8e64 Update text search test since indexing ancestor hierarchy added 2023-11-17 15:26:55 -08:00
Debanjum Singh Solanky
55785d50c3 Use title, when present, as root ancestor of entries instead of file path 2023-11-17 15:03:27 -08:00
sabaimran
ec06d2c446 Move data indexer files into a separate folder under processor. Update assoc UTs 2023-11-16 17:19:55 -08:00
Debanjum Singh Solanky
ddb07def0d Test search uses ancestor headings as context for improved results
- Update test data to add deeper outline hierarchy for testing
  hierarchy as context
- Update collateral tests that need count of entries updated, deleted
  asserts to be updated
2023-11-16 03:05:19 -08:00
Debanjum Singh Solanky
74403e3536 Add ancestor headings of each org-mode entry to their compiled form
Resolves #85
2023-11-16 02:54:41 -08:00
Debanjum Singh Solanky
305c25ae1a Track ancestor headings for each org-mode entry in org-node parser 2023-11-16 02:39:14 -08:00
Debanjum Singh Solanky
922983bd53 Set max cos distance to 0.18. Test search API query with max distance 2023-11-15 20:26:21 -08:00
Debanjum Singh Solanky
08a057bdd5 Rename SearchModel to SearchModelConfig DB model, Require Cross-Encoder 2023-11-15 17:31:50 -08:00
Debanjum Singh Solanky
5a6ab9cc85 Fix failing client tests 2023-11-15 00:17:44 -08:00
Debanjum Singh Solanky
8f200cf53f Remove unused parameter from configure_search_type method 2023-11-14 19:09:35 -08:00
Debanjum Singh Solanky
4af194d74b Make search model configurable on server
- Expose ability to modify search model via Django admin interface
- Previously the bi_encoder and cross_encoder models to use were set
  in code
- Now it's user configurable but with a default config generated by
  default
2023-11-14 19:09:35 -08:00
Debanjum Singh Solanky
e98141f4c3 Subscribe default user to standard plan with a far away renewal date
Self hosted users in anonymous mode have all capabilities unlocked
2023-11-14 16:31:39 -08:00
Asim Shrestha
0bfc094e18
Add test separators 2023-11-11 17:08:58 -08:00
Debanjum Singh Solanky
c4364b9100 Weaken asking follow-up qs and q&a mode in notes prompt to OpenAI models
- Notes prompt doesn't need to be so tuned to question answering. User
could just want to talk about life. The notes need to be used to
response to those, not necessarily only retrieve answers from notes

- System and notes prompts were forcing asking follow-up questions a
  little too much. Reduce strength of follow-up question asking
2023-11-10 23:36:43 -08:00
sabaimran
e2e96f9aa4 Add default settings to let new users be subscribed on trial
- Add the default user to a subscription trial
- Update associated unit tests
2023-11-10 22:38:28 -08:00
Debanjum Singh Solanky
c9c0ba67c6 Fix chat_client configurations for OpenAI chat director tests 2023-11-10 17:29:23 -08:00
sabaimran
262a8574d1 Add a test to verify that a user without data sucessfully returns a respones to the /search endpoint 2023-11-10 14:00:58 -08:00
Debanjum Singh Solanky
404d47f1a1 Bubble up content indexing errors to notify user on client apps 2023-11-07 05:28:13 -08:00
Debanjum Singh Solanky
9ab327a2b6 Store the data source of each entry in database
This will be useful for updating, deleting entries by their data
source. Data source can be one of Computer, Github or Notion for now

Store each file/entries source in database
2023-11-07 02:18:48 -08:00
Debanjum Singh Solanky
a08b152358 Improve log messages in text_entries and memory leak unit test 2023-11-06 19:27:31 -08:00
Debanjum
38f24a037d
Improve Indexing Text Entries (#535)
Major
- Ensure search results logic consistent across migration to DB, multi-user
- Manually verified search results for sample queries look the same across migration
 - Flatten indexing code for better indexing progress tracking and code readability

Minor
- a4f407f Test memory leak on MPS device when generating vector embeddings
- ef24485 Improve Khoj with DB setup instructions in the Django app readme (for now)
- f212cc7 Arrange remaining text search tests in arrange, act, assert order
- 022017d Fix text search tests to test updated indexing log messages
2023-11-06 16:01:53 -08:00
sabaimran
3d6e8d53fe Try adding dependencies for libgl in order to run OCR in github action unit tests 2023-11-05 15:09:40 -08:00
sabaimran
fdd727712f Rename test files from x_to_jsonl to x_to_entries 2023-11-05 14:33:07 -08:00
Debanjum Singh Solanky
a4f407f595 Test memory leak on MPS device when generating vector embeddings
Slope threshold of 2.0 determined qualitatively on local Mac device
Minor unused import and clean-up
2023-11-05 03:48:54 -08:00
Debanjum Singh Solanky
f212cc7174 Arrange remaining text search tests in arrange, act, assert order 2023-11-05 02:04:52 -08:00
Debanjum Singh Solanky
022017dd0f Fix text search tests to test updated indexing log messages 2023-11-05 02:04:52 -08:00
sabaimran
d1d210605e Merge branch 'features/multi-user-support-khoj' of github.com:khoj-ai/khoj into features/multi-user-support-khoj 2023-11-04 14:29:34 -07:00
sabaimran
3678aa5614 Add tests to validate expected behaviors in the multi-user scenario 2023-11-04 14:29:30 -07:00
Debanjum Singh Solanky
2f1756cc15 Do not use icon for each file, folder to index in desktop app.
Other minor fixes based on PR feedback
2023-11-04 00:13:10 -07:00
Debanjum Singh Solanky
345856e7be Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-user-support-khoj
Merge changes to use latest GPT4All with GPU, GGUF model support into
khoj multi-user support rearchitecture branch
2023-11-02 22:44:25 -07:00
sabaimran
fe6720fa06
[Multi-User Part 8]: Make conversation processor settings server-wide (#529)
- Rather than having each individual user configure their conversation settings, allow the server admin to configure the OpenAI API key or offline model once, and let all the users re-use that code.
- To configure the settings, the admin should go to the `django/admin` page and configure the relevant chat settings. To create an admin, run `python3 src/manage.py createsuperuser` and enter in the details. For simplicity, the email and username should match.
- Remove deprecated/unnecessary endpoints and views for configuring per-user chat settings
2023-11-02 10:43:27 -07:00
Debanjum Singh Solanky
d92a2d03a7 Rename Files, Classes from X_To_JSONL to more appropriate X_To_Entries
These content processors are converting content into entries in DB
instead of entries in JSONL file
2023-11-01 14:51:33 -07:00
Debanjum Singh Solanky
87e6b1eab9 Rename TextEmbeddings to TextEntries for improved readability
Improves readability as name has closer match to underlying
constructs
2023-10-31 18:55:59 -07:00
Debanjum Singh Solanky
bcbee05a9e Rename DbModels Embeddings, EmbeddingsAdapter to Entry, EntryAdapter
Improves readability as name has closer match to underlying
constructs

- Entry is any atomic item indexed by Khoj. This can be an org-mode
  entry, a markdown section, a PDF or Notion page etc.

- Embeddings are semantic vectors generated by the search ML model
  that encodes for meaning contained in an entries text.

- An "Entry" contains "Embeddings" vectors but also other metadata
  about the entry like filename etc.
2023-10-31 18:50:54 -07:00
sabaimran
54a387326c
[Multi-User Part 6]: Address small bugs and upstream PR comments (#518)
- 08654163cb: Add better parsing for XML files
- f3acfac7fb: Add a try/catch around the dateparser in order to avoid internal server errors in app
- 7d43cd62c0: Chunk embeddings generation in order to avoid large memory load
- e02d751eb3: Addresses comments from PR #498 
- a3f393edb4: Addresses comments from PR #503 
- 66eb078286: Addresses comments from PR #511 
- Address various items in https://github.com/khoj-ai/khoj/issues/527
2023-10-31 17:59:53 -07:00
sabaimran
5f3f6b7c61
[Multi-User Part 5]: Add a production Docker file and use a gunicorn configuration with it (#514)
- Add a productionized setup for the Khoj server using `gunicorn` with multiple workers for handling requests
- Add a new Dockerfile meant for production config at `ghcr.io/khoj-ai/khoj:prod`; the existing Docker config should remain the same
2023-10-26 13:15:31 -07:00
Debanjum
9acc722f7f
[Multi-User Part 4]: Authenticate using API Tokens (#513)
###  New
- Use API keys to authenticate from Desktop, Obsidian, Emacs clients
- Create API, UI on web app config page to CRUD API Keys
- Create user API keys table and functions to CRUD them in Database

### 🧪 Improve
- Default to better search model, [gte-small](https://huggingface.co/thenlper/gte-small), to improve search quality
- Only load chat model to GPU if enough space, throw error on load failure
- Show encoding progress, truncate headings to max chars supported
- Add instruction to create db in Django DB setup Readme

### ⚙️ Fix
- Fix error handling when configure offline chat via Web UI
- Do not warn in anon mode about Google OAuth env vars not being set
- Fix path to load static files when server started from project root
2023-10-26 12:33:03 -07:00