sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-30 10:53:02 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	d74f8e03d3	Pass max context length to fix using updated GPT4All.list_gpu method It's signature was updated in GPT4All 2.1.0 pypi release. Resolves #610	2024-01-16 12:23:45 +05:30
sabaimran	5ff9df9d4c	Add support per user for configuring the preferred search model from the config page - Honor this setting across the relevant places where embeddings are used - Convert the VectorField object to have None for dimensions in order to make the search model easily configurable	2023-12-20 13:25:43 +05:30
Debanjum Singh Solanky	7009793170	Migrate to OpenAI Python library >= 1.0	2023-12-03 18:16:00 -05:00
Debanjum Singh Solanky	de5aa5c32e	Update pillow, aiohttp dependencies	2023-11-28 19:55:43 -08:00
Debanjum Singh Solanky	4636390f7f	Transcribe speech to text offline with Whisper - Allow server admin to configure offline speech to text model during initialization - Use offline speech to text model to transcribe audio from clients - Set offline whisper as default speech to text model as no setup api key reqd	2023-11-26 05:55:11 -08:00
Debanjum Singh Solanky	19e042037a	Run isort with black profile to avoid conflicts between the two	2023-11-21 12:52:07 -08:00
Debanjum Singh Solanky	4e98acbca7	Update minimum pydantic version to one with model_validate function	2023-11-20 14:52:37 -08:00
Debanjum Singh Solanky	ca87b4ede9	Wrap common API query parameters into shared class to deduplicate code - Upgrade FastAPI to >= latest version. Required upgrade of FastAPI. Earlier version didn't support wrapping common query params in class - Use per fixture app instead of a global FastAPI app in conftest - Upgrade minimum required Django version - Fix no notes chat director test with updated no notes message No notes message was updated in commit `118f1143`	2023-11-17 18:43:49 -08:00
sabaimran	d06b2cf24b	Downgrade pyproject.toml to avert depedency conflict	2023-11-15 10:47:54 -08:00
Debanjum Singh Solanky	9c6e7bdea2	Upgrade server, desktop app dependencies to resolve CVE bugs	2023-11-15 01:47:53 -08:00
Debanjum Singh Solanky	9aaf475c8a	Create API webhook, endpoints for subscription payments using Stripe - Add fields to mark users as subscribed to a specific plan and subscription renewal date in DB - Add ability to unsubscribe a user using their email address - Expose webhook for stripe to callback confirming payment	2023-11-07 10:20:51 -08:00
Debanjum Singh Solanky	9f47fc8e34	Upgrade langchain version since adding support for OCR-ing PDFs	2023-11-06 21:58:33 -08:00
Debanjum	38f24a037d	Improve Indexing Text Entries (#535 ) Major - Ensure search results logic consistent across migration to DB, multi-user - Manually verified search results for sample queries look the same across migration - Flatten indexing code for better indexing progress tracking and code readability Minor - `a4f407f` Test memory leak on MPS device when generating vector embeddings - `ef24485` Improve Khoj with DB setup instructions in the Django app readme (for now) - `f212cc7` Arrange remaining text search tests in arrange, act, assert order - `022017d` Fix text search tests to test updated indexing log messages	2023-11-06 16:01:53 -08:00
Debanjum Singh Solanky	a4f407f595	Test memory leak on MPS device when generating vector embeddings Slope threshold of 2.0 determined qualitatively on local Mac device Minor unused import and clean-up	2023-11-05 03:48:54 -08:00
sabaimran	b5972e9311	Use OCR to extract image text in PDFs	2023-11-04 17:15:28 -07:00
Debanjum Singh Solanky	6fae6fb2a4	Merge branch 'features/multi-user-support-khoj' into improve-client-app-theming	2023-11-03 04:58:41 -07:00
sabaimran	fb6ebd19fc	Fix refactor bugs, CSRF token issues for use in production (#531 ) Fix refactor bugs, CSRF token issues for use in production * Add flags for samesite settings to enable django admin login * Include tzdata to dependencies to work around python package issues in linux * Use DJANGO_DEBUG flag correctly * Fix naming of entry field when creating EntryDate objects * Correctly retrieve openai config settings * Fix datefilter with embeddings name for field	2023-11-02 23:02:38 -07:00
Debanjum Singh Solanky	345856e7be	Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-user-support-khoj Merge changes to use latest GPT4All with GPU, GGUF model support into khoj multi-user support rearchitecture branch	2023-11-02 22:44:25 -07:00
sabaimran	54a387326c	[Multi-User Part 6]: Address small bugs and upstream PR comments (#518 ) - `08654163cb`: Add better parsing for XML files - `f3acfac7fb`: Add a try/catch around the dateparser in order to avoid internal server errors in app - `7d43cd62c0`: Chunk embeddings generation in order to avoid large memory load - `e02d751eb3`: Addresses comments from PR #498 - `a3f393edb4`: Addresses comments from PR #503 - `66eb078286`: Addresses comments from PR #511 - Address various items in https://github.com/khoj-ai/khoj/issues/527	2023-10-31 17:59:53 -07:00
sabaimran	5f3f6b7c61	[Multi-User Part 5]: Add a production Docker file and use a gunicorn configuration with it (#514 ) - Add a productionized setup for the Khoj server using `gunicorn` with multiple workers for handling requests - Add a new Dockerfile meant for production config at `ghcr.io/khoj-ai/khoj:prod`; the existing Docker config should remain the same	2023-10-26 13:15:31 -07:00
sabaimran	a8a82d274a	[Multi-User Part 2]: Add login pages and gate access to application behind login wall (#503 ) - Make most routes conditional on authentication if anonymous mode is not enabled. If anonymous mode is enabled, it scaffolds a default user and uses that for all application interactions. - Add a basic login page and add routes for redirecting the user if logged in	2023-10-26 10:17:29 -07:00
sabaimran	216acf545f	[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account (#498 ) - Partition configuration for indexing local data based on user accounts - Store indexed data in an underlying postgres db using the `pgvector` extension - Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time - Apply filters using SQL queries - Start removing many server-level configuration settings - Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB. - Update the Docker image and docker-compose.yml to work with the new application design	2023-10-26 09:42:29 -07:00
Debanjum Singh Solanky	0f1ebcae18	Upgrade to latest GPT4All. Use Mistral as default offline chat model GPT4all now supports gguf llama.cpp chat models. Latest GPT4All (+mistral) performs much at least 3x faster. On Macbook Pro at ~10s response start time vs 30s-120s earlier. Mistral is also a better chat model, although it hallucinates more than llama-2	2023-10-22 19:04:23 -07:00
sabaimran	6dc0df3afb	Pin pytorch version to 2.0.1 in order to avoid exit code 139 in Docker container (#512 )	2023-10-20 14:10:21 -07:00
sabaimran	963cd165eb	Resolve merge conflicts	2023-10-19 14:39:05 -07:00
Debanjum	ecc6fbfeb2	Push Files to Index from Emacs, Obsidian & Desktop Clients using Multi-Part Forms (#499 ) ### Overview - Add ability to push data to index from the Emacs, Obsidian client - Switch to standard mechanism of syncing files via HTTP multi-part/form. Previously we were streaming the data as JSON - Benefits of new mechanism - No manual parsing of files to send or receive on clients or server is required as most have in-built mechanisms to send multi-part/form requests - The whole response is not required to be kept in memory to parse content as JSON. As individual files arrive they're automatically pushed to disk to conserve memory if required - Binary files don't need to be encoded on client and decoded on server ### Code Details ### Major - Use multi-part form to receive files to index on server - Use multi-part form to send files to index on desktop client - Send files to index on server from the khoj.el emacs client - Send content for indexing on server at a regular interval from khoj.el - Send files to index on server from the khoj obsidian client - Update tests to test multi-part/form method of pushing files to index #### Minor - Put indexer API endpoint under /api path segment - Explicitly make GET request to /config/data from khoj.el:khoj-server-configure method - Improve emoji, message on content index updated via logger - Don't call khoj server on khoj.el load, only once khoj invoked explicitly by user - Improve indexing of binary files - Let fs_syncer pass PDF files directly as binary before indexing - Use encoding of each file set in indexer request to read file - Add CORS policy to khoj server. Allow requests from khoj apps, obsidian & localhost - Update indexer API endpoint URL to` index/update` from `indexer/batch` Resolves #471 #243	2023-10-17 06:05:15 -07:00
sabaimran	c125995d94	[Multi-User]: Part 0 - Add support for logging in with Google (#487 ) * Add concept of user authentication to the request session via GoogleUser	2023-10-14 19:39:13 -07:00
sabaimran	09bb3686cc	Strip the incoming query from the slash conversation command (#500 ) * Strip the incoming query from the slash conversation command before passing it to the model or for search * Return q when content index not loaded * Remove -n 4 from pytest ini configuration to isolate test failures	2023-10-13 21:11:23 -07:00
Debanjum Singh Solanky	96c0b21285	Sync desktop app package.json with other Khoj clients metadata - Make `bump_version.sh' script set version for the Khoj desktop app too - Sync Khoj desktop app authors, license, description and version with the other interfaces and server - Update description in packages metadata to match project subtitle on Github	2023-10-13 20:43:55 -07:00
Debanjum Singh Solanky	72f8fde7ef	Run pytests in parallel on multiple CPU cores using pytest-xdist for speed	2023-10-12 20:56:17 -07:00
Debanjum Singh Solanky	60e9a61647	Use multi-part form to receive files to index on server - This uses existing HTTP affordance to process files - Better handling of binary file formats as removes need to url encode/decode - Less memory utilization than streaming json as files get automatically written to disk once memory utilization exceeds preset limits - No manual parsing of raw files streams required	2023-10-11 23:58:23 -07:00
Debanjum Singh Solanky	148e8f468f	Restrict openai package version below 1.0.0 to avoid breaking changes	2023-10-09 19:30:58 -07:00
sabaimran	2dd15e9f63	Resolve issues with GPT4All and fix prompt for yesterday extract questions date filter (#483 ) - GPT4All integration had ceased working with 0.1.7 specification. Update to use 1.0.12. At a later date, we should also use first party support for llama v2 via gpt4all - Update the system prompt for the extract_questions flow to add start and end date to the yesterday date filter example. - Update all setup data in conftest.py to use new client-server indexing pattern	2023-09-18 14:41:26 -07:00
sabaimran	343854752c	Improve docker builds for local hosting (#476 ) * Remove GPT4All dependency in pyproject.toml and use multiplatform builds in the dockerization setup in GH actions * Move configure_search method into indexer * Add conditional installation for gpt4all * Add hint to go to localhost:42110 in the docs. Addresses #477	2023-09-08 17:07:26 -07:00
sabaimran	dccfae3853	Remove PySide dependency and deprecate desktop builds (#475 ) * Remove PySide, gui option from code * Remove pyside 6 dependency from code * Remove workflows which build desktop applications * Update unit tests and update line in documentation * Remove additional references to pyinstaller, gui * Add uninstall steps to normal uninstall instructions	2023-09-07 11:36:27 -07:00
sabaimran	76562f4250	Add front-end Electron application for Khoj local file syncing (#473 ) * Initial version - setup a file-push architecture for generating embeddings with Khoj * Use state.host and state.port for configuring the URL for the indexer * Fix parsing of PDF files * Read markdown files from streamed data and update unit tests * On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system * Init: refactor indexer/batch endpoint to support a generic file ingestion format * Add features to better support indexing from files sent by the desktop client * Initial commit with Electron application - Adds electron app * Add import for pymupdf, remove import for pypdf * Allow user to configure khoj host URL * Remove search type configuration from index.html * Use v1 path for current indexer routes	2023-09-06 12:04:18 -07:00
sabaimran	922222a813	Fix anyio package version to avoid backwards compatibility issue with start_blocking_portal method	2023-08-31 14:14:13 -07:00
sabaimran	7c35da9fc4	Fix bug in /chat endpoint for general and update depdendencies	2023-08-28 14:12:11 -07:00
sabaimran	b45e1d8c0d	Fix plaintext HTML parsing and rendering (#464 ) * Store conversation command options in an Enum * Move to slash commands instead of using @ to specify general commands * Calculate conversation command once & pass it as arg to child funcs * Add /notes command to respond using only knowledge base as context This prevents the chat model to try respond using it's general world knowledge only without any references pulled from the indexed knowledge base * Test general and notes slash commands in openai chat director tests --------- Co-authored-by: Debanjum Singh Solanky <debanjum@gmail.com>	2023-08-27 11:24:30 -07:00
Debanjum Singh Solanky	3c58ab5fcb	Unmark Python 3.8 as supported in khoj-assistant pypi package	2023-08-16 00:58:59 -07:00
Debanjum Singh Solanky	84d774ea34	Retain desktop builds for 3 days to allow user tests Upgrade minimum tiktoken version to work for encoding gpt4	2023-08-08 23:02:13 -07:00
Debanjum	cc951450fb	Build Khoj Debian package on Ubuntu 20.04 to work with Glibc 2.31 (#424 ) Build the Debian package using Ubuntu 20.04 instead of 22.04 as Ubuntu 20.04 comes pre-installed with glibc_2.31 unlike Ubuntu 22.04 which uses glibc_2.35	2023-08-06 23:21:51 -07:00
Debanjum Singh Solanky	75c16432a4	Loosen dateparser dependency to get python3.10 wheel for regex package This should reduce chances of installation errors due to regex package being built from source for python3.11 Previously, the regex dependency of dateparser = 1.1.1 didn't have a wheel for python 3.11. This would trigger building the regex package from scratch which would fail for a lot of folks	2023-08-06 22:48:40 -07:00
sabaimran	d00f51b531	Fix the minimum version of the transformers library required to address #404	2023-08-02 09:10:14 -07:00
sabaimran	124d97c26d	Replace Falcon 🦅 model with Llama V2 🦙 for offline chat (#352 ) * Working example with LlamaV2 running locally on my machine - Download from huggingface - Plug in to GPT4All - Update prompts to fit the llama format * Add appropriate prompts for extracting questions based on a query based on llama format * Rename Falcon to Llama and make some improvements to the extract_questions flow * Do further tuning to extract question prompts and unit tests * Disable extracting questions dynamically from Llama, as results are still unreliable	2023-07-27 20:51:20 -07:00
sabaimran	8b2af0b5ef	Add support for our first Local LLM 🤖🏠 (#330 ) * Add support for gpt4all's falcon model as an additional conversation processor - Update the UI pages to allow the user to point to the new endpoints for GPT - Update the internal schemas to support both GPT4 models and OpenAI - Add unit tests benchmarking some of the Falcon performance * Add exc_info to include stack trace in error logs for text processors * Pull shared functions into utils.py to be used across gpt4 and gpt * Add migration for new processor conversation schema * Skip GPT4All actor tests due to typing issues * Fix Obsidian processor configuration in auto-configure flow * Rename enable_local_llm to enable_offline_chat	2023-07-26 16:27:08 -07:00
Debanjum Singh Solanky	9c76150895	Migrate from PyQT6 to PySide6	2023-07-11 18:43:44 -07:00
Debanjum Singh Solanky	70f6b8266c	Upgrade minimum supported pydantic version	2023-07-03 12:22:56 -07:00
Debanjum Singh Solanky	0f993b332e	Drop support for Ledger as a separate content type Khoj will soon get a generic text indexing content type. This along with a file filter should suffice for searching through Ledger transactions, if required. Having a specific content type for niche use-case like ledger isn't useful. Removing unused content types will reduce khoj code to manage.	2023-07-02 16:57:49 -07:00
Debanjum Singh Solanky	30459ee4ba	Fix Khoj subtitle in desktop entry, pyproject, cli and Obsidian Readme	2023-07-02 16:09:07 -07:00

1 2

74 commits