sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-27 17:35:07 +01:00

Author	SHA1	Message	Date
Debanjum	11ce3e2268	Update Text Chunking Strategy to Improve Search Context (#645 ) ## Major - Parse markdown, org parent entries as single entry if fit within max tokens - Parse a file as single entry if it fits with max token limits - Add parent heading ancestry to extracted markdown entries for context - Chunk text in preference order of para, sentence, word, character ## Minor - Create wrapper function to get entries from org, md, pdf & text files - Remove unused Entry to Jsonl converter from text to entry class, tests - Dedupe code by using single func to process an org file into entries Resolves #620	2024-04-08 13:56:38 +05:30
sabaimran	f57f9f672d	Address Notion, Image tech debt in indexing code path (#687 ) * Add support for using OAuth2.0 in the Notion integration * Add notion to the admin page * Remove unnecessary content_index and image search/setup references * Trigger background job to start indexing Notion after user configures it * Add a log line when a new Notion integration is setup * Fix references to the configure_content methods	2024-04-05 12:10:03 +05:30
Debanjum Singh Solanky	29c1c18042	Increase search distance to get relevant content for chat post indexer update More content indexed per entry would result in an overall scores lowering effect. Increase default search distance threshold to counter that - Details - Fix expected results post indexing updates - Fix search with max distance post indexing updates - Minor - Remove openai chat actor test for after: operator as it's not expected anymore	2024-04-04 02:41:55 +05:30
Debanjum	4d30f7d1d9	Short-circuit API rate limiter for unauthenticated users (#607 ) ### Major - Short-circuit API rate limiter for unauthenticated user Calls by unauthenticated users were failing at API rate limiter as it failed to access user info object. This is a bug. API rate limiter should short-circuit for unauthenicated users so a proper Forbidden response can be returned by API Add regression test to verify that unauthenticated users get 403 response when calling the /chat API endpoint ### Minor - Remove trailing slash to normalize khoj url in obsidian plugin settings - Move used /api/config API controllers into separate module - Delete unused /api/beta API endpoint - Fix error message rendering in khoj.el, khoj obsidian chat - Handle deprecation warnings for subscribe renew date, langchain, pydantic & logger.warn	2024-01-17 00:59:52 +05:30
Debanjum Singh Solanky	d26a4ffcea	Only run the OpenAI chat client, /online test when API keys are set	2024-01-17 00:36:03 +05:30
Debanjum Singh Solanky	ba99089a12	Short-circuit API rate limiter for unauthenticated user Calls by unauthenticated users were failing at API rate limiter as it failed to access user info object. This is a bug. API rate limiter should short-circuit for unauthenicated users so a proper Forbidden response can be returned by API Add regression test to verify that unauthenticated users get 403 response when calling the /chat API endpoint	2024-01-12 00:23:50 +05:30
Debanjum Singh Solanky	4ded32cc64	Test 1000 file upload limit to index/update API endpoint Due to FastAPI limitation	2024-01-03 22:14:36 +05:30
sabaimran	79913d4c17	Add isort to the pre-commit configuration and apply it to the whole project (#595 ) * Apply isort to the entire repository * Fix missing import issues in text_to_entries * Fix imports in migration files	2023-12-28 18:04:02 +05:30
sabaimran	e438853b09	Add additional unit tests to verify behavior of unsubscribed/subscribed users	2023-11-26 13:09:00 -08:00
sabaimran	73e38fccf3	Explicitly set billing to off in the test for being able to index a large set of data	2023-11-25 20:48:32 -08:00
sabaimran	b2afbaa315	Add support for rate limiting the amount of data indexed - Add a dependency on the indexer API endpoint that rounds up the amount of data indexed and uses that to determine whether the next set of data should be processed - Delete any files that are being removed for adminstering the calculation - Show current amount of data indexed in the config page	2023-11-25 20:28:04 -08:00
sabaimran	1e2af083f0	Rename the data_sources module to content	2023-11-21 22:11:32 -08:00
sabaimran	2bb989e9d8	Resolve merge conflicts and fix some import ordering	2023-11-21 12:30:43 -08:00
sabaimran	a474c31e02	Move the django app into the src/khoj folder for better organization and functionality - Our pypi package currently does not work because the django app and associated database is not included. To remedy this issue, move the app into the src/khoj folder. This has the added benefit of improved organization of the codebase, as all server related code is now in a single folder - Update associated file paths and system references	2023-11-21 10:56:04 -08:00
sabaimran	b8e6883a81	Merge branch 'master' of github.com:khoj-ai/khoj into features/internet-enabled-search	2023-11-19 16:20:08 -08:00
sabaimran	ec06d2c446	Move data indexer files into a separate folder under processor. Update assoc UTs	2023-11-16 17:19:55 -08:00
Debanjum Singh Solanky	ddb07def0d	Test search uses ancestor headings as context for improved results - Update test data to add deeper outline hierarchy for testing hierarchy as context - Update collateral tests that need count of entries updated, deleted asserts to be updated	2023-11-16 03:05:19 -08:00
Debanjum Singh Solanky	922983bd53	Set max cos distance to 0.18. Test search API query with max distance	2023-11-15 20:26:21 -08:00
Debanjum Singh Solanky	5a6ab9cc85	Fix failing client tests	2023-11-15 00:17:44 -08:00
Debanjum Singh Solanky	8f200cf53f	Remove unused parameter from configure_search_type method	2023-11-14 19:09:35 -08:00
sabaimran	262a8574d1	Add a test to verify that a user without data sucessfully returns a respones to the /search endpoint	2023-11-10 14:00:58 -08:00
sabaimran	3678aa5614	Add tests to validate expected behaviors in the multi-user scenario	2023-11-04 14:29:30 -07:00
Debanjum Singh Solanky	d92a2d03a7	Rename Files, Classes from X_To_JSONL to more appropriate X_To_Entries These content processors are converting content into entries in DB instead of entries in JSONL file	2023-11-01 14:51:33 -07:00
Debanjum Singh Solanky	bcbee05a9e	Rename DbModels Embeddings, EmbeddingsAdapter to Entry, EntryAdapter Improves readability as name has closer match to underlying constructs - Entry is any atomic item indexed by Khoj. This can be an org-mode entry, a markdown section, a PDF or Notion page etc. - Embeddings are semantic vectors generated by the search ML model that encodes for meaning contained in an entries text. - An "Entry" contains "Embeddings" vectors but also other metadata about the entry like filename etc.	2023-10-31 18:50:54 -07:00
Debanjum	9acc722f7f	[Multi-User Part 4]: Authenticate using API Tokens (#513 ) ### ✨ New - Use API keys to authenticate from Desktop, Obsidian, Emacs clients - Create API, UI on web app config page to CRUD API Keys - Create user API keys table and functions to CRUD them in Database ### 🧪 Improve - Default to better search model, [gte-small](https://huggingface.co/thenlper/gte-small), to improve search quality - Only load chat model to GPU if enough space, throw error on load failure - Show encoding progress, truncate headings to max chars supported - Add instruction to create db in Django DB setup Readme ### ⚙️ Fix - Fix error handling when configure offline chat via Web UI - Do not warn in anon mode about Google OAuth env vars not being set - Fix path to load static files when server started from project root	2023-10-26 12:33:03 -07:00
sabaimran	4b6ec248a6	[Multi-User Part 3]: Separate chat sesssions based on authenticated users (#511 ) - Add a data model which allows us to store Conversations with users. This does a minimal lift over the current setup, where the underlying data is stored in a JSON file. This maintains parity with that configuration. - There does _seem_ to be some regression in chat quality, which is most likely attributable to search results. This will help us with #275. It should become much easier to maintain multiple Conversations in a given table in the backend now. We will have to do some thinking on the UI.	2023-10-26 11:37:41 -07:00
sabaimran	a8a82d274a	[Multi-User Part 2]: Add login pages and gate access to application behind login wall (#503 ) - Make most routes conditional on authentication if anonymous mode is not enabled. If anonymous mode is enabled, it scaffolds a default user and uses that for all application interactions. - Add a basic login page and add routes for redirecting the user if logged in	2023-10-26 10:17:29 -07:00
sabaimran	216acf545f	[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account (#498 ) - Partition configuration for indexing local data based on user accounts - Store indexed data in an underlying postgres db using the `pgvector` extension - Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time - Apply filters using SQL queries - Start removing many server-level configuration settings - Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB. - Update the Docker image and docker-compose.yml to work with the new application design	2023-10-26 09:42:29 -07:00
sabaimran	963cd165eb	Resolve merge conflicts	2023-10-19 14:39:05 -07:00
Debanjum Singh Solanky	7b1c62ba53	Mark test_get_configured_types_via_api unit test as flaky It passes locally on running individually but fails when run in parallel on local or CI	2023-10-17 05:56:00 -07:00
Debanjum Singh Solanky	5efae1ad55	Update indexer API endpoint query params for force, content type New URL query params, `force' and `t' match name of query parameter in existing Khoj API endpoints Update Desktop, Obsidian and Emacs client to call using these new API query params. Set `client' query param from each client for telemetry visibility	2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky	84654ffc5d	Update indexer API endpoint URL to index/update from indexer/batch New URL follows action oriented endpoint naming convention used for other Khoj API endpoints Update desktop, obsidian and emacs client to call this new API endpoint	2023-10-17 04:58:13 -07:00
sabaimran	c125995d94	[Multi-User]: Part 0 - Add support for logging in with Google (#487 ) * Add concept of user authentication to the request session via GoogleUser	2023-10-14 19:39:13 -07:00
Debanjum Singh Solanky	bed3aff059	Update tests to test multi-part/form method of pushing files to index Instead of using the previous method to push data as json payload of POST request pass it as files to upload via the multi-part/form to the batch indexer API endpoint	2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky	6aa69da3ef	Put indexer API endpoint under /api path segment Update FastAPI app router, desktop app and to use new url path to batch indexer API endpoint All api endpoints should exist under /api path segment	2023-10-09 21:35:58 -07:00
sabaimran	76562f4250	Add front-end Electron application for Khoj local file syncing (#473 ) * Initial version - setup a file-push architecture for generating embeddings with Khoj * Use state.host and state.port for configuring the URL for the indexer * Fix parsing of PDF files * Read markdown files from streamed data and update unit tests * On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system * Init: refactor indexer/batch endpoint to support a generic file ingestion format * Add features to better support indexing from files sent by the desktop client * Initial commit with Electron application - Adds electron app * Add import for pymupdf, remove import for pypdf * Allow user to configure khoj host URL * Remove search type configuration from index.html * Use v1 path for current indexer routes	2023-09-06 12:04:18 -07:00
sabaimran	4854258047	Move to a push-first model for retrieving embeddings from local files (#457 ) * Initial version - setup a file-push architecture for generating embeddings with Khoj * Update unit tests to fix with new application design * Allow configure server to be called without regenerating the index; this no longer works because the API for indexing files is not up in time for the server to send a request * Use state.host and state.port for configuring the URL for the indexer * On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system	2023-08-31 12:55:17 -07:00
Debanjum	7919787fb7	Use Slash Commands and Add Notes Slash Command (#463 ) * Store conversation command options in an Enum * Move to slash commands instead of using @ to specify general commands * Calculate conversation command once & pass it as arg to child funcs * Add /notes command to respond using only knowledge base as context This prevents the chat model to try respond using it's general world knowledge only without any references pulled from the indexed knowledge base * Test general and notes slash commands in openai chat director tests * Update gpt4all tests to use md configuration * Add a /help tooltip * Add dynamic support for describing slash commands. Remove default and treat notes as the default type --------- Co-authored-by: sabaimran <narmiabas@gmail.com>	2023-08-26 18:11:18 -07:00
sabaimran	5ccb01343e	Add Offline chat to Obsidian (#359 ) * Add support for configuring/using offline chat from within Obsidian * Fix type checking for search type * If Github is not configured, /update call should fail * Fix regenerate tests same as the update ones * Update help text for offline chat in obsidian * Update relevant description for Khoj settings in Obsidian * Simplify configuration logic and use smarter defaults	2023-07-28 18:47:56 -07:00
Debanjum Singh Solanky	d078e7b1f6	Clean up search type usage in khoj server, tests and Readme	2023-07-18 19:57:55 -07:00
Debanjum Singh Solanky	b9fb656657	Update Tests to setup both content_index, search_models before testing This is required by the updated structure of Khoj setup - Add content_config pytest fixture, pass bi_encoder from search_models.[text\|image]_search	2023-07-14 01:29:48 -07:00
Debanjum Singh Solanky	0f993b332e	Drop support for Ledger as a separate content type Khoj will soon get a generic text indexing content type. This along with a file filter should suffice for searching through Ledger transactions, if required. Having a specific content type for niche use-case like ledger isn't useful. Removing unused content types will reduce khoj code to manage.	2023-07-02 16:57:49 -07:00
Debanjum Singh Solanky	c9db5321e7	Remove unused org-music as an indexable content type from Khoj Org-music was just a custom content type that worked with org-music. It was mostly only useful for me. Cleaning up that code will reduce number of content types for khoj to manage.	2023-07-02 16:21:21 -07:00
Debanjum Singh Solanky	f516d127c8	Update client tests to expect "all" as a valid new content type	2023-06-28 22:07:02 -07:00
Saba	019d3732de	Rename orgmode_search to org_search	2023-06-13 16:06:54 -07:00
Debanjum Singh Solanky	acd14a5e41	Wire up PDF to jsonl processor to Khoj server layer (API, config) - Specify PDF content to index via khoj.yml - Index PDF content on app start, reconfigure - Expose PDF as a search type via API	2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky	2bed4c3b50	Fix configuring search types & /config/types API when no plugin configured - Test /config/types API when no plugin configured, only plugin configured and no content configured scenarios - Do not throw null reference exception while configuring search types when no plugin configured - Do not throw null reference exception on calling /config/types API when no plugin configured Resolves bug introduced by #173	2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky	b09350c052	Fix to return only enabled content types via the new config/types API - Previously was return all core content types even if they had not been setup - Add test to validate only configured content types are returned by the api/config/types API endpoint	2023-02-28 22:08:26 -06:00
Debanjum Singh Solanky	ede6eb6879	Re-enable testing search and update API with image content type It may have been disabled due to issues with image search earlier	2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky	88a9eadfba	Use client pytest fixture to test API with plugin type configured	2023-02-28 20:25:51 -06:00

1 2

80 commits