Commit graph

58 commits

Author SHA1 Message Date
Debanjum Singh Solanky
d92a2d03a7 Rename Files, Classes from X_To_JSONL to more appropriate X_To_Entries
These content processors are converting content into entries in DB
instead of entries in JSONL file
2023-11-01 14:51:33 -07:00
Debanjum Singh Solanky
bcbee05a9e Rename DbModels Embeddings, EmbeddingsAdapter to Entry, EntryAdapter
Improves readability as name has closer match to underlying
constructs

- Entry is any atomic item indexed by Khoj. This can be an org-mode
  entry, a markdown section, a PDF or Notion page etc.

- Embeddings are semantic vectors generated by the search ML model
  that encodes for meaning contained in an entries text.

- An "Entry" contains "Embeddings" vectors but also other metadata
  about the entry like filename etc.
2023-10-31 18:50:54 -07:00
Debanjum
9acc722f7f
[Multi-User Part 4]: Authenticate using API Tokens (#513)
###  New
- Use API keys to authenticate from Desktop, Obsidian, Emacs clients
- Create API, UI on web app config page to CRUD API Keys
- Create user API keys table and functions to CRUD them in Database

### 🧪 Improve
- Default to better search model, [gte-small](https://huggingface.co/thenlper/gte-small), to improve search quality
- Only load chat model to GPU if enough space, throw error on load failure
- Show encoding progress, truncate headings to max chars supported
- Add instruction to create db in Django DB setup Readme

### ⚙️ Fix
- Fix error handling when configure offline chat via Web UI
- Do not warn in anon mode about Google OAuth env vars not being set
- Fix path to load static files when server started from project root
2023-10-26 12:33:03 -07:00
sabaimran
4b6ec248a6
[Multi-User Part 3]: Separate chat sesssions based on authenticated users (#511)
- Add a data model which allows us to store Conversations with users. This does a minimal lift over the current setup, where the underlying data is stored in a JSON file. This maintains parity with that configuration.
- There does _seem_ to be some regression in chat quality, which is most likely attributable to search results.

This will help us with #275. It should become much easier to maintain multiple Conversations in a given table in the backend now. We will have to do some thinking on the UI.
2023-10-26 11:37:41 -07:00
sabaimran
a8a82d274a
[Multi-User Part 2]: Add login pages and gate access to application behind login wall (#503)
- Make most routes conditional on authentication *if anonymous mode is not enabled*. If anonymous mode is enabled, it scaffolds a default user and uses that for all application interactions.
- Add a basic login page and add routes for redirecting the user if logged in
2023-10-26 10:17:29 -07:00
sabaimran
216acf545f
[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account (#498)
- Partition configuration for indexing local data based on user accounts
- Store indexed data in an underlying postgres db using the `pgvector` extension
- Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time
- Apply filters using SQL queries
- Start removing many server-level configuration settings
- Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB.
- Update the Docker image and docker-compose.yml to work with the new application design
2023-10-26 09:42:29 -07:00
sabaimran
963cd165eb Resolve merge conflicts 2023-10-19 14:39:05 -07:00
Debanjum Singh Solanky
7b1c62ba53 Mark test_get_configured_types_via_api unit test as flaky
It passes locally on running individually but fails when run in
parallel on local or CI
2023-10-17 05:56:00 -07:00
Debanjum Singh Solanky
5efae1ad55 Update indexer API endpoint query params for force, content type
New URL query params, `force' and `t' match name of query parameter in
existing Khoj API endpoints

Update Desktop, Obsidian and Emacs client to call using these new API
query params. Set `client' query param from each client for telemetry
visibility
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
84654ffc5d Update indexer API endpoint URL to index/update from indexer/batch
New URL follows action oriented endpoint naming convention used for
other Khoj API endpoints

Update desktop, obsidian and emacs client to call this new API
endpoint
2023-10-17 04:58:13 -07:00
sabaimran
c125995d94
[Multi-User]: Part 0 - Add support for logging in with Google (#487)
* Add concept of user authentication to the request session via GoogleUser
2023-10-14 19:39:13 -07:00
Debanjum Singh Solanky
bed3aff059 Update tests to test multi-part/form method of pushing files to index
Instead of using the previous method to push data as json payload of POST request
pass it as files to upload via the multi-part/form to the batch indexer API endpoint
2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky
6aa69da3ef Put indexer API endpoint under /api path segment
Update FastAPI app router, desktop app and to use new url path to
batch indexer API endpoint

All api endpoints should exist under /api path segment
2023-10-09 21:35:58 -07:00
sabaimran
76562f4250
Add front-end Electron application for Khoj local file syncing (#473)
* Initial version - setup a file-push architecture for generating embeddings with Khoj
* Use state.host and state.port for configuring the URL for the indexer
* Fix parsing of PDF files
* Read markdown files from streamed data and update unit tests
* On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system
* Init: refactor indexer/batch endpoint to support a generic file ingestion format
* Add features to better support indexing from files sent by the desktop client
* Initial commit with Electron application
- Adds electron app
* Add import for pymupdf, remove import for pypdf
* Allow user to configure khoj host URL
* Remove search type configuration from index.html
* Use v1 path for current indexer routes
2023-09-06 12:04:18 -07:00
sabaimran
4854258047
Move to a push-first model for retrieving embeddings from local files (#457)
* Initial version - setup a file-push architecture for generating embeddings with Khoj
* Update unit tests to fix with new application design
* Allow configure server to be called without regenerating the index; this no longer works because the API for indexing files is not up in time for the server to send a request
* Use state.host and state.port for configuring the URL for the indexer
* On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system
2023-08-31 12:55:17 -07:00
Debanjum
7919787fb7
Use Slash Commands and Add Notes Slash Command (#463)
* Store conversation command options in an Enum

* Move to slash commands instead of using @ to specify general commands

* Calculate conversation command once & pass it as arg to child funcs

* Add /notes command to respond using only knowledge base as context

This prevents the chat model to try respond using it's general world
knowledge only without any references pulled from the indexed
knowledge base

* Test general and notes slash commands in openai chat director tests

* Update gpt4all tests to use md configuration

* Add a /help tooltip

* Add dynamic support for describing slash commands. Remove default and treat notes as the default type

---------

Co-authored-by: sabaimran <narmiabas@gmail.com>
2023-08-26 18:11:18 -07:00
sabaimran
5ccb01343e
Add Offline chat to Obsidian (#359)
* Add support for configuring/using offline chat from within Obsidian
* Fix type checking for search type
* If Github is not configured, /update call should fail
* Fix regenerate tests same as the update ones
* Update help text for offline chat in obsidian
* Update relevant description for Khoj settings in Obsidian
* Simplify configuration logic and use smarter defaults
2023-07-28 18:47:56 -07:00
Debanjum Singh Solanky
d078e7b1f6 Clean up search type usage in khoj server, tests and Readme 2023-07-18 19:57:55 -07:00
Debanjum Singh Solanky
b9fb656657 Update Tests to setup both content_index, search_models before testing
This is required by the updated structure of Khoj setup

- Add content_config pytest fixture, pass bi_encoder from
  search_models.[text|image]_search
2023-07-14 01:29:48 -07:00
Debanjum Singh Solanky
0f993b332e Drop support for Ledger as a separate content type
Khoj will soon get a generic text indexing content type. This along
with a file filter should suffice for searching through Ledger
transactions, if required.

Having a specific content type for niche use-case like ledger isn't
useful. Removing unused content types will reduce khoj code to manage.
2023-07-02 16:57:49 -07:00
Debanjum Singh Solanky
c9db5321e7 Remove unused org-music as an indexable content type from Khoj
Org-music was just a custom content type that worked with org-music.
It was mostly only useful for me.

Cleaning up that code will reduce number of content types for khoj to
manage.
2023-07-02 16:21:21 -07:00
Debanjum Singh Solanky
f516d127c8 Update client tests to expect "all" as a valid new content type 2023-06-28 22:07:02 -07:00
Saba
019d3732de Rename orgmode_search to org_search 2023-06-13 16:06:54 -07:00
Debanjum Singh Solanky
acd14a5e41 Wire up PDF to jsonl processor to Khoj server layer (API, config)
- Specify PDF content to index via khoj.yml
- Index PDF content on app start, reconfigure
- Expose PDF as a search type via API
2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
2bed4c3b50 Fix configuring search types & /config/types API when no plugin configured
- Test /config/types API when no plugin configured, only plugin configured
  and no content configured scenarios
- Do not throw null reference exception while configuring search types
  when no plugin configured
- Do not throw null reference exception on calling /config/types API
  when no plugin configured

Resolves bug introduced by #173
2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky
b09350c052 Fix to return only enabled content types via the new config/types API
- Previously was return all core content types even if they had not been
  setup
- Add test to validate only configured content types are returned by
  the api/config/types API endpoint
2023-02-28 22:08:26 -06:00
Debanjum Singh Solanky
ede6eb6879 Re-enable testing search and update API with image content type
It may have been disabled due to issues with image search earlier
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
88a9eadfba Use client pytest fixture to test API with plugin type configured 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
68bd5d9ebc Configure API routes after set up search types while configuring server
Configure app routes after configuring server.
Import API routers after search type is dynamically populated.
Allow API to recognize the dynamically populated plugin search types
as valid type query param.
Enable searching for plugin type content.
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
5e83baab21 Use Black to format Khoj server code and tests 2023-02-17 11:55:17 -06:00
Debanjum Singh Solanky
25a749ca1d Use the src/ layout to fix packaging Khoj for PyPi
- Why
  The khoj pypi packages should be installed in `khoj' directory.
  Previously it was being installed into `src' directory, which is a
  generic top level directory name that is discouraged from being used

- Changes
 - move src/* to src/khoj/*
 - update `setup.py' to `find_packages' in `src' instead of project root
 - rename imports to form `from khoj.*' in complete project
 - update `constants.web_directory' path to use `khoj' directory
 - rename root logger to `khoj' in `main.py'
 - fix image_search tests to use the newly rename `khoj' logger
 - update config, docs, workflows to reference new path `src/khoj'
2023-02-14 15:19:06 -06:00
Debanjum Singh Solanky
d292bdcc11 Do not version API. Premature given current state of the codebase
- Reason
  - All clients that currently consume the API are part of Khoj
  - Any breaking API changes will be fixed in clients immediately
  - So decoupling client from API is not required
  - This removes the burden of maintaining muliple versions of the API
2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky
e42a38e825 Version Khoj API, Update frontends, tests and docs to reflect it
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
  under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
2022-09-28 20:08:38 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
1bfe9c4ef2 Handle filter only queries. Short-circuit and return filtered results
- For queries with only filters in them short-circuit and return
  filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
f930324350 Rename explicit filter to word filter to be more specific 2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky
cdcee89ae5 Wrap words in quotes to trigger explicit filter from query
- Do not run the more expensive explicit filter until the word to be
  filtered is completed by user. This requires an end sequence marker
  to identify end of explicit word filter to trigger filtering

- Space isn't a good enough delimiter as the explicit filter could be
  at the end of the query in which case no space
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
30c3eb372a Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
972523e8a9 Re-enable tests for image search
Verify if recent fixes resolve test flakiness
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
7b04978f52 Put global state variables into separate state module
- Variables storing app, device state aren't constants.
  Do not mix with actual constants like empty_escape_sequence, web_directory
2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky
bc423d8f76 Disable image search in tests. Import global state from constants module
- Upstream issues causing load of image search model to fail.
  Disable tests related to image search for now
2022-08-06 02:47:52 +03:00
Debanjum Singh Solanky
1168244c92 Make cross-encoder re-rank results if query param set on /search API
- Improve search speed by ~10x
  Tested on corpus of 125K lines, 12.5K entries

- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
  - It's an optional param that default to False
  - Earlier all results were re-ranked by cross-encoder
  - Making this configurable allows for much faster results, if desired
    but for lower accuracy
2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky
65fea7681a Rename notes search type to org search, now that markdown notes supported 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
0602d018c0 Merge Symmetric, Asymmetric Search Types into a single Text Search Type
- The code for both the text search types were mostly the same
  It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
  text_search type
- This simplifies the app and making it easier to process other
  text types
2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky
c1369233db Consistently use "entry", "score" in json response for all search types
- Had already made some progress on this earlier by updating the image
  search responses. But needed to update the text search responses to
  use lowercase entry and score

- Update khoj.el to consume the updated json response keys for text
  search
2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky
c9ff97451b Fix tests to handle updated response types by API 2022-07-20 03:01:56 +04:00
Debanjum Singh Solanky
68ee88cebc Fix image search tests after update to API response for image search types
- Look for 'entry' key in response json instead of 'Entry'
- Expect image where id = alphanumeric order of image name
2022-07-20 01:37:01 +04:00
Debanjum Singh Solanky
732b2d287f Give the project a short, less generic name. Rename it to Khoj
- Semantic Search was just a placeholder used to test the idea out
  Didn't want to get into naming at that point of time
2022-07-19 18:26:16 +04:00