Commit graph

3142 commits

Author SHA1 Message Date
Debanjum Singh Solanky
d527b644f4 Update content by source via API. Make web client use this API for config 2023-11-07 03:41:19 -08:00
Debanjum Singh Solanky
9ab327a2b6 Store the data source of each entry in database
This will be useful for updating, deleting entries by their data
source. Data source can be one of Computer, Github or Notion for now

Store each file/entries source in database
2023-11-07 02:18:48 -08:00
Debanjum Singh Solanky
c82cd0862a Delete deprecated content config pages for local files from web client
The desktop app now manages syncing local computer files to index
The server only manages "cloud" data source like github and notion.
2023-11-06 23:55:37 -08:00
Debanjum Singh Solanky
97cf8339aa Rename Sync button, Force Sync toggle to Save, Save All buttons 2023-11-06 21:57:37 -08:00
Debanjum Singh Solanky
a08b152358 Improve log messages in text_entries and memory leak unit test 2023-11-06 19:27:31 -08:00
sabaimran
6c8689e4ae Update corresponding chat UX in the desktop client as well 2023-11-06 16:18:41 -08:00
sabaimran
e01ecf1419 /s/references/reference to fix bug of jumping references 2023-11-06 16:12:25 -08:00
Debanjum
38f24a037d
Improve Indexing Text Entries (#535)
Major
- Ensure search results logic consistent across migration to DB, multi-user
- Manually verified search results for sample queries look the same across migration
 - Flatten indexing code for better indexing progress tracking and code readability

Minor
- a4f407f Test memory leak on MPS device when generating vector embeddings
- ef24485 Improve Khoj with DB setup instructions in the Django app readme (for now)
- f212cc7 Arrange remaining text search tests in arrange, act, assert order
- 022017d Fix text search tests to test updated indexing log messages
2023-11-06 16:01:53 -08:00
sabaimran
270f7b3eb3 Update the chat UI to have richer representation of the references 2023-11-05 15:46:43 -08:00
sabaimran
d697d752c2
Use repeat rather than manually specify auto in grid-template-rows
Co-authored-by: Debanjum <debanjum@gmail.com>
2023-11-05 15:23:42 -08:00
sabaimran
5f1e37fff0 Adjust indentation for css property 2023-11-05 14:33:23 -08:00
Debanjum Singh Solanky
a4f407f595 Test memory leak on MPS device when generating vector embeddings
Slope threshold of 2.0 determined qualitatively on local Mac device
Minor unused import and clean-up
2023-11-05 03:48:54 -08:00
Debanjum Singh Solanky
ef24485ada Improve Khoj with DB setup instructions in the Django app readme (for now) 2023-11-05 02:04:52 -08:00
sabaimran
084a8becc5 Fix but to prevent default in chat trigger 2023-11-04 20:13:33 -07:00
Debanjum Singh Solanky
5489e98b9c Do not index org heading entries by default
This is to maintain the previous default behavior
2023-11-04 20:09:25 -07:00
Debanjum Singh Solanky
34b5a86d1d Use SentenceTransformer to disable progress bar when encoding query
The Langchain HuggingFaceEmbeddings wrapper doesn't support disabling
progressbar, not especially for only query but not documents.

This makes the logs noisy with encoding progressbar for each
incremental queries

No features of the Langchain wrapper for SentenceTransformer was
currently being used anyway for now, and we can always switch back to
it if required
2023-11-04 20:09:25 -07:00
Debanjum Singh Solanky
dc9946fc03 Flatten nested loops, improve progress reporting in text_to_jsonl indexer
Flatten the nested loops to improve visibilty into indexing progress

Reduce spurious logs, report the logs at aggregated level and update
the logging description text to improve indexing progress reporting
2023-11-04 20:09:25 -07:00
sabaimran
88eeee3f4b Move try/catch for import one line later 2023-11-04 19:46:47 -07:00
sabaimran
dbaa892665 Flip catching modulenotfound to import error exception 2023-11-04 19:34:10 -07:00
sabaimran
8c3d5a49da Add try/except around image extraction step 2023-11-04 19:27:18 -07:00
sabaimran
fdfab39942 Update the config UI to show all files indexed with option to delete
- Given the separation of the client and server now, the web UI will no longer support configuration of local file paths of data to index
- Expose a way to show all the files that are currently set for indexing, along with an option to delete all or specific files
2023-11-04 19:03:34 -07:00
sabaimran
800bb4f458 Remove references to demo
- The demo setting is no longer necessary for the time being, as we won't have anymore demo instances
2023-11-04 17:17:04 -07:00
sabaimran
b5972e9311 Use OCR to extract image text in PDFs 2023-11-04 17:15:28 -07:00
Debanjum Singh Solanky
8273bf26b7 Fix multi-line chat input and output render on web, desktop clients
- Remove spurious whitespace in chat input box on page load being
  added because text area element was ending on newline
- Do not insert newline in message when send message by hitting enter key
  This would be more evident when send message with cursor in the
  middle of the sentence, as a newline would be inserted at the cursor
  point
- Remove chat message separator tokens from model output. Model
  sometimes starts to output text in it's chat format
2023-11-04 01:09:35 -07:00
Debanjum Singh Solanky
2f1756cc15 Do not use icon for each file, folder to index in desktop app.
Other minor fixes based on PR feedback
2023-11-04 00:13:10 -07:00
Debanjum Singh Solanky
e8f568d79c Make splash screen wider, opaque and fix it's spinner radius
Radius should be such that final spin doesn't extend out of the circle
Opaque background improves contrast for better visual
2023-11-03 23:59:21 -07:00
Debanjum Singh Solanky
3ef05f4803 Use css var for main font color in search, chat page of desktop app 2023-11-03 23:59:21 -07:00
Debanjum Singh Solanky
a19cbde2d7 Add About page for Khoj to Desktop app. Expose it via system tray
- Pass current khoj version from package.json to about page via
  electron IPC between backend js and frontend page
- Update Khoj information in default About screen as well, in case
  it's exposed anywhere else
2023-11-03 23:59:21 -07:00
Debanjum Singh Solanky
a327294ee9 Rename khoj.js to utils.js in web and desktop client apps 2023-11-03 18:13:37 -07:00
Debanjum Singh Solanky
db57eeaefe Console log a welcome message on loading Desktop client 2023-11-03 05:15:41 -07:00
Debanjum Singh Solanky
6fae6fb2a4 Merge branch 'features/multi-user-support-khoj' into improve-client-app-theming 2023-11-03 04:58:41 -07:00
Debanjum Singh Solanky
4cd76311ad Slow down spinning at end of splash sequence. Make animation bigger 2023-11-03 04:28:17 -07:00
Debanjum Singh Solanky
34661c33a2 Show splash screen on starting desktop app 2023-11-03 03:19:08 -07:00
Debanjum Singh Solanky
126d3f4563 Render each file, folder to index row with icon in desktop app
Make the file, folders to index look less like an editable field
2023-11-03 02:48:42 -07:00
Debanjum Singh Solanky
80ae132cad Update Desktop, Obsidian client color theme to lighter yellow
- Update background color to a different shade of white
- Make primary and primary hover colors less intense and more aligned
  with lantern flame shade
- Add water, leaf, flower color variables
2023-11-03 02:48:42 -07:00
sabaimran
fb6ebd19fc
Fix refactor bugs, CSRF token issues for use in production (#531)
Fix refactor bugs, CSRF token issues for use in production
* Add flags for samesite settings to enable django admin login
* Include tzdata to dependencies to work around python package issues in linux
* Use DJANGO_DEBUG flag correctly
* Fix naming of entry field when creating EntryDate objects
* Correctly retrieve openai config settings
* Fix datefilter with embeddings name for field
2023-11-02 23:02:38 -07:00
Debanjum Singh Solanky
345856e7be Merge branch 'master' of github.com:khoj-ai/khoj into features/multi-user-support-khoj
Merge changes to use latest GPT4All with GPU, GGUF model support into
khoj multi-user support rearchitecture branch
2023-11-02 22:44:25 -07:00
Debanjum Singh Solanky
041074ccd6 Make chat the landing page for the desktop app
Chat, unlike search, doesn't knowledge base indexing setup.
So you can get started with chat much faster.
2023-11-02 20:42:21 -07:00
Debanjum Singh Solanky
3801105b2a Make chat the landing page for the web app
Chat, unlike search, doesn't knowledge base indexing setup.
So you can get started with chat much faster.
2023-11-02 20:42:21 -07:00
Debanjum Singh Solanky
0d4e7d46c2 Fix color and size of profile picture circle in nav pane 2023-11-02 20:42:21 -07:00
Debanjum Singh Solanky
4fbe8ac6b1 Console log a welcome message on loading web client 2023-11-02 20:42:21 -07:00
Debanjum Singh Solanky
9fc6c97139 Use Khoj standard font family, weight in web client settings page 2023-11-02 20:42:21 -07:00
Debanjum Singh Solanky
b6f07099cd Simplify login page styling on web client
- Center all elements: icon, text and button
- Use khoj icon not logo-text
- Simplify login title text
2023-11-02 20:42:21 -07:00
Debanjum Singh Solanky
7b7f6d3bc8 Update web client theme to a lighter
- Update background color to a different shade of white
- Make primary and primary hover colors less intense and more aligned
  with lantern flame shade
- Add water, leaf, flower color variables
2023-11-02 20:42:21 -07:00
sabaimran
fe860aaf83 Merge branch 'features/multi-user-support-khoj' of github.com:khoj-ai/khoj into features/multi-user-support-khoj 2023-11-02 14:56:01 -07:00
sabaimran
2c9496bcf1 Add additional null checks in the migrate_server_pg script 2023-11-02 14:55:58 -07:00
sabaimran
20df0f5330 Use url_path_for for creating the login page URL in the application 2023-11-02 14:55:14 -07:00
sabaimran
fd11b78552
Fix migration script error when openai not available (#530) 2023-11-02 11:28:08 -07:00
sabaimran
fe6720fa06
[Multi-User Part 8]: Make conversation processor settings server-wide (#529)
- Rather than having each individual user configure their conversation settings, allow the server admin to configure the OpenAI API key or offline model once, and let all the users re-use that code.
- To configure the settings, the admin should go to the `django/admin` page and configure the relevant chat settings. To create an admin, run `python3 src/manage.py createsuperuser` and enter in the details. For simplicity, the email and username should match.
- Remove deprecated/unnecessary endpoints and views for configuring per-user chat settings
2023-11-02 10:43:27 -07:00
Debanjum Singh Solanky
12b3eeae9e Use Khoj fonts on config page of web and desktop apps too
Previously pico.css font-families were being selected for the config
page. This was different from the fonts used by index.html, chat.html

This improves spacing issue of heading further
2023-11-01 17:50:50 -07:00
Debanjum Singh Solanky
022d695309 Switch to narrow view below width of 700px on web client
This makes the dropdown menu align better to the profile picture in
mobile view
2023-11-01 17:49:44 -07:00
Debanjum Singh Solanky
6a0adfbfbb Default to profile picture with Initial if user has no profile picture 2023-11-01 17:49:44 -07:00
Tuan Nguyen
354605e73e
Autofocus to chat input when openning chat (#524) 2023-11-01 16:09:45 -07:00
Debanjum Singh Solanky
d92a2d03a7 Rename Files, Classes from X_To_JSONL to more appropriate X_To_Entries
These content processors are converting content into entries in DB
instead of entries in JSONL file
2023-11-01 14:51:33 -07:00
Debanjum Singh Solanky
2ad2055bcb Remove user null check in API controllers that require authentication 2023-11-01 14:38:19 -07:00
Debanjum Singh Solanky
7ac5a4766d Match spacing of navigation header pane in config vs search/chat pages 2023-11-01 14:38:19 -07:00
Debanjum Singh Solanky
2e3a4a6a9b Use Jinja macro to deduplicate navigation header HTML 2023-11-01 14:38:12 -07:00
Debanjum Singh Solanky
c631b61a81 Put colors shared by index, chat html into khoj css global variables 2023-11-01 02:13:24 -07:00
Debanjum Singh Solanky
f585a71744 Put logout, settings under dropdown menu with logged in user's profile picture
- Create dropdown menu. Put settings page, logout action under it
- Make user's profile picture the dropdown menu heading
- Create khoj.js to store shared js across web client
  It currently stores the dropdown menu open, close functionality
- Put shared styling for khoj dropdown menu under khoj.css
2023-11-01 02:13:24 -07:00
Debanjum Singh Solanky
58a7171911 Show truncated API key for identification & restrict table width
- Use a function to generate API Key table row HTML, to dedup logic
- Show delete, copy icon hints on hover
- Reduce length of copied message to not expand table width
- Truncating API key helps keep the API key table width within width
  of smaller width displays
2023-10-31 23:10:26 -07:00
Debanjum Singh Solanky
9cebd7f856 Add emoji icons to Search, Chat, Settings items in nav menu of Web client
Emoji icons have already been added to the Search, Chat and Settings
top navigation menu in the desktop client. This change adds these to
the web client as well
2023-10-31 22:38:44 -07:00
Debanjum Singh Solanky
f77336ba61 Add key icon for API keys table in Web client config page 2023-10-31 19:01:09 -07:00
Debanjum Singh Solanky
87e6b1eab9 Rename TextEmbeddings to TextEntries for improved readability
Improves readability as name has closer match to underlying
constructs
2023-10-31 18:55:59 -07:00
Debanjum Singh Solanky
bcbee05a9e Rename DbModels Embeddings, EmbeddingsAdapter to Entry, EntryAdapter
Improves readability as name has closer match to underlying
constructs

- Entry is any atomic item indexed by Khoj. This can be an org-mode
  entry, a markdown section, a PDF or Notion page etc.

- Embeddings are semantic vectors generated by the search ML model
  that encodes for meaning contained in an entries text.

- An "Entry" contains "Embeddings" vectors but also other metadata
  about the entry like filename etc.
2023-10-31 18:50:54 -07:00
sabaimran
54a387326c
[Multi-User Part 6]: Address small bugs and upstream PR comments (#518)
- 08654163cb: Add better parsing for XML files
- f3acfac7fb: Add a try/catch around the dateparser in order to avoid internal server errors in app
- 7d43cd62c0: Chunk embeddings generation in order to avoid large memory load
- e02d751eb3: Addresses comments from PR #498 
- a3f393edb4: Addresses comments from PR #503 
- 66eb078286: Addresses comments from PR #511 
- Address various items in https://github.com/khoj-ai/khoj/issues/527
2023-10-31 17:59:53 -07:00
sabaimran
5f3f6b7c61
[Multi-User Part 5]: Add a production Docker file and use a gunicorn configuration with it (#514)
- Add a productionized setup for the Khoj server using `gunicorn` with multiple workers for handling requests
- Add a new Dockerfile meant for production config at `ghcr.io/khoj-ai/khoj:prod`; the existing Docker config should remain the same
2023-10-26 13:15:31 -07:00
Debanjum
9acc722f7f
[Multi-User Part 4]: Authenticate using API Tokens (#513)
###  New
- Use API keys to authenticate from Desktop, Obsidian, Emacs clients
- Create API, UI on web app config page to CRUD API Keys
- Create user API keys table and functions to CRUD them in Database

### 🧪 Improve
- Default to better search model, [gte-small](https://huggingface.co/thenlper/gte-small), to improve search quality
- Only load chat model to GPU if enough space, throw error on load failure
- Show encoding progress, truncate headings to max chars supported
- Add instruction to create db in Django DB setup Readme

### ⚙️ Fix
- Fix error handling when configure offline chat via Web UI
- Do not warn in anon mode about Google OAuth env vars not being set
- Fix path to load static files when server started from project root
2023-10-26 12:33:03 -07:00
sabaimran
4b6ec248a6
[Multi-User Part 3]: Separate chat sesssions based on authenticated users (#511)
- Add a data model which allows us to store Conversations with users. This does a minimal lift over the current setup, where the underlying data is stored in a JSON file. This maintains parity with that configuration.
- There does _seem_ to be some regression in chat quality, which is most likely attributable to search results.

This will help us with #275. It should become much easier to maintain multiple Conversations in a given table in the backend now. We will have to do some thinking on the UI.
2023-10-26 11:37:41 -07:00
sabaimran
a8a82d274a
[Multi-User Part 2]: Add login pages and gate access to application behind login wall (#503)
- Make most routes conditional on authentication *if anonymous mode is not enabled*. If anonymous mode is enabled, it scaffolds a default user and uses that for all application interactions.
- Add a basic login page and add routes for redirecting the user if logged in
2023-10-26 10:17:29 -07:00
sabaimran
216acf545f
[Multi-User Part 1]: Enable storage of settings for plaintext files based on user account (#498)
- Partition configuration for indexing local data based on user accounts
- Store indexed data in an underlying postgres db using the `pgvector` extension
- Add migrations for all relevant user data and embeddings generation. Very little performance optimization has been done for the lookup time
- Apply filters using SQL queries
- Start removing many server-level configuration settings
- Configure GitHub test actions to run during any PR. Update the test action to run in a containerized environment with a DB.
- Update the Docker image and docker-compose.yml to work with the new application design
2023-10-26 09:42:29 -07:00
Debanjum Singh Solanky
9677eae791 Expose CLI flag to disable using GPU for offline chat model
- Offline chat models outputing gibberish when loaded onto some GPU.
  GPU support with Vulkan in GPT4All seems a bit buggy

- This change mitigates the upstream issue by allowing user to
  manually disable using GPU for offline chat

Closes #516
2023-10-25 17:51:46 -07:00
Debanjum Singh Solanky
0f1ebcae18 Upgrade to latest GPT4All. Use Mistral as default offline chat model
GPT4all now supports gguf llama.cpp chat models. Latest
GPT4All (+mistral) performs much at least 3x faster.

On Macbook Pro at ~10s response start time vs 30s-120s earlier.
Mistral is also a better chat model, although it hallucinates more
than llama-2
2023-10-22 19:04:23 -07:00
sabaimran
963cd165eb Resolve merge conflicts 2023-10-19 14:39:05 -07:00
Debanjum Singh Solanky
8346e1193c Release Khoj version 0.13.0 2023-10-18 03:43:54 -07:00
Debanjum Singh Solanky
6631fc38db Delete plaintext config via API. Catch any offline model loading exception 2023-10-18 03:37:45 -07:00
Debanjum Singh Solanky
53abd1a506 Mark sync completed on desktop client, even when no files to send
Previously Sync spinner on desktop config screen would hang when no
files to send to server & the Sync button had been manually triggered
2023-10-18 01:30:56 -07:00
Debanjum Singh Solanky
71b0012e8c Set offline chat config to default value if unset on server load 2023-10-18 00:59:43 -07:00
Debanjum Singh Solanky
cf1cdc3fe1 Disambiguate input_filter variable names in fs_syncer functions 2023-10-17 23:32:10 -07:00
Debanjum Singh Solanky
e3cd8b4150 Only index files returned by input-filter globs in fs_syncer
Ignore .org, .pdf etc. suffixed directories under `input-filter' from
being evaluated as files.

Explicitly filter results by input-filter globs to only index files,
not directory for each text type

Add test to prevent regression

Closes #448
2023-10-17 23:32:10 -07:00
Debanjum Singh Solanky
51363d280d Do not configure khoj server for pull based indexing from khoj.el
Do not make khoj server pull update index on Obsidian plugin load.
Index is updated on push from plugin instead now/
2023-10-17 21:47:19 -07:00
Debanjum Singh Solanky
d9d133dfb9 Read text files as utf-8, instead of default os locale
On Windows, the default locale isn't utf8. Khoj had regressed to
reading files in OS specified locale encoding, e.g cp1252, cp949 etc.

It now explicitly uses utf8 encoding to read text files for indexing

Resolves #495, resolves #472
2023-10-17 21:47:19 -07:00
Debanjum
3d4576ae38
Fix encoding binary files for sync from the Desktop, Obsidian client (#506)
- Fix encoding binary files like PDFs for sync from Desktop client
- Fix encoding binary files like PDFs for sync from Obsidian client
2023-10-17 15:37:22 -07:00
Debanjum Singh Solanky
c8293998d9 Fix encoding binary files like PDFs for sync from Obsidian client
Use readBinary to read binary files like PDFs instead of read
2023-10-17 15:08:30 -07:00
sabaimran
ba60c869c9 Fix encoding binary files like PDFs for sync from Desktop client
Use readFileSync, Buffer to pass appropriately formatted binary data
2023-10-17 15:08:23 -07:00
Andrew Spott
3d7381446d
Changed globbing. Now doesn't clobber a users glob if they want to a… (#496)
* Changed globbing.  Now doesn't clobber a users glob if they want to add it, but will (if just given a directory), add a recursive glob.

Note: python's glob engine doesn't support `{}` globing, a future option is to warn if that is included.

* Fix typo in globformat variable

* Use older glob pattern for plaintext files

---------

Co-authored-by: Saba <narmiabas@gmail.com>
2023-10-17 11:26:06 -07:00
sabaimran
2646c8554d Provide a default value to offline_chat configuration of the conversation processor 2023-10-17 10:35:22 -07:00
Debanjum Singh Solanky
b8976426eb Update offline chat model config schema used by Emacs, Obsidian clients
The server uses a new schema for the conversation config. The Emacs,
Obsidian clients need to use this schema to update the conversation
config
2023-10-17 07:01:35 -07:00
Debanjum
ecc6fbfeb2
Push Files to Index from Emacs, Obsidian & Desktop Clients using Multi-Part Forms (#499)
### Overview
- Add ability to push data to index from the Emacs, Obsidian client
- Switch to standard mechanism of syncing files via HTTP multi-part/form. Previously we were streaming the data as JSON
  - Benefits of new mechanism
    - No manual parsing of files to send or receive on clients or server is required as most have in-built mechanisms to send multi-part/form requests
    - The whole response is not required to be kept in memory to parse content as JSON. As individual files arrive they're automatically pushed to disk to conserve memory if required
    - Binary files don't need to be encoded on client and decoded on server

### Code Details
### Major
- Use multi-part form to receive files to index on server
- Use multi-part form to send files to index on desktop client
- Send files to index on server from the khoj.el emacs client
  - Send content for indexing on server at a regular interval from khoj.el
- Send files to index on server from the khoj obsidian client
- Update tests to test multi-part/form method of pushing files to index

#### Minor
- Put indexer API endpoint under /api path segment
- Explicitly make GET request to /config/data from khoj.el:khoj-server-configure method
- Improve emoji, message on content index updated via logger
- Don't call khoj server on khoj.el load, only once khoj invoked explicitly by user
- Improve indexing of binary files
  - Let fs_syncer pass PDF files directly as binary before indexing
  - Use encoding of each file set in indexer request to read file 
- Add CORS policy to khoj server. Allow requests from khoj apps, obsidian & localhost
- Update indexer API endpoint URL to` index/update` from `indexer/batch`

Resolves #471 #243
2023-10-17 06:05:15 -07:00
Debanjum Singh Solanky
6a4f1b2188 Add more client, request details in logs by index/update API endpoint 2023-10-17 05:43:29 -07:00
Debanjum Singh Solanky
5efae1ad55 Update indexer API endpoint query params for force, content type
New URL query params, `force' and `t' match name of query parameter in
existing Khoj API endpoints

Update Desktop, Obsidian and Emacs client to call using these new API
query params. Set `client' query param from each client for telemetry
visibility
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
84654ffc5d Update indexer API endpoint URL to index/update from indexer/batch
New URL follows action oriented endpoint naming convention used for
other Khoj API endpoints

Update desktop, obsidian and emacs client to call this new API
endpoint
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
e347823ff4 Log telemetry for index updates via push to API endpoint 2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
05be6bd877 Clicking Update Index in Obsidian settings should push files to index
Use the indexer/batch API endpoint to regenerate content index rather
than the previous pull based content indexing API endpoint
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
13a3122bf3 Stop configuring server to pull files to index from Obsidian client
Obsidian client now pushes vault files to index instead
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
99a2c934a3 Add CORS policy to allow requests from khoj apps, obsidian & localhost
Using fetch from Khoj Obsidian plugin was failing due to cross-origin
request and method: no-cors didn't allow passing x-api-key custom
header. And using Obsidian's request with multi-part/form-data wasn't
possible either.
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
541cd59a49 Let fs_syncer pass PDF files directly as binary before indexing
No need to do unneeded base64 encoding/decoding to pass pdf contents
for indexing from fs_syncer to pdf_to_jsonl
2023-10-17 04:58:13 -07:00
Debanjum Singh Solanky
d27dc71dfe Use encoding of each file set in indexer request to read file
Get encoding type from multi-part/form-request body for each file
Read text files as utf-8 and pdfs, images as binary
2023-10-17 04:58:12 -07:00
Debanjum Singh Solanky
8e627a5809 Pass any files to be deleted to indexer API via Khoj Obsidian plugin
- Keep state of previously synced files to identify files to be deleted
- Last synced files stored in settings for persistence of this data
  across Obsidian reboots
2023-10-17 03:34:49 -07:00
Debanjum Singh Solanky
f2e293a149 Push Vault files to index to Khoj server using Khoj Obsidian plugin
Use the multi-part/form-data request to sync Markdown, PDF files in
vault to index on khoj server

Run scheduled job to push updates to value for indexing every 1 hour
2023-10-17 03:05:30 -07:00
Debanjum Singh Solanky
6baaaaf91a Test request body of multi-part form to update content index from khoj.el 2023-10-16 23:54:32 -07:00
Debanjum Singh Solanky
79b3f8273a Make khoj.el send files to be deleted from index to server 2023-10-16 23:53:02 -07:00
Debanjum Singh Solanky
f64fa06e22 Initialize the Khoj Transient menu on first run instead of load
This prevents Khoj from polling the Khoj server until explicitly
invoked via `khoj' entrypoint function.

Previously it'd make a request to the khoj server every time Emacs or
khoj.el was loaded

Closes #243
2023-10-16 19:11:46 -07:00
Debanjum
b4949f7f0b
Improve Offline Chat Model Experience (#494)
- Make offline chat model user configurable. Use `filename` of any [GPT4All supported  model](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models.json) like below:
- Run GPT4All Chat Model on GPU, when available via [GPT4All Vulcan support](https://blog.nomic.ai/posts/gpt4all-gpu-inference-with-vulkan)
- Use default Llama 2 supported by GPT4All
- Make `tokenizer` and `max-prompt-size` of chat model user configurable. E.g When using chat models not in [this pre-defined list](https://github.com/khoj-ai/khoj/blob/master/src/khoj/processor/conversation/utils.py) that support larger context window or a different tokenizer.

Closes #406, #418
2023-10-16 17:44:49 -07:00
Debanjum Singh Solanky
644c3b787f Scale no. of chat history messages to use as context with max_prompt_size
Previously lookback turns was set to a static 2. But now that we
support more chat models, their prompt size vary considerably.

Make lookback_turns proportional to max_prompt_size. The truncate_messages
can remove messages if they exceed max_prompt_size later

This lets Khoj pass more of the chat history as context for models
with larger context window
2023-10-16 17:22:28 -07:00
Debanjum Singh Solanky
df1d74a879 Use max_prompt_size, tokenizer from config for chat model context stuffing 2023-10-15 16:52:53 -07:00
Debanjum Singh Solanky
116595b351 Use chat_model specified in new offline_chat section of config
- Dedupe offline_chat_model variable. Only reference offline chat
  model stored under offline_chat. Delete the previous chat_model
  field under GPT4AllProcessorConfig

- Set offline chat model to use via config/offline_chat API endpoint
2023-10-15 16:37:49 -07:00
Debanjum Singh Solanky
feb4f17e3d Update chat config schema. Make max_prompt, chat tokenizer configurable
This provides flexibility to use non 1st party supported chat models

- Create migration script to update khoj.yml config
  - Put `enable_offline_chat' under new `offline-chat' section
    Referring code needs to be updated to accomodate this change
  - Move `offline_chat_model' to `chat-model' under new `offline-chat' section
  - Put chat `tokenizer` under new `offline-chat' section
  - Put `max_prompt' under existing `conversation' section
    As `max_prompt' size effects both openai and offline chat models
2023-10-15 16:35:11 -07:00
sabaimran
c125995d94
[Multi-User]: Part 0 - Add support for logging in with Google (#487)
* Add concept of user authentication to the request session via GoogleUser
2023-10-14 19:39:13 -07:00
Debanjum Singh Solanky
247e75595c Use AutoTokenizer to support more tokenizers 2023-10-14 16:54:52 -07:00
Saba
ff2dbadc9d Use computed plaintext_content to set file content rather than calling f.read again 2023-10-14 13:28:34 -07:00
Debanjum Singh Solanky
1ad8b150e8 Add default tokenizer, max_prompt as fallback for non-default offline chat models
Pass user configured chat model as argument to use by converse_offline

The proper fix for this would allow users to configure the max_prompt
and tokenizer to use (while supplying default ones, if none provided)
For now, this is a reasonable start.
2023-10-13 22:48:56 -07:00
Debanjum Singh Solanky
56bd69d5af Improve Llama v2 extract questions actor and associated prompt
- Format extract questions prompt format with newlines and whitespaces
- Make llama v2 extract questions prompt consistent

- Remove empty questions extracted by offline extract_questions actor
- Update implicit qs extraction unit test for offline search actor
2023-10-13 22:48:56 -07:00
sabaimran
09bb3686cc
Strip the incoming query from the slash conversation command (#500)
* Strip the incoming query from the slash conversation command before passing it to the model or for search
* Return q when content index not loaded
* Remove -n 4 from pytest ini configuration to isolate test failures
2023-10-13 21:11:23 -07:00
Debanjum Singh Solanky
96c0b21285 Sync desktop app package.json with other Khoj clients metadata
- Make `bump_version.sh' script set version for the Khoj desktop app too
- Sync Khoj desktop app authors, license, description and version with
  the other interfaces and server
- Update description in packages metadata to match project subtitle on Github
2023-10-13 20:43:55 -07:00
sabaimran
80fb56b8a5 Sync deksktop app package version with the other releases 2023-10-13 19:23:00 -07:00
Debanjum Singh Solanky
b669aa2395 Clean and fix the content indexing code in the Emacs client
- Pass payloads as unibyte. This was causing the request to fail for
  files with unicode characters
- Suppress messages with file content in on index updates
- Fix rendering response from server on index update API call
- Extract code to populate body of index update HTTP request with files
2023-10-13 18:00:37 -07:00
Debanjum Singh Solanky
bea196aa30 Explicitly make GET request to /config/data from khoj.el:khoj-server-configure method
Previously global state of `url-request-method' would affect the
kind of request made to api/config/data API endpoint as it wasn't
being explicitly being set before calling the API endpoint

This was done with the assumption that the default value of GET for
url-request-method wouldn't change globally

But in some cases, experientially, it can get changed. This was
resulting in khoj.el load failing as POST request was being made
instead which would throw error
2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky
292f0420ad Send content for indexing on server at a regular interval from khoj.el
- Allow indexing frequency to be configurable by user
- Ensure there is only one khoj indexing timer running
2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky
fc99431754 Send files to index on server from the khoj.el emacs client
- Add elisp variable to set API key to engage with the Khoj server
- Use multi-part form to POST the files to index to the indexer API
  endpoint on the khoj server
2023-10-12 20:58:52 -07:00
Debanjum Singh Solanky
68018ef397 Use multi-part form to send files to index on desktop client
- Add typing for variables in for loop and other minor formatting clean-up
- Assume utf8 encoding for text files and binary for image, pdf files
2023-10-12 20:58:49 -07:00
Debanjum Singh Solanky
7190b3811d Remove all filter terms in user query from defiltered_query
Previously only the the last filter's terms were getting effectively
applied as the `filter.defilter' operation was being done on
`user_query' but was updating the `defiltered_query'
2023-10-12 20:56:17 -07:00
Debanjum Singh Solanky
60e9a61647 Use multi-part form to receive files to index on server
- This uses existing HTTP affordance to process files
  - Better handling of binary file formats as removes need to url encode/decode
  - Less memory utilization than streaming json as files get
    automatically written to disk once memory utilization exceeds preset limits
  - No manual parsing of raw files streams required
2023-10-11 23:58:23 -07:00
Debanjum Singh Solanky
9ba173bc2d Improve emoji, message on content index updated via logger
Use mailbox closed with flag down once content index completed.

Use standard, existing logger messages in new indexer messages, when
files to index sent by clients
2023-10-11 17:12:03 -07:00
Debanjum Singh Solanky
6aa69da3ef Put indexer API endpoint under /api path segment
Update FastAPI app router, desktop app and to use new url path to
batch indexer API endpoint

All api endpoints should exist under /api path segment
2023-10-09 21:35:58 -07:00
Debanjum Singh Solanky
f6f7a62d80 Wait for user to stop typing to trigger search from khoj.el in Emacs
- Improves user experience by aligning idle time with search latency
  to avoid display jitter (to render results) while user is typing

- Makes the idle time configurable

Closes #480
2023-10-06 12:44:45 -07:00
sabaimran
5c4f0d42b7 Return new default config in API endpoint 2023-10-06 12:30:09 -07:00
sabaimran
052b25af0a Update default configuration passed to Khoj clients to circumvent valiation issues 2023-10-06 12:29:15 -07:00
Debanjum Singh Solanky
a85ff941ca Make offline chat model user configurable
Only GPT4All supported Llama v2 models will work given the prompt
structure is not currently configurable
2023-10-04 20:41:14 -07:00
Debanjum Singh Solanky
d1ff812021 Run GPT4All Chat Model on GPU, when available
GPT4All now supports running models on GPU via Vulkan
2023-10-04 18:42:12 -07:00
Debanjum Singh Solanky
13b16a4364 Use default Llama 2 supported by GPT4All
Remove custom logic to download custom Llama 2 model.
This was added as GPT4All didn't support Llama 2 when it was added to Khoj
2023-10-03 19:01:54 -07:00
sabaimran
4a5ed7f06c
Update Khoj package version for Electron, Desktop app (#492)
* Address package upgrade for Electron application
* Update package version for Electron desktop application
2023-10-03 12:21:32 -07:00
sabaimran
3f962a55c3
Fix Linux Desktop Application (#491)
* Use separate functions for adding files and folders to configuration for indexing
* Add a loading bar while data is syncing
* Bump the minor version for the application
2023-10-03 11:43:19 -07:00
sabaimran
63b3696af0 Release Khoj version 0.12.3 2023-09-26 22:41:11 -07:00
sabaimran
d2f9bca1cf Fix null ref issue in query method and update logic for determining whether khoj is already configured in obsidian 2023-09-26 22:33:44 -07:00
sabaimran
2f18383349 Release Khoj version 0.12.2 2023-09-26 11:59:47 -07:00
sabaimran
588f35b6e9 Add max prompt size for gpt-3.5-turbo-16k 2023-09-26 10:57:35 -07:00
sabaimran
4e370d7a18 Release Khoj version 0.12.1 2023-09-26 09:24:53 -07:00
sabaimran
3675aa348a Update naming of Khoj in manifest.json for Obsidian 2023-09-26 09:24:36 -07:00
sabaimran
a82d1becc3 Release Khoj version 0.12.0 2023-09-26 09:17:56 -07:00
sabaimran
38f0df3d53 Remove unused icons from electron app folder 2023-09-26 07:56:29 -07:00
sabaimran
5e16074b92 Fix comparison for search type in plugins mode 2023-09-25 10:57:17 -07:00
sabaimran
2dd15e9f63
Resolve issues with GPT4All and fix prompt for yesterday extract questions date filter (#483)
- GPT4All integration had ceased working with 0.1.7 specification. Update to use 1.0.12. At a later date, we should also use first party support for llama v2 via gpt4all
- Update the system prompt for the extract_questions flow to add start and end date to the yesterday date filter example.
- Update all setup data in conftest.py to use new client-server indexing pattern
2023-09-18 14:41:26 -07:00
sabaimran
b225d1188c Fix formatting of gpt.py 2023-09-18 11:09:02 -07:00
Jonny-GM
34b202b868
More lenient date searching (#481)
* Modify DateFilter to use compiled entry key
* Instruct search to include date in query
* Minor prompt change
* Prompt fix
2023-09-18 10:46:00 -07:00
sabaimran
16874e1953 Provide force fallback for regeneration 2023-09-12 16:35:07 -07:00
sabaimran
9f42a1a036 Propagate flags to configure index command 2023-09-11 10:33:44 -07:00
sabaimran
343854752c
Improve docker builds for local hosting (#476)
* Remove GPT4All dependency in pyproject.toml and use multiplatform builds in the dockerization setup in GH actions
* Move configure_search method into indexer
* Add conditional installation for gpt4all
* Add hint to go to localhost:42110 in the docs. Addresses #477
2023-09-08 17:07:26 -07:00
sabaimran
dccfae3853
Remove PySide dependency and deprecate desktop builds (#475)
* Remove PySide, gui option from code
* Remove pyside 6 dependency from code
* Remove workflows which build desktop applications
* Update unit tests and update line in documentation
* Remove additional references to pyinstaller, gui
* Add uninstall steps to normal uninstall instructions
2023-09-07 11:36:27 -07:00
sabaimran
76562f4250
Add front-end Electron application for Khoj local file syncing (#473)
* Initial version - setup a file-push architecture for generating embeddings with Khoj
* Use state.host and state.port for configuring the URL for the indexer
* Fix parsing of PDF files
* Read markdown files from streamed data and update unit tests
* On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system
* Init: refactor indexer/batch endpoint to support a generic file ingestion format
* Add features to better support indexing from files sent by the desktop client
* Initial commit with Electron application
- Adds electron app
* Add import for pymupdf, remove import for pypdf
* Allow user to configure khoj host URL
* Remove search type configuration from index.html
* Use v1 path for current indexer routes
2023-09-06 12:04:18 -07:00
bholagabbar
205dc90746
Fix notion title bug (#474)
* Update notion_to_jsonl.py
* Fix try-catch block
2023-09-05 10:47:42 -07:00
sabaimran
4854258047
Move to a push-first model for retrieving embeddings from local files (#457)
* Initial version - setup a file-push architecture for generating embeddings with Khoj
* Update unit tests to fix with new application design
* Allow configure server to be called without regenerating the index; this no longer works because the API for indexing files is not up in time for the server to send a request
* Use state.host and state.port for configuring the URL for the indexer
* On application startup, load in embeddings from configurations files, rather than regenerating the corpus based on file system
2023-08-31 12:55:17 -07:00
sabaimran
92cbfef7ab Skip plaintext file indexing if there's a parsing issue and log the file 2023-08-29 14:34:08 -07:00
sabaimran
74409c2c64 Release Khoj version 0.11.4 2023-08-29 11:44:35 -07:00
sabaimran
1b85958bcc trim chat input start 2023-08-28 19:18:10 -07:00
sabaimran
e592f6eac8 Release Khoj version 0.11.3 2023-08-28 14:46:03 -07:00
sabaimran
7c35da9fc4 Fix bug in /chat endpoint for general and update depdendencies 2023-08-28 14:12:11 -07:00
sabaimran
bc09143856 Release Khoj version 0.11.2 2023-08-28 10:16:13 -07:00
Debanjum Singh Solanky
01b310635e Enable passing search query filters via chat and test it 2023-08-28 09:24:32 -07:00
Debanjum Singh Solanky
794bad8bcb Make date_filter.extract_date_range method always return a list type 2023-08-28 00:55:28 -07:00
Debanjum Singh Solanky
d5a2de6222 Add method to extract filter terms from query to all filters
- Test the get_filter_term method in all 3 word, file, date filters
- Make the existing can_filter method by default in base filter abstract class
2023-08-28 00:55:28 -07:00
Debanjum
150105505b
Add Default chat command. Make Khoj ask clarifying questions (#468)
- Make Khoj ask clarifying questions when answer not in provided context
- Add default conversation command to auto switch b/w general, notes modes
- Show filtered list of commands available with the currently input text
- Use general prompt when no references found and not in Notes mode
- Test general and notes slash commands in offline chat director tests
2023-08-28 00:52:57 -07:00
Debanjum Singh Solanky
eb6cd4f8d0 Use general prompt when no references found and not in Notes mode 2023-08-28 00:47:02 -07:00
Debanjum Singh Solanky
edffbad837 Make Khoj ask clarifying questions when answer not in provided context
Previously it would just refuse ask for clarification. This improves
the chat quality score for the existing director tests
2023-08-28 00:47:02 -07:00
Debanjum Singh Solanky
75c1016ec0 Show filtered list of commands available with the currently input text 2023-08-28 00:46:10 -07:00
Debanjum Singh Solanky
74605f6159 Add default conversation command to auto switch b/w general, notes modes
This was the default behavior but behavior regressed when adding slash
commands in PR #463
2023-08-28 00:46:10 -07:00
sabaimran
cbc978ea08 Update help links for notion, github to point to the main docs 2023-08-27 15:02:55 -07:00
sabaimran
b45e1d8c0d
Fix plaintext HTML parsing and rendering (#464)
* Store conversation command options in an Enum
* Move to slash commands instead of using @ to specify general commands
* Calculate conversation command once & pass it as arg to child funcs
* Add /notes command to respond using only knowledge base as context
This prevents the chat model to try respond using it's general world
knowledge only without any references pulled from the indexed
knowledge base
* Test general and notes slash commands in openai chat director tests
---------

Co-authored-by: Debanjum Singh Solanky <debanjum@gmail.com>
2023-08-27 11:24:30 -07:00
Debanjum
7919787fb7
Use Slash Commands and Add Notes Slash Command (#463)
* Store conversation command options in an Enum

* Move to slash commands instead of using @ to specify general commands

* Calculate conversation command once & pass it as arg to child funcs

* Add /notes command to respond using only knowledge base as context

This prevents the chat model to try respond using it's general world
knowledge only without any references pulled from the indexed
knowledge base

* Test general and notes slash commands in openai chat director tests

* Update gpt4all tests to use md configuration

* Add a /help tooltip

* Add dynamic support for describing slash commands. Remove default and treat notes as the default type

---------

Co-authored-by: sabaimran <narmiabas@gmail.com>
2023-08-26 18:11:18 -07:00
sabaimran
e64357698d
Skip indexing single bad markdown, plaintext file (#460) 2023-08-23 15:34:56 -07:00
sabaimran
84bd579077 Format the chat outputted message with code, bolding, or italics. Add a copy button for code. Closes #445. 2023-08-19 20:02:57 -07:00
sabaimran
f9e09ba490 Do not try downloading model from GPT4All if the user is not connected to the internet 2023-08-19 19:09:21 -07:00
Debanjum Singh Solanky
3ff4e19dd2 Release Khoj version 0.11.1 2023-08-16 22:53:29 -07:00
sabaimran
4fb8c2c5e1 Pass a SIGTERM to tell the uvicorn server to exit and gracefully kill the thread 2023-08-16 21:27:05 -07:00
sabaimran
4e03dfea43
Attach the parent to the server thread, allowing the kill signal to trigger a graceful exit (#446) 2023-08-16 19:36:10 -07:00
Debanjum Singh Solanky
26c3977fb9 Remove info hint to reindex khoj on unexpected search results
The index corruption was issue resolved a while ago in #325 and
hasn't cropped up again
2023-08-16 00:58:59 -07:00
sabaimran
def909a913
Revert "Open Web interface within Desktop app in GUI mode" (#444) 2023-08-15 23:26:28 -07:00
sabaimran
6562ec6531 Release Khoj version 0.11.0 2023-08-14 19:25:03 -07:00
sabaimran
0ea901c7c1
Allow indexing to continue even if there's an issue parsing a particular org file (#430)
* Allow indexing to continue even if there's an issue parsing a particular org file
* Use approximation in pytorch comparison in text_search UT, skip additional file parser errors for org files
* Change error of expected failure
2023-08-14 07:56:33 -07:00
sabaimran
7b907add77
Add support for indexing plaintext files (#420)
* Add support for indexing plaintext files
- Adds backend support for parsing plaintext files generically (.html, .txt, .xml, .csv, .md)
- Add equivalent frontend views for setting up plaintext file indexing
- Update config, rawconfig, default config, search API, setup endpoints
* Add a nifty plaintext file icon to configure plaintext files in the Web UI
* Use generic glob path for plaintext files. Skip indexing files that aren't in whitelist
2023-08-09 15:44:40 -07:00
Ellen7ions
26bddcb65c
Add support for starting a new line with shift-enter (#412)
* Add support for starting a new line with shift-enter
* Remove useless comments. Set font-size: medium.
* Update src/khoj/interface/web/chat.html
Update the styling to have the padding, margin and line-height like before.
Co-authored-by: Debanjum <debanjum@gmail.com>
* Update src/khoj/interface/web/chat.html
Make the chat-body scroll to the bottom after resizing
Co-authored-by: Debanjum <debanjum@gmail.com>
---------
Co-authored-by: Debanjum <debanjum@gmail.com>
2023-08-07 19:49:07 -07:00
Debanjum Singh Solanky
97609e4995 Use 500px png of khoj logo instead svg for much smaller asset size
The khoj logo svg was 1.3Mb. The 500px png of it is 38Kb.
Given all usage of khoj-logo are below 230px this should work fine
2023-08-07 18:27:11 -07:00
Debanjum
14a816d173
Open Web interface within Desktop app in GUI mode (#429)
Previously the GUI mode (with khoj --gui or using the desktop app) would open the web interface in the users default web browser. Now the web interface is just rendered within the app itself using PyQT's Webview. This gives it a more proper app like feel
2023-08-07 17:48:30 -07:00
Debanjum Singh Solanky
378b96ec1b Open the khoj app window maximized on startup 2023-08-07 15:39:05 -07:00
Debanjum Singh Solanky
ea734ba1c8 Open app in native view on starting it in GUI mode instead of on web browser
- Opens settings page on first run and landing page after in GUI mode
  Previously was only opening the GUI on linux after first run as it
  doesn't have a system tray
- Both the views are from the web interface but are rendered within
  the app instead of the browser
2023-08-07 13:41:42 -07:00
Debanjum Singh Solanky
9c494705a8 Open the search, chat or config view in app from the system tray menu 2023-08-07 13:41:42 -07:00
Debanjum Singh Solanky
cc36b87345 Render the web interface directly within the desktop app as a webview 2023-08-07 13:41:12 -07:00
Jason Qin
3ef1b7073d
Update obsidian/manifest.json
Closes #426
2023-08-07 10:41:39 -07:00
sabaimran
738cf650b3
Explicitly set Khoj to use the default locale of the user (#425)
- Explicitly set locale using `locale.setLocale(locale.LC_ALL, '')` for localization. Relevant for datetime libraries. See [Python 3 documentation](https://docs.python.org/3/library/locale.html#locale.setlocale).
2023-08-07 09:23:24 -07:00
Muftawo
c8ef619090
fixed reference link to landing page (#417)
* Fixed zsh error no matches found
* Fixed home page 404 error
2023-08-04 10:38:14 -07:00
sabaimran
78012b8111 Avoid null ref issue when setting model state for web UI. Closes #410 2023-08-03 00:39:06 -07:00
sabaimran
0baed742e4
Add checksums to verify the correct model is downloaded as expected (#405)
* Add checksums to verify the correct model is downloaded as expected
- This should help debug issues related to corrupted model download
- If download fails, let the application continue
* If the model is not download as expected, add some indicators in the settings UI
* Add exc_info to error log if/when download fails for llamav2 model
* Simplify checksum checking logic, update key name in model state for web client
2023-08-02 23:26:52 -07:00
Debanjum Singh Solanky
e6e3acdbe4 Release Khoj version 0.10.1 2023-08-01 23:55:13 -07:00
Debanjum Singh Solanky
7c1d70aa17 Bump GPT4All response generation batch size to 512 from 256
A batch size of 512 performs ~20% better on a XPS with no GPU and 16Gb
RAM. Seems worth the tradeoff for now
2023-08-01 23:34:02 -07:00
Debanjum
16c6bfce8e
Improve Quality and Reliability of Offline Chat (#393)
# Incoming
## Major
### Fix Prompt Size Exceeded Issue
- Fix issues related to prompt size, Closes #386. Use the correct tokenizer to calculate whether the input needs to be truncated or not.

### Improve Llama 2 Model Download
- Use the correct download link for LlamaV2 -- should have been using the small model, but was using the medium
- Add better downloading logic to retry download if it failed, Closes #379 

### Fix Segmentation Fault due to Race
- Add a lock around generating chat responses from the offline model to avoid segmentation faults. Closes #367.
- Add a loading symbol to the web chat UI when the model is thinking. Closes #392

### Improve Chat Response Latency
- Improve performance of offline chat by increasing batch size (via `n_batch`) to automatically engage more cores/GPU, using smaller model and fixing prompt vs response token generation numbers. Closes #363

### Fix Fake Dialogue Continuation
- Fix formatting of user query with offline chat, this was contributing to #398
- Stop Llama 2 from Creating Fake Dialogue Continuations. Closes #398

## Minor
- Improve default message for Chat window on web when it's not configured. Include hint to use offline chat.
- Add null check in `perform_chat_checks` method
- Add offline chat director unit tests

## Performance Analysis (Time to First Token)
|  | v0.10.0 | this branch |
|-|-|-|
| Query 1 | 52s | 28s |
| Query 2 | 33s| 42s |
| Query 3 | 67s| 38s|
2023-08-01 22:07:27 -07:00
Debanjum Singh Solanky
44292afff2 Put offline model response generation behind the chat lock as well
Not just the chat response streaming
2023-08-01 21:53:52 -07:00
Debanjum Singh Solanky
1812473d27 Extract new schema version for each migration script into a variable
This should ease readability, indicates which version this
migration script will update the schema to once applied
2023-08-01 21:41:08 -07:00
Debanjum Singh Solanky
b9937549aa Simplify migration scripts management. Make them use static version
- Only make them update config when it's run conditions are satisfies
- Use static schema version to simplify reasoning about run conditions
2023-08-01 21:28:20 -07:00
Debanjum Singh Solanky
185a1fbed7 Remove old chat setup timer. It is mislabelled, irrelevant since streaming 2023-08-01 20:52:00 -07:00
Debanjum Singh Solanky
c2b7a14ed5 Fix context, response size for Llama 2 to stay within max token limits
Create regression text to ensure it does not throw the prompt size
exceeded context window error
2023-08-01 20:52:00 -07:00
Debanjum Singh Solanky
6e4050fa81 Make Llama 2 stop generating response on hitting specified stop words
It would previously some times start generating fake dialogue with
it's internal prompt patterns of <s>[INST] in responses.

This is a jarring experience. Stop generation response when hit <s>

Resolves #398
2023-08-01 20:52:00 -07:00
Debanjum Singh Solanky
aa6846395d Fix offline model migration script to run for version < 0.10.1
- Use same batch_size in extract question actor as the chat actor
- Log final location the chat model is to be stored in, instead of
  it's temp filename while it is being downloaded
2023-08-01 20:51:53 -07:00
Ikko Eltociear Ashimine
49abb9df9c
Fix typo in orgnode.py (#397)
Fix spelling of Ouput in org parser property drawer comment to Output.
2023-08-01 19:54:57 -07:00
sabaimran
f409e16137 Update some of the extract question prompts for llamav2 2023-08-01 12:23:36 -07:00
sabaimran
b11b00a9ff Add log line for time to first response 2023-08-01 10:57:38 -07:00
sabaimran
778df6be71 Add a logline when the offline model migration script runs 2023-08-01 09:27:42 -07:00
sabaimran
3a5d93d673 Add migration script for getting the new offline model 2023-08-01 09:25:05 -07:00
sabaimran
90efc2ea7a Update comments and add explanations 2023-08-01 09:24:03 -07:00
sabaimran
f7e03f6d63 Switch spinner snake case -> camel case 2023-08-01 08:52:25 -07:00
sabaimran
1c52a6993f add a lock around chat operations to prevent the offline model from getting bombarded and stealing a bunch of compute resources
- This also solves #367
2023-08-01 00:23:17 -07:00
sabaimran
6c3074061b Disable the input bar when chat response is in flight 2023-08-01 00:21:39 -07:00
sabaimran
c14cbe926a Add a loading symbol to web chat. Closes #392 2023-07-31 23:35:48 -07:00
sabaimran
8054bdc896 Use n_batch parameter to increase resource consumption on host machine (and implicitly engage GPU) 2023-07-31 23:25:08 -07:00
sabaimran
e55e9a7b67 Fix unit tests and truncation logic 2023-07-31 21:37:59 -07:00
sabaimran
2335f11b00 Add better error handling for download processes incase of failure 2023-07-31 21:07:38 -07:00
sabaimran
209975e065 Resolve merge conflicts: let Khoj fail if the model tokenizer is not found 2023-07-31 19:12:26 -07:00
sabaimran
2d6c3cd4fa Misc. quality improvements for Llama V2
- Fix download url -- was mapping to q3_K_M, but fixed to use q4_K_S
- Use a proper Llama Tokenizer for counting tokens for truncation with Llama
- Add additional null checks when running
2023-07-31 19:11:20 -07:00
sabaimran
ca195097d7 Update chat hint message at first run 2023-07-31 17:46:09 -07:00
Debanjum Singh Solanky
ded606c7cb Fix format of user query during general conversation with Llama 2 2023-07-31 17:21:14 -07:00
Debanjum Singh Solanky
48e5ac0169 Do not drop system message when truncating context to max prompt size
Previously the system message was getting dropped when the context
size with chat history would be more than the max prompt size
supported by the cat model

Now only the previous chat messages are dropped or the current
message is truncated but the system message is kept to provide
guidance to the chat model
2023-07-31 17:21:14 -07:00
sabaimran
88ef86ad5c
Fix typing issues for mypy (#372) 2023-07-30 19:27:48 -07:00
sabaimran
ca2c942b65 Add typing to compiled_references and inferred_queries 2023-07-30 19:10:30 -07:00
sabaimran
3646fd1449 Add a warning to indicate that Khoj is not configured to work with personal data sources 2023-07-30 18:52:10 -07:00
sabaimran
996832dc72 Allow user to chat even if content types aren't configured - use empty references 2023-07-30 18:47:45 -07:00
Debanjum Singh Solanky
53810a0ff7 Create khoj config dir if non-existant, before writing to khoj env file 2023-07-30 01:35:36 -07:00
sabaimran
f65d157244 Release Khoj version 0.10.0 2023-07-28 19:27:47 -07:00
Debanjum Singh Solanky
f76af869f1 Do not log the gpt4all chat response stream in khoj backend
Stream floods stdout and does not provide useful info to user
2023-07-28 19:14:04 -07:00
sabaimran
5ccb01343e
Add Offline chat to Obsidian (#359)
* Add support for configuring/using offline chat from within Obsidian
* Fix type checking for search type
* If Github is not configured, /update call should fail
* Fix regenerate tests same as the update ones
* Update help text for offline chat in obsidian
* Update relevant description for Khoj settings in Obsidian
* Simplify configuration logic and use smarter defaults
2023-07-28 18:47:56 -07:00
Debanjum
b3c1507708
Merge pull request #361 from khoj-ai/configure-offline-chat-from-emacs
- Configure using Offline Chat from Emacs: 
- Enable, Disable Offline Chat from Emacs

- Use: Enable offline chat with `(setq khoj-chat-offline t)' during khoj setup
- Benefits: Offline chat models are better for privacy but not great at answering questions
2023-07-28 18:06:58 -07:00
sabaimran
9f78db0579
Let Offline chat override OpenAI API settings (#362)
* Let Offline chat override OpenAI API settings
* Download the offline model whenever offline chat is enabled
* Add progressbar for download for llamav2 model to track progress
* Change ordering of n due to switch of default processor
* Flip ordering of offline/openai checks when extracting questions from query
2023-07-28 17:26:20 -07:00
Debanjum Singh Solanky
ebfbef1f68 Configure using offline chat from Emacs
Closes #358
2023-07-28 16:07:33 -07:00
sabaimran
29081f4429 Adjust parameters for offline chat 2023-07-27 22:22:09 -07:00
sabaimran
124d97c26d
Replace Falcon 🦅 model with Llama V2 🦙 for offline chat (#352)
* Working example with LlamaV2 running locally on my machine

- Download from huggingface
- Plug in to GPT4All
- Update prompts to fit the llama format

* Add appropriate prompts for extracting questions based on a query based on llama format

* Rename Falcon to Llama and make some improvements to the extract_questions flow

* Do further tuning to extract question prompts and unit tests

* Disable extracting questions dynamically from Llama, as results are still unreliable
2023-07-27 20:51:20 -07:00
Debanjum Singh Solanky
715d56d4f0 Use new schema to update khoj.yml config from khoj.el 2023-07-26 17:34:16 -07:00
sabaimran
8b2af0b5ef
Add support for our first Local LLM 🤖🏠 (#330)
* Add support for gpt4all's falcon model as an additional conversation processor
- Update the UI pages to allow the user to point to the new endpoints for GPT
- Update the internal schemas to support both GPT4 models and OpenAI
- Add unit tests benchmarking some of the Falcon performance
* Add exc_info to include stack trace in error logs for text processors
* Pull shared functions into utils.py to be used across gpt4 and gpt
* Add migration for new processor conversation schema
* Skip GPT4All actor tests due to typing issues
* Fix Obsidian processor configuration in auto-configure flow
* Rename enable_local_llm to enable_offline_chat
2023-07-26 16:27:08 -07:00
sabaimran
23d77ee338
Fix import issues in desktop image builds (#343) 2023-07-26 15:45:52 -07:00
Debanjum Singh Solanky
7722a9c347 Default to using the gpt-3.5-turbo model for chat from khoj.el 2023-07-22 00:29:26 -07:00
Debanjum Singh Solanky
f0d4a4cf9a Revert "Make configure_content functional. Do not pass content index state to it."
This reverts commit 2ddee7e745 as it
broke partial updates of the content index for just the specified
content types
2023-07-21 13:59:09 -07:00
sabaimran
82c725817e Merge branch 'master' of github.com:khoj-ai/khoj 2023-07-21 13:24:05 -07:00
sabaimran
596e11ec6d Use the same function for computing entries for IDs regardless of whether it has prev entries 2023-07-21 13:23:56 -07:00
Debanjum Singh Solanky
2ddee7e745 Make configure_content functional. Do not pass content index state to it. 2023-07-20 23:24:08 -07:00
sabaimran
1610d2ebd9
📝 Add a documentation base for Khoj! (#333)
* Add docs for more organized, accessible information detailing Khoj setup
* Delete duplicated files
* Add a coverpage without enabling it. Add logo and theme
* Remove obsidian README.md
* Add plausible script to index.html via docsify
2023-07-20 22:34:25 -07:00
Debanjum Singh Solanky
3e59be7f1d Release Khoj version 0.9.0 2023-07-18 19:59:27 -07:00
Debanjum Singh Solanky
d078e7b1f6 Clean up search type usage in khoj server, tests and Readme 2023-07-18 19:57:55 -07:00
Debanjum Singh Solanky
4d910936b7 Fix triggering index update on khoj server from khoj.el 2023-07-18 19:57:54 -07:00
Debanjum Singh Solanky
5c7d7f558d Make AI model used for Khoj chat configurable from khoj.el
- Fix bug. Set the unused model-name to a standad default value
2023-07-18 19:57:54 -07:00
Debanjum
5f2be2a9bb
Merge pull request #298 from HyunggyuJang/patch-1
Encode config as utf-8 during setup in khoj.el. This will allow utf-8 encoded files etc to be passed in config
2023-07-18 17:54:11 -07:00
Debanjum Singh Solanky
429e1b4b48 Regenerate index to apply corruption fixes on first run of new khoj 2023-07-18 16:10:47 -07:00
Debanjum Singh Solanky
83e1088d42 Manage khoj.yml config migrations on app start. Version the schema
- Add version to khoj.yml schema
  Versioning the khoj.yml config schema will simplify future migrations
2023-07-18 16:10:10 -07:00
Debanjum Singh Solanky
71e8ddd9a2 Check if PDF is configured before showing it as an option in khoj.el 2023-07-17 15:49:20 -07:00
Debanjum
d00c5da8b7
Merge pull request #325 from khoj-ai/stablize-simplify-content-indexing
## Stabilize and Simplify Content Indexing

### Major Updates
- 9bcca43 Unify logic to update entries when indexing from scratch or incrementally
- 89c7819 Unify logic to update embeddings when indexing from scratch or incrementally
- 6a0297c Stable sort new entries when marking entries for update
- 58d86d7 Unify logic to configure server from API or on server start
- Create tests to ensure old entries, embeddings in index are unaffected on adding new entries
  - Refer: 1482fd4, 7669b85, 88d1a29 
  - ad41ef3 Make normalization of embeddings configurable to test this in c73feeb

### Minor Updates
- 1673bb5 Add todo state to compiled form of each entry
- 6e70b91 Remove unused `dump_jsonl` helper method 
- 7ad9603 Improve naming of lock
- b02323a Improve naming text search test methods

Resolves #190
2023-07-17 14:51:10 -07:00
Debanjum Singh Solanky
3e3a1ecbc8 Start app even if server init fails to let user fix it
Show stacktrace on error to help debugging
2023-07-17 14:33:02 -07:00
Debanjum Singh Solanky
ef6a0044f4 Drop embeddings of deleted text entries from index
Previously the deleted embeddings would continue to be in the index,
even after the entry was deleted
2023-07-16 03:47:05 -07:00
Debanjum Singh Solanky
ad41ef3991 Make normalizing embeddings configurable 2023-07-16 02:16:33 -07:00
Debanjum Singh Solanky
89c7819cb7 Unify logic to generate embeddings from scratch and incrementally
This simplifies the `compute_embeddings' method and avoids potential
later divergence in handling the index regenerate vs update scenarios
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
6a0297cc86 Stable sort new entries when marking entries for update 2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
6e70b914c2 Remove unused dump_jsonl method
The entries index is stored ingzipped jsonl files for each content type
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
9bcca43299 Use single func to handle indexing from scratch and incrementally
Previous regenerate mechanism did not deduplicate entries with same key
So entries looked different between regenerate and update
Having single func, mark_entries_for_update, to handle both scenarios
will avoid this divergence

Update all text_to_jsonl methods to use the above method for
generating index from scratch
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
1673bb5558 Add todo state to compiled form of each org-mode entry 2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
7ad96036b0 Improve lock name to config_lock instead of search_index_lock
It is used to lock updates to all app config state, including processor
2023-07-16 01:45:53 -07:00
Debanjum Singh Solanky
58d86d7876 Use single func to configure server via API and on server start
Improve error messages on failure to configure server components
2023-07-16 01:45:53 -07:00
sabaimran
a15711e635 Fix null type checks in get /config 2023-07-15 15:53:56 -07:00
sabaimran
e590d75b20
Start Khoj even when config is not valid (#320)
* Add icon to indicate bad config, start Khoj even if there was an issue setting up the index
2023-07-15 14:11:54 -07:00
sabaimran
49ab201c30
Fix issues importing PySide in Docker container (#322)
* Rather than installing PyQT dependencies, remove codepaths that require pyqt files in no-gui mode
2023-07-15 13:33:13 -07:00
sabaimran
ba47f2ab39 Merge branch 'master' of github.com:debanjum/khoj 2023-07-14 22:28:05 -07:00
sabaimran
874cffd256 Add additional support for parsing notion workspaces 2023-07-14 22:27:56 -07:00
Debanjum
52f68167ce
Merge pull request #317 from khoj-ai/reduce-memory-consumption-by-search-model-duplication
Reuse Search Models across Content Types to reduce Memory Consumption

- Memory consumption now only scales with search models used, not with content types. 
  Previously each content type had it's own copy of the search ML models. 
  That'd result in 300+ Mb per enabled text content type

- Split model state into 2 separate state objects, `search_models` and `content_index`. 
  This allows loading text_search and image_search models first
  and then reusing them across all content_types in content_index

- The change should cut down memory utilization quite a bit for most users.
  I see a >50% drop in memory utilization on my Khoj instance. 
  But this will vary for each user based on the amount of content indexed vs number of plugins enabled.

- This change does not solve the RAM utilization scaling with size of the index,
  as the whole content index is still kept in RAM while Khoj is running

Should help with #195, #301 and #303
2023-07-14 19:54:12 -07:00
Debanjum Singh Solanky
f08e9539f1 Release lock after updating index even if update fails to prevent deadlock
Wrap acquire/release locks in try/catch/finally when updating content
index and search models to prevent lock not being released on error
and causing a deadlock
2023-07-14 16:57:27 -07:00
sabaimran
37f7f9fd1d
Add additional telemetry for system understanding (#316)
* Add additional telemetry in order to understand which data sources are the most useful
* Make actions side by side in the configuration page
* Restore main run command
* Update links to point to wiki pages for Github, Notion integrations
* Stanardize nomenclature of the api_type to use _config suffix

Remove header fields that aren't actually helpful for understanding config usage
2023-07-14 10:14:07 -07:00
Debanjum Singh Solanky
86e2bec9a0 Reuse Search Models across Content Types to Reduce Memory Consumption
- Memory consumption now only scales with search models used, not with
  content types as well. Previously each content type had it's own
  copy of the search ML models. That'd result in 300+ Mb per enabled
  content type

- Split model state into 2 separate state objects, `search_models' and
  `content_index'.
  This allows loading text_search and image_search models first and then
  reusing them across all content_types in content_index

- This should cut down memory utilization quite a bit for most users.
  I see a ~50% drop in memory utilization.

  This will, of course, vary for each user based on the amount of
  content indexed vs number of plugins enabled

- This does not solve the RAM utilization scaling with size of the index.
  As the whole content index is still kept in RAM while Khoj is running

Should help with #195, #301 and #303
2023-07-14 01:27:22 -07:00
Debanjum
b2718d330c
Merge pull request #304 from migrate-from-pyqt-to-pyside
Migrate from PyQT6 to PySide6
2023-07-13 11:54:47 -07:00
sabaimran
31e933207f Set default values for sys.stdout if they're unavailable 2023-07-12 22:22:49 -07:00
Debanjum Singh Solanky
9c76150895 Migrate from PyQT6 to PySide6 2023-07-11 18:43:44 -07:00
HyunggyuJang
88c42b3043
Encode data as utf-8
otherwise it will complain, see 1c85531090
2023-07-11 17:06:05 +09:00
Debanjum Singh Solanky
f664a74e77 Update Khoj server to run on non standard port, 42110 instead of 8000
Resolves #295
2023-07-10 21:27:58 -07:00
sabaimran
effb52f859 Fix demo rendering with the new header 2023-07-10 21:16:19 -07:00
sabaimran
55f5be7b03 Release Khoj version 0.8.2 2023-07-10 14:39:32 -07:00
sabaimran
9a63f89f33 Merge branch 'master' of github.com:debanjum/khoj 2023-07-10 14:31:19 -07:00
sabaimran
53809298c0 Release Khoj version 0.8.1 2023-07-10 14:30:04 -07:00
tjsousa
5b37e988e6
Allow using configured GPT chat model (#292)
My account doesn't have gpt-4 enabled and it wouldn't work as the default value was always used from extract_questions, where the caller could use the configured model.
2023-07-10 14:24:40 -07:00
Debanjum Singh Solanky
75ff871217 Release Khoj version 0.8.0 2023-07-10 13:37:51 -07:00
Debanjum Singh Solanky
979088b3dc Add tooltip helper text on web settings page buttons
- Provide more details on what clicking configure, initialize buttons
  or changing the results count slider does
- This shows up on user hovering over those buttons
2023-07-10 13:32:41 -07:00
Debanjum Singh Solanky
255781e135 Use relative link on logo to jump to correct page on local and cloud 2023-07-10 13:22:20 -07:00
Debanjum Singh Solanky
b2d229c116 Move header pane style to base khoj.css for reuse. Fix logo size 2023-07-10 13:10:17 -07:00
Debanjum Singh Solanky
20cb314171 Open the Khoj config page in the browser on first run 2023-07-10 12:10:20 -07:00
sabaimran
07cf5a214a
Check if PDF files are present in the Obsidian vault before initializing the Khoj configuration (#293) 2023-07-10 10:33:04 -07:00
sabaimran
7364bac8ae Make the header take up less space
- Use a single row for the header
- Needed custom styling for each page because each of them are different in subtle ways, unfortunately
2023-07-09 22:31:37 -07:00
sabaimran
62704cac09
Add a plugin which allows users to index their Notion pages (#284)
* For the demo instance, re-instate the scheduler, but infrequently for api updates
- In constants, determine the cadence based on whether it's a demo instance or not
- This allow us to collect telemetry again. This will also allow us to save the chat session
* Conditionally skip updating the index altogether if it's a demo isntance
* Add backend support for Notion data parsing
- Add a NotionToJsonl class which parses the text of Notion documents made accessible to the API token
- Make corresponding updates to the default config, raw config to support the new notion addition
* Add corresponding views to support configuring Notion from the web-based settings page
- Support backend APIs for deleting/configuring notion setup as well
- Streamline some of the index updating code
* Use defaults for search and chat queries results count
* Update pagination of retrieving pages from Notion
* Update state conversation processor when update is hit
* frequency_penalty should be passed to gpt through kwargs
* Add check for notion in render_multiple method
* Add headings to Notion render
* Revert results count slider and split Notion files by blocks
* Clean/fix misc things in the function to update index
- Use the successText and errorText variables appropriately
- Name parameters in function calls
- Add emojis, woohoo
* Clean up and further modularize code for processing data in Notion
2023-07-09 15:29:26 -07:00
Debanjum
77755c0284
Fix Packaging the Khoj Desktop Apps (#289)
* Add langchain static files and pytorch metadata to Khoj native app

* Add pillow static files, metadata & hidden imports to Khoj native app

* Fix path to web interface static files on Khoj native app

* Add tiktoken hidden imports to make chat work from Khoj native app

* Fix Khoj native app to run with GUI mode enabled

This got broken when we moved from using the --no-gui flag to using
--gui in https://github.com/khoj-ai/khoj/pull/263
2023-07-09 10:21:16 -07:00
sabaimran
4c135ea316
Make streaming optional for the /chat endpoint (#287)
* Update the /chat endpoint to conditionally support streaming

- If streams are enabled, return the threadgenerator as it does currently
- If stream is disabled, return a JSON response with the response/compiled references separated out
- Correspondingly, update the chat.html UI to use the streamed API, as well as Obsidian
- Rename chat/init/ to chat/history

* Update khoj.el to use the /history endpoint

- Update corresponding unit tests to use stream=true

* Remove & from call to /chat for obsidian

* Abstract functions out into a helpers.py file and clean up some of the error-catching
2023-07-09 10:12:09 -07:00
Debanjum Singh Solanky
0a86220d42 Use default values, delete content config on disable and update state 2023-07-07 20:36:16 -07:00
Debanjum Singh Solanky
362063f5fe By default, connect to Khoj server over IPv4 from Obsidian plugin 2023-07-07 20:36:16 -07:00
Debanjum Singh Solanky
571e8c2548 Add rerank, index corruption hint on search page of web interface
Similar to the hint alrady in the Obsidian search modal
Closes #272
2023-07-07 20:36:16 -07:00
Debanjum Singh Solanky
61e131f95c Hide unused model field from chat settings on web interface 2023-07-07 18:43:53 -07:00
Debanjum Singh Solanky
af30d01e85 Move to newer chat models to extract questions & summarize chats
Deprecate usage of the older gpt3 models in-place of the newer chat
based models
- text-davinci-003 is only 50% cheaper than gpt4 and less reliable for
  question extraction
- Using gpt-3.50turbo for summarization should reduce cost of chat

- Keep conversation.chat_session as a list instead of a string
- Update completion_with_backoff func to use ChatML format
2023-07-07 17:32:27 -07:00
Debanjum Singh Solanky
171ce19e1f Update date filter to allow quoting values in single quotes 2023-07-07 17:13:47 -07:00
Debanjum Singh Solanky
e588f7c528 Deprecate unused beta search and answer API endpoints 2023-07-07 16:38:07 -07:00
Debanjum Singh Solanky
c9fc4d1296 Revert to using cross-encoder to improve search results used by chat 2023-07-07 15:31:34 -07:00
Debanjum Singh Solanky
11f0a9f196 Fix chat tests since streaming. Pass args correctly to chat methods
- Fix testing gpt converse method after it started streaming responses
- Pass stop in model_kwargs dictionary and api key in openai_api_key
  parameter to chat completion methods. This should resolve the arg
  warning thrown by OpenAI module
2023-07-07 15:23:44 -07:00
Debanjum Singh Solanky
48870d9170 Fix parsing questions generated by extract_questions actor into list
The previous json parsing was failing to handle questions with date
filters

Fix the chat actor tests to run without throwing error with freezegun
complaining about importing transformers.local_llama model

Remove quote escapes from date filter examples provided to
extract_questions actor
2023-07-07 15:18:55 -07:00
Debanjum Singh Solanky
279662620b Move results count to settings page on web. Use it for search & chat
- Before
  Only the search interface had the results count configuration option

- After
  - The results count is set on the settings page instead of the
    search page
  - Both search and chat can use the configured results count instead
    of just search
2023-07-07 14:08:08 -07:00
Debanjum Singh Solanky
2ec8da89e8 Remove Update button from Khoj Search page on the Web interface
The settings page on the Khoj web interface already has a configure
button. Don't need the Update button on the search page as well
2023-07-07 12:49:58 -07:00
Debanjum Singh Solanky
bf427cd8dd Set no. of results used to generate chat response from Khoj Emacs 2023-07-07 12:34:50 -07:00
Debanjum Singh Solanky
1d77fe712c Set no. of results used to generate chat response from Khoj Obsidian 2023-07-07 12:32:32 -07:00
Debanjum Singh Solanky
2f31de5ed5 Set no. of references to use for chat configurable in Chat API 2023-07-07 12:29:36 -07:00
Debanjum Singh Solanky
d97682fdac Use tooltip, placeholders to guide Khoj setup via web settings page 2023-07-06 21:37:48 -07:00
Debanjum Singh Solanky
f5cf09424b Use more descriptive field names for content type settings on Khoj web
Resolves #281
2023-07-06 20:47:39 -07:00
Debanjum Singh Solanky
a2c668268f Use node-fetch >=3.1.0 in khoj obsidian plugin to avoid security vulnerability 2023-07-06 13:05:52 -07:00
sabaimran
d688ddf92c
Re-instate the scheduler for the demo instances (#279)
* For the demo instance, re-instate the scheduler, but infrequently for api updates

- In constants, determine the cadence based on whether it's a demo instance or not
- This allow us to collect telemetry again. This will also allow us to save the chat session

* Conditionally skip updating the index altogether if it's a demo isntance
2023-07-06 11:01:32 -07:00
Debanjum Singh Solanky
8f36572a9b Improve typing, null checks in controllers and gpt functions 2023-07-05 20:49:25 -07:00
Debanjum
6c2a8a5bce
️ Stream Responses by Khoj Chat on Web, Obsidian
- What
   - Stream chat responses from OpenAI API to Web, Obsidian clients
      - Implement using a callback function which manages a queue where new tokens can be placed as they come on. As the thread is read from, tokens are removed.
      - When the final token has been processed, add the `compiled_references` to the queue to be rendered by the `chat` client
      - When the thread has been closed, save the accumulated conversation log in the user's history using a `partial func`
      - Incrementally decode tokens on the front end and add them as they appear from the streamed response

- Why
This significantly reduces perceived latency and OpenAI API request timeouts for Chat

Closes https://github.com/khoj-ai/khoj/issues/257
2023-07-05 20:02:11 -07:00
Debanjum Singh Solanky
e111eda6ae Make client, app_config optional in telemetry logger for correct typing 2023-07-05 18:57:38 -07:00
Debanjum Singh Solanky
e562114f6b Improve comments, var names in js for chat streaming on web interface 2023-07-05 18:57:27 -07:00
Debanjum Singh Solanky
46269ddfd3 Fix chat logging messages to get context without flooding logs 2023-07-05 18:27:06 -07:00
Debanjum Singh Solanky
0ba838b53a Show temp status message in Khoj Obsidian chat while Khoj is thinking
- Scroll to bottom after adding temporary status message and
references too
2023-07-05 18:02:43 -07:00
Debanjum Singh Solanky
8271abe729 Use optional chaining operator to extract khojBannerSubmit from conditional 2023-07-05 18:02:43 -07:00
Debanjum Singh Solanky
c12ec1fd03 Show temp status message in Khoj web chat while Khoj is thinking
- Scroll to bottom after adding temporary status message and
references too
2023-07-05 18:02:30 -07:00
sabaimran
257a421e45 Bonus: add try-catch logic around telemetry upload in case of JSON serializability issues 2023-07-05 15:12:18 -07:00
sabaimran
4e6b66b139 Add support for streaming chat response from OpenAI to Obsidian
- I needed to installed node-fetch to accomplish this, as the built-in request object from Obsidian doesn't seem to support streaming and the built-in fetch object is very sensitive to any and all cross origin requests
2023-07-05 15:01:22 -07:00
sabaimran
3ff5074cf5 Log the end-to-end time of generating a streamed response from OpenAI 2023-07-05 14:59:44 -07:00
sabaimran
68e635cc32 Remove additional comments and debug statements 2023-07-05 11:33:56 -07:00
sabaimran
67a8795b1f Clean-up commented out code 2023-07-05 11:24:40 -07:00
sabaimran
79b1b1d350 Save streamed chat conversations via partial function passed to the ThreadGenerator 2023-07-04 17:33:52 -07:00
sabaimran
afd162de01 Add reference notes to result response from GPT when streaming is completed
- NOTE: results are still not being saved to conversation history
2023-07-04 12:47:50 -07:00
sabaimran
8f491d72de Initial code with chat streaming working (warning: messy code) 2023-07-04 10:14:39 -07:00
Debanjum Singh Solanky
5889eceba4 Make text selectable in Khoj chat modal on Obsidian
Previously the text in the Khoj chat modal couldn't be copied as it
was not selectable

Resolves #206
2023-07-03 23:24:04 -07:00
sabaimran
89354def9b Update request timeout window to 20 seconds 2023-07-03 22:28:18 -07:00
sabaimran
b1940519c3 Log error if unable to decode chunk from Github 2023-07-03 16:29:32 -07:00
Debanjum Singh Solanky
ecf9730cd7 Disable Chat, Search on Web if Khoj not configured & show next steps 2023-07-03 16:04:32 -07:00
sabaimran
017e8c1aef
Skip indexing a PDF that has an indexing error (#274) 2023-07-03 15:55:11 -07:00
sabaimran
a6f313589e Release Khoj version 0.7.1 2023-07-03 12:26:41 -07:00
sabaimran
8bfd5828e6 Remove deprecation notice since we're opening the web UI by default 2023-07-03 12:01:09 -07:00
sabaimran
92d81d3b16
Initialize the search.model field to SearchModels() and fix Reinitialize API call (#273) 2023-07-03 11:32:44 -07:00
sabaimran
61403138d5
Merge pull request #269 from khoj-ai/features/simplify-configuration-steps
Simplify some common configuration steps
2023-07-03 00:16:51 -07:00
sabaimran
ea3dc2cfa3 Simplify rendering of content type pages and logic of selecting config 2023-07-03 00:15:29 -07:00
sabaimran
260272dca2 Check if state.config is populated before configuring via the update method 2023-07-03 00:10:56 -07:00
sabaimran
bf8914d0c8 Fix default config initialization for for chat.html 2023-07-03 00:00:47 -07:00
Debanjum
faad1297f4
Drop Support for Org Music, Ledger Content Types
Removing unused content types will reduce khoj code to manage

- 0f993b3 Drop support for Ledger as a separate content type
   Khoj will soon get a generic text indexing content type in Index plain text files #237.
   This along with a file filter should suffice for searching through Ledger transactions

- c9db532 Remove unused org-music as an indexable content type from Khoj
   Org-music was just a custom content type that worked with org-music.
   It was mostly only useful for me.
2023-07-02 17:48:29 -07:00
Debanjum Singh Solanky
0f993b332e Drop support for Ledger as a separate content type
Khoj will soon get a generic text indexing content type. This along
with a file filter should suffice for searching through Ledger
transactions, if required.

Having a specific content type for niche use-case like ledger isn't
useful. Removing unused content types will reduce khoj code to manage.
2023-07-02 16:57:49 -07:00
sabaimran
fa218ff5aa Fix call to update for Reinitialize button 2023-07-02 16:31:30 -07:00
sabaimran
a8b83da872 Merge branch 'master' of github.com:debanjum/khoj into features/simplify-configuration-steps 2023-07-02 16:21:54 -07:00
Debanjum Singh Solanky
c9db5321e7 Remove unused org-music as an indexable content type from Khoj
Org-music was just a custom content type that worked with org-music.
It was mostly only useful for me.

Cleaning up that code will reduce number of content types for khoj to
manage.
2023-07-02 16:21:21 -07:00
sabaimran
b86a3bb0c5 Merge branch 'master' of github.com:debanjum/khoj into fix/obsidian-setup-issues 2023-07-02 16:21:05 -07:00
sabaimran
a52c1c8380 Use built-in app.vault to determine whether there are any PDF files within 2023-07-02 16:20:43 -07:00
sabaimran
eff1436857 Overwrite existing PDFs in Obsidian as well, make if-block more legible 2023-07-02 16:17:25 -07:00
Debanjum Singh Solanky
30459ee4ba Fix Khoj subtitle in desktop entry, pyproject, cli and Obsidian Readme 2023-07-02 16:09:07 -07:00
sabaimran
1a1b044d12 Simplify settings pages for configuration
- Add one-click disablement
- Remove fields that probably don't need to be edited (our implementation details)
- Add a green tick if a given field is configured
2023-07-02 16:04:05 -07:00
sabaimran
e4c445f805 Add try-except-finally blocks around configure calls in /update 2023-07-02 13:35:02 -07:00
sabaimran
4b02a8c788 Fix PDF setup in Obsidian plugin and force Obsidian configuration for markdown 2023-07-02 12:37:24 -07:00
sabaimran
2a7e4f2b71 Escape special characters in the URL when adding a link to the remote file 2023-07-02 09:13:28 -07:00
sabaimran
c747562897 Update the GUI to just be a simple box with a button for the web UI 2023-07-01 20:37:21 -07:00
sabaimran
bab7f39d47 Move logic to open the web browser into the GUI section 2023-07-01 20:11:27 -07:00
sabaimran
36537606da Update unit test and preserve prior operational ordering in main.py 2023-07-01 20:02:35 -07:00
sabaimran
ea9ae4ae28 Configure Khoj to automatically open the browser to their web home page when Khoj is up 2023-07-01 19:46:31 -07:00
sabaimran
d2083dd395 Remove bespoke processing for GithubToJsonl file demo 2023-07-01 19:09:22 -07:00
sabaimran
a71440f62a Update the guidance in the error message if config is not set 2023-07-01 19:09:00 -07:00
sabaimran
7db97d8aa9 Fix: don't try to render the search_type.ALL 2023-07-01 19:08:19 -07:00
sabaimran
f0f6390366 Make --no-gui the default behavior of Khoj and update corresponding documentation 2023-07-01 19:07:59 -07:00
Debanjum Singh Solanky
d77e05c279 Release Khoj version 0.7.0 2023-07-01 05:44:22 -07:00
Debanjum Singh Solanky
30d87a9a01 Update color of Khoj chat in Obsidinan plugin to Lantern theme 2023-07-01 02:18:47 -07:00
Debanjum Singh Solanky
51826d28d6 Ensure clicking Update in Khoj Obsidian indexes PDF files too 2023-07-01 02:18:47 -07:00
sabaimran
dac2d14380 Handle file names appropriately for md files and render commits in github results 2023-07-01 01:20:58 -07:00
sabaimran
dbe713604d Fix error in tests for markdown_to_jsonl 2023-07-01 00:49:40 -07:00
sabaimran
931aab4464 Handle case for when headers value is None 2023-07-01 00:37:30 -07:00
sabaimran
d01afb3ee4 Fix path issues for URL-based markdown files 2023-07-01 00:25:11 -07:00
sabaimran
31655447e7 Add the sign-up list to the chat page as well and update copy 2023-06-30 21:43:01 -07:00
sabaimran
796102c74e Add separate configuration if the given Khoj instance is meant for demo
- In theory, this will be suitable for any Khoj instance that's meant for external-facing purposes (as in, outside of the user's network)
- Prevent re-indexing for Github data if this is a demo instance
- Fix up some issues with the CSS which made settings page small in mobile
- In the frontend views for Khoj, add a button to get on the waitlist and links to the landing page
2023-06-30 20:38:55 -07:00
sabaimran
db3026739d Resolve diffs in api.py to make /chat endpoint async with new request parameter 2023-06-30 00:25:37 -07:00
sabaimran
ef72508914 Try/catch around github file decoding, await call to search in chat API, fix img width 2023-06-30 00:23:21 -07:00
Debanjum Singh Solanky
b950889f47 Fix org-mode web renderer to handle results containing list in block
- Break out of rendering list if at end of org block in org.js
- This would previous hang rendering results in web interface

Should try fix this upstream in org.js as well
2023-06-29 19:01:25 -07:00
sabaimran
780c769567 Add additional request headers to improve telemetry 2023-06-29 18:51:24 -07:00
sabaimran
6c10d68262
Merge pull request #253 from khoj-ai/features/github-issues-indexing
Support indexing Github issues as well as corresponding comments
2023-06-29 16:02:47 -07:00
sabaimran
b2dd946c6d Rename issue to entry method for accuracy 2023-06-29 15:23:50 -07:00
Debanjum Singh Solanky
51dfa48e2b Have Khoj support Python 3.11 as Pytorch supports it now
- Previously Khoj could only support Python upto 3.10 due to pytorch.
  But lots of folks had python 3.11 installed by default on their machines.

  This required installing python 3.10 and dealing with virtual envs.

  With Torch >= 2.0.1 now able to support python 3.11, at least one
  class of installation troubles for Khoj should drop. See
  https://github.com/pytorch/pytorch/issues/86566 for reference

- Preliminary testing indicates using the new torch 2.x may reduce
  search time by 25% (from 80ms to 60ms on Mac M1)

- Update Docs to not require mentioning python <=3.10 required
- Update Github test workflow to run khoj tests with python 3.11 too
2023-06-29 15:13:26 -07:00
sabaimran
65bf894302 Interpret org files as a list and put them in separate divs. Update styling of search results to separate into cards 2023-06-29 15:12:48 -07:00
Debanjum Singh Solanky
d212298573 Make Configure button on web interface incrementally update by default
We should add a way to force index everything.

But force indexing should not be the default when user is just trying
update content to index
2023-06-29 14:52:51 -07:00
Debanjum Singh Solanky
da2de21339 Only return requested result count even if search in multiple content types
- Set results_count to default value at start so it is an int, never None
2023-06-29 14:49:05 -07:00
sabaimran
77672ac0ae Demarcate different results with a border box
- Add back support for searching by type Github
- Remove custom class name in markdown js file
2023-06-29 14:14:25 -07:00
sabaimran
6edc32f2f4 Accept current changes to include issues in rendering flow 2023-06-29 12:25:29 -07:00
sabaimran
ab7dabe74f Explicitly use Union type for function parameters for lint checks 2023-06-29 11:44:30 -07:00
sabaimran
fecf6700d2 Limit small image rendering to just the avatar images 2023-06-29 11:27:18 -07:00
sabaimran
70e550250a Add an additional data source for issues from Github repositories + quality of life updates
- Use a request session to reduce the overhead of setting up a new connection with the Github URL each request
- Use the streaming feature for the REST api to reduce some of the memory footprint
2023-06-29 10:59:54 -07:00
Debanjum Singh Solanky
5f2717cc4b Use logger.warning since logger.warn is deprecated 2023-06-28 22:15:27 -07:00
Debanjum Singh Solanky
56ce97ef9e Use async/await in tests for query method of text and image search
The text, image search query method has become async. So async/await
is required to get results correctly in tests etc
2023-06-28 22:07:02 -07:00
Debanjum Singh Solanky
b1767f93d6 Get any configured asymmetric search model to encode query for search
- Set image_search.query to async to use it with multi-threading
  This is same as text_search.query being set to an async method
- Exit search early if no search_model is defined in state.model
2023-06-28 22:07:02 -07:00
Debanjum Singh Solanky
8eae7c898c Put each result under org heading when query for "all" content type in khoj.el
- Add "all" as default content type when no content type retrieved
  from server
2023-06-28 22:07:02 -07:00
Debanjum Singh Solanky
630bf995f1 Style each result based on its content type in same view on Khoj web
- So when searching across content types (with content-type = "all")
  org-mode results get rendered differently than markdown, PDF etc. results

- Set div class for each result separately instead of a single uber div
  for styling. This allows styling div of each result based on the
  content-type of that result

- No need to create placeholder "all" content type on web interface as
  server is passing an all content type by itself
2023-06-28 22:07:01 -07:00
Debanjum Singh Solanky
1773a78339 Fix createRequestUrl method signature to fetch results from khoj web 2023-06-28 12:10:45 -07:00
Debanjum Singh Solanky
212b1a96c8 Create "all" search type for search across all content types on khoj server
Allows moving logic to handle search across all content types to
server from clients
2023-06-28 11:34:26 -07:00
Debanjum Singh Solanky
0636ceaf14 Merge branch 'master' of github.com:khoj-ai/khoj into parallelize-search-across-all-asymmetric-text-content-types
Conflicts:
- src/khoj/routers/api.py: Use theirs
2023-06-27 16:10:32 -07:00
Debanjum Singh Solanky
510bb7e684 Use typing union in text_search for python 3.8 compatible type hinting 2023-06-27 15:59:50 -07:00
Debanjum Singh Solanky
1b11d5723d Extract search request URL builder into js function in web interface 2023-06-27 15:50:41 -07:00
Debanjum Singh Solanky
09f739b8cc Null check config, log warning instead of error when configuring search 2023-06-27 15:48:48 -07:00
sabaimran
9d62d66a77 Simplify construction of repo shorthand in GithubToJsonl 2023-06-27 15:05:03 -07:00
sabaimran
227169ebde Support configuration of multiple Github repositories in the settings interface
- Add cards to configure each of the Github repositories
- Fix a bug in the API which caused all other settings to be wiped when updating one of the content types
- Provide an error message to the user if they have a misconfiguration in their chat settings
2023-06-27 14:10:09 -07:00
sabaimran
37a1f15c38 Add backend support for indexing multiple repositories
- Add support for indexing org files as well as markdown files from the Github repository and update corresponding search view
- Support indexing a list of repositories
2023-06-27 12:06:15 -07:00
sabaimran
ddd550e6f4 Add call to use X-CSRFToken in relevant POST methods 2023-06-26 12:38:00 -07:00
sabaimran
35e24d7851 Fix null checking in state for content config API and telemetry API 2023-06-26 11:37:34 -07:00
sabaimran
5e39421f56 Merge branch 'master' of github.com:debanjum/khoj 2023-06-25 11:41:47 -07:00
sabaimran
4410a3bb4b Limit max width of the pre tag to 100% of the screen width 2023-06-25 11:41:15 -07:00
sabaimran
ffe66b848a Use a single column tempalte for config plugins when in mobile 2023-06-25 11:27:41 -07:00
Debanjum Singh Solanky
b1890aa050 Null check intermediary objects when config not fully initialized 2023-06-24 15:34:18 -07:00
Debanjum Singh Solanky
946af0889d Improve showing status message on saving config via web interface
- Show success/failure status message much closer to the save button
  Previously status message was shown on top of the page, which wasn't
  always in view and wasn't easily seen
- Improve the status message to more clearly show next steps on success
2023-06-24 00:49:57 -07:00
Debanjum Singh Solanky
40d1abfe50 Update the new /config APIs to configure Khoj for first time users
- Setup state.config and sub-components from unset state
- Setup search types with default settings
2023-06-24 00:45:30 -07:00
Debanjum Singh Solanky
edabede93a Fix post configuration state update on error or success on config html 2023-06-23 14:52:25 -07:00
Debanjum Singh Solanky
4744d69221 Resolve button name, anchor tag feedback. Add status message to settings page
- Use "Configure" name for settings config action
- Use more standard anchor tag instead of button
- Add configure status message
2023-06-23 09:48:38 -07:00
Debanjum Singh Solanky
26abafa658 Highlight currently active tab in web interface for orientation 2023-06-22 00:33:28 -07:00
Debanjum Singh Solanky
2728c714d7 Put pico.css in local assets. Move common css styling into khoj.css 2023-06-22 00:33:11 -07:00
Debanjum Singh Solanky
20a37697de Add Khoj header with navigation pane to Search and Chat Interfaces 2023-06-22 00:33:11 -07:00
Debanjum Singh Solanky
c467a0cbb0 Update UI of config sub pages to use khoj lantern theme styling 2023-06-22 00:33:11 -07:00
Debanjum Singh Solanky
0ce2ec590a Update main config page on khoj server to match khoj lantern theme 2023-06-21 20:25:25 -07:00
Debanjum Singh Solanky
d30a9ddd33 Use Khoj Logo on Search, Chat pages of Web Interface 2023-06-21 12:34:53 -07:00
Debanjum Singh Solanky
6d4aad57e1 Use new Khoj Lantern Logo in Web, Emacs, Obsidian UIs and Docs 2023-06-21 01:57:22 -07:00
Debanjum Singh Solanky
69d4fa6525 Rename project links across repo from debanjum/khoj to khoj-ai/khoj 2023-06-21 00:13:21 -07:00
Debanjum Singh Solanky
5c4eb950d5 Search across all content types via khoj.el on Emacs
If no content-type selected in transient menu option, khoj.el queries
khoj server without content-type parameter (t) set.

This results in search across all enabled asymmetric search text
content types
2023-06-20 23:39:56 -07:00
Debanjum Singh Solanky
2cd3e799d3 Improve null and type checks 2023-06-20 23:30:59 -07:00
Debanjum Singh Solanky
d5fb4196de Update web interface to allow querying all content types at once 2023-06-20 22:21:50 -07:00
Debanjum Singh Solanky
5c7c8d1f46 Use async/await to fix parallelization of search across content types 2023-06-20 22:21:50 -07:00
Debanjum Singh Solanky
1192e49307 Pass default value matching argument types expected by text_search methods 2023-06-20 22:21:50 -07:00
Debanjum Singh Solanky
0144e610d6 Only search across content types that work with asymmetric search 2023-06-20 22:21:46 -07:00
Debanjum Singh Solanky
f6a7aa6c96 Style Khoj chat on web interface with new lantern theme
- Color khoj chat message with new yellow theme color
- Update Khoj chat emoji to lantern
- Add page type to title of pages on web interface
2023-06-20 01:39:33 -07:00
Debanjum Singh Solanky
6d94d6e75a Encode the asymmetric, symmetric search queries in parallel for speed
Use timer to measure time to encode queries and total search time
2023-06-20 01:18:17 -07:00
Debanjum Singh Solanky
d292dc03b3 Use new Khoj Logotype in Web interface 2023-06-20 01:13:06 -07:00
Debanjum Singh Solanky
db07362ca3 Encode user query as same across search types to speed up query time
- Add new filter abstract method to remove filter terms from query
- Use the filter method to remove filter terms, encode this defiltered
  query and pass it to the query methods of each search types

TODO: Encoding query is still taking 100-200 ms unlike before. Need to
investigate why
2023-06-19 23:29:54 -07:00
Debanjum Singh Solanky
285d17af2a Search in parallel across all enabled content types requested via API
- Update API to return content from all enabled content types when type
  is not set to specific type in HTTP request param
- To do this efficiently run the search queries in parallel threads
2023-06-19 23:29:06 -07:00
Debanjum Singh Solanky
79d325fbb6 Fix triggering @general queries in Khoj Chat 2023-06-19 23:05:33 -07:00
Debanjum Singh Solanky
e97a20d70c Set conversation type if query param set, else return chat history
Only initialize variables if query is not empty, to avoid unnecessary
compute, variable null checks etc.

Fixes #230
2023-06-19 19:59:16 -07:00
sabaimran
4722a2c16d Add Github configuration page and success notifications 2023-06-18 10:06:45 -07:00
sabaimran
668135c763 Merge branch 'master' of github.com:debanjum/khoj into features/pretty-config-page 2023-06-18 08:35:09 -07:00
sabaimran
81183a1fe1 Address misc PR comments and update logo in all clients
- Rename the new logo to reflect accuracy on size (e.g., 128x128)
- Update the icns file for Mac
- Update nomenclature in settings pages
2023-06-18 08:34:58 -07:00
Debanjum Singh Solanky
a44cde2865 Show hint to re-index vault if wonky results in Obsidian search modal
Remove spurious indentation in Obsidian styles.css

Resolves #207
2023-06-18 04:53:51 -07:00
Debanjum Singh Solanky
595cc5b0f5 Use printer icon for PDF logs. Only split lines if file at web link in web interface 2023-06-18 02:26:03 -07:00
Debanjum Singh Solanky
e31a540a5e Get all md files recursively in repository by passing recursive param
Previously the `get_markdown_files' method was only getting files at
root of the repository

Fix, improve logger messages in github to jsonl processor
2023-06-18 01:47:15 -07:00
Debanjum Singh Solanky
6fdac24416 Set page size to 100 to reduce requests required to Github API to 1/3
- Default is 30. So number of paginated requests required to get all
  items (commits, files) will reduce by 67%

- No need to increase page size for the get tree Github API request from
  `get_markdown_files'

  Get tree Github API doesn't support pagination and return 100K items
  in response. This should be way more than enough for our current
  use-cases
2023-06-18 01:44:36 -07:00
Debanjum Singh Solanky
87975e589a Fix passing auth token to Github API to increase rate limits by x85
- Previously wasn't prefixing "token" to PAT token in Auth header
  This resulted in the request being considered unauthenticated

- Unauthenticated requests to Github API are limited to 60 requests/hour
  Authenticated requests to Github API are allowed 5000 requests/hour
2023-06-18 01:19:26 -07:00
Debanjum Singh Solanky
9c70af960c Extract logic to get file content from Github into a separate method 2023-06-18 01:19:13 -07:00
Debanjum Singh Solanky
10d4c38ce9 Extract Wait for rate limit reset logic into a function for reuse 2023-06-18 01:06:46 -07:00
sabaimran
aad7f825e0 Remove music configuration 2023-06-17 21:23:56 -07:00
sabaimran
5f97afbfac Ignore type checks from mypy in subindexed fields 2023-06-17 16:53:36 -07:00
sabaimran
c2d46de8bc Add endpoint for regenerating directly from the config page and add music content-type 2023-06-17 15:47:33 -07:00
sabaimran
ded3100caf Update the configuration page to make config management easier
- Add a central configuration management page to make management of config details easier
- Add relevant api endpoints both for client and server to update/request data as necessary
- Attempt to update the favicon
2023-06-17 15:21:28 -07:00
Debanjum Singh Solanky
3f24e53b6e Render URL as link in web interface if file param of result is a web link 2023-06-17 04:26:40 -07:00
Debanjum Singh Solanky
63ec84ad78 Store Github URL of Markdown files on Github in file jsonl param 2023-06-17 04:23:01 -07:00
Debanjum Singh Solanky
0c1c7583b5 Handle pagination, API rate limits. Get all commits from Github repo 2023-06-17 04:21:39 -07:00
Debanjum Singh Solanky
31d17d0b22 Index commits message from repository with the github plugin 2023-06-17 02:59:54 -07:00
Debanjum Singh Solanky
c29c141a7e Use Github Rest API to index Markdown files in Github Repository
The Llama_Hub Github plugin is fairly limited.

The Github Rest API is well supported and can easily be extended to
index commit messages, issues, discussions, PRs etc.
2023-06-17 02:16:13 -07:00
Saba
ac96f43b1b Remove try-catch specific to Github plugin; consolidate GUI logic 2023-06-16 23:46:25 -07:00
Saba
019d3732de Rename orgmode_search to org_search 2023-06-13 16:06:54 -07:00
Saba
08d79f5ba4 Unify types used in Github and other text-based configs. Fix typing issues 2023-06-13 15:52:36 -07:00
Saba
a6cd96a6a9 Add a Github plugin which can be used to read from a Github repository 2023-06-13 14:40:06 -07:00
Debanjum
c68cde4803
Log clients calling API endpoints on Khoj server
- Make API endpoints on Khoj server accept `client` as request parameter
  - Khoj API endpoints: /chat, /search, /update
- Make Khoj clients set `client` request param when calling the API endpoints on the Khoj server
  - Khoj clients: Emacs, Obsidian and Web
- Also log khoj server_version running to telemetry server
2023-06-09 18:36:49 +05:30
sabaimran
59fa48036f
Merge pull request #224 from debanjum/fix/message-exceeds-prompt-size
Pass truncated message as string in ChatMessage when exceeding max prompt size
2023-06-08 17:32:53 -07:00
Debanjum Singh Solanky
139a3ba060 Update server to log new server version field to telemetry db 2023-06-08 14:14:21 +05:30
Saba
5d5ebcbf7c Rename truncate messages method and update unit tests to simplify assertion logic 2023-06-06 23:25:43 -07:00
Saba
7119ed0849 Run pre-commit script 2023-06-05 19:29:23 -07:00
Saba
6212d7c2e8 Remove debug line 2023-06-05 19:00:25 -07:00
Saba
f65ff9815d Move message truncation logic into a separate function. Add unit tests with factory boy. 2023-06-05 18:58:29 -07:00
Debanjum Singh Solanky
eb6175e9b0 Update description field in webmanifest of Khoj, Khoj Chat PWA 2023-06-06 01:53:42 +05:30
Debanjum Singh Solanky
bb2363f324 Set client request param when calling khoj server APIs from Web 2023-06-06 00:05:00 +05:30
Debanjum Singh Solanky
caab55fbdd Set client request param when calling khoj server APIs from Obsidian 2023-06-06 00:04:46 +05:30
Debanjum Singh Solanky
de2494154f Set client request param when calling khoj server APIs from Emacs 2023-06-06 00:02:10 +05:30
Debanjum Singh Solanky
168c11cea7 Make server API endpoints accept client as query param
- The chat, search and update API will accept client as request param.
- This will allow logging the client from which these APIs was called.
2023-06-05 23:57:08 +05:30
Debanjum Singh Solanky
8617cf1389 Push telemetry to Posthog to grok Khoj usage 2023-06-05 22:47:49 +05:30
Debanjum Singh Solanky
d13db2e666 Make old telemetry server forward requests to new server 2023-06-05 13:06:45 +05:30
Saba
5f4223efb4 Increase timeout to OpenAI call 2023-06-04 20:49:47 -07:00
Saba
0e63a90377 Fix the mechanism to retrieve the message content 2023-06-04 20:25:37 -07:00
Saba
f0efe0177e Pass truncated message as string in ChatMessage when exceeding max prompt size 2023-06-04 19:33:46 -07:00
Saba
068ee0ac5e Swap elif with else, as usage of this method does not use openai_api_key 2023-06-04 02:25:08 -07:00
Saba
6508379d7b Use api_key keyword argument to set the openai_api_key parameter for GPT 2023-06-04 00:57:00 -07:00
Debanjum Singh Solanky
7af8a56434 Remove filename from reference before rendering references in khoj.el
Fixes bug where actual reference heading in next line jumping out of
references footnote section
2023-06-02 10:42:44 +05:30
Debanjum Singh Solanky
ec280067ef Do not retrieve relevant notes when having a general chat with Khoj
- This improves latency of @general chat by avoiding unnecessary
  compute
- It also avoids passing references in API response when they haven't
  been used to generate the chat response. So interfaces don't have to
  add logic to not render them unnecessarily
2023-06-02 10:42:44 +05:30
Debanjum Singh Solanky
90439a8db1 Update Khoj subtitle to AI personal assistant for your digital brain 2023-06-02 10:42:44 +05:30
Debanjum Singh Solanky
e9ed7a19fd Update search prompt to extract PDF search type. Fix extract_question prompt 2023-06-02 10:06:03 +05:30
Debanjum Singh Solanky
bbe3bf9733 Render PDF search results in Khoj Obsidian interface
- Make plugin update khoj server config to index PDF files in vault too
- Make Obsidian plugin update index for PDF files in vault too
- Show PDF results in Khoj Search modal as well
  - Ensure combined results are sorted by score across both types
- Jump to PDF file when select it PDF search result from modal
2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
e3892945d4 Render PDF search results in Khoj.el Emacs interface 2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
85144006a1 Render PDF search results in khoj web interface 2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
acd14a5e41 Wire up PDF to jsonl processor to Khoj server layer (API, config)
- Specify PDF content to index via khoj.yml
- Index PDF content on app start, reconfigure
- Expose PDF as a search type via API
2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
286b500f66 Create PDF to JSONL processor using PyPDF and LangChain
Switch `pydantic' to >= 1.9.1 else `langchain.document_loaders' starts
throwing typing error for python 3.8, 3.9
2023-06-01 21:41:49 +05:30
Debanjum Singh Solanky
1b3effd8e6 Fork Markdown to JSONL processor as start template for PDF to Jsonl Processor 2023-06-01 09:13:31 +05:30
Debanjum Singh Solanky
1cd9ecd449 Truncate last message if still over max supported prompt size by model 2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
ed4d0f9076 Simplify argument names used in khoj openai completion functions
- Match argument names passed to khoj openai completion funcs with
  arguments passed to langchain calls to OpenAI
- This simplifies the logic in the khoj openai completion funcs
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
703a7c89c0 Reduce retry count and request timeout for faster response or failure
- Fix bug where both LangChain and Khoj retry requests 6 times each.
  So a total of 12 requests at >1minute intervals for each chat
  response in case of OpenAI API being down

- Retrying too many times when the API is failing doesn't help
- The earlier 60 second request timeout was spacing out the interval
  between retries way too much. This slowed down chat response times
  quite a bit when API was being flaky

- With these updates you'll know if call to chat API failed in under a
  minute
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
18081b3bc6 Use LangChain to call GPT over API 2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
277d2f5c96 Do not add "Notes:" suffix to chat messages when no notes retrieved
This was causing spurious "Notes:" suffix being added to Khoj Chat in
response
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
334be4e600 Use LangChain to call OpenAI for Khoj Chat
- Use ChatModel and ChatOpenAI to call OpenAI chat model instead of
  using OpenAI package directly
- This is being done as part of migration to rely on LangChain for
  creating agents and managing their state
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
efcf7d1508 Extract prompts as LangChain Prompt Templates into a separate module
Improves code modularity, cleanliness. Reduces bloat in GPT.py module
2023-06-01 08:50:58 +05:30
Debanjum Singh Solanky
b484953bb3 Import app state correctly to generate embeddings with OpenAI model
Resolves #216
2023-05-28 10:21:54 +05:30
Debanjum Singh Solanky
a0d0dbaca7 Fix link to Khoj Obsidian Demo video in Readmes 2023-05-23 04:23:08 +05:30
Debanjum Singh Solanky
ebb5d7b8e5 Release Khoj version 0.6.2 2023-05-17 20:04:20 +05:30
Debanjum Singh Solanky
d02415edcc Write generated server id to env file when env file does not contain it 2023-05-17 19:38:44 +05:30
Debanjum Singh Solanky
dc0626856e Put the telemetry db in a separate directory by default 2023-05-17 18:58:47 +05:30
Debanjum Singh Solanky
e9f04dc644 Add dockerfile to containerize telemetry server 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
07b19964d4 Schedule jobs at (co-)prime intervals to reduce overlap in job runs 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
d42f0f5055 Add basic telemetry server for khoj 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
134cce9d32 Batch upload telemetry data at regular interval instead of while querying 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
3ede919c66 Log usage of /search, /chat, /update API endpoints to telemetry server 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
f2e89f6f46 Add khoj app helper methods to log app usage to a telemetry server 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
9ca61d62ff Enable/disable logging telemetry by setting bool in khoj.yml config
We log usage telemetry by default, unless setting explicitly set in
khoj.yml
2023-05-15 23:26:38 +08:00
Debanjum Singh Solanky
131b8407b5 Allow Khoj Chat to respond to general queries not in reference notes
- Khoj chat will now respond to general queries if:
  1. no relevant reference notes available or
  2. when explicitly induced by prefixing the chat message with "@general"

- Previously Khoj Chat would a lot of times refuse to respond to
  general queries not answerable from reference notes or chat history

- Make chat quality tests more robust
  - Add more equivalent chat response options refusing to answer
  - Force haiku writing to not give any preable, just the haiku
2023-05-12 18:42:40 +08:00
Debanjum Singh Solanky
cc75f986b2 Test text search index only updates on changes to text content 2023-05-12 17:37:34 +08:00
Debanjum Singh Solanky
f9ccce430e Allow configuring OpenAI chat model for Khoj chat
- Simplifies switching between different OpenAI chat models. E.g GPT4
- It was previously hard-coded to use gpt-3.5-turbo. Now it just
  defaults to using gpt-3.5-turbo, unless chat-model field under
  conversation processor updated in khoj.yml
2023-05-03 23:01:13 +08:00
Debanjum Singh Solanky
6b535cc345 Snip prepended heading to avoid crossing model max_token limits
Otherwise if heading > max_tokens than the search models will just see
a heading (with repeated filename) for each compiled entry and not
actual content.

100 characters should be sufficient to include filename (not path) and
entry heading. If longer rather truncate to pass entry unique text to
model for search context
2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky
02aeee60aa Set filename as top heading of org entries for better search context
Previously filename was only being appended to markdown entries.

Test filename getting prepended to compiled entry as heading
2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky
94825a70b9 Set heading of md entries to improve search context for long entries
Otherwise if a markdown entry is longer than max_tokens, the split
entries (apart from first one) do not get their heading context set
2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky
5de04621b5 Set filename as top heading of md entries for better search context
Previously filename was appended to the end of the compiled entry.
This didn't provide appropriate structured context

Test filename getting prepended as heading to compiled entry
2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky
0e3fb59e09 Entries with no md headings should not get heading prefix prepended
Files with no headings would previously get their entry be prefixed
with a markdown heading prefix (#)
2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky
45a991d75c Prepend entry heading to all compiled org snippets to improve search context
All compiled snippets split by max tokens (apart from first) do not
get the heading as context.

This limits search context required to retrieve these continuation
entries
2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky
3386cc92b5 Fix khoj server config update in khoj.el by unquoting list to cl-push to
- cl-push expects a generatlized variable. Else throws (setf quote)
  undefined warning
- This results in the config call failing on calling khoj entrypoint
2023-05-03 15:10:56 +08:00
Debanjum Singh Solanky
948a4274e4 Fix documentation strings and simplify not null checks 2023-05-02 21:47:50 +08:00
Debanjum Singh Solanky
731ef5688f Use cl-pushnew to fix byte-compile errors with using add-to-list 2023-05-02 21:47:38 +08:00
Debanjum Singh Solanky
f046523b33 Improve khoj.el messages to convey state of khoj server
- Remove waiting for server message as it hides the messages from the
  server
- Fix the nil message that were being rendered, by checking before
  showing messages from server
- Consistently prefix messages from khoj with khoj.el
2023-04-28 11:15:13 +08:00
Debanjum Singh Solanky
76df393eb5 Only call khoj server configure API from khoj.el when config updated
Previously khoj.el was calling the server configure API even when
config was same as before.
This had broken the khoj search as you type experience from emacs

Also show more details to user about what in khoj is being configured
2023-04-27 20:45:16 +08:00
Debanjum Singh Solanky
ceae06ae9d Fix khoj.el compilation warnings around unused variables 2023-04-27 20:45:16 +08:00
Debanjum Singh Solanky
8269adf849 Refactor khoj-setup in khoj.el for readability. No functional change 2023-04-27 20:45:00 +08:00
Debanjum Singh Solanky
865d12b6f2 Fix escaping quote in chat references to prevent it breaking out of html 2023-04-27 20:45:00 +08:00
Debanjum Singh Solanky
26cb878327 Add Yarn lockfile for Khoj Obsidian 2023-04-18 00:57:11 +07:00
Debanjum Singh Solanky
e3180d63e6 Sync Khoj Obsidian Tagline with Khoj tagline 2023-04-18 00:56:50 +07:00
Debanjum Singh Solanky
62e6e09521 Release Khoj version 0.6.1 2023-04-17 23:31:35 +07:00
Debanjum Singh Solanky
b079fb31bc Replace Windows path separators in indexName configured via Khoj Obsidian
Resolves #185, #199

- Issue
  IndexName created from Obsidian Absolute Vault path wasn't replacing
  windows path, drive separators with underscore. It was only
  replacing unix path separators

- Fix
  Also replace windows drive and path separators with _ while creating
  IndexName in Khoj Obsidian plugin
2023-04-17 16:55:33 +07:00
Debanjum Singh Solanky
d90df966a9 Make khoj logger use utf-8 encoding when writing to khoj log file
Resolve logger error issue mentioned in #199
2023-04-17 16:55:07 +07:00
Debanjum Singh Solanky
dc3f399f91 Fix to get score associated with SearchResponse in result as string 2023-04-16 20:22:51 +07:00
Debanjum Singh Solanky
d5000c63e1 Update Readmes to use python -m pip install khoj-assistant
Makes it easier to tell pip associated with which python is being
used. Easier to debug when users have different versions of python
installed (e.g 3.10 and 3.11)
2023-04-16 20:17:20 +07:00
Debanjum Singh Solanky
453c84ab79 Add Screenshots of Khoj Chat Interface on Emacs, Obsidian to Readmes 2023-04-07 23:19:47 +07:00
Debanjum Singh Solanky
35aa06067f Release Khoj version 0.6.0
Upload styles.css via release workflow
2023-03-31 18:13:16 +07:00
Debanjum Singh Solanky
5673bd5b96 Keep original formatting in compiled text entry strings
- Explicity split entry string by space during split by max_tokens
- Prevent formatting of compiled entry from being lost
- The formatting itself contains useful information
  No point in dropping the formatting unnecessarily,
  even if (say) the currrent search models don't account for it (yet)
2023-03-30 14:02:46 +07:00
Debanjum Singh Solanky
a2ab68a7a2 Include filename of markdown entries for search indexing
Append originating filename to compiled string of each entry for
better search quality by providing more context to model

Update markdown_to_jsonl tests to ensure filename being added

Resolves #142
2023-03-30 13:51:36 +07:00
Debanjum Singh Solanky
67129964a7 Create Note with Query as title from within Khoj Search Modal
This follows expected behavior for obsidain search modals
E.g Ominsearch and default Obsidian search.

The note creation code is borrowed from Omnisearch.

Resolves #133
2023-03-30 13:51:36 +07:00
Debanjum Singh Solanky
d3257cb24e Style the search result. Use Obsidian theme colors and font-size
Based on PR #135
2023-03-30 12:35:29 +07:00
Debanjum Singh Solanky
40091489c0 For each result: snip it by lines, show filename, remove frontmatter
Based on PR #135
Resolves #134
2023-03-30 12:34:55 +07:00
Debanjum Singh Solanky
240db7b4f0 Add screenshot of Khoj chat on Obsidian to Readme. Fix links 2023-03-30 02:49:05 +07:00
Debanjum Singh Solanky
234be96e53 Fix processor key used to configure chat model in khoj obsidian 2023-03-30 01:47:09 +07:00
Debanjum Singh Solanky
c8c0cfd10e Add Chat features, setup and usage to Khoj Obsidian plugin Readme 2023-03-30 00:32:24 +07:00
Debanjum Singh Solanky
7ecae224e7 Configure OpenAI API Key from the Khoj plugin setting in Obsidian 2023-03-29 23:54:08 +07:00
Debanjum Singh Solanky
3d616c8d65 Use Obsidian font sizes. Improve input field, reference indexing
- Give space in the input field. Too narrow previously
- References should be indexed from 1 instead of 0
- Use Obsidian font size variables to scale fonts in chat appropriately
2023-03-29 22:13:55 +07:00
Debanjum Singh Solanky
23bd737f6b Use chat input element to send message on Enter. No send button required 2023-03-29 22:13:30 +07:00
Debanjum Singh Solanky
81e98c3079 Scroll to bottom of modal on open and message send 2023-03-29 18:12:12 +07:00
Debanjum Singh Solanky
59ff1ae27f Use obsidian theme colors for bg, text. Restrict css namespace via prefix 2023-03-29 18:12:12 +07:00
Debanjum Singh Solanky
001ac7b5eb Style Obsidian Chat Modal like Khoj Chat Web Interface
- Add message sender, date metadata as message footer
- Use css directly from Khoj Chat Web Interface.
  - Modify it to work under a Obsidian modal
  - So replace html, body styling from web interface to instead
    styling new "khoj-chat" class attached to contentEl of modal
2023-03-29 18:12:12 +07:00
Debanjum Singh Solanky
112f388ada Render references next to chat responses by khoj in chat modal 2023-03-28 18:11:03 +07:00
Debanjum Singh Solanky
1d3d949962 Render conversation logs on page load 2023-03-28 14:56:29 +07:00
Debanjum Singh Solanky
cd46a17e5f Add Khoj Chat Modal, Command in Khoj Obsidian to Chat using API 2023-03-28 14:56:29 +07:00
Debanjum Singh Solanky
c0972e09e6 Rename KhojModal to KhojSearchModal, a more specific name for it
In preparation to introduce Khoj chat in Obsidian
2023-03-28 14:56:29 +07:00
Debanjum Singh Solanky
64fff1d372 Release Khoj version 0.5.0 2023-03-28 03:35:59 +07:00
Debanjum Singh Solanky
fc218508f9 Update khoj.el docs and Emacs Readme for chat, simplified setup 2023-03-27 22:02:47 +07:00
Debanjum Singh Solanky
83a7ccd729 Fix docstrings and method ordering in khoj.el 2023-03-27 18:33:09 +07:00
Debanjum Singh Solanky
5c2327ee4f Configure org directories to index from khoj.el
Converts paths to glob style regexes that will index all org files
recursively under the specified list of path

Should help setup for org-roam users from khoj.el
2023-03-27 18:30:53 +07:00
Debanjum Singh Solanky
6e8a40906d Allow disabling automatic server setup. Fix server start vs ready logic
- khoj-auto-setup controls whether to automatically check for and
  setup khoj server from within Emacs
- extract install, start, configure sequence into public, interactive
  method. Allows calling khoj-setup during package load via init.el

- Fix: Do not attempt to configure or wait for server ready if
  user has said no to auto-setup request
- Fix logic to mark server started vs ready
  - Previously the started/running vs ready variables defs were getting
    intertwined
  - Server started indicates server bootup has been triggered
  - Server ready indicates server API ready to accept requests
2023-03-27 17:53:08 +07:00
Debanjum Singh Solanky
526a927bce Fix org entry extraction test, variable prefixed with khoj in khoj.el
Discovered via failing build and test workflows on Github
2023-03-27 16:44:50 +07:00
Debanjum Singh Solanky
7243059507 Track index update asynchronously via moon phase progressbar in khoj.el 2023-03-27 06:01:04 +07:00
Debanjum Singh Solanky
8a9055f918 Restrict server messages show in echo area to main server files 2023-03-27 04:59:55 +07:00
Debanjum Singh Solanky
ae535a06eb Configure Khoj chat using khoj.el by setting OpenAI API key in Emacs 2023-03-27 04:59:54 +07:00
Debanjum Singh Solanky
36b17d4ae0 Generalize the directory from config extraction elisp method 2023-03-27 03:44:03 +07:00
Debanjum Singh Solanky
924424c754 Throw actionable exceptions when content types or chat not configured 2023-03-27 02:47:44 +07:00
Debanjum Singh Solanky
359a2cacef Fix khoj--server-running to work with unconfigured or external server
- If khoj server started outside emacs, khoj--server-ready should be set
to true by khoj--server-running method (instead of waiting for proc msg)

- If khoj server is unconfigured the /config/types endpoint wouldn't
return anything. Using config/data/default allows checking khoj server
running status without requiring it to be configured as well
2023-03-27 02:45:59 +07:00
Debanjum Singh Solanky
d7fb9a596e Auto configure server before loading khoj-menu
If the config hasn't changed there'll be no update. If config has
changed indexing will get triggered asynchronously. But user cannot
make query till indexing done

As easier to know when server ready to configure
2023-03-27 02:44:02 +07:00
Debanjum Singh Solanky
8a21aff438 Make khoj.el server start, stop, restart, setup methods interactive
No need to erase temporary buffers before working on them
2023-03-27 01:53:15 +07:00
Debanjum Singh Solanky
cb40a96c85 Index configured org files from khoj.el
- Set `khoj-org-files-index' to list of files to index
- Defaults to indexing org-agenda-files
- Uses khoj server api to configure org files to index
2023-03-27 01:05:26 +07:00
Debanjum Singh Solanky
50760acc37 Wait for Khoj server to get ready before opening khoj.el transient menu
- Use process filter, sentinel to mark when khoj server is ready or not
- Display server messages for visibility into server boot-up process
- Wait until server ready to open khoj transient menu in Emacs
  Until then khoj features wouldn't work anyway, so avoids confusion
2023-03-26 13:00:01 +07:00
Debanjum Singh Solanky
82eb4bfd0d Setup Khoj server on opening khoj from with Emacs
- Create helper methods to check, stop, restart, setup khoj server
- (Ask to) setup khoj server on calling khoj main entrypoint function
2023-03-26 10:12:06 +07:00
Debanjum Singh Solanky
99d19dcf43 Start Khoj server from Emacs using khoj.el 2023-03-26 09:38:46 +07:00
Debanjum Singh Solanky
c92d79118a Install Khoj server from Emacs using khoj.el 2023-03-26 08:50:03 +07:00
Debanjum Singh Solanky
e281a498b4 Style Khoj search org buffer via elisp instead of in-buffer settings 2023-03-26 06:34:18 +07:00
Debanjum Singh Solanky
4f655d20ae Style Khoj chat directly via elisp instead of via in-buffer settings 2023-03-26 06:03:30 +07:00
Debanjum Singh Solanky
f6ff7b1beb Render foonote reference links as superscript for Khoj Chat on Emacs 2023-03-26 05:33:08 +07:00
Debanjum Singh Solanky
67c850a4ac Add retry logic to OpenAI API queries to increase Chat tenacity
- Move completion and chat_completion into helper methods under utils.py
- Add retry with exponential backoff on OpenAI exceptions using
  tenacity package. This is officially suggested and used by other
  popular GPT based libraries
2023-03-26 05:12:35 +07:00
Debanjum Singh Solanky
ff846f05c5 Clean-up khoj.el based on linting helpers and manual review 2023-03-25 05:47:49 +07:00
Debanjum Singh Solanky
7e36f421f9 Truncate message logs to below max supported prompt size by model
- Use tiktoken to count tokens for chat models
- Make conversation turns to add to prompt configurable via method
  argument to generate_chatml_messages_with_context method
2023-03-25 05:13:56 +07:00
Debanjum Singh Solanky
4725416fbd Use shortcut keybindings in buffer to ease sending messages to Khoj 2023-03-25 05:06:01 +07:00
Debanjum Singh Solanky
508b2176b7 Update Chat API, Logs, Interfaces to store, use references as list
- Remove the need to split by magic string in emacs and chat interfaces
- Move compiling references into string as context for GPT to GPT layer
- Update setup in tests to use new style of setting references
- Name first argument to converse as more appropriate "references"
2023-03-24 22:10:11 +07:00
Debanjum Singh Solanky
b08745b541 Keep chat messages at 1 empty line visible distance in khoj.el
- Clean redundant concat, format string
- Improve variable name to emojified sender
2023-03-24 22:10:11 +07:00
Debanjum Singh Solanky
27217a330d Time chat API sub-components for performance analysis
Time and the search query extraction, search and response generation
components
2023-03-24 20:39:41 +07:00
Debanjum Singh Solanky
5e9558d39d Stylize references shown as footnote links in chat messages
- Render references as superscript
- Show reference definitions on hover over reference links to ease access
- Truncate reference def shown on hover to 70 char
  - Add continuation suffix, ..., when reference definition truncated
2023-03-24 20:38:05 +07:00
Debanjum Singh Solanky
cf28f104c7 Register separate timestamps for user query and response by Khoj Chat 2023-03-24 18:31:58 +07:00
Debanjum Singh Solanky
93e2aff786 Add references as org footnotes instead of links 2023-03-24 18:31:42 +07:00
Debanjum Singh Solanky
d78454d4ad Load Khoj Chat buffer before asking for query to provide context 2023-03-24 13:43:46 +07:00
Debanjum Singh Solanky
863933daaa Resolve build issues found by melpazoid 2023-03-23 02:25:34 +04:00
Debanjum Singh Solanky
e9ca04af0d Require dash, org to run ERT tests for khoj.el 2023-03-23 01:46:26 +04:00
Debanjum Singh Solanky
06df394d6c Style chat messages as org-mode entries in Emacs
- Style Message as Org Entries instead of List
- Put khoj response as child of user query entry
  - Improves color coding for readability
  - Allows folding each back-n-forth
- Put timestamp of message received into property drawer
- Use standardized time format for new and old chat messages
2023-03-22 12:00:43 -06:00
Debanjum Singh Solanky
364e6c11af Render chat history from API in chat buffer on first run
- Generalize the render-chat-response method to handle rendering
  history or chat response from chat API reponse

- Trigger rendering of khoj chat history if Khoj chat buffer not
  created for this session yet
2023-03-22 12:00:35 -06:00
Debanjum Singh Solanky
36b52fdd0a Properly escape reference links before rendering
- Use org-insert-link method to improve link rendering robustness
  Previous simple mechanism to crete org-links would result in links
  escaping out of formating. Use a user-facing org-mode method to
  remove/reduce probability of this

- Replace newlines with space to render reference notes as links
2023-03-22 11:05:38 -06:00
Debanjum Singh Solanky
72f63a6ef7 Add basic chat interface for Khoj on Emacs
- Query khoj chat API to get Khoj Chat response to user message
- Render chat messages as a org-mode list in format:
  - [sender-name]: *[message]*
    - /[receive-date]/
- Add references as org links with context visible on hover,
  but no jump to note
- Require dash library for khoj.el to simplify list manipulation.
  Use `-map-indexed' method from dash
2023-03-22 10:47:55 -06:00
Debanjum Singh Solanky
e4d67694e1 Add search to method, variable names meant for khoj search in khoj.el
In preparation to introduce Khoj chat in Emacs
2023-03-21 21:44:11 -06:00
Debanjum Singh Solanky
2f6284872d Mention Khoj needs Python version 3.10 or lower in docs 2023-03-20 15:18:19 -06:00
Debanjum Singh Solanky
601ff2541b Revert to using GPT to extract search queries from users message
- Reasons:
  - GPT can extract date aware search queries with date filters
    better than ChatGPT given the same prompt.
  - Need quality more than cost savings for now.
  - Need to figure ways to improve prompt for ChatGPT before using it
2023-03-18 17:56:13 -06:00
Debanjum Singh Solanky
e28526bbc9 Extract search queries from users message using ChatGPT as Search Actor
- Reasons
  - ChatGPT should be better at following instructions than GPT
  - At 1/10th the cost, it's much cheaper than using older GPT models
2023-03-18 16:33:24 -06:00
Debanjum Singh Solanky
939d7731da Fix-up Search Actor GPT's response for decoding it as valid JSON 2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky
f63fd0995e Pass more search results as context to Chat Actor to improve inference 2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky
10836dedee Search should return user message if GPT response is not valid JSON
Previously would throw if GPT response is not valid JSON. Better to
return original message to use for search instead
2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky
08f5fb315f Add answers to context for Search Actor to generate relevant queries
Update Search Actor prompt with answers, more precise primer and
two more examples for context

Mark the 3 chat quality tests using answer as context to generate
queries as expected to pass. Verify that the 3 tests pass now, unlike
before when the Search Actor did not have the answers for context
2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky
45cb510421 Loosen search results score thresold used by chat for more context 2023-03-18 16:30:55 -06:00
Debanjum Singh Solanky
d871e04a81 Use past user messages, inferred questions as context to extract questions
- Keep inferred questions in logs
- Improve prompt to GPT to try use past questions as context
- Pass past user message and inferred questions as context to help GPT
  extract complete questions
- This should improve search results quality

- Example Expected Inferred Questions from User Message using History:
  1. "What is the name of Arun's daughter?"
    => "What is the name of Arun's daughter"
  2. "Where does she study?" =>
    => "Where does Arun's daughter study?" OR
    => "Where does Arun's daughter, Reena study?"
2023-03-18 16:30:50 -06:00
Debanjum Singh Solanky
1a5d1130f4 Generate search queries from message to answer users chat questions
The Search Actor allows for
1. Looking up multiple pieces of information from the notes
   E.g "Is Bob older than Tom?" searches for age of Bob and Tom in 2 searches
2. Allow date aware user queries in Khoj chat
   Answer time range based questions
   Limit search to specified timeframe in question using date filter
   E.g "What national parks did I visit last year?" adds
   dt>="2022-01-01" dt<"2023-01-01" to Khoj search

Note: Temperature set to 0. Message to search queries should be deterministic
2023-03-18 16:28:51 -06:00
Debanjum
e75e13d788
Create Tests to Measure Chat Quality, Capabilities
Create Rubric to Test Chat Quality and Capabilities

### Issues
- Previously the improvements in quality of Khoj Chat on changes was uncertain
- Manual testing on my evolving set of notes was slow and didn't assess all expected, desired capabilities

### Fix
1. Create an Evaluation Dataset to assess Chat Capabilities
   - Create custom notes for a fictitious person (I'll publish a book with these soon 😅😋)
   - Add a few of Paul Graham's more personal essays. *[Easy to get as markdown](https://github.com/ofou/graham-essays)*
2. Write Unit Tests to Measure Chat Capabilities
   - Measure quality at 2 separate layers
     - **Chat Actor**: These are the narrow agents made of LLM + Prompt. E.g `summarize`, `converse` in `gpt.py`
     - **Chat Director**: This is the chat orchestration agent. It calls on required chat actors, search through user provided knowledge base (i.e notes, ledger, image) etc to respond appropriately to the users message.  This is what the `/api/chat` API exposes.
   - Mark desired but not currently available capabilities as expected to fail <br />
     This still allows measuring the chat capability score/percentage while only failing capability tests which were passing before on any changes to chat
2023-03-16 11:30:52 -06:00
Debanjum Singh Solanky
7526a50dd4 Extract conversation processor utility funcs from gpt.py into utils.py 2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
24ddebf3ce Make converse prompt more precise. Fix default arg vals in gpt methods
- Set conversation_log arg default to dict
- Increase default temperature to 0.2 for a little creativity in
  answering
- Make GPT be more reliable in looking at past conversations for
  forming response
2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
8609e3129e Fix, improve displaying chat messages, sources by Khoj in web interface
Pretty pretty json in conversation logs
2023-03-14 11:24:47 -06:00
Debanjum
6c0e82b2d6
Merge Improve Khoj Chat PR #183 from debanjum/improve-chat-interface
# Improve Khoj Chat
## Main Changes
- Use the new [API](https://openai.com/blog/introducing-chatgpt-and-whisper-apis) for [ChatGPT](https://openai.com/blog/chatgpt) to improve conversation quality and cost
- Improve Prompt to answer query using indexed notes
  - Previously was asking GPT to summarize the notes
  - Both the chat and answer API use this new prompt
- Support Multi-Turn conversations
  - Pass previous messages and associated reference notes to ChatGPT for context
- Show note snippets referenced to generate response
  - Allows fact-checking, getting details
- Simplify chat interface by using only single unified chat type for now

## Miscellaneous
- Replace summarize with answer API. Summarize via API not useful for now
- Only pass Khoj search results above a threshold confidence to GPT for context
  - Allows Khoj to say don't know if it can't find answer to query from notes
  - Allows relying on (only) conversation history to generate response in multi-turn conversation
- Move Chat API out of beta. Update Readme
2023-03-10 19:03:44 -06:00
Debanjum Singh Solanky
cccd225247 Deduplicate and simplify logic to render chat message with reference 2023-03-10 18:58:11 -06:00
Debanjum Singh Solanky
b9caad458e Type score_threshold with union, not |, to support python <3.10 2023-03-10 18:58:11 -06:00
Debanjum Singh Solanky
a71f168273 Move the chat API out of beta. Save chat sessions at 15min intervals 2023-03-10 17:20:52 -06:00
Debanjum Singh Solanky
8bb8824d0c Bump khoj versions in obsidian, emacs files 2023-03-10 15:23:17 -06:00
Debanjum Singh Solanky
e16d0b6d7e Open references notes used for chat on mobile too (by clicking)
Requires clicking the reference as hover doesn't work on mobile
2023-03-09 17:13:07 -06:00
Debanjum Singh Solanky
c3c7b8a951 Make Khoj chat a separate Progressive Web App (PWA) for easier access 2023-03-09 13:45:06 -06:00
Debanjum Singh Solanky
3838f9d8e3 Remove explicitly asking GPT to say I don't know in prompt for now
GPT still mostly says I don't know when answer not in notes or chats

But with this its more inclined to answer general questions not in
chats or notes while informing user that the information is not from
existing chats or notes
2023-03-09 12:11:44 -06:00
Debanjum Singh Solanky
f7b8cdd02e Log prompts being passed to GPT for debugging 2023-03-08 19:17:52 -06:00
Debanjum Singh Solanky
2739a492b4 Log message metadata along with Khoj message instead of user message
References should be attached to khoj chat messsage rather than the
users message in the chat interface
2023-03-08 19:16:24 -06:00
Debanjum Singh Solanky
87d1e1341d Show reference notes used as response context in chat interface 2023-03-08 19:16:24 -06:00
Debanjum Singh Solanky
280061e1fa Do not deduplicate search results used for chat context
- Chat uses compiled form of search results, not the raw entries to
  provide context for chat. The compiled snipped search results
  themselves are unique and using multiple of them for context from
  the same raw note is fine if they cross the score and rank thresholds

  This should improve the context provided for chat

- Also apply score_threshold, no deduplication to the answers API
2023-03-06 23:51:31 -06:00
Debanjum Singh Solanky
672f61529e Make getting deduped search results configurable via Search API 2023-03-06 23:48:46 -06:00
Debanjum Singh Solanky
4fb628975c Fix jumping to note from Khoj Obsidian search modal result on Windows
- Issue
  The file path separator by khoj server and the Obsidian vault were
  different on Windows
- Fix
  Normalize file path to use forward slash(/) to find the matching
  note file in the Obsidian vault for jump to it

Resolves #177
2023-03-05 21:07:54 -06:00
Debanjum Singh Solanky
b6cdc5c7cb Do not expose answer API as a chat type in chat web interface or API
Answer does not rely on past conversations, just the knowledge base.
It is meant for one off interactions, like search rather than a
continuing conversation like chat

For now it is only exposed via API. Later it will be expose in the
interfaces as well

Remove ability to select different chat types from the chat web
interface as there is only a single chat type

Stop appending answers to the conversation logs
2023-03-05 18:21:59 -06:00
Debanjum Singh Solanky
7f994274bb Support multi-turn conversations in chat mode
- Only use decent quality search results, if any, as context
- Pass source results used by previous chat messages as context
- Loosen prompt to allow looking at previous chats and notes to answer
- Pass current date for context

- Make GPT provide reason when it can't answer the question. Gives
  user context to tune their questions
2023-03-05 18:21:39 -06:00
Debanjum Singh Solanky
d73042426d Support filtering for results above threshold score in search API 2023-03-05 18:21:39 -06:00
Debanjum Singh Solanky
45f461d175 Keep search results passed to GPT as context in conversation logs
This will be useful to
1. Show source references used to arrive at answer
2. Carry out multi-turn conversations
2023-03-05 16:00:19 -06:00
Debanjum Singh Solanky
7cad1c9428 Only use past chat message, not session summaries as chat context
Passing only chat messages for current active, and summaries
for past session isn't currently as useful
2023-03-05 16:00:18 -06:00
Debanjum Singh Solanky
ad1f1cf620 Improve and simplify Khoj Chat using ChatGPT
- Set context by either including last 2 chat messages from active
  session or past 2 conversation summaries from conversation logs

- Set personality in system message
- Place personality system message before last completed back & forth
  This may stop ChatGPT forgetting its personality as conversation progresses given:
  - The conditioning based on system role messages is light
  - If system message is too far back in conversation history, the
    model may forget its personality conditioning
  - If system message at end of conversation, the model can think its
    the start of a new conversation
  - Inserting the system message before last completed back & forth should
    prevent ChatGPT from assuming its the start of a new conversation
    while not losing personality conditioning from the system message

- Simplfy the Khoj Chat API to for now just answer from users notes
  instead of trying to infer other potential interaction types.
  - This is the default expected behavior from the feature anyway
  - Use the compiled text of the top 2 search results for context

- Benefits of using ChatGPT
  - Better model
  - 1/10th the price
  - No hand rolled prompt required to make GPT provide more chatty,
    assistant type responses
2023-03-05 01:24:13 -06:00
Debanjum Singh Solanky
9d42b5d60d Use multiple compiled search results for more relevant context to GPT
Increase temperature to allow GPT to collect answer across multiple
notes
2023-03-05 01:24:13 -06:00
Debanjum Singh Solanky
c3b624e351 Introduce improved answer API and prompt. Use by default in chat web interface
- Improve GPT prompt
  - Make GPT answer users query based on provided notes instead
    of summarizing the provided notes
  - Make GPT be truthful using prompt and reduced temperature
  - Use Official OpenAI Q&A prompt from cookbook as starting reference
- Replace summarize API with the improved answer API endpoint
- Default to answer type in chat web interface. The chat type is not
  fit for default consumption yet
2023-03-05 01:24:13 -06:00
Debanjum Singh Solanky
7184508784 Mention Python and Pip need to be installed in Main and Emacs Readme 2023-03-02 21:28:54 -06:00
Debanjum Singh Solanky
211e460398 Output date filter from cache log at debug level. Remove unused imports
Other logs not directly useful to user have already been converted
to debug log levels in 1ae4016. Just forgot to convert this log line too
2023-03-02 15:41:32 -06:00
Debanjum Singh Solanky
b6dbe4dd1d Do not try retrieve an unconfigured core content type in Config GUI
Previous behavior was resulting in a null reference error. As key for
the core content/search type was not present in current config

Fallback to using default config for unconfigured core content type
instead

See #165 for details
2023-03-02 11:09:31 -06:00
Debanjum Singh Solanky
1ae40163a9 Show user friendly information logs by default for context
- Use emojis to make info logs easier to read
- Inform when khoj is ready to use
- Provide information on what khoj is doing while starting up
- Inform when content/search types and processors are setup
- Inform when models are being loaded from the web as this step can
  take time
- Convert all other info logs to be only shown in verbose mode
2023-03-01 16:39:07 -06:00
Debanjum Singh Solanky
fe03ba3dce Index intro text before headings in org files
- Text before headings was not being indexed due to buggy orgnode
  parsing logic
- Resolved indexing intro text from files with and without headings in
  them
- Ensure intro text node has heading set to all title lines collected
  from the file

Resolves #165
2023-03-01 12:11:33 -06:00
Debanjum Singh Solanky
7ad251b8ef Log and Continue on OSError while collating dates for date filters
Log to understand if error, date can be handled better
Mitigates #172
2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky
2bed4c3b50 Fix configuring search types & /config/types API when no plugin configured
- Test /config/types API when no plugin configured, only plugin configured
  and no content configured scenarios
- Do not throw null reference exception while configuring search types
  when no plugin configured
- Do not throw null reference exception on calling /config/types API
  when no plugin configured

Resolves bug introduced by #173
2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky
8914dbd073 Fix creating GUI panels for unconfigured search, processor types
Repro:
1. Open khoj server with `khoj` on first run
2. Install/enable Khoj Obsidian plugin (to configure khoj server)
3. Restart khoj server with `khoj`

Bug:
- Unconfigured processor and search_types are instantiated as None in
  self.current_config
- While creating the desktop GUI, these null configs are attempted to
  be accessed as valid dictionaries for creating their GUI panels
- This results in the null ref errors

Fix:
Use default config to create their GUI elements for unconfigured
search and processor types

Resolves #167
2023-03-01 01:20:58 -06:00
Debanjum Singh Solanky
b09350c052 Fix to return only enabled content types via the new config/types API
- Previously was return all core content types even if they had not been
  setup
- Add test to validate only configured content types are returned by
  the api/config/types API endpoint
2023-02-28 22:08:26 -06:00
Debanjum Singh Solanky
b177adf3a7 Return value of search_type in /config/type API endpoint
- Remove need for interfaces to downcase content types returned by API
  before using the type in search and other API endpoint
- Fix to check for search_type.name in plugin keys instead of value
2023-02-28 21:49:26 -06:00
Debanjum Singh Solanky
88344f9ed2 Improve rendering search results of plugin content types on web interface
Render only the entry from plugin search response instead of raw json
Use the results-ledger styling for results-plugin styling
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
c2814fce58 Improve rendering search results of plugin content types in khoj.el
Render only the entry from plugin search response instead of raw json
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
f3f24387ec Use new config/types API to set enabled content types on web interface 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
1e43f1a12e Use new config/types API to set enabled content types in khoj.el menu 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
9d38eadd42 Return enabled content types via api/config/types API endpoint
Simplifies dynamically populating enabled content types for interfaces
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
68bd5d9ebc Configure API routes after set up search types while configuring server
Configure app routes after configuring server.
Import API routers after search type is dynamically populated.
Allow API to recognize the dynamically populated plugin search types
as valid type query param.
Enable searching for plugin type content.
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
d91c7e2761 Search for plugin content via the search API 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
47b58a2a4d Configure, use dynamically instantiated SearchType enum on app start
The SearchType is now dynamically populated with core and configured
plugin types

Use the new dynamic SearchType enum from state.py across codebase
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
ab0d3a08e2 Index configured plugins on app start and via update API endpoint 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
55a032e8c4 Add processor to index entries from jsonl files for plugins
- Read, merge entries from input jsonl files and filters
- Mark new, modified entries for update
2023-02-24 02:54:12 -06:00
Debanjum Singh Solanky
fcbbe8c759 Read content plugin configs from Khoj config YAML
Configure external text content plugins via the Khoj YAML
Reuse existing TextContentConfig definition for external text content plugins
2023-02-23 23:57:32 -06:00
Debanjum Singh Solanky
61b6ee2857 Use helper script to bump khoj pre-release versions 2023-02-17 20:31:51 -06:00
Debanjum Singh Solanky
053d6141f3 Ignore ts typing error, Fix SPDX license identifier in Obsidian plugin 2023-02-17 18:19:01 -06:00
Debanjum Singh Solanky
36be3c4b8f Fix or ignore MyPy issues in PyQt desktop GUI code
- Remove unneeded type ignore for mps with the latest mypy
- Stop excluding PyQT desktop GUI code from MyPy checks
- Do not warn about unused ignores. Some issue with mypy giving
  different errors in different environments (venv, system and pre-commit)
2023-02-17 16:13:05 -06:00
Debanjum Singh Solanky
051f0e3fb5 Add, configure and run pre-commit locally and in test workflow 2023-02-17 13:31:36 -06:00
Debanjum Singh Solanky
5e83baab21 Use Black to format Khoj server code and tests 2023-02-17 11:55:17 -06:00
Debanjum Singh Solanky
8b293edd7c Move mypy config into pyproject.toml. Ignore 2 remaining mypy issues 2023-02-16 03:33:08 -06:00
Debanjum Singh Solanky
c641eb4ad6 Improve rendering log and error stacktraces using the Rich package
- Use Rich to render uvicorn, fastAPI logs as well
  The previous CustomFormatter only worked on khoj logs
- Improve rendering stacktrace on errors using Rich
2023-02-15 16:19:32 -06:00
Debanjum Singh Solanky
bc7477ea3e Move Emacs, Obsidian plugin code out from under src/khoj directory
- What
  - The Emacs and Obsidian interfaces stay in their original
    directories under src/
  - src/khoj now only contains code meant for pypi packaging

- Benefits
  - This avoids having to update khoj MELPA, Obsidian plugin config as
    the Emacs, Obsidian code is under their original directories
  - It separates the code in src/khoj meant for python packaging from
    code for external interfaces like Emacs and Obsidian
2023-02-14 15:44:22 -06:00
Debanjum Singh Solanky
25a749ca1d Use the src/ layout to fix packaging Khoj for PyPi
- Why
  The khoj pypi packages should be installed in `khoj' directory.
  Previously it was being installed into `src' directory, which is a
  generic top level directory name that is discouraged from being used

- Changes
 - move src/* to src/khoj/*
 - update `setup.py' to `find_packages' in `src' instead of project root
 - rename imports to form `from khoj.*' in complete project
 - update `constants.web_directory' path to use `khoj' directory
 - rename root logger to `khoj' in `main.py'
 - fix image_search tests to use the newly rename `khoj' logger
 - update config, docs, workflows to reference new path `src/khoj'
2023-02-14 15:19:06 -06:00
Debanjum
84322b2a45
Demo using Search in Khoj Obsidian Plugin 2023-02-14 08:43:50 -08:00
Debanjum Singh Solanky
a4dcb20622 Add setting to toggle auto configuring of khoj backend from Obsidian
- By default the obsidian plugin automatically configures the khoj
  backend to index the current vault
- For more complex scenarios, users can manage their ~/.khoj/khoj.yml
  manually by toggling the auto-configure setting off in the khoj
  plugin settings

Resolves #156
2023-02-13 20:15:28 -06:00
Debanjum Singh Solanky
24aa696ef5 Indicate indexing active on Update button in Obsidian plugin settings
Use moon rotating through phases to indicate notes indexing in progress

Resolves #129
2023-02-13 19:28:19 -06:00
Debanjum Singh Solanky
11517ba8eb Encode jsonl data as utf8 for gzip write for consistent read/write encoding
Should help with issue #89
2023-02-12 17:33:23 -06:00
Debanjum Singh Solanky
3ec41c4d64 Wrap lines for org, markdown results in khoj search results buffer 2023-02-12 07:33:50 -06:00
Debanjum Singh Solanky
9a013ec48f Add more details to setup Khoj backend in Obsidian plugin readme 2023-02-12 07:31:13 -06:00
Jason Axelson
6d5930363a Fix obsidian plugins doc link
Also make it more obvious where the link is going, initially I thought
the link was to another official khoj documentation site.
2023-02-10 07:11:21 -10:00
Debanjum Singh Solanky
215235efd2 Bump khoj pre-release version 2023-02-08 20:24:36 -03:00
Debanjum Singh Solanky
2445664d40 Deprioritize searching for Music content over other text content 2023-02-07 02:41:31 -03:00
Debanjum Singh Solanky
2e052913b6 Search in first configured content type when no search type set
Instead of searching through all configured content types but only
returning results of the last configured content type
2023-02-07 02:41:31 -03:00
Debanjum Singh Solanky
a26ab31d20 Allow chat with markdown notes if no org-mode content configured 2023-02-07 02:41:31 -03:00
Debanjum Singh Solanky
fbb7747dcc Read Markdown file as utf8 instead of the default encoding used by OS
- Background
  1. Obsidian stores markdown notes as utf8[1]
  2. By default, the python `open' command uses the OS locale encoding[2]

  This was causing the `UnicodeDecodeError: <locale_encoding> codec can't decode byte' error

- Fix
  - Read markdown files as utf8
    The Obsidian plugin is the main use-case for markdown files in
    khoj currently and that stores md files as utf8.
    Do not assume utf8 for other content types like org-mode, beancount for now.
  - Fail if error in reading file as utf8, instead of ignoring errors.
    Would rather have user realize that their files are not going to
    get indexed correctly.

[1]: https://forum.obsidian.md/t/better-handle-md-files-not-stored-in-utf8-format/13524/3
[2]: https://docs.python.org/3/library/functions.html#open
2023-02-06 21:04:50 -03:00
Debanjum Singh Solanky
66dca6cf33 Add Docs to Search across Languages, Uninstall Khoj to Readme
Add details and fixes to Obsidian, Main readme
based on feedback, confusion from the Obsidian plugin announcement
2023-02-06 21:04:50 -03:00
Debanjum Singh Solanky
cba9a6a703 Use List, Tuple, Set from typing to support Python 3.8 for khoj
Before Python 3.9, you can't directly use list, tuple, set etc for
type hinting

Resolves #130
2023-02-06 01:23:52 -03:00
Debanjum Singh Solanky
f26cee604d Update Khoj Plugin Install Instructions. Rename main Readme to README
Khoj plugin page from within Obsidian isn't recognized. Seems like it
needs an uppercase readme file only. So it doesn't show the Khoj
readme from within Obsidian itself.
2023-01-27 20:01:31 -03:00
Debanjum Singh Solanky
2e13e15625 Ensure markdown entries in khoj.el results separated by empty line
- Update khoj.el test to reflect updated rendering logic
- Move ledger render function before image rendered to group functions
  with similar logic closer
2023-01-26 19:13:02 -03:00
Debanjum Singh Solanky
85ae46f429 Use thread_last to make results rendering funcs more readable in khoj.el 2023-01-26 18:59:44 -03:00
Debanjum Singh Solanky
b415f87093 Split code in onChooseSuggestion method to make it more readable
Split find file, jump to file code to make onChooseSuggestion more readable
- Use find, instead of using return in forEach to get first match
- Move the jump to file+heading code out from forEach
2023-01-26 18:26:24 -03:00
Debanjum Singh Solanky
37063f6a38 Truncate query to 8k chars for find similar notes from obsidian plugin
Truncate current file data passed to khoj backend API via query string
below default query size supported by popular servers
2023-01-26 18:26:24 -03:00
Debanjum Singh Solanky
4456cf5c8f No need to use then or finally in async functions after an await 2023-01-26 18:26:24 -03:00
Debanjum Singh Solanky
4070be637c Pass app object from plugin instance to child objects and functions
Do not reference global app object from child objects and funcs
directly.

It is only available for debugging purposes and access to it maybe
dropped in the future.
2023-01-26 18:26:24 -03:00
Debanjum Singh Solanky
c203c6a3fd Use Sentence case for Find Similar Note command name in Khoj Obsidian 2023-01-26 18:26:24 -03:00
Debanjum Singh Solanky
e18124ef6f Add badge for tests and update project subtitle in khoj.el Readme 2023-01-23 20:52:03 -03:00
Debanjum Singh Solanky
86e808abfb Test get-current-text helpers for Find Similar feature in khoj.el 2023-01-23 20:33:47 -03:00
Debanjum Singh Solanky
be6acda212 Create khoj.el tests. Test rendering results of each content types 2023-01-23 20:33:47 -03:00
Debanjum Singh Solanky
0d0bf3b5aa Simplify get-current-text functions for Find Similar in khoj.el
Use existing functions like `string-trim', `thing-at-point' and
remove unneeded code from the two functions
2023-01-23 19:15:52 -03:00
Debanjum Singh Solanky
07e9e4ecc3 Get current paragraph text when point at start of paragraph in khoj.el
Previously if cursor was at start of current paragraph, it would get
text for the current and next paragraph, instead of just the current one
2023-01-23 18:05:54 -03:00
Debanjum Singh Solanky
a0b03c8bb1 Get current entry text when point at heading for Find Similar in khoj.el
Previously if cursor was at heading of current entry, it would find entries
similar to the previous outline heading, instead of the current one
2023-01-23 10:01:25 -03:00
Debanjum Singh Solanky
013c7c10a4 Bump khoj pre-release version 2023-01-22 18:45:56 -03:00
Debanjum Singh Solanky
ad3c9b5f44 Bump khoj version to 0.2.5 in preparation for release 2023-01-22 18:18:21 -03:00
Debanjum Singh Solanky
9ed056c7e7 Use consistent indentation in Khoj Emacs Readme 2023-01-22 18:04:12 -03:00
Debanjum Singh Solanky
0980c6e87f Update Emacs Usage section in Readme. Add find-similar, menu usage 2023-01-22 18:04:12 -03:00
Debanjum Singh Solanky
6908b6eed3 Truncate image queries below max tokens length supported by ML model
This would previously return the infamous tensor size mismatch error
Verify this error is not raised since adding the query truncation logic
2023-01-21 14:11:00 -03:00
Debanjum Singh Solanky
3d9ed91e42 Search by image at path only if query of form "file:/path/to/image"
Previously no query syntax helpers, like the "file:" prefix, were used
before checking if query contains file path.

This made query to image search brittle to misinterpretation and
pointless checking

Add test to verify search by image at file works as expected
2023-01-21 14:06:56 -03:00
Debanjum Singh Solanky
b7aa22a059 Change order of arg passed to query-api-and-render-results by importance 2023-01-20 22:13:24 -03:00
Debanjum Singh Solanky
936a88fa7e Find items of specified type similar to current text item at point
- Support querying with text surrounding point in any text buffer
  Previously could only find items similar to org entry at point

- Find similar items of specified content type indexed on khoj
  Previously only looked for similar org entries indexed on khoj

  Now uses the content-type configured in khoj transient menu to find
  items of the specified content type

- Details
  - Generalize the get-current-org-entry-text func to get text for any
    outline section
  - Replace leading whitespaces from query text as well
  - Create method to get current paragraph text from non-outline mode
    buffers
  - Update transient, find-similar funcs to pass, use content-type
    configured in khoj transient menu
  - Generalize query title creation logic to remove markdown headings
    prefix (#) apart from org heading prefix (*) as well
  - Update last used khoj content-type and results from the
    find-similar and update funcs for later reuse
  - Jump to top of results buffer after results rendered
2023-01-20 22:12:54 -03:00
Debanjum Singh Solanky
17aaadea1f Find notes similar to current org entry at point 2023-01-20 05:14:54 -03:00
Debanjum Singh Solanky
44bbc0a417 Add section separators to khoj.el for easier code traversal 2023-01-19 23:36:54 -03:00
Debanjum Singh Solanky
48ad3c535e Use default content types if fail to call backend on khoj.el load
Do not want khoj.el to fail on init/load if khoj backend not running
2023-01-19 20:13:49 -03:00
Debanjum Singh Solanky
9f0bd0a361 Add Github workflow for khoj.el build and quality checks
Add khoj.el build badge to khoj.el Readme
2023-01-19 20:13:19 -03:00
Debanjum Singh Solanky
0dd1cba272 Rename configuration sections in khoj.el transient menu 2023-01-19 03:03:08 -03:00
Debanjum Singh Solanky
5d0f369186 Add ability to quit khoj transient with standard q keybinding 2023-01-19 02:47:07 -03:00
Debanjum Singh Solanky
87c7cf4272 Use single khoj func as entrypoint. Group khoj.el code into sections
- Give more relevant, specific name to khoj suffix commands
- Remove `khoj-simple'. Have single `khoj' function for entrypoint
2023-01-19 02:38:19 -03:00
Debanjum Singh Solanky
9d64a009fd Allow updating khoj content index from within khoj.el
- Split transient config menu by type
2023-01-18 23:07:59 -03:00
Debanjum Singh Solanky
a8d0c7d905 Rename search type to more apt content type in khoj.el 2023-01-18 22:13:49 -03:00
Debanjum Singh Solanky
00daea16df Allow setting default-search-type to image. Make docstrings compact 2023-01-18 22:01:17 -03:00
Debanjum Singh Solanky
216b17cfd0 Dynamically populate content type choices when khoj transient invoked 2023-01-18 22:00:56 -03:00
Debanjum Singh Solanky
5f446b1440 Convert main khoj.el entrypoint into transient menu for richer configuration 2023-01-18 21:50:07 -03:00
Debanjum Singh Solanky
5c07dcd219 Fix, update Obsidian Readme. Add Find Similar Notes to Implementation section 2023-01-18 00:22:26 -03:00
Debanjum
b7fc344be1
Search for Similar Notes from Obsidian Plugin
Enable searching for notes similar to the current note being viewed

## Main Changes
- 39a18e2 Extend search modal to search for similar notes
  - Hide input field on init, Trigger search on opening modal when in similar notes mode
  - Set input to contents of current markdown file and get notes similar to it
  - Re-rank, by default, when searching for similar notes
  - Filter out current note from similar note search results
- 0bed410 Only show `Find Similar Note' command in Editor
2023-01-18 00:10:10 -03:00
Debanjum Singh Solanky
6119d0a69e Add usage of "Find Similar Notes" command to the Khoj Obsidian Readme 2023-01-18 00:03:13 -03:00
Debanjum Singh Solanky
657e455785 Remove unused `onunload' method in main.ts of khoj obsidian plugin 2023-01-17 23:46:38 -03:00
Debanjum Singh Solanky
0bed410712 Limit Find Similar Note command to be triggered from Editor
Fixup indentation and comments
2023-01-17 19:34:48 -03:00
Debanjum Singh Solanky
39a18e2080 Add ability to search for similar notes in Khoj Obsidian
- Hide input field on init, Trigger search on opening modal in similar notes mode
- Set input to current markdown file and get similar notes to it
- Enable rerank when searching for similar notes
- Filter out current note from similar note search results
2023-01-17 19:07:18 -03:00
Debanjum Singh Solanky
ffaef92476 Encode query string before passing as query param to search API 2023-01-17 18:04:11 -03:00
Debanjum Singh Solanky
d5a7cc5b0f Compact code to map results from search API into SearchResult objects
Make code compact for readability
Remove unneeded temporary variables and return statements
2023-01-17 18:04:11 -03:00
Debanjum Singh Solanky
8ab7a26bde Update Khoj on Obsidian screenshots in Main and Plugin Readme
- Screenshot querying "Setup Editor" on test vault with Khoj Readmes
- New features showcase:
  - information keybindings, rerank keybinding at bottom of modal
  - fixed top level headings in search results
  - search results snipped if greater than N words
2023-01-17 13:58:50 -03:00
Debanjum Singh Solanky
7b4f78776c Fix extracting Markdown Entries with Top Level Headings
- Previously top level headings would have get stripped of the
  space between heading text and the prefix # symbols. That is,
  `# Top Level Heading' would get converted to `#Top Level Heading'
- This would mess up their rendering as a heading in search results

- Add unit tests to text_to_jsonl processors to prevent regression
2023-01-17 13:06:28 -03:00
Debanjum Singh Solanky
1a296518c5 Limit total words for each Search Result rendered in search modal
Provides a more consistent rendering of results in modal.
Makes it easier to see more results in modal.
To see complete entry, user can always just jump to entry from modal
2023-01-17 13:06:14 -03:00
Debanjum Singh Solanky
e7b89f7fd0 Return compiled entry in additional details of /api/search response
This can be used to highlight portion of raw entry to highlight and
for passing to summarizer to stay with max_tokens limit supported by
GPT models
2023-01-16 22:56:06 -03:00
Debanjum Singh Solanky
7071d081e9 Increase max_tokens returned by GPT summarizer. Remove default params 2023-01-16 22:55:36 -03:00
Debanjum Singh Solanky
3d9cdadbbb Add codebase visualization of Khoj Obsidian to Khoj Obsidian Readme 2023-01-15 14:09:21 -03:00
Debanjum Singh Solanky
d02ba325aa Handle empty chat history returned by API to chat.html on web interface 2023-01-15 13:51:16 -03:00
Debanjum
3f2ea039a7
Add Chat page to the Khoj Web Interface
### Overview
- Provide a chat interface to engage with and inquire your notes
- Simplify interacting with the beta `chat` and `summarize` APIs

### Use
- Open `<khoj-url>/chat`, by default at http://localhost:8000/chat?type=summarize
- Type your queries, see summarized response by Khoj from your notes

**Note**:
 - **You will need to add an API key from OpenAI to your khoj.yml**
 - **Your query and top note from search result will be sent to OpenAI for processing**

## Details
- 177756b Show chat history on loading chat page on web interface
- d8ee0f0 Save chat history to disk for persistence, seeing chat logs
- 5294693 Style chat messages as speech bubbles
- d170747 Add khoj web interface and chat styling to new chat page on khoj web
- de6c146 Implement functional, unstyled chat page for khoj web interface
2023-01-13 23:02:19 -03:00
Debanjum Singh Solanky
16d4560ff8 Comment css styling of chat page for later reference 2023-01-13 22:40:01 -03:00
Debanjum Singh Solanky
cfef346d03 Do not update query field to ever chat message
It doesn't work as well with chat, unlike for search page
Use more appropriate thinking face emoji for you instead of surprise face
2023-01-13 22:24:26 -03:00
Debanjum Singh Solanky
177756be7e Fetch chat history from backend and render it on chat page load 2023-01-13 22:01:57 -03:00
Debanjum Singh Solanky
330febaa1a Update conversation logs from /beta/summary API endpoint too 2023-01-13 22:01:57 -03:00
Debanjum Singh Solanky
cb6f0b53c9 Make user_message_metadata arg to message_to_log in gpt.py optional
- Use a default user_message_metadata if arg not set
- Update conversation to use `by' as `you' and `khoj'
2023-01-13 22:01:57 -03:00
Debanjum Singh Solanky
cc2456e411 Update /beta/chat API to return chat history if no query param passed 2023-01-13 22:01:57 -03:00
Debanjum Singh Solanky
d8ee0f0e9a Use scheduler to save chat history to disk every 5 minutes
- The previous mechanism to trigger saving on shutdown event did not work
- Use scheduler to persist chat sessions to disk at a 5 minute interval
  - This improve time granularity, fixed interval of saving chat logs
  - It may lose ~5 minutes of chat history until mechanism to also
    write on shutdown found/resolved
- Create conversation directory if it doesn't exist before attempting write
- Reset chat_session after writing it to disk
2023-01-13 22:01:57 -03:00
Debanjum Singh Solanky
5294693e97 Style message as speech bubbles on chat page of web interface
- Wrap messages into speech bubbles
  - Color messages by khoj blue, sender grey
  - Add those standard protrusions to the speech bubbles for fun

- Align bubbles left or right based on sender
  - messages by khoj are left aligned, message by self are right aligned

- Put message metadata like sender and time under speech bubble
  - use data-* attribute and ::after css pseudo-selector for this

- Update renderMessage func to accept time param, remove unused type_ param
2023-01-13 22:01:57 -03:00
Debanjum Singh Solanky
7723d656dc Do not force GPT to summarize note using past tense
Not all notes are in the past. Notes can be about stuff in the future.
Casting them to past tense gives the impression that they've already
happened / been done.
2023-01-13 13:10:35 -03:00
Debanjum Singh Solanky
2842e3a035 Automatically scroll to bottom of chat body on new messages 2023-01-13 13:09:51 -03:00
Debanjum Singh Solanky
34014635d0 Improve colors, fix contrast for accessability on web interface
- Changes
  - Use blue color for khoj heading font
    - This fixes the title color issue

  - Update background to lighter shade
    - This fixes the body text color issue

  - Update colors for todo, done, miscellaneous todo state, tag color
    - This does not fix the color contrast issue but seems like an acceptable solution
    - Using white text rather than black text on blue background
      better even though the black text on blue background passes the
      WCAG acceptable contrast score
    - For details see blog post:
      https://uxmovement.com/buttons/the-myths-of-color-contrast-accessibility/

  - Add border to tags to give them tag pills look and differntiate
    from todo states

  - Buttons and inputs
    - Change background color of input fields like type dropdown,
      update button and results count counter, to match background
      color of page
    - Add shadow on hover over button, dropdowns

Resolves #111
2023-01-12 21:59:50 -03:00
Debanjum Singh Solanky
d170747ec2 Add khoj web interface & chat styling to new chat page on khoj web
- Ensure message input box sticks to bottom of screen
- Ensure chat logs div is scrollable when logs become longer than screen
  Do not make the whole page scroll, just the chat logs body div
2023-01-12 21:58:46 -03:00
Debanjum Singh Solanky
de6c146290 Implement functional, unstyled chat page for khoj web interface
Expose it at /chat URL
2023-01-12 21:53:25 -03:00
Debanjum Singh Solanky
e6793816f9 Upgrade Khoj.el Readme. Add TOC, Screenshot, Features Sections
- Update Query filter details
2023-01-12 02:14:02 -03:00
Debanjum Singh Solanky
26f791e9ad Update Obsidian Plugin Readme. Add Khoj icon to Khoj Modal Placeholder text
- Fold Query Filter, Demo Description
- Add Limitations to Readme
- Add *Update* index bullet to Troubleshooting Options
2023-01-12 01:48:52 -03:00
Debanjum Singh Solanky
3e63af5c94 Constrain grid rows to fix layout of Khoj web interface on Chrome 2023-01-12 01:48:52 -03:00
Debanjum Singh Solanky
50c797962c Jump to Search Result from Khoj Modal even on Obsidian Android
Uses longest file path match to find markdown file in vault
corresponding to file of search result returned by Khoj

Allow jumping to search result from khoj plugin modal on Android too
2023-01-11 19:44:11 -03:00
Debanjum Singh Solanky
51ea6d9c9b Do not force index update when configure backend on plugin load
- Backend can handle incremental updates
- Avoid khoj usability delay by avoiding recomputed everytime vault opened
2023-01-11 17:17:08 -03:00
Debanjum Singh Solanky
5996d47d7c Trigger input event to Get, Render Reranked results from Khoj backend
Previous mechanism of manually triggering getSuggestions,
renderSuggestions flow was corrupting traversing and opening
reranked search results in KhojModal

Emulate event that would anyway trigger the get & render of results in
modal. This lets obsidian core handle the flow without digging too
deep into obsidian cores handling of the flow. Lowers the chance of
breakage
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
1c813a6884 Convert results count setting to slider in plugin settings pane 2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
4e1abd1b72 Disable update button while indexing vault in plugin settings 2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
513c86c6a1 Set index file paths relative to current or default path on khoj backend
We need the index file paths to make sense on the khoj backend server

Having path of index on backend relative to current vault directory
on frontend ignores the fact that the frontend maybe on a different
machine than the khoj backend server

Using unique index name per vault allows switching vaults without
overwriting indices of other vaults created on khoj backend when khoj
obsidian plugin is loaded on opening a different vault
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
4407e23c19 Only index current vault on Khoj. Remove plugin setting to configure it
- Overview
  Limits using Khoj with a single vault at a time. This is
  automatically configured to the most recently opened vault.

  Once directory filters are supported on backend, the plugin will be
  updated to index multiple vault but search only current vault from
  current vaults khoj obsidian plugin

- Code Details
 - Remove setting to configure Vault directory from Khoj Obsidian plugin
 - Automatically configure Khoj to index only current Vault.
 - Overwrites any previous vaults that were intended to be indexed by
   Khoj backend
 - Force update of index after configuring vault

- Why
  It's not helpful for now and can lead to more problems, confusion.
  Once directory filters
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
86a1e43605 Return HTTP Exception on /api/update API call failure
- Previously the backend was just throwing backend error.
  The frontend calling the /update API wasn't getting notified
- Now the frontend can react appropriately and make the issue
  visible to the user
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
5af2b68e2b Update plugin notifications for errors and success
- Only show notification on plugin load and failure.
- In settings page, set current backend status at top of pane instead
  of showing notification
  Notices bubbles cluttered the UI while typing updates to settings
- Show notification once index updated via settings pane button click
  There was no notification on index updated, which usually takes time
  on the backend
2023-01-11 16:39:23 -03:00
Debanjum Singh Solanky
853192932a setCTA on Khoj Obsidian plugin button. Minor cleanup of space, tabs 2023-01-10 23:36:02 -03:00
Debanjum Singh Solanky
da49ea272c Add placeholder text to modal in Khoj Obsidian plugin 2023-01-10 22:50:11 -03:00
Debanjum Singh Solanky
580f4aca23 Add hints to Modal for available Keybindings 2023-01-10 22:03:47 -03:00
Debanjum Singh Solanky
b52cd85c76 Allow Reranking Results using Keybinding from Khoj Search Modal 2023-01-10 21:59:38 -03:00
Debanjum Singh Solanky
7991ab7a86 Add button in Obsidian plugin settings to force re-indexing your vault 2023-01-10 19:49:12 -03:00
Debanjum Singh Solanky
f046a95f3d Track connectedToBackend as a setting. Use it across obsidian plugin
- Display warning at top of khoj obsidian plugin settings
- Make search command available only if connected to backend
- Show warning notice on clicking khoj search ribbon button

- Call saveData after configureKhojBackend to ensure
  connnectedToBackend setting saved after being (potentially) updated
  in configureKhojBackend function
2023-01-10 17:28:47 -03:00
Debanjum Singh Solanky
768e874185 Load obsidian plugin even if fail to connect to backend but show warning
- Previously the plugin would not load if cannot connect to Khoj backend
  - Silently failing to load with no reason provided is not helpful
- Load plugin to allow user to fix the Khoj URL in their plugin setting
- Show reason for khoj plugin not working. More helpful than failing silently
2023-01-10 17:20:02 -03:00
Debanjum Singh Solanky
aa22d83172 Create and use a context manager to time code
Use the timer context manager in all places where code was being timed

- Benefits
  - Deduplicate timing code scattered across codebase.
  - Provides single place to manage perf timing code
  - Use consistent timing log patterns
2023-01-09 19:48:16 -03:00
Debanjum Singh Solanky
93f39dbd43 Add typing to text_search. Reformat code to set existing_embedding 2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
db7483329c Only import type hint packages for type checking. Avoids circular imports
Use annotations from the __future__ package to avoid having to quote
type hints. This import will not be required after Python 3.11
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
e5254a8e56 Create BaseEncoder class. Make OpenAI encoder its child. Use for typing
- Set type of all bi_encoders to BaseEncoder

- Make load_model return type Union of CrossEncoder and BaseEncoder
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
cf7400759b Remove unused render_results method from text and image search
It's a relic from when khoj was being used as a python module
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
afcfc3cd62 Split text_search.query logic into separate methods for modularity
The query method had become too big.

Extract out filter, score, sort and deduplicate logic used by
text_search.query into separate methods.

This should improve readabilty of code.
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
8dc6ee8b6c Pass `model' arg to extract_search_type method from beta search API
Issue caught by mypy
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
8498903641 Fix, add typing to Filter and TextSearchModel classes
- Changes
  - Fix method signatures of BaseFilter subclasses.
    Else typing information isn't translating to them
  - Explicitly pass `entries: list[Entry]' as arg to `load' method
  - Fix type of `raw_entries' arg to `apply' method
    to list[Entry] from list[str]
  - Rename `raw_entries' arg to `apply' method to `entries'
  - Fix `raw_query' arg used in `apply' method of subclasses to `query'
  - Set type of entries, corpus_embeddings in TextSearchModel

- Verification
  Ran `mypy --config-file .mypy.ini src' to verify typing
2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
eace7c6215 Use torch.tensor as torch.Tensor cannot create tensor on MPS device
- `torch.Tensor' is apparently a legacy tensor constructor
- Using that to create tensor on MPS devices throws error:
  RuntimeError: legacy constructor expects device type: cpu but device type: mps was passed
- `torch.tensor' can handle creating tensors on Mac GPU (MPS) fine
2023-01-09 19:47:19 -03:00
Debanjum Singh Solanky
9def3f8c6f Add exception handling to beta APIs, in case OpenAI API call fails 2023-01-09 01:27:06 -03:00
Debanjum Singh Solanky
7b164de021 Add beta API to summarize top search result using an OpenAI model
This is unlike the more general chat API that combines summarization
of top search result and conversing with the OpenAI model

This should give faster summary results. As no intent categorization
API call required
2023-01-09 01:25:59 -03:00
Debanjum Singh Solanky
d36da46f7b Truncate prompt to not exceed OpenAI prompt limit
Truncate prompt containing the top retrieved entry to 500 words to
avoid triggering the max_token limit error
2023-01-09 00:51:46 -03:00
Debanjum Singh Solanky
237123d18c Fix tests for the conversation processor
- Use latest davinci model for tests
- Wrap prompt in triple quotes to improve legibilty
- `understand' method returns dictionary instead of string. Fix its test
- Fix prompt for new model to pass `chat_with_history' test
2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky
918af5e6f8 Make OpenAI conversation model configurable via khoj.yml
- Default to using `text-davinci-003' if conversation model not
  explicitly configured by user. Stop using the older `davinci' and
  `davinci-instruct' models

- Use `model' instead of `engine' as parameter.
  Usage of `engine' parameter in OpenAI API is deprecated
2023-01-09 00:17:51 -03:00
Debanjum Singh Solanky
74e779f8d0 Fix /beta/chat API to use Entry class instead of old dictionary pattern
Search returns response of type SearchResponse instead of a dict now
2023-01-08 15:28:26 -03:00
Debanjum Singh Solanky
f2436039a0 Improve readability of GPT prompt strings in conversation processor 2023-01-08 15:27:41 -03:00
Debanjum Singh Solanky
6119005838 Improve comments, exceptions, typing and init of OpenAI model code 2023-01-08 00:36:18 -03:00
Debanjum Singh Solanky
c0ae8eee99 Allow using OpenAI models for search in Khoj
- Init processor before search to instantiate `openai_api_key'
  from `khoj.yml'. The key is used to configure search with openai models

- To use OpenAI models for search in Khoj
  - Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002
  - Set `encoder-type' in `khoj.yml' to `src.utils.models.OpenAI'
  - Set `model-directory' to `null', as online model cannot be stored on disk
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
826f9dc054 Drop long words from compiled entries to be within max token limit of models
Long words (>500 characters) provide less useful context to models.

Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
6a30a13326 Only create model directory if the optional field is set in SearchConfig 2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
2fe37a090f Make type of encoder to use for embeddings configurable via khoj.yml
- Previously `model_type' was set in the setup of each `search_type'
  - All encoders were of type `SentenceTransformer'
  - All cross_encoders were of type `CrossEncoder'

- Now `encoder-type' can be configured via the new `encoder_type' field
  in `TextSearchConfig' under `search-type` in `khoj.yml`.

- All the specified `encoder-type' class needs is an `encode' method
  that takes entries and returns embedding vectors
2023-01-07 23:09:12 -03:00
Debanjum Singh Solanky
d55d7d53dc Fix GPU usage by Khoj on Macs to speed up search and indexing
- Ensure all tensors are on MPS device before doing operations across them

- Background
  - GPU is used by default for Khoj on MacOS now
    - Needed PyTorch > 1.13.0 on Macs to use GPU, which we do now
  - MPS should speed up search and indexing on MacOS
2023-01-05 15:39:09 -03:00
Debanjum
abd035e2fa
Merge PR #112 to fix quote usage in khoj.el docstring from suliveevil/master
Fix usage warning for unescaped single quote in `khoj.el' docstring. 
Converts usage of '<text>' into `<text>' to use the correct quote forms in generated docs
2023-01-05 13:24:11 -03:00
Debanjum Singh Solanky
e792523849 Bump version in metadata packages for khoj, khoj.el and obsidian plugin 2023-01-05 12:50:27 -03:00
suliveevil
b2812b409f
fix docstring usage warning
 Warning (comp): khoj.el:119:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:120:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:121:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
 Warning (comp): khoj.el:168:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)
2023-01-05 16:47:38 +08:00
Debanjum Singh Solanky
47015ee6cc Fold Demo video descriptions, analysis by default in main Readme 2023-01-04 20:13:43 -03:00
Debanjum Singh Solanky
da17ff6ac8 Add Upgrade instructions for Khoj.el Readme. Fix version of khoj.el 2023-01-04 20:06:39 -03:00
Debanjum Singh Solanky
66ccd0c970 Create Obsidian plugin for Khoj
- Features
  - Search using Khoj from within the Obsidian app
    Allow Natural language search on your (markdown) notes in Obsidian Vault

  - Show search results as rendered (instead of raw) Markdown
    Improve legibility of the results

  - Jump to selected note from search result in Khoj search modal
    Simplify seeing result within its original note context

  - Automatically configure khoj to index markdown files in current vault
    Reduce khoj setup steps for plugin users by using reasonable defaults

    - Code updates the markdown config in khoj.yml and triggers index update
    - It can be configured by user in khoj plugin settings, if required

  - Add Demo and detailed Readme for the Obsidian plugin
    Ease setup and usage. Give context about capabilities

- Miscellaneous
  - Trying keep a mono repo until the Khoj project is mature enough
    to reduce maintainance burden
2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky
feddb6ce62 Add start_url to khoj webmanifest to show Khoj as PWA on Chrome 2023-01-04 13:37:56 -03:00
Debanjum Singh Solanky
3dee1aed9e Create /config/data/default API endpoint to serve default khoj config
This can ease configuring khoj from the different interfaces

- Don't need to know all the (default) config used by khoj.
- Just get default config by calling the above API endpoint.
- Then modify desired portions and call POST /api/config/data to
  configure khoj.
2023-01-03 21:52:34 -03:00
Debanjum Singh Solanky
ce945f7a90 Configure processors too on calling /update API
- Previously only search was being reconfigured
- But Processors are configured on app start too
- Match that behavior on calling /update API
2023-01-03 21:51:02 -03:00
Debanjum Singh Solanky
9d31988f42 Allow starting khoj in non-GUI mode without config file instantiated
- Start khoj server (in non-GUI mode) without needing config file
  already instantiated.
  - But throw warning to configure khoj to use it
- This allows plugins to configure the app via the /config/data APIs
- To be used by the Khoj obsidian plugin to configure markdown content
  in khoj
2023-01-03 21:36:59 -03:00
Debanjum Singh Solanky
52664dd96c Allow recursive glob pattern (**) to add files to search index
- Simplify configuring files to index For Obsidian/Org-Roam type
  systems with lots of small files in khoj.yml using `input-filter'
2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky
152e5f1661 Return the file of each search result in response
- Useful for enabling jump to note functionality in interfaces
- It will be used in the Khoj plugin for Obsidian
2023-01-03 01:25:34 -03:00
Debanjum Singh Solanky
c535953915 Update index automatically in non GUI mode too
- Poll scheduler every minute using threading.Timer
  - Use 60 seconds polling interval to avoid fork bombing
- Schedule next via the same poll scheduler
- Allow clean program interrupt by running scheduler in daemon mode
2023-01-01 21:03:19 -03:00
Debanjum Singh Solanky
701d92e17b Lock the index before updating it via API or Scheduler
- There are 3 paths to updating/setting the index (stored in state.model)
  - App start
  - API
  - Scheduler

- Put all updates to the index behind a lock. As multiple updates path
that could (potentially) run at the same time (via API or Scheduler)
2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky
3b0783aab9 Automate updating embeddings, search index on a hourly schedule
- Use the schedule pypi package
- Use QTimer to poll schedule.run_pending() regularly for jobs to run
2023-01-01 17:09:36 -03:00
Debanjum
06c25682c9
Split text entries by max tokens supported by ML models
### Background
There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated)

### Issue
Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

### Fix
- e057c8e Add method to split entries by specified max tokens limit
- Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL
- b283650 Deduplicate results for user query by raw text before returning results

### Results
- The quality of the search results should improve
- Relevant, long entries should show up in results more often
2022-12-26 18:23:43 +00:00
Debanjum Singh Solanky
17fa123b4e Split entries by max tokens while converting Beancount entries To JSONL 2022-12-26 15:14:32 -03:00
Debanjum Singh Solanky
f209e30a3b Split entries by max tokens while converting Markdown entries To JSONL 2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky
24676f95d8 Fix comments, use minimal test case, regenerate test index, merge debug logs
- Remove property drawer from test entry for max_words splitting test
  - Property drawer is not required for the test
  - Keep minimal test case to reduce chance for confusion
2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky
b283650991 Deduplicate results for user query by raw text before returning results
- Required because entries are now split by the max_word count supported
  by the ML models
- This would now result in potentially duplicate hits, entries being
  returned to user
- Do deduplication after ranking to get the top ranked deduplicated
  results
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
53cd2e5605 Regenerate initial model in asymmetric reload test to reduce flakyness
- Fix logger message when converting org node to entries
- Remove unused import from conftest
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
c79919bd68 Split entries by max tokens while converting Org entries To JSONL
- Test usage the entry splitting by max tokens in text search
2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky
08dc5e3324 Update instructions in khoj.el to install it from MELPA stable
- The instructions suggest installing khoj-assistant via pip install.
  This installs the latest tagged/release version of khoj
- To match that version user should install khoj.el from MELPA stable
  instead of MELPA
2022-12-23 19:08:38 -03:00
Debanjum Singh Solanky
e057c8e208 Add method to split entries by specified max tokens limit
- Issue
   ML Models truncate entries exceeding some max token limit.
   This lowers the quality of search results

- Fix
  Split entries by max tokens before indexing.
  This should improve searching for content in longer entries.

- Miscellaneous
  - Test method to split entries by max tokens
2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky
d3e175370f Update readme to install khoj.el from MELPA stable unless using pre-release khoj
Update readme to ask user to install khoj.el from MELPA when a
pre-release version of the main khoj app is installed. Else install
khoj.el from MELPA Stable
2022-12-20 23:29:22 -03:00
Debanjum Singh Solanky
cd463c5085 Update Khoj.el Install Instructions on Emacs 2022-12-20 11:06:33 -03:00
Debanjum Singh Solanky
23ca5a2d43 Improve (un-)quoting of funcs used in `khoj--get-enabled-content-types'
- Based on melpa package feedback for khoj.el
- Verified these changes don't affect behavior of the function
2022-12-19 18:02:23 -03:00
Debanjum Singh Solanky
5db3a67df5 Fix Khoj Emacs package URL in khoj.el 2022-12-14 22:49:19 -03:00
Debanjum Singh Solanky
abad6d5f44 Declare external khoj.el funcs. Remove undefined func warnings on install 2022-12-14 22:36:04 -03:00
Debanjum Singh Solanky
c52383b11c Delete stale, unused installation helper script 2022-12-03 13:36:47 -03:00
Debanjum Singh Solanky
1990d09032 Bump khoj version in setup.py, khoj.el to 0.2.0 2022-12-02 14:58:54 -03:00
Debanjum Singh Solanky
a9cfd8b800 Extract hash func for incremental text indexing into separate method 2022-10-26 13:56:58 +05:30
Debanjum Singh Solanky
0de2ff9c97 Add __init__.py to routers directory to register it as a package 2022-10-25 20:40:40 +05:30
Debanjum Singh Solanky
55d2fea9be Move Custom Formatter class for logger to util.helper module from main.py 2022-10-20 00:32:24 +05:30
Debanjum Singh Solanky
1c40f97114 Merge branch 'master' of github.com:debanjum/khoj into modularize-api-and-increase-typing
- Conflicts:
  - src/interface/emacs/khoj.el
    Use our update to `config-url', use their `url-request-method'
2022-10-19 16:46:53 +05:30
Debanjum Singh Solanky
e1b5a87920 Rename Frontend Router to Web Client. Fix logger usage in routers
- Use logger in api_beta router instead of print statements
- Remove unused logger in web client router
2022-10-19 16:36:48 +05:30
Debanjum
4abd51cb04
Merge pull request #99 from telotortium/method
Explicitly set `url-request-method' to GET in khoj.el
2022-10-19 10:31:37 +00:00
Debanjum Singh Solanky
c467df8fa3 Setup `mypy' for static type checking 2022-10-08 17:33:13 +03:00
Debanjum Singh Solanky
d292bdcc11 Do not version API. Premature given current state of the codebase
- Reason
  - All clients that currently consume the API are part of Khoj
  - Any breaking API changes will be fixed in clients immediately
  - So decoupling client from API is not required
  - This removes the burden of maintaining muliple versions of the API
2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
99754970ab Type the /search API response to better document the response schema
- Both Text, Image Search were already giving list of entry, score
- This change just concretizes this change and exposes this in the API
  documentation (i.e OpenAPI, Swagger, Redocs)
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
0521ea10d6 Put image score breakdown under `additional' field in search response
- Update web, emacs interfaces to consume the scores from new schema
2022-10-08 12:06:01 +03:00
Debanjum Singh Solanky
e42a38e825 Version Khoj API, Update frontends, tests and docs to reflect it
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
  under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
2022-09-28 20:08:38 +03:00
Robert Irelan
d25e1d8e86
fix: explicitly set url-request-method
In my installation, it appears that `url-request-method` is sometimes set
globally to POST.  Need to explicitly set it to ensure that GET is always
used as intended.
2022-09-19 15:46:46 -04:00
Debanjum Singh Solanky
ee65a4f2c7 Merge /reload, /regenerate into single /update API endpoint
- Pass force=true to /update API to force regenerating index from
scratch
- Otherwise calls to the /update API endpoint will result in an
incremental update to index
2022-09-16 00:53:19 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
c16ae9e344 Ignore "Legacy way to download model" warning for upstream dependency 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
3169e3b78e Use ellipsis instead of pass in base filter abstract methods for aesthetic 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
bf1ae038cb Get XMP metadata from image using Pillow. Remove ExifTool dependency
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
  - This also removes dependency on system Exiftool package for
    XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
8f57a62675 Remove unused imports. Fix typing and indentation
- Typing issues discovered using `mypy'. Fixed manually
- Unused imports discovered and fixed using `autoflake'
- Fix indentation in `org_to_jsonl' manually
2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky
be57c711fd Revert OrgNode.hasTag func to method instead of property as accepts argument 2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
0109c7bd91 Disable ability to call <text>_to_jsonl, <type>_search packages directly
- This code is de-synced with expected args by above scripts
- Better to remove unused capabilitity that needlessly increases
  maintainance burden
2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
1680a617da Reflect updates to query and results count in URL
- Simplify tracking khoj query history, saving/sharing links
- Do not execute search, when query only contains whitespaces
  - Prevents error when try process results of empty query
2022-09-13 23:39:24 +03:00
Debanjum Singh Solanky
34314e859a Call /reload instead of /regenerate API to update index from web interface
- As `/reload` updates index incrementally, it's relatively quick
- This makes exposing `/reload` endpoint a better default to expose
  via the web interface than `the /regenerate' endpoint
2022-09-12 23:39:10 +03:00
Debanjum Singh Solanky
13b5d5082f Create input field to set results count on the web interface
Resolves #96
2022-09-12 23:24:46 +03:00
Debanjum Singh Solanky
1bfe9c4ef2 Handle filter only queries. Short-circuit and return filtered results
- For queries with only filters in them short-circuit and return
  filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky
afc84de234 Make word filter regex explicit. Allow hyphen in word filters
Helps with #88
2022-09-12 17:05:29 +03:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
940c8fac8c Use app LRU, not functools LRU decorator, to cache search results in router
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
  wrapped in generically named function when using functools LRU decorator
2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky
c6fa09d8fc Fix querying with include word filter from web interface
- Not encoding the `query' string before querying the backend API with
  it was causing the "+" prefix for include word filter to be lost
2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky
1502fbc9e9 Add index_heading_entries flag to default and sample khoj configs 2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky
7216cdff58 Add Date, Word filter for Org-Music content 2022-09-11 17:29:34 +03:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky
db37e38df7 Create OrgNode hasBody method. Use it in org_to_jsonl checks 2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
b4878d76ea Extract entries from scratch when regenerate requested
- Do not rely on previously extracted entries to find new entries in
regenerate scenario
2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
52e3dd9835 Pass the whole TextContentConfig as argument to text_to_jsonl methods
- Let the specific text_to_jsonl method decide which of the
  TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
  modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
  methods in check
2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky
e951ba37ad Raise exception when org file not found
- No need to catch the IOError in OrgNode
2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky
2e1bbe0cac Fix striping empty escape sequences from strings
- Fix log message on jsonl write
2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky
a7cf6c8458 Use dictionary instead of list to track entry to file maps 2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky
3e1323971b Stack function calls in jsonl converters to avoid unneeded variables 2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky
4eb84c7f51 Log performance metrics for beancount, markdown to jsonl conversion 2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky
ebd5039bd1 Merge branch 'master' into support-incremental-updates-of-embeddings 2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky
030fab9bb2 Support incremental update of Markdown entries, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
91aac83c6a Support incremental update of Beancount transactions, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
b01b4d7daa Extract logic to mark entries for embeddings update into helper function
- This could be re-used by other text_to_jsonl converters like
  markdown, beancount
2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
f97308bef2 Fix log message on writing JSONL data to file 2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
91d11ccb49 Only hash compiled entry to identify new/updated entries to update
- Comparing compiled entries is the appropriately narrow target to
  identify entries that need to encode their embedding vectors. Given we
  pass the compiled form of the entry to the model for encoding

- Hashing the whole entry along with it's raw form was resulting in a
  bunch of entries being marked for updated as LINE: <entry_line_no>
  is a string added to each entries raw format.

- This results in an update to a single entry resulting in all entries
  below it in the file being marked for update (as all their line
  numbers have changed)

- Log performance metrics for steps to convert org entries to jsonl
2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky
b9a6e80629 Make OrgNode tags stable sorted to find new entries for incremental updates
- Having Tags as sets was returning them in a different order
  everytime
- This resulted in spuriously identifying existing entries as new
  because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
  entries for incremental update correctly
2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
ec675d27d3 Suppress non-actionable HuggingFace FutureWarning shown on app start 2022-09-10 16:43:14 +03:00
Debanjum Singh Solanky
1ac6a71ff0 Add --version flag to show installed version of khoj 2022-09-10 16:40:19 +03:00
Debanjum Singh Solanky
976397bd82 Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings 2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky
11917c6ddd Do not normalize absolute filenames for creating links in OrgNode 2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
07b98d35f1 Use filename or #+TITLE as heading for 0th level content in org files
- Set LINE, SOURCE link properties in property drawer correctly for
  content which falls under no heading
- See Issue #83 for more details
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
d6bd7bf3e1 Fix initializing OrgNode level to string to parse org files
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue #83, as level is set to string of `*`s
  the moment a heading is found in the current file
2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky
d835467f2c Throw exception if no valid entries found in specified content files
- Previously we were failing if no valid entries while computing
  embeddings. This was obscuring the actual issue of no valid entries
  found in the specified content files
- Throwing an exception early with clear message when no entries found
  should make clarify the issue to be fixed
- See issue #83 for details
2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky
e00bb53336 Init word filter dictionary with default value as set to simplify code 2022-09-10 12:19:09 +03:00
Debanjum Singh Solanky
4d776d9c7a Bump khoj version to 0.1.9 2022-09-09 07:50:15 +03:00
Debanjum Singh Solanky
588f598949 Pass empty list of `input_files' to FileBrowser on first run
- Default config has `input_files' set to None
- This was being passed to `FileBrowser' on Initialization
- But `FileBrowser' expects `content_files' of list type, not None
- This resulted in an unexpected NoneType failure
2022-09-09 07:26:40 +03:00
Debanjum Singh Solanky
3ddffdfba4 Create config directory before setting up logging to file under it
- The logging to file code expects the config directory to already be setup
- But parent directory of config file was being set up later in code
- This resulted in app start failing with ~/.khoj dir does not exist error
2022-09-09 07:21:42 +03:00
Debanjum Singh Solanky
762607fc9f Log processed entries by org_to_jsonl only if verbosity > 2
Output too verbose for even debug mode logging. So gated behind -vvv
2022-09-06 23:03:29 +03:00
Debanjum Singh Solanky
490157cafa Setup File Filter for Markdown and Ledger content types
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
  - Word, Date Filters were accidently removed from the above types yesterday
  - File Filter is the only filter that newly got added
2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky
94cf3e97f3 Log app logs to file for posthoc debugging and performance analysis 2022-09-06 14:51:48 +03:00
Debanjum Singh Solanky
3707a4cdd4 Improve date filter perf. Precompute date to entry map, Cache results
- Precompute date to entry map
- Cache results for faster recall
- Log preformance timers in date filter
2022-09-05 18:21:29 +03:00
Debanjum Singh Solanky
31503e7afd Do not pass embeddings as argument to filter.apply method 2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky
965bd052f1 Make search filters return entry ids satisfying filter
- Filter entries, embeddings by ids satisfying all filters in query
  func, after each filter has returned entry ids satisfying their
  individual acceptance criteria

- Previously each filter would return a filtered list of entries.
  Each filter would be applied on entries filtered by previous filters.
  This made the filtering order dependent

- Benefits
  - Filters can be applied independent of their order of execution
  - Precomputed indexes for each filter is not in danger of running
    into index out of bound errors, as filters run on original entries
    instead of on entries filtered by filters that have run before it
  - Extract entries satisfying filter only once instead of doing
    this for each filter

- Costs
  - Each filter has to process all entries even if previous filters
    may have already marked them as non-satisfactory
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7dd20d764c Pre-compute file to entry map in file filter to mark ids to include faster 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
2890b4cd44 Simplify extracting entries satisfying file filter 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7606724dbc Add file of each entry to entry dict in org_to_jsonl converter
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
  entry in text_search as all text search content types do not have
  file being set in jsonl converters
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7e083d3e96 Cache results for file filters passed in query for faster filtering 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
f634399f23 Convert simple file filters with no path separator into regex
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
  if `entry1` is in `file1.org` at `~/notes/file.org`

- Test
  - Test converting simple file name filter to regex for path match
  - Test file filter with space in file name
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
1f9fd28b34 Create File Filter to filter files to query. Add tests for file filter 2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky
e4418746f2 Create Abstract Base Class for Filters. Make Word, Date Filter Child of BaseFilter 2022-09-04 18:48:16 +03:00
Debanjum Singh Solanky
f930324350 Rename explicit filter to word filter to be more specific 2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky
6087862521 Use LRU helper class for explicit filter cache 2022-09-04 16:42:28 +03:00
Debanjum Singh Solanky
8f3326c8d4 Create LRU helper class for caching 2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky
191a656ed7 Use word to entry map, list comprehension to speed up explicit filter
- Code Changes
  - Use list comprehension and `torch.index_select' methods
    - to speed selection of entries, embedding tensors satisfying filter
    - avoid deep copy of entries, embeddings
    - avoid updating existing lists (of entries, embeddings)

  - Use word to entry map and set operations to mark entries satisfying
    inclusion, exclusion filters

- Results
  - Speed up explicit filtering by two orders of magnitude
  - Improve consistency of speed up across inclusion and exclusion filtering
2022-09-04 15:22:35 +03:00
Debanjum Singh Solanky
28d3dc1434 Deep copy entries, embeddings in filters. Defer till actual filtering
- Only the filter knows when entries, embeddings are to be manipulated.
  So move the responsibility to deep copy before manipulating entries,
  embeddings to the filters

- Create deep copy in filters. Avoids creating deep copy of entries,
  embeddings when filter results are being loaded from cache etc
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
3308e68edf Cache explicitly filtered entries, embeddings by required, blocked words 2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
cdcee89ae5 Wrap words in quotes to trigger explicit filter from query
- Do not run the more expensive explicit filter until the word to be
  filtered is completed by user. This requires an end sequence marker
  to identify end of explicit word filter to trigger filtering

- Space isn't a good enough delimiter as the explicit filter could be
  at the end of the query in which case no space
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
8d9f507df3 Load entries_by_word_set from file only once on first load of explicit filter 2022-09-04 00:37:37 +03:00
Debanjum Singh Solanky
858d86075b Use regexes to check if any explicit filters in query. Test can_filter 2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky
546fad570d Use regex to extract include, exclude filter words from query 2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky
ffb8e3988e Use Python Logging Framework to Time Performance of Explicit Filter 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
c7de57b8ea Pre-compute entry word sets to improve explicit filter query performance 2022-09-03 16:16:31 +03:00
Debanjum Singh Solanky
094bd18e57 Use python standard logging framework for app logs
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
  - verbose = 0 maps to logging.WARN
  - verbose = 1 maps to logging.INFO
  - verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
2022-09-03 14:43:32 +03:00
Debanjum Singh Solanky
d0531c3064 Update URL QueryParam when Type set in Dropdown on Web Interface
- This also pushes the updated URL state to history
- Allows jumping back to the web interface after clicking on an image
  and having the type set to image search
- Previously type would get reset to the default search type on
  jumping back
2022-08-28 12:22:22 +03:00
Debanjum Singh Solanky
2eae32d743 Time, Log Image Search Performance 2022-08-28 00:28:46 +03:00
Debanjum Singh Solanky
c3ca99841b Scale down images to generate image embeddings faster, with less memory
- CLIP doesn't need full size images for generating embeddings with
  decent search results. The sentence transformers docs use images
  scaled to 640px width

- Benefits
  - Normalize image sizes
  - Increase image embeddings generation speed
  - Decrease memory usage while generating embeddings from images
2022-08-24 14:09:02 +03:00
Debanjum Singh Solanky
ea4fdd9134 Fix logic to ignore notes with no body. Add tests to prevent regression
- Notes with empty newlines in body were not being ignored
- Add regression tests to avoid above regression in org_to_jsonl conversion
2022-08-21 19:41:40 +03:00
Debanjum
144986ebfd
Fix, Improve Desktop GUI Splash Screen and Main Window
- 5e6625a Fix file browser to not add empty line when no file/dir selected
- 8098b8c Bring main window to Top when open from System Tray
- 1c122a8 Place window near top so buttons are not hidden by OS bottom bar
- dfe2546 Set Khoj Icon on Main Desktop Window
- 1b1f8f9 Move Splash screen text below icon. Set the text color to black
- 450f644 Fix path to remove shared libraries when packaging the Windows app
2022-08-20 23:19:01 +00:00
Debanjum Singh Solanky
5e6625ac68 Fix file browser to not add empty line when no file/dir selected
- When no file selected in file browser an empty line/entry gets added
  to input entries list
- Bug got introduced due to insufficient update on change to add
  instead of insert
- Update is_none_or_empty helper method to also check for empty string
2022-08-21 02:03:28 +03:00
Debanjum Singh Solanky
8098b8c3a8 Bring Configure Window to Top when Opened from System Tray
- Previously the window could get hidden behind other app windows when
  user clicked configure from the system tray
2022-08-20 23:38:43 +03:00
Debanjum Singh Solanky
1c122a8a91 Place window near top so buttons are not hidden by OS bottom bar 2022-08-20 22:38:06 +03:00
Debanjum Singh Solanky
dfe2546c04 Set Khoj Icon on Main Desktop Window 2022-08-20 20:36:15 +03:00
Debanjum Singh Solanky
82d2891765 Do not pass ML compute `device' around as argument to search funcs
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
  required by looking for ML compute device in global state
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
acc9091260 Use MPS on Apple Mac M1 to GPU accelerate Encode, Query Performance
- Note: Support for MPS in Pytorch is currently in v1.13.0 nightly builds
  - Users will have to wait for PyTorch MPS support to land in stable builds
- Until then the code can be tweaked and tested to make use of the GPU
  acceleration on newer Macs
2022-08-20 14:44:06 +03:00
Debanjum Singh Solanky
7de9c58a1c Load models, corpus embeddings onto GPU device for text search, if available
- Pass device to load models onto from app state.
  - SentenceTransformer models accept device to load models onto during initialization
- Pass device to load corpus embeddings onto from app state
2022-08-20 14:04:18 +03:00
Debanjum Singh Solanky
dc8dcc94a6 Bump Khoj.el package version to 0.1.6 2022-08-19 20:48:42 +03:00
Debanjum Singh Solanky
ffbf15eff8 Add helper function to identify when app running as pyinstaller app
Useful for when want the app to behave differently in pyinstaller app
scenario with frozen python. And in development scenarios
2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
6c5c1c33c1 Turn off Tokenizers Parallelism. Khoj doesn't support it right now
- Forking and multiprocess are problemantic in frozen python
  scenarios. This will cause issues when running App packaged by
  pyinstaller
2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
d4072974d7 Use of XMP metadata in Khoj Image Search is broken. Disable by default
- CLIP Image score and XMP metadata score are not combining well.
  When combined they give non sensical results. Enable only once
  figure how best to combine the two.

- Show scores with higher precision for image search
  - Image search scores seem to be mostly be between 0.2 - 0.3 for some reason
  - Higher precision scores make it easier to understand the quality
    of returned results perceived by the model itself
2022-08-19 19:17:28 +03:00
Debanjum Singh Solanky
7c4417126c Append files, directories selected by user to config in Desktop GUI
- Allows adding multiple image directories via GUI
- Allow adding multiple files in different directories via GUI
- Previously users couldn't add multiple directories via GUI
  They'd have to manually append to input field if multiple files, directories
- To clear/overwrite is much easier.
  The user can just select text to delete in input area
2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
00ddcfdac8 Use .ico icon when packaging for Windows (and Linux) using Pynstaller 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
60dacf3f2c Show splash screen on app start. Only supported on Windows, Linux 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
0079c13bf7 Set input-directories in config for image search type on Desktop GUI
- Issue
  Fix configuring image search from Desktop GUI. It was broken before.
  The Desktop GUI was updating input-files field under content-type > image.
  This field is not used for image search. So image search couldn't be
  configured from the Desktop GUI

- Fix
  - Set input-directories when field of search type image is set from GUI
  - Otherwise set input-files field in config
2022-08-18 18:29:55 +03:00
Debanjum Singh Solanky
c4fd661909 Move the experimental /chat API to under /beta/chat 2022-08-16 16:36:15 +03:00
Debanjum Singh Solanky
b8913476ba Fix if condition in router to trigger markdown search 2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
9bc4fd539e Set Web Interface URL from loaded state in Desktop GUIs. Not hard-coded 2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
7f479b0104 Improve Displaying Error to User on Khoj window in Desktop GUI
- Show a helpful error message in the GUI to the user, instead of the
  crashing if loading config fails, for e.g if file wasn't found
- Collate GUI errors into an ErrorType enum class
- Remove previous error messages before showing the new one
2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
873bb9dd97 Do not force the Khoj window to always be on top. It's needlessly annoying 2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
67ab40bb01 Regenerate embeddings everytime user clicks configure in Desktop GUI
Previously if the embeddings were already there only the khoj.yml
config file would get updated. The embeddings would remain old.

1. This results in a stale app state where the config doesn't
   match the embeddings

2. Currently the user cannot update their config from the config
   screen. They'd have to use a combination of config screen and web
   interface>regenerate button to trigger it or delete their ~/.khoj dir

This commit should resolve the above issues
2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
2647e6bab4 Display re-ranked results triggered via keybinding in khoj.el
- Prevent immediate overwrite of re-ranked results by
  incremental-search without rerank triggered via post-command-hook.

- This triggers right after the reranking results are rendered, so
  user never ends up seeing them
2022-08-15 18:41:12 +03:00
Debanjum Singh Solanky
a91d2df300 Simplify Emacs interface to only rerank results on explicit command 2022-08-15 06:20:13 +03:00
Debanjum Singh Solanky
e846829a2e Reset Khoj.el version to align with Khoj package version 2022-08-15 06:20:13 +03:00
Debanjum Singh Solanky
fed0b591af Package Khoj as Debian app in Github Release Workflow 2022-08-14 05:07:58 +03:00
Debanjum Singh Solanky
541e03da3d Make khoj.el pass checkdoc, package-lint, flycheck checks
- Add docstrings, mention args in them. Make docstring crisper
- prefix funcs, variables with khoj--
- Require emacs >27.1 for json-parse-buffer
- Use lexical binding
- Add quickstart docs to elisp file itself
- Bump version of khoj.el
2022-08-13 21:37:41 +03:00
Debanjum Singh Solanky
3300378804 Minimal formatting to render beancount results legibly on web interface 2022-08-13 05:03:45 +03:00
Debanjum Singh Solanky
a0759dd923 Convert Configure Screen into the Main Application Window
- What
  - Convert the config screen into the main application window
    with configuration as just one of the functionality it provides
  - Rename config screen to main window to match new designation

- Why
  - System Tray isn't available everywhere (e.g Linux)
  - This requires moving functionality into a normal window for cross-compat
2022-08-13 02:05:52 +03:00
Debanjum Singh Solanky
684f497abe Handle no System Tray on Linux (Gnome)
- What
  - On Linux
    - Show Configure Screen, even if not first run experience
    - Do no show system tray on Linux
    - Quit app on closing Configure Screen
  - On Windows, Mac
    - Show Configure screen only if first run experience
    - Show system tray always
    - Do not quit app on closing Configure Screen

- Why
  - Configure screen is the only GUI element on Linux. So closing it
    should close the application
  - On Windows, Mac the system tray exists, so app should not be closed
    on closing configure screen
2022-08-13 01:00:20 +03:00
Debanjum Singh Solanky
c2815c5d09 Open Search from Khoj Configure Screen
- Start evolving configure screen away from just being a configure screen
  - Update Window Title to just say Khoj
- Allow Opening Web Interface to Search from Khoj configure screen
- Rename "Start" Button to more accurate "Configure"
- Disable Search button on first run and while configuring app
2022-08-13 00:43:49 +03:00
Debanjum Singh Solanky
28a91ad1fd Deep copy the default_config constant to prevent it being overwritten
- Issue
  - In the previous form, updates to self.current_config would update
    default_config as python does a shallow copy
  - So self.current_config is just referencing the values of default_config
  - Hence updates to current_config updates the default_config values too
  - This is not what we want

- Fix
  - Deep copy the default_config values. Now updates to
    self.current_config wouldn't affect the default_config
2022-08-12 23:54:16 +03:00
Debanjum Singh Solanky
62ac41ce3b Reload settings in a separate thread to not freeze Config Screen
- Generating embeddings takes time
- If user enables a content type and clicks start.
  The app starts to generate embeddings when loading the new settings
- Run this function in a separate thread to keep config screen responsive
- But disable start button to prevent re-entrant threads
- Also show a minimal visual indication that the app is saving state
2022-08-12 23:34:00 +03:00
Debanjum Singh Solanky
927547d0af Update Title of Configure Screen to follow "<Screen> - App" pattern 2022-08-12 22:53:10 +03:00
Debanjum Singh Solanky
32ac1ea1b6 Allow user to quit application from the terminal via SIGINT
Call python interpreter at regular interval to handle any interrupt
signals. create custom handler to terminate server and application
2022-08-12 21:11:58 +03:00
Debanjum Singh Solanky
43301d488a Increase Width of Configure Screen 2022-08-12 18:34:47 +03:00
Debanjum Singh Solanky
9baea9c9fd Let Input Fields Wrap. Adjust Height based on Text in Field
- Convert Input Fields into PlainTextEdit
- Display Each Selected File on a Separate Line in Field
- Set Height of FileBrowser Input Field based on Number of Lines/Files
2022-08-12 18:33:56 +03:00
Debanjum Singh Solanky
b7b96110e9 Rename FileBrowser Button Text to "Select" instead of "Add" 2022-08-12 17:08:40 +03:00
Debanjum Singh Solanky
a1c58a9470 Create, Use a Labelled Text Field for the Conversation Input Field
- This fixes the field expanding when configure screen is expanded
- Allows for reusability of the labelled text field
- Simplifies the logic to save settings for conversation processor
2022-08-12 16:59:15 +03:00
Debanjum Singh Solanky
fa7e36cada Rename external *.js files to *.min.js to mark them as vendored
- Excludes from Github language stats.
  See linguists/vendor.yml for exclusion rules
- Signifies them as external for Khoj developers too
2022-08-12 04:08:50 +03:00
Debanjum Singh Solanky
110e3df0b7 Set default config in the constant module. Use from there to configure app
- Avoid having to pass the khoj_sample.yml data file into pip, native apps
- Packaging data files into python packages is annoying.
  - There's `MANIFEST.in`, `data_files` and `package_data` in setup.py
  - Bdist, wheel, generated source tarball use different set of these fields
    and put the data files in different locations
  - Rather just code the default config into a constant. Avoid
    pointless file reads as well this way
2022-08-12 02:18:46 +03:00
Debanjum Singh Solanky
fad2f3a2e7 Resolve config_file to absolute right at start on parsing args in cli
- Assume path is absolute in yaml util module while saving, loading file
  - This follows same convention as jsonl. Which just operates on
    passed file path, assuming it is of appropriate form.
    Responsibility to put it in appropriate form is on the caller, for now
2022-08-12 01:34:08 +03:00
Debanjum Singh Solanky
44fe70513a Handle situation where default config directory or file does not exist
- Include khoj_sample.yml in pip package to load default config from
- Create khoj config directory if it doesn't exist
- Load config from khoj_sample.yml if khoj.yml config doesn't exist
2022-08-12 01:17:34 +03:00
Debanjum Singh Solanky
41520e1608 Improve Docstring for Configure Screen and System Tray class, funcs 2022-08-11 23:36:02 +03:00
Debanjum Singh Solanky
a748acfeeb Merge branch 'master' of github.com:debanjum/khoj into create-native-gui
Conflicts:
- src/main.py
  - router functions have moved to router
  - move logic to handle null query perf timer variables into
    router.py
  - set main.py to current branch, not master
2022-08-11 21:09:42 +03:00
Debanjum Singh Solanky
6af2d6bb6d Add Flag to Start App without Native GUI 2022-08-11 20:59:57 +03:00
Debanjum Singh Solanky
b74ca1def6 Wrap error message instead of expanding screen to show message 2022-08-11 20:51:56 +03:00
Debanjum Singh Solanky
2646fa825b Get Files from File input line to match user expectation
- If a user manually edits the input file lines, clicking start should
  use that. Currently it just looks at the files selected last via file
  browser
- We want to allow users to manually enter file paths in field. Which
  is why the field hasn't been set to read-only
2022-08-11 20:48:45 +03:00
Debanjum Singh Solanky
dad9133598 Split save_settings method into smaller methods for modularization 2022-08-11 20:00:52 +03:00
Debanjum Singh Solanky
56ba91fec8 Remove unused methods in file browser widget. Improve name of existing 2022-08-11 19:46:09 +03:00
Debanjum Singh Solanky
fd4e41495c Use appropriate label for directory input types to minimize confusion 2022-08-11 19:45:19 +03:00
Debanjum Singh Solanky
c1e1466fb1 Validate new config before write. Show error if new config invalid 2022-08-11 19:18:22 +03:00
Debanjum Singh Solanky
1ff049599f Show current config on config screen. Load default config if config unset
- Track current (saved/loaded) config separate from the new config (to
  be written) when user clicks Start

- Fallback to using default config when no config for the specific
  content type or processor is specified in khoj.yml
  - Earlier were only loading default config on first run, not after

- Create Child CheckBox, LineEdit classes for Processor Widgets
  - Create ProcessorType, similar to SearchType
  - Track ProcessorType the widgets are associated with
  - Simplify update, save, load of config based on type
2022-08-11 19:11:25 +03:00
Debanjum Singh Solanky
23e06f483d Do not emit type tags when dumping config YAML to file 2022-08-11 19:08:36 +03:00
Debanjum Singh Solanky
678fb6a3c7 Add Settings Panel for Conversation Settings to Config Screen 2022-08-11 04:52:40 +03:00
Debanjum Singh Solanky
c1fcf44405 Initialize Settings on Config Screen with Existing Settings from File 2022-08-11 04:51:33 +03:00
Debanjum Singh Solanky
3cec6229ad Hot swap backend config via config screen start button click
- Update configuration to use by the backend, while app is running
- Trigger after user hits start button with their config.
  The config gets written to khoj.yml file first, then the updated
  config is loaded onto memory
2022-08-11 00:32:11 +03:00
Debanjum Singh Solanky
f7fdf8d8ce Refactor app start to start server even if backend not configured
- Decouple configuring backend from starting server.
  Backend search and processors can be configured after the backend
  server has started

- Set global state in main instead of in configure_server method.
  This allows the app to start even if configure_server exits early in
  the first run scenario, where no config available to configure server

- Now start server, even if no config, before GUI started in main

- This refactor of app startup flow will allow users to configure
  backend using the configure screen after server start
2022-08-11 00:13:14 +03:00
Debanjum Singh Solanky
34018c7d4b Store args passed from commandline at app start in global app state 2022-08-11 00:11:35 +03:00
Debanjum Singh Solanky
cc6ef0f450 Save configure screen settings to app config yaml on clicking Start 2022-08-10 23:10:39 +03:00
Debanjum Singh Solanky
dae65c5b6b Create child class of Qt CheckBox to track search type it enables/disables 2022-08-10 22:44:37 +03:00
Debanjum Singh Solanky
f42f54019b Type parent_layout passed as arguments to ConfigureScreen methods 2022-08-10 22:43:20 +03:00
Debanjum Singh Solanky
f63f11186f Pass config file for app to configure screen 2022-08-10 22:42:32 +03:00
Debanjum Singh Solanky
82a7059b6a Only setup conversation processor if it has configuration set 2022-08-10 22:34:03 +03:00
Debanjum Singh Solanky
9628ca073c Extract conversation processor from config into separate function
- Only pass processor config arg required by configure_processor. Not
  the unused full config object
- Type arguments passed to methods configure processors
- Import json for use by conversation processor to load logs
2022-08-10 22:33:33 +03:00
Debanjum Singh Solanky
62eb66b8ca Rename load_config_from_file to more descriptive parse_config_from_file 2022-08-10 22:28:51 +03:00
Debanjum Singh Solanky
328cc00439 Create global constant to store app root directory 2022-08-10 20:09:03 +03:00
Debanjum Singh Solanky
d2c7b28172 Extract code to load config from YAML file into new utils.yaml module 2022-08-10 20:07:44 +03:00
Debanjum Singh Solanky
150ae19660 Indent Timestamps, Drawers at Body Level in OrgNode Entry Representation 2022-08-10 18:55:37 +03:00
Debanjum Singh Solanky
fd31d339c1 Remove spurious space in Entries without Todo in OrgNode Entry Repr 2022-08-10 13:48:44 +03:00
Debanjum Singh Solanky
eddf88f818 Org buffer customization settings to tail of khoj.el results buffer
- Results get priority screen real estate
- Allows quick speed key based traversal of results as cursor
  on switching to buffer is at top level heading
  - E.g C-x o n n o 2 jumps to entry in actual file of second result
  - Unlike before when it is at the #+STARTUP org buffer customization
    settings
2022-08-10 12:57:37 +03:00
Debanjum Singh Solanky
daef276fd1 Add files for each search type. Extract config on clicking start
- Only allow adding files with appropriate file extension for each search type
  - e.g .org for org-mode search, directory for image search

- Extract file paths added to config and enablement state of each search type
  - This extracted state will be used to populate the khoj.yml config file
2022-08-10 03:27:22 +03:00
Debanjum Singh Solanky
d74134e6cc Reuse Single Method to Create Setting Panels for each Search Type 2022-08-09 23:50:43 +03:00
Debanjum Singh Solanky
509d52e2cd Toggle Editability instead of Visibility of Per Search Type Settings
- Simplifies the configure screen layout and allows it to be of constant width

- It was buggy, the configure screen would dynamically expand but not
  restore back to original size on disabling search type after enable
2022-08-09 23:34:54 +03:00
Debanjum Singh Solanky
3c788f1d29 Rename configure window to more generic configure screen 2022-08-09 22:44:05 +03:00
Debanjum Singh Solanky
c50ab7c3ad Split config settings GUI into functions. Convert Config Window to Dialog 2022-08-09 22:36:41 +03:00
Debanjum Singh Solanky
664713b24e Extract Qt GUI code from main.py into separate interface/desktop dir 2022-08-09 22:12:29 +03:00
Debanjum Singh Solanky
84c1fc701d Fix query timing variables from being referenced before assignment 2022-08-09 21:06:37 +03:00
Debanjum Singh Solanky
57026b802c Set size of rendered images using user customizable vars 2022-08-09 21:06:37 +03:00
Debanjum Singh Solanky
0a758c9f0f By default, wait for 2 seconds before initiating rerank in khoj.el
- Subjectively, previous default seems to aggressive based on usage
  Doesn't give time for user to think and type their query
2022-08-09 21:06:30 +03:00
Debanjum Singh Solanky
f01fb16ebb Use single hyphen in name of user configurable variables in khoj.el
- Follow convention, two hyphens indicate variable private to library
- Defcustom are user configurable variables. So they should have single -
- Use khoj-results-count variable directly in code
2022-08-09 20:49:34 +03:00
Debanjum Singh Solanky
cd59982c9c Add Qt Button to save Khoj configuration in Khoj Configuration Window 2022-08-09 20:42:44 +03:00
Debanjum Singh Solanky
2c77caf06c Group ledger, org setting widgets into child Qt widgets of config window 2022-08-09 20:42:44 +03:00
Debanjum Singh Solanky
027da719aa Open Configure Window on First Run or from System Tray
- Trigger FRE if no config loaded. Open Configure Window automatically
- Else user can manually open config window from App on System Tray
2022-08-09 17:05:27 +03:00
Debanjum Singh Solanky
a588a8e21f Make config_file an optional argument. It can be generated on FRE
- Make config_file an optional arg. It defaults to default khoj config dir
- Return args.config as None if no config_file explicitly passed by user
- Parent can use args.config = None as signal to trigger first run experience
2022-08-09 17:02:02 +03:00
Debanjum Singh Solanky
21af122447 Clean up unused methods, module imports. Add comments 2022-08-09 16:59:38 +03:00
Debanjum Singh Solanky
80fa9fde6a Quit GUI via SysTray instead of sys.exit to cleanly terminate server 2022-08-08 23:49:26 +03:00
Debanjum Singh Solanky
e5691f9d1d PyInstaller Spec to Wrap Khoj into a Basic Native App
- Verified functionality on MacOS

- Add ICNS Icon to use as MacOS App Icon
- Spec generated by PyInstaller:
  ```sh
  pyinstaller \
       src/main.py \
       --windowed \
       --onefile \
       --name "Khoj" \
       --target-arch arm64 \
       -i src/interface/web/assets/icons/favicon.icns \
       --add-data "src/interface/web:src/interface/web" \
       --copy-metadata tqdm \
       --copy-metadata regex \
       --copy-metadata requests \
       --copy-metadata packaging \
       --copy-metadata filelock \
       --copy-metadata numpy \
       --copy-metadata tokenizers
  ```
2022-08-08 23:23:02 +03:00
Debanjum Singh Solanky
ef009323e7 Use sys.exit to quit via system tray. Fix pip install cmd in Readme 2022-08-08 21:42:36 +03:00
Debanjum Singh Solanky
eacd95bebd Start Creating Native Configure Page using PyQt 2022-08-08 18:31:47 +03:00
Debanjum Singh Solanky
dddc57e132 Rename get-enabled-search-types to get-enabled-content-types as more appropriate 2022-08-07 18:53:14 +03:00
Debanjum Singh Solanky
127c6e78df Only show keybindings for enabled search types in simple info menu too
Convert the khoj--keybindings-info-message into a func
Dynamically generate info menu
Show keybindings for enabled search types only
2022-08-07 18:40:35 +03:00
Debanjum Singh Solanky
d08c25b62b Make default search type used in the Emacs interface configurable 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
5a10c47499 Allow setting music as search type in khoj.el. Had forgotten to include it earlier 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
ebee716026 Only show keybindings reference for enabled search types in khoj.el 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
6dc9801f45 Get Khoj search-types enabled by user in Emacs 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
f3c1512c38 Fix to let user to start enter query right after initiating khoj on emacs
- Fix regression since moving to use `which-key-show-full-keymap~
- The above function reads user keypress, so eats up 1 keypress
  before starting to enter query
- No way to pass no-paging config via the external function to the
  internally used which-key--show-keymap function that does allow
  setting no-paging to not read user keypress
- So use the internal function instead and set no-paging arg to t
2022-08-07 15:57:08 +03:00
Debanjum Singh Solanky
e95686c89c Show complete Khoj keybindings when initiate search in Emacs
- The keybindings to select search types was previously confusing as
  it only highlighted the final symbol to press (the C-x was shown but
  it wasn't made apparent that it had to be pressed before)

- Previously some keybindings unrelated to khoj were also being shown
  in the which-key popup. Now only the khoj keybindings are visible
2022-08-06 16:36:57 +03:00
Debanjum Singh Solanky
4696eadc02 Fix definition of khoj--search-<content-type> functions in khoj.el 2022-08-06 15:19:01 +03:00
Debanjum Singh Solanky
c5bf051a29 Rename initialize_{search,processor,server} to configure_{search,procesor,server}
- Search is being reconfigured multiple times in /regenerate and
  n/reload. More appropriate name is configure_ rather than initialize_
  for it
- Standardize name of methods under configure.py
2022-08-06 03:23:02 +03:00
Debanjum Singh Solanky
7b04978f52 Put global state variables into separate state module
- Variables storing app, device state aren't constants.
  Do not mix with actual constants like empty_escape_sequence, web_directory
2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky
b04c84721b Extract configure and routers from main.py into separate modules
- Main.py was becoming too big to manage. It had both
  controllers/routers and component configurations (search, processors)
  in it

- Now that the native app GUI code is also getting added to the main
  path, good time to split/modularize/clean main.py

- Put global state into a separate file to share across modules
2022-08-06 02:39:18 +03:00
Debanjum Singh Solanky
083fefdd07 Create Native Menu Bar with PyQt to open Search, Config webpages
- Run FastAPI server in a separate thread.
  - This allows starting both the server and gui in parallel

- Create System Tray for Khoj
  - Contains menu items that open search or config pages in browser

- Rearrange code to have only the code required to start Backend and
  GUI in the run() method
  - Move the backend setup code into a separate method
2022-08-06 01:00:25 +03:00
Debanjum Singh Solanky
9fa3345000 Show available Khoj keybindings to customize search using which-key
Fallback to showing simple khoj keybindings info message in echo area
when which-key not available
2022-08-05 20:24:29 +03:00
Debanjum Singh Solanky
6a8b2a6936 Do not run incremental search when query is empty 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
609cd6e8bb Show keybindings to set khoj search type in echo area to assist user 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
48e4a983c5 Allow switching search type in the middle of querying Khoj on Emacs
- More generally, this allows configuring the khoj search anytime
  while in khoj minibuffer window
- Earlier could only configure search type at the start of the search
2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
48c33b93cc Generalize khoj keymap to func that can update existing keybdings 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
19c4701f3f Default to ledger search from files with .beancount extensions 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
cc9a395e0a Keep name of buffer for Khoj results in a variable 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
0a5c6d067a Do not prompt user to set search type before querying Khoj via Emacs
- What
  - Default to last used search type, when no search type specified
  - Allow user to change search type before they enter query (and
    after they've called khoj), if they want

- Why
  - Reduce time from intent to results by using reasonable defaults
  - Make interactions smoother, more intuitive
2022-08-05 19:35:38 +03:00
Debanjum Singh Solanky
24ccba74d4 Put type dropdown, regenerate button on same row. Regain screen space 2022-08-05 06:17:43 +03:00
Debanjum Singh Solanky
017e287b8a Remove redundant query as title in results section
- Regain screen real-estate
- Remove unused parameters, html being returned by org.js
2022-08-05 06:17:25 +03:00
Debanjum Singh Solanky
06afeec7e2 Hide stars of org entry results on Emacs to reduce visual clutter
They've all been normlized to the same level and hence don't hold much
data. So good opportunity to reduce, non-useful visual clutter
2022-08-05 05:27:57 +03:00
Saba
d1fe6353b5 Check whether processor_config exists during shutdown event 2022-08-04 21:57:36 -04:00
Debanjum Singh Solanky
4d4d2ff921 Ensure all org entries are unfolded in results buffer on Emacs 2022-08-05 04:54:29 +03:00
Debanjum Singh Solanky
49ef741d4b Prevent Zoom on Input in Web Interface. Document Pip upgrade in Readme
- Name /Reload API Controller Reload
2022-08-05 03:51:34 +03:00
Debanjum Singh Solanky
675e821d95 Make embeddings, jsonl paths absolute. Create directories if non-existent 2022-08-05 02:57:59 +03:00
Debanjum Singh Solanky
d5b43eb836 Use input filter in image search setup. Input filter wasn't used earlier 2022-08-05 02:40:03 +03:00
Debanjum Singh Solanky
ca5a8bd113 Make config file a positional argument, as it is required
- Test invalid config file path throws. Remove redundant cli test

- Simplify cli parser code
  - Do not need to explicitly check if args.config_file set.
    argparser checks for positional arguments automatically

- Use standard semantics for cli args
  - All positional args are required. Non positional args are optional

- Improve command line --help description
2022-08-05 01:09:40 +03:00
Debanjum Singh Solanky
1374065092 Mark all required fields for config. Throw if no input_* field specified
- Add custom validator to throw if neither input_filter or
  input_<files|directories> are specified

- Set field expecting paths to type Path

- Now that default_config isn't used in code. We can update
  fields in rawconfig to specify whether they're required or not.
  This lets pydantic validate config file and throw appropriate error
2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky
f78d6ae754 Create khoj_sample file with all configurable fields in one place
- Reason
  - Simplifies code. No merge_dict required
  - 1 place for user to see all configurables, defaults and required values

- Details
  - Remove default_config from code. Set defaults in khoj_sample.yml itself
  - Keep fields required to be set by user as empty in khoj_sample to YAML
  - Set defaults for fields not requiring configuration by user
2022-08-05 01:08:33 +03:00
Debanjum Singh Solanky
3abf3e5ee0 Update merge_dicts to recursively merge the dictionaries
Previously it was only merging dictionary at the first/top level
2022-08-04 22:46:20 +03:00
Debanjum Singh Solanky
61c26ba611 Only show large Khoj favicon on web interface
- Do not want browsers to use the small, grainy favicons
- Firefox for Android does use the bigger icon, when it's the only one available
- Update svg to match the 144x144 ratio just for consistency
2022-08-04 14:33:29 +03:00
Debanjum Singh Solanky
1649fa644c Autofocus on Query field in Web Interface. Improve time to query 2022-08-04 05:23:19 +03:00
Debanjum Singh Solanky
71fcb1087f Add icons for web interface to render on more browsers and as PWA
Safari, Firefox for Android etc don't support SVG Favicons yet
2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky
5b6b7ec123 Delete khoj network connections on incremental search teardown on Emacs interface
Currently only get into this state when debug breakpoints on backend
are keeping the connection open and user exits khoj search from Emacs
Results in a number of open connections that slow khoj down.
2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky
555c1088cc Cache queries in /search controller using LRU cache
- Most concretely right now,
  it eliminates the re-rank latency hit
  on re-rank triggered on user hitting enter
  after re-rank is already done on user idle
  in the emacs interface

- Improves search latency of (incremental) search
2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky
38df727ef4 Fix escape sequence usage in strings. Remove unneeded import of os
Rename /config API method to config to match it's purpose. UI is
anyway too generic, and not what it is doing
2022-08-03 18:51:55 +03:00
Debanjum Singh Solanky
f642450ed9 Disable Incremental Search for Images on Web
Bug introduced in commit da118b3fed
2022-08-03 11:52:51 +03:00
Debanjum Singh Solanky
b9e6273644 Include interfaces in pip package. Fix paths to web interface in app 2022-08-03 00:02:39 +03:00
Debanjum Singh Solanky
1b55462fb0 Convert search_filter, conversation dir to proper modules
Add __init__.py files to their directories
2022-08-02 20:23:42 +03:00
Debanjum Singh Solanky
5108d45951 Wrap application startup steps into a method 2022-08-02 20:13:14 +03:00
Debanjum Singh Solanky
0ebfbb43ce Nest org, md results at level 2 on Emacs interface. Improve readability
- Makes it easier to fold/unfold, traverse and read results
- This 2 level nesting is already being used on the web interface

- Previously we were using the original nesting depth of the entry.
  This was aimed at providing more of the orginal context of the
  results. But currently this additional information does not provide
  as much, for the decreased legibility of the results
2022-08-01 04:01:18 +03:00
Debanjum Singh Solanky
1201bfddf3 Simplify name of config css from config-style.css to config.css 2022-08-01 01:34:00 +03:00
Debanjum Singh Solanky
075dba5d64 Use Khoj Title, Favicon in Config Page for Consistency 2022-08-01 01:27:14 +03:00
Debanjum Singh Solanky
56a4429f01 Move web interface to configure application into src/interface/web directory
- Improve code layout by ensuring all web interface specific code
  under the src/interface/web directory
- Rename config API to more specifi /config instead of /ui
- Rename config data GET, POST api to /config/data instead of /config
2022-08-01 00:53:42 +03:00
Debanjum
bb2ccec1ca
Populate type dropdown on the web interface with only enabled search types
- Previously we were statically populating types dropdown field in the web interface with all available search types
- This change populates the type dropdown field with only search types that are enabled/configured
- It queries the `/config` backend API to see which of the available search types are configured
2022-08-01 00:20:45 +03:00
Debanjum Singh Solanky
8b6058c879 Fix instantiating type field with value from URL query parameter
- Populate via `.then` after enabled search types in dropdown are
  populated
- Call to `/config` API is async and will usually complete after the value of type field is set from url
- So value of type field would earlier be overridden when search types
  dropdown is populated after the call to `/config` API completes
2022-08-01 00:04:50 +03:00
Debanjum Singh Solanky
be253bab39 Populate type dropdown with only enabled search types in web interface
- Get /config API and check config for which available search types is
  populated. This gives us the list of enabled search types
- Dynamically populate search type field with enabled search types only
2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky
0abd40aeb7 Only set query field when appropriate query param passed via URL
- Setting query value to default option when query param wasn't
  passed via URL was overriding placeholder text in query field

- We wanted placeholder text in field, not the query field to actually
  be populated by placeholder text

- This clears field when user starts typing query into the query field,
  instead of them having to manually delete the  default text populated
2022-07-31 22:29:23 +03:00
Debanjum Singh Solanky
17c38b526a Default config for each search types to None
- Setting up default compressed-jsonl, embeddings-file was only required
  for org search_type, while org-files and org-filter were allowed to be
  passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
  command line argument as well for org search type
- Now that all search types are only configurable via config file, We
  can default all search types to None. The default config for the
  rest of the search types wasn't being used anyway
2022-07-31 22:23:57 +03:00
Debanjum Singh Solanky
b83021a723 Improve code readability of merge_dicts helper method 2022-07-31 22:07:56 +03:00
Debanjum Singh Solanky
38aede68f2 Only configure org via config file for consistency across search types
- Previously org-files were configurable via cmdline args.
  Where as none of the other search types are
- This is an artifact of how the application grew
- It can be removed for better consistency and
  equal preference given all search types
2022-07-31 22:02:03 +03:00
Saba
b55159f5bd Fix URL for khoj.el quelpa setup instructions 2022-07-29 23:01:04 -04:00
Debanjum Singh Solanky
da118b3fed Simplify incremental search function used in web interface
Re-rank isn't passed to image search API in search function.
So don't need to check type in incremental_search function too
2022-07-29 23:18:01 +04:00
Debanjum Singh Solanky
3079614981 Allow set up of search form via query params in web interface
- Default search type to org, instead of images
2022-07-29 23:13:26 +04:00
Debanjum Singh Solanky
02ca2c05a1 Add Eagle Icon for Khoj to Web, Emacs Interfaces and Readme 2022-07-29 17:50:29 +04:00
Debanjum Singh Solanky
78314263a0 Add Table of Contents, Features, Performance Details to Readme 2022-07-29 17:08:17 +04:00
Debanjum Singh Solanky
ed181f47c9 Prettify rendering of org music results on Khoj web interface 2022-07-29 04:28:22 +04:00
Debanjum Singh Solanky
7e5291a38e Make org result headings at same level. Improve spacing of results
Having org-mode result headings change size based on their depth in
the source document makes is a confusing UI experience.

Improve font-size, line-spacing and margins of results to make
delineation between entries, and differntiating between entry heading
and it's body easier to visually infer.

Do not white-space: pre-line. Improves rendering of Markdown results
2022-07-29 01:55:46 +04:00
Debanjum Singh Solanky
4d5183063c Create images directory if doesn't exist, to store image search results 2022-07-28 21:30:31 +04:00
Debanjum Singh Solanky
a9bc17a6b0 Prettify Render of Markdown Results in Web Interface 2022-07-28 20:56:37 +04:00
Debanjum Singh Solanky
a6ae74f52e Move JS files like org.js into a separate assets/ directory 2022-07-28 20:46:48 +04:00
Debanjum Singh Solanky
a12eaa4ce0 Move Khoj image results into a child images/ directory 2022-07-28 20:45:12 +04:00
Debanjum
a71253e137
Support Incremental Search on Web Interface
## Support Incremental Search on Khoj Web Interface
- Use default, fast path to query /search API while user is typing
- Upgrade to cross-encoder re-ranked results once user hits enter on search box

## Improve Render of Org Results on Web Interface
- We were previously just wrapping results from /search API into a pre formatted div field. This was not easy to read
- Use [org.js](https://mooz.github.io/org-js/) to render results from Khoj `/search` API as proper HTML
- Improve org.js to render all task states, stylize task tags and make org-mode results look more like original content

Closes #42 #41
2022-07-28 09:31:57 -07:00
Debanjum Singh Solanky
e8029bf415 Extract and Highlight org-mode tags in HTML render of search results 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
c6c248df26 Improve styling of org-mode results to original alignment, line breaks 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
9f59897eeb Highlight all org-mode task states in HTML. Not just TODO, DONE.
- Make logic to extract, mark todo state in org.js more generic
- Add default todo state styling to html
2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
f040b3f65c Stylize TODO/DONE states with CSS 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
581b6097c7 Clean Results. Remove TOC, Heading Number and Property Drawers 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
965a93a2f2 Add Basic HTML Rendering of Org-Mode Results 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
1da44d4dfe Add Incremental Search to Khoj Web Interface 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
af1dd31401 Do not pass verbose argument to image_search.query() as not supported 2022-07-28 19:52:58 +04:00
Debanjum Singh Solanky
80ac10835c Rerank results on normal minibuffer exit
In current state:
 - Rerank results:
   - If user idles while entering query OR
   - exits normally

 - Do not rerank results:
   - If user exits abnormally, e.g via C-g from query
2022-07-28 03:37:16 +04:00
Debanjum Singh Solanky
1b759597df Make incremental search more robust. Follow standard user expectations
- Rename functions to more standard, descriptive names
- Keep known, required code for incremental search
  - E.g Do not set buffer local flag in hooks on minibuffer setup

- Only query when user in khoj minibuffer
  - Use active-minibuffer-window and track khoj minibuffer
  - (minibuffer-prompt) is not useful for our use-case here

- (For now) Run re-rank only if user idle while querying
  - Do not run rerank on teardown/completion
    - The reranking lag (~2s) is annoying; hit enter,
      wait to see results
    - Also triggered when user exits abnormally,
      so C-g also results in rerank which is even more annoying
  - Emacs will still hang if re-ranking gets triggered on idle but
    that's better than always getting triggered. And better than not
    having mechanism to get results re-ranked via cross-encoder at all
2022-07-28 02:52:27 +04:00
Debanjum Singh Solanky
9a6eee31be Make number of results to get from Khoj API customizable in khoj.el 2022-07-27 18:55:18 +04:00
Debanjum Singh Solanky
9302b45fe0 Use khoj-incremental as the main khoj func. Rename khoj to khoj-simple
- Update khoj-simple to work cross-encoder re-ranked results like before
- Increment major version as incremental search considered a breaking
  change and a major update to search capability
2022-07-27 18:18:17 +04:00
Debanjum Singh Solanky
09727ac3be Make bi-encoder return fewer results to reduce cross-encoder latency 2022-07-27 07:26:02 +04:00
Debanjum Singh Solanky
9ab3edf6d6 Re-rank incremental search results using cross-encoder if user idle
This provides a relatively smooth mechanism
- to improve relevance of results on idle
- while providing the rapid, incremental results while typing
2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
ad242cafa7 Support querying all text search types in incremental search
- Before incremental search was hard-coded to only query org
2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
bfcb962cbe Use post-command-hook to only query on user input
- Hooking into after-change-functions results in system logs triggering query
2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
0d49398954 Reuse code to query api, render results. Formalize method, arg names 2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
fd1963d781 Implement Basic Incremental Search Interface in Emacs for Org Mode Notes 2022-07-27 03:05:00 +04:00
Debanjum Singh Solanky
3fa7d8f03a Skeleton to allow incremental search on Khoj via Emacs 2022-07-27 02:48:27 +04:00
Debanjum Singh Solanky
1168244c92 Make cross-encoder re-rank results if query param set on /search API
- Improve search speed by ~10x
  Tested on corpus of 125K lines, 12.5K entries

- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
  - It's an optional param that default to False
  - Earlier all results were re-ranked by cross-encoder
  - Making this configurable allows for much faster results, if desired
    but for lower accuracy
2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky
b1e64fd4a8 Improve search speed. Only apply filter if filter keywords in query
- Formalize filters into class with can_filter() and filter() methods

- Use can_filter() method to decide whether to apply filter and
  create deep copies of entries and embeddings for it

- Improve search speed for queries with no filters
  as deep copying entries, embeddings takes the most time
  after cross-encodes scoring when calling the /search API

  Earlier we would create deep copies of entries, embeddings
  even if the query did not contain any filter keywords
2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky
f094c86204 Trace query response performance and display timings in verbose mode 2022-07-26 21:03:53 +04:00
Debanjum Singh Solanky
65fea7681a Rename notes search type to org search, now that markdown notes supported 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
4c24202e42 Update documentation. Simplify, reflect current capabilities 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
d4d7dbaca6 Support Natural Search on Markdown Files
- Reason:
  Allow natural search on markdown based notes, documentation,
  websites etc

- Details:
  - Create markdown processor to extract Markdown entries (identified by
    Heading) into standard jsonl format required by text_search
  - Update API, Configs to support interfacing with new markdown type
  - Update Emacs, Web clients to support interfacing with new markdown
    type via API
  - Update Readme to mentiond markdown is also supported

Closes #35
2022-07-21 22:07:05 +04:00
Debanjum Singh Solanky
0602d018c0 Merge Symmetric, Asymmetric Search Types into a single Text Search Type
- The code for both the text search types were mostly the same
  It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
  text_search type
- This simplifies the app and making it easier to process other
  text types
2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky
0917f1574d Consolidate jsonl helper methods in a single file under utils module 2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky
de726c4b6c Minor fixes to unused installer utility script 2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky
5aad297286 Reuse logic to extract entries across symmetric, asymmetric search
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types

Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky
e220ecc00b Generate compiled form of each transaction directly in the beancount processor
- The logic for compiling a beancount entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows symmetric search to be generic and not be aware of
  beancount specific properties that were extracted by the
  beancount-to-jsonl processor layer

- Now symmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be location, transaction, chat etc, it doesn't have to care
2022-07-21 02:43:28 +04:00
Debanjum Singh Solanky
06cf425314 Generate compiled form of each entry directly in the org-mode processor
- The logic for compiling an org-mode entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows asymmetric search to be generic and not be aware of
  org-mode specific properties that were extracted by the org-to-jsonl
  processor layer

- Now asymmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be mail, chat, markdown, org-mode etc, it doesn't have to care
2022-07-21 02:08:02 +04:00
Debanjum Singh Solanky
4ead79d272 Make Notes Search Natural Language Date Aware
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings

- The (new?) model seems to understand dates. So can give more
  relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
  will give different results, with the second prioritizing entries
  mentioning any entries with closed, scheduled dates from 1984
2022-07-21 01:06:49 +04:00
Debanjum Singh Solanky
d50bfb5188 Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests 2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky
70e70d4b15 Rename 'embed' key to more generic 'compiled' for jsonl extracted results
- While it's true those strings are going to be used to generated
  embeddings, the more generic term allows them to be used elsewhere as
  well

- Their main property is that they are processed, compiled for
  usage by semantic search

- Unlike the 'raw' string which contains the external representation
  of the data, as is
2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky
c1369233db Consistently use "entry", "score" in json response for all search types
- Had already made some progress on this earlier by updating the image
  search responses. But needed to update the text search responses to
  use lowercase entry and score

- Update khoj.el to consume the updated json response keys for text
  search
2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky
d68a9dc445 Sort extracted images before computing their embeddings
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
  across machines.
- Use case:
  A more powerful, always on machine actually computes the image embeddings regularly
  The client machine just load these periodically to provide semantic search functionality
2022-07-20 03:51:27 +04:00
Debanjum Singh Solanky
c4c7f38b15 Fix extracting image names from multiple image directories 2022-07-20 03:40:49 +04:00
Debanjum Singh Solanky
bdc1b9f2bb Resolve edge case errors in encoding image metadata
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
  - return empty strings in such a scenario instead of ". "
2022-07-20 02:58:43 +04:00
Debanjum Singh Solanky
2a5445216c Image input directory not required by collate result as image_name already absolute path 2022-07-20 02:56:23 +04:00
Debanjum Singh Solanky
6c9ffdba57 Allow indexing multiple image directories for image search 2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky
70221bb038 Allow filtering transactions by date in symmetric ledger 2022-07-19 20:58:24 +04:00
Debanjum Singh Solanky
b673d26a12 Extract Entries in a standardized format across text search types
Issue:
 - Had different schema of extracted entries for symmetric_ledger vs asymmetric

 - Entry extraction for asymmetric was dirty, relying on cryptic
   indices to store raw entry vs cleaned entry meant to be passed to embeddings

 - This was pushing the load of figuring out what property to extract
   from each entry to downstream processes like the filters

 - This limited the filters to only work for asymmetric search, not for
   symmetric_ledger

- Fix
   - Use consistent format for extracted entries
     {
       'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
       'raw'  : raw_entry_string_meant_to_be_passed_to_use
     }

 - Result
   - Now filters can be applied across search types, and the specific
     field they should be applied on can be configured by each search
     type
2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky
e66cd5bf59 Only extract transactions from Beancount
- Earlier was extracting all entries starting with dates but the other
  type of entries like account open/close, asserts etc aren't useful for
  querying
2022-07-19 19:50:58 +04:00
Debanjum Singh Solanky
732b2d287f Give the project a short, less generic name. Rename it to Khoj
- Semantic Search was just a placeholder used to test the idea out
  Didn't want to get into naming at that point of time
2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky
989526ae54 Use a more accurate model for symmetric semantic search
- The all-MiniLM-L6-v2 is more accurate
  - The exact previous model isn't benchmarked but based on the
    performance of the closest model to it. Seems like the new model
    maybe similar in speed and size

- On very preliminary evaluation of the model, the new model seems
  faster, with pretty decent results
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
4a90972e38 Use a better model for asymmetric semantic search
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
  - It doubles the encoding speed of all entries (down from ~8min to 4mins)
  - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)

[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
5e302dbcda Fix using 1 column layout on small screens 2022-07-18 02:40:16 +04:00
Debanjum Singh Solanky
7d16b673b1 Use Single Column Layout for Small Screens on Web Interface 2022-07-18 02:08:52 +04:00
Debanjum Singh Solanky
31a221a76b Auto focus cursor on query input box to simplify, speed interactions
- Avoids having to click the query input box
- Just open page, type whatever and hit enter to do image search
  - For other search types select appropriate type from dropdown
2022-07-16 19:39:15 +04:00
Debanjum Singh Solanky
06b0c720d6 Improve Rendering of Image Search Results in Emacs
- Use shr to render image response from html in result buffer
  Earlier was using org-mode. But rendering HTML with shr seems cleaner
- Use Headings to Add highlights
- Use Random to Force fetch of Image. Similar to what was done for Web interface
- Remove trailing elisp brackets from response
- Show query match scores by image model for each image in results
2022-07-16 19:31:49 +04:00
Debanjum Singh Solanky
28ec9af589 Extract image URL location from response in elisp after API update 2022-07-16 18:43:55 +04:00
Debanjum Singh Solanky
47613cba1f Improve Landing Page Look in General and Layout for Mobile
- Ask for 6 Images to Fill Grid into 3x2 Layout
- Submit Form on Hitting Enter
2022-07-16 16:55:13 +04:00
Debanjum Singh Solanky
cf207d6ebe Add title, heading to the semantic search web interface 2022-07-16 03:44:29 +04:00
Debanjum Singh Solanky
e0d8398b27 Normalize metadata match score to work better with image match score
- Metadata match score were consistently giving higher scores by a
  factor of ~3x wrt to image match score. This was resulting in all
  results being from the metadata match with query and none from the
  image match with query.
- Scaling the metadata match scores down by scaling factor seems to
  give more consistently give a blend of results from both image and
  metadata matches
2022-07-16 03:39:33 +04:00
Debanjum Singh Solanky
a3fc82817d Log and continue on image metadata encoding error due to Tensor size mismatch 2022-07-16 03:39:19 +04:00
Debanjum Singh Solanky
f26d0ddbbd Minor fix to asymmetric search when no entries returned 2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
ca3f93e641 Add button on web interface to regenerate embeddings of specified type 2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
231cc91e14 Force reload of images every time user clicks search button
Adding a random, unused url param at the end of the img.src string
fixes the issue. As the browser thinks it's a new image and doesn't
use the image data that's already cached because of which it wasn't
even making the fetch call for the image
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
a6aef62a99 Create Basic Landing Page to Query Semantic Search and Render Results
- Allow viewing image results returned by Semantic Search.
  Until now there wasn't any interface within the app to view image
  search results. For text results, we at least had the emacs interface

- This should help with debugging issues with image search too
  For text the Swagger interface was good enough
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
4e27ae0577 Ease access to image result for given query by image_search
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
  directly in browser
- Return image, metadata scores for each image in response as well
  This should help get a better sense of image scores along both
  XMP metadata and whole image axis
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
801e59a20d Allow explicit filters when querying Ledger transactions 2022-07-15 23:41:54 +04:00
Debanjum Singh Solanky
0e979587e0 Add configurable filter support to Symmetric Ledger Search 2022-07-14 23:40:41 +04:00
Debanjum Singh Solanky
85077bc1d1 Handle unparseable date range passed via date filter in query
- Do not reuse the same list
- Just create new list, so only parsed data is in it
2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky
a60de2c02b Include date filter in asymmetic search on music as well 2022-07-14 22:37:17 +04:00
Debanjum Singh Solanky
c3b3e8959d Put entry splitting regex in explicit filter into a variable for code readability 2022-07-14 22:00:10 +04:00
Debanjum Singh Solanky
3aac3c7d52 Run explicit filter on raw entry, add more terms to split entries by
- With \t Last Word in Headings was suffixed by \t and so couldn't be
filtered by
- User interacts with raw entries, so run explicit filters on raw entry
   - For semantic search using the filtered entry is cleaner, still
2022-07-14 21:54:04 +04:00
Debanjum Singh Solanky
7640e2ab0c Wrap attempt to extract dates from entry in try/catch
- Not all YYYY-MM-DD strings in entry are necessarily dates
2022-07-14 21:38:00 +04:00
Debanjum Singh Solanky
9de2097182 Fix date filter usage with multi word queries. Simplify date regex 2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky
dcb6fe479e Fix date_filter query, entry in query range check. Add tests for it
- Fix date_filter date_in_entry within query range check
  - Extracted_date_range is in [included_date, excluded_date) format
  - But check was checking for date_in_entry <= excluded_date
  - Fixed it to do date_in_entry < excluded_date

- Fix removal of date filter from query
- Add tests for date_filter
2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky
011f81fac5 Fix date_filter to handle non overlapping date ranges 2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky
70ac35b2a5 Compute Date Range to filter entries to, from Comparators, Dates in Query 2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky
e6db3e3d00 Prefer Dates From Future only when specific words in date string
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
  dates in the future but dateparser's parse doesn't parse it at all.
  E.g parse('5 months from now') returns nothing

- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
  parse('5 months') to dateparser.parse works as expected
2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky
4a201d52af Add, test date filter regex and date parsing to get natural date range 2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky
b54588717f Filter for entries with dates specified by user in query
- Create Date filter
  - Users can pass dates in YYYY-MM-DD format in their query
- Use it to filter asymmetric search to user specified dates
2022-07-14 00:51:02 +04:00
Debanjum Singh Solanky
b82aef26bf Make filters to apply before semantic search configurable
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
  query, entries and embeddings before semantic search is performed

Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky
c92789d20a Extract explicit pre-search filter function into a separate module
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query

Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:20:04 +04:00
Debanjum Singh Solanky
6d7ab50113 Run Explicit Filter on Entries, Embeddings before Semantic Search for Query
- Issue
  - Explicit filtering was earlier being done after search by bi-encoder
    but before re-ranking by cross-encoder

  - This was limiting the quality of results being returned. As the
    bi-encoder returned results which were going to be excluded. So the
    burden of improving those limited results post filtering was on the
    cross-encoder by re-ranking the remaining results based on query

- Fix
  - Given the embeddings corresponding to an entry are at the same index
    in their respective lists. We can run the filter for blocked,
    required words before the search by the bi-encoder model. And limit
    entries, embeddings being considered for the current query

- Result
  - Semantic search by the bi-encoder gets to return most relevant
    results for the query, knowing that the results aren't going to be
    filtered out after. So the cross-encoder shoulders less of the
    burden of improving results

- Corollary
  - This pre-filtering technique allows us to apply other explicit
    filters on entries relevant for the current query
    - E.g limit search for entries within date/time specified in query
2022-07-12 18:25:42 +04:00
Debanjum Singh Solanky
7677465f23 Fix passing of device to setup method in /reload, /regenerate API
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky
eda4b65ddb Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU
- Move embeddings to CUDA GPU for compute, when available
- Normalize embeddings and Use Dot Product instead of Cosine
2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky
b89fc2f4ac Add /reload API to reload model embeddings and entries from file
- The reload API adds the ability to separate out the loading of
  embeddings from file without having to restart app or (re-)generate embeddings

- Before this the only way to load model from file was by restarting app
- The other way to reload the model embeddings by regenerating them
  was to expensive for larger datasets

- This unlocks at least 1 use-case, where
  - we regenerate model via an app instance running on a separate server and
  - just reload the generated embeddings on the client device

  - This allows us to offload the expensive embedding generation
    compute to a background server while letting

  - This avoids having to (re-)restart application on client device or
    be forced to generate embeddings on the client device itself

  - But it requires the model relevant files to be synced to the client device
    This can be done with any file syncing application like Syncthing

  - We can then call /regenerate on server and /reload client on a
    regular schedule to keep our data up to date on semantic search
2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky
f5d6d1e752 Tiny style fix to separate functions by 2 newlines 2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky
85fbe1c42b Normalize org notes path to be relative to home directory
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
  - Notes are being processed on a different machine and used on a different machine
  - But the notes directory is in the same location relative to home on both the machines
2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky
094eaf3fcc Fix minor bugs in OrgNode parser
- Bugs discovered from writing org-node tests
2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky
36495038dd Fix storing parsed CLOSED date in OrgNode
The CLOSED date was getting parsed but not stored
Adding setClosed at start also fixed the issue
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
1c5754bf95 Simplify storing Tags in OrgNode object
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
  - Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
  for code readability
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
51a43245d3 Escape square brackets in file+heading based org-mode links 2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky
04610f453a Include scheduled date, deadline date and close date in repr of org node
- Now that excluding the times line from the raw body of node,
  show it in repr so user can see it for reference

- But the model doesn't need to see it for it's embeddings to be
  confused by
2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky
367d7377df Ignore scheduled, closed, deadline time and logbook start, end in org node body
- Gives cleaner embeddings for semantic search
- Hopefully improves results and reduces size, compute
2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky
b77ccadcba Make property key regex more strict. Property key has to be alphanumeric 2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky
ac9d746444 Fix Tags extraction in Org Node parser
- Previous version required two tags at least to work, not sure why
- Fixed it to extract all tags, even if only one tag in heading
2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky
fb86be8cd9 Add ID, File+Heading based Links to Org-Mode Entries
- Add links to property drawer
- This ensures results returned by semantic search contain these links
- This allows the user to jump to entry within original file for context
- The ID, file+heading based links are more robust to find relevant
  entry in original file than the line no based link,
  as edits being done by user to original files between embedding regenerations
2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky
de23fc2051 Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration

This reverts commit a2a08d1354.
2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky
a2a08d1354 Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search 2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky
cfbd5c4ecc Update global model on regenerate via API 2022-06-17 00:49:06 +03:00
Debanjum Singh Solanky
c78bf84eef Introduce search api endpoint that auto infers search type intent
- Introduce prompt for GPT to automatically extract user's search intent
- Expose new search api endpoint to use that to set SearchType being
  passed to search API
- Currently meant as an experimental API to gauge usefulness,
  extendability. Evaluating for phone or voice use-case
2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
8ef7917014 Fix json format passed in prompt to GPT 2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
f57b7f65ea Wrap prompts for GPT in triple quotes to improve prompt readability
To prompt improve readability:
- Remove newline escape sequence and use actual newline directly
  - This avoids one long line of text as prompt and
- Remove escaping of double quotes
2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
1eba7b1c6f Use empty_escape_sequence constant to strip response text from gpt 2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
1c3a1420f8 Update asymmetric extract_entries method to handle uncompressed jsonl
This is similar to what was done for the symmetric extract_entries
method earlier
2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky
3d8a07f252 Extract empty line escape sequences var into constants file for reuse 2022-02-27 19:01:49 -05:00
Debanjum Singh Solanky
bb5d0d8908 Improve Semantic Search Buffer Names in Emacs
- Allow multiple semantic searches buffers to exist simultaneously
  - Uniquify semantic search buffer namew
- Add query and search-type to semantic search buffer name for easier
  disambiguration, search and find appropriate
2022-02-26 18:30:14 -05:00
Debanjum Singh Solanky
b68558651b Improve Extraction of Beancount Entries
- Only extract entries starting with YYYY-MM-DD from Beancount
- Strip Trailing Escape Sequences from Entries
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
b3ac2dd730 Improve Results Rendered on Emacs from Semantic Search on Ledger
- Add search query to top of buffer as Beancount comment
- Remove trailing ) from response
- Separate entries by empty line
- Load beancount-mode in semantic search on ledger buffer
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
502c68d4f8 Remove trailling escape sequence in ledger search response entries
- Fix loading entries from jsonl in extract_entries method
  - Only extract Title from jsonl of each entry
    This is the only thing written to the jsonl for symmetric ledger
  - This fixes the trailing escape seq in loaded entries
  - Remove the need for semantic-search.el response reader to do pointless complicated cleanup

- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
  Both methods were doing similar work

- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
248aa632c0 Do not throw warning for beancount files with .beancount extension 2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
76cd63f4bd Fix count of processed jsonl entries shown to user by ledger processor
Count lines not chars
2022-02-26 17:46:06 -05:00
Saba
33bc62dc19 Fix type of use_xmp_metadata to be bool, rather than str 2022-01-24 21:53:26 -05:00
Debanjum Singh Solanky
179153dc5a Rename RawConfig Types for Consistency
- Naming convention - [ContentType][ConfigType]Config
  - Where [ConfigType] ~ Content, Search, Processor
  - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation

- Current Configs:
  - Content:
    - Org Notes
    - Org Music
    - Image
    - Ledger/Beancount

  - Search:
     - Asymmetric
     - Symmetric
     - Image

  - Processor:
    - Conversation
2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky
c64e0c2965 Load model from HuggingFace if model_directory unset in config YAML
- Do not save/load the model to/from disk when model_directory unset
in config.yml
- Add symmetric search default config to cli.py
2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
510faa1904 Save Image Search Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
934ec233b0 Add Search Config for Symmetric Model. Save Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
b63026d97c Save Asymmetric Search Model to Disk
- Improve application load time
- Remove dependence on internet to startup application and perform semantic search
2022-01-14 17:36:27 -05:00
Debanjum Singh Solanky
2e53fbc844 Fix the user intent extraction prompt for GPT. Clean up chatbot test 2022-01-12 10:36:01 -05:00
Debanjum Singh Solanky
ea28897cdd Remove deprecated conversation_history field from config 2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky
5a686b7be9 Add logs for chat bot in verbose mode 2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky
6dc2a99d35 Merge branch 'master' of github.com:debanjum/semantic-search into add-summarize-capability-to-chat-bot
- Fix openai_api_key being set in ConfigProcessorConfig
- Merge addition of config UI and config instantiation updates
2021-12-20 13:30:42 +05:30
Debanjum Singh Solanky
65da7daf1f Load, Save Conversation Session Summaries to Log. s/chat_log/chat_session
Conversation logs structure now has session info too instead of just chat info
Session info will allow loading past conversation summaries as context for AI in new conversations

{
    "session": [
    {
        "summary": <chat_session_summary>,
        "session-start": <session_start_index_in_chat_log>,
        "session-end": <session_end_index_in_chat_log>
    }],
    "chat": [
    {
        "intent": <intent-object>
        "trigger-emotion": <emotion-triggered-by-message>
        "by": <AI|Human>
        "message": <chat_message>
        "created": <message_created_date>
    }]
}
2021-12-15 10:17:07 +05:30
Saba
97a6dfaa1e Use default value False for verbose parameter, and small changes
Pass config as parameter to initialize_search, change name of API methods to handle config CRUD operations, and initalize config to FullConfig
2021-12-11 14:13:14 -05:00
Saba
9536358d34 Fix key error model_name issue by upgrade sentence-transformers version
Refer to https://github.com/UKPLab/sentence-transformers/issues/1241
Also user verbose flag passed through function parameters in image_search
2021-12-11 11:58:19 -05:00
Saba
ce7a751e6b Fix passing verbose flag down in symmetric_ledger.py 2021-12-11 11:36:32 -05:00
Saba
d65190c3ee Update unit tests, files with removing model suffix to config types 2021-12-09 08:50:38 -05:00
Debanjum Singh Solanky
0ac1e5f372 Summarize chat logs and notes returned by semantic search via /chat API 2021-12-08 02:34:07 +05:30
Saba
76e9e9da2f Update unit tests to use the new BaseModel types 2021-12-05 09:31:39 -05:00
Saba
9b16cdbb41 Use past tense for verbose log 2021-12-04 11:45:44 -05:00
Saba
10e4065e05 Consolidate the search config models and pass verbose as a top level flag 2021-12-04 11:43:48 -05:00
Saba
43e647835b Append Model Suffixed to config models 2021-12-04 10:51:21 -05:00
Saba
e068968b35 Update imports for raw config models in config.py 2021-12-04 10:44:55 -05:00
Saba
4d6284b0af Remove Test suffix from Config models 2021-12-04 10:44:13 -05:00
Saba
7fcc8d2cef Add null check for processor config 2021-12-04 10:11:00 -05:00
Saba
7ca4fc3453 Resolve mrege conflicts with updated processor conversation data model 2021-11-28 16:22:52 -05:00
Saba
87a6c2d716 Use parse_obj instead of parse_raw as incoming data is in dict 2021-11-28 14:34:32 -05:00
Saba
5d50487d83 Linting
New line at end of config.html
Remove debug print statement
2021-11-28 13:32:56 -05:00
Saba
6f466c8d99 Use global config and add a regenerate button to the config ui' && git push 2021-11-28 13:28:22 -05:00
Saba
34d1e4199c Use alias generator when deserializing the config file 2021-11-28 13:05:48 -05:00
Saba
19b81e82f0 Write back to the raw config.yml file on update 2021-11-28 12:34:40 -05:00
Saba
8837b02de6 dump updated config to a yaml file 2021-11-28 12:26:07 -05:00
Saba
5b80b87379 Streamline None checking in initialize_search 2021-11-28 12:05:04 -05:00
Saba
bf8ae31e6a Streamline None checking in initialize_search 2021-11-28 11:59:45 -05:00
Saba
da52433d89 Update to re-use the raw config base models in config.py as well 2021-11-28 11:57:33 -05:00
Saba
6292fe4481 Update to re-use the raw config base models in config.py as well 2021-11-28 11:57:13 -05:00
Saba
311c4b7e7b Working API request body parsing to /post config! 2021-11-28 11:16:33 -05:00
Saba
66183cc298 Working API request body parsing to /post config! 2021-11-28 11:12:26 -05:00
Debanjum Singh Solanky
5cd920544d Add GPT method to summarize notes and chat logs 2021-11-28 13:08:05 +05:30
Debanjum Singh Solanky
1785047ea6 Improve understand primer and load understand response as dict 2021-11-28 13:04:16 +05:30
Saba
64645c3ac1 Begin type checking/input validation effort 2021-11-27 21:47:56 -05:00
Saba
9a0264b7fc Add a dummy POST config endpoint, integrate with editable UI 2021-11-27 20:36:03 -05:00
Saba
f3b03ea5b7 Make raw data reactive to changes 2021-11-27 19:17:15 -05:00
Debanjum Singh Solanky
67c3cd7372 Wire up GPT understand method to /chat API. Log conversation metadata too 2021-11-28 00:04:39 +05:30
Saba
3db06eee3f Basic example of serving conifg as JSON and retriving on button click 2021-11-27 10:49:33 -05:00
Saba
3d4471e107 Merge branch 'master' of github.com:debanjum/semantic-search into saba/configui 2021-11-27 08:52:48 -05:00
Debanjum Singh Solanky
ccfb97e1a7 Wire up minimal conversation processor. Expose it over /chat API endpoint
Ensure conversation history persists across application restart
2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky
a99b4b3434 Make conversation processor configurable 2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky
d4e1120b22 Add GPT based conversation processor to understand intent and converse with user
- Allow conversing with user using GPT's contextually aware, generative capability
- Extract metadata, user intent from user's messages using GPT's general understanding
2021-11-27 18:12:01 +05:30
Saba
baee52648d Set up basic ui page with no functionality 2021-11-26 14:51:11 -05:00
debanjum
46661b3057 Ensure top_k never more than total entries to run symmetric search on 2021-11-16 11:32:21 -08:00
debanjum
8c858d1a94 Reduce symmetric search results for cross-encoder to re-rank to improve search speed 2021-11-16 11:31:19 -08:00
Debanjum Singh Solanky
f3fd5ae978 Improve code comments. Do not import unused modules in asymmetric search 2021-11-17 00:58:31 +05:30
Debanjum Singh Solanky
8cf2465e8e Ensure top_k never more than total entries to search from 2021-11-17 00:56:31 +05:30
Debanjum Singh Solanky
4d37ace3d6 Reduce search results for cross-encoder to re-rank to improve search speed
Search time on my notes reduced from 14s to 4s. Cross-encoder
re-ranking step takes majority time, not the cosine similarity search
2021-11-17 00:50:28 +05:30
Debanjum Singh Solanky
1832e418e5 Use raw string for regex in orgnode to fix deprecation warning 2021-10-02 17:38:31 -07:00
Debanjum Singh Solanky
f59e321419 Update CLIP model load path 2021-10-02 16:50:06 -07:00
Debanjum Singh Solanky
c47a8cdf16 Allow configuring host, port or unix socket of server via CLI 2021-10-02 16:16:33 -07:00
Debanjum Singh Solanky
516f28b082 Merge branch 'master' of github.com:debanjum/semantic-search 2021-09-30 04:17:32 -07:00
Debanjum Singh Solanky
d2905c4be6 Move tests out to project root. Use absolute import in project
tests/ directory in project root is more standard.
Just had to use absolute path for internal module imports to get it to
work
2021-09-30 04:12:14 -07:00
Debanjum Singh Solanky
58bb420f69 Fix image_metadata argument ordering bug. Add E2E image search test
- Image search test seems a little flaky
- Interchanged argument was causing inaccurate results earlier
2021-09-30 03:30:47 -07:00
Debanjum Singh Solanky
d5597442f4 Modularize Code. Wrap Search, Model Config in Classes. Add Tests
Details
  - Rename method query_* to query in search_types for standardization
  - Wrapping Config code in classes simplified mocking test config
  - Reduce args beings passed to a function by passing it as single
    argument wrapped in a class
  - Minimize setup in main.py:__main__. Put most of it into functions
    These functions can be mocked if required in tests later too

Setup Flow:
  CLI_Args|Config_YAML -> (Text|Image)SearchConfig -> (Text|Image)SearchModel
2021-09-30 02:04:04 -07:00
Debanjum Singh Solanky
f4dd9cd117 Use type specific model for other search types too. Expose them via SearchModels
- Wrap Image, Music, Ledger search into the type of SearchModel they use
  Similar to what was done for notes model by wrapping it's config
  into an AsymmetricSearchModel.

- Use the uber wrapper class to expose all type specific search models
2021-09-29 21:09:42 -07:00
Debanjum Singh Solanky
352d2930ee Use multiple threads to generate model embeddings. Other minor formating 2021-09-29 20:47:58 -07:00
Debanjum Singh Solanky
e22e0b41e3 Wrap asymmetric search model into SearchModels. Test notes search end-to-end
- Wrap asymmetric search model parameters into AsymmetricSearchModel class
- Create wrapper for all search type models. Put notes search model into it
- Test notes search end-to-end from client API layer to results.
  Use model build on test data
2021-09-29 20:47:35 -07:00
Debanjum Singh Solanky
cde11a2331 Wrap search type enablement status in a search settings class
- Cleaner, more idiomatic usage of a global variable
- Simplifies mocking when testing client in pytest as setting wrapped
  in object rather than a simple type. So passed around by reference
2021-09-29 19:18:33 -07:00
Debanjum Singh Solanky
81ce0cacc3 Only allow supported search types to /search, /regenerate APIs
- Use a SearchType to limit types that can be passed by user
- FastAPI automatically validates type passed in query param
- Available type options show up in Swagger UI, FastAPI docs
- controller code looks neater instead of doing string comparisons for type
- Test invalid, valid search types via pytest
2021-09-29 19:12:56 -07:00
Debanjum Singh Solanky
5db08c5293 Set query as heading of notes search results in Emacs Org buffer 2021-09-29 13:30:15 -07:00
Debanjum Singh Solanky
fdb60a8dcf Set Query as Heading of Image Search Results Emacs Buffer 2021-09-16 12:30:06 -07:00
Debanjum Singh Solanky
169ddcc8c6 Make Using XMP Metadata to Enhance Image Search Optional, Configurable
- Break the compute embeddings method into separate methods:
  compute_image_embeddings and compute_metadata_embeddings

- If image_metadata_embeddings isn't defined, do not use it to enhance
  search results. Given image_metadata_embeddings wouldn't be defined
  if use_xmp_metadata is False, we can avoid unnecessary addition of
  args to query method
2021-09-16 12:01:05 -07:00
Debanjum Singh Solanky
a4a23d7a72 Batch encode XMP metadata from images too for image_search 2021-09-16 11:11:36 -07:00
Debanjum Singh Solanky
3afe054312 Make image batch size to encode configurable via config.yml 2021-09-16 10:52:31 -07:00
Debanjum Singh Solanky
41c328dae0 Batch encode images to keep memory consumption manageable
- Issue:
  Process would get killed while encoding images
  for consuming too much memory

- Fix:
  - Encode images in batches and append to image_embeddings
  - No need to use copy or deep_copy anymore with batch processing.
    It would earlier throw too many files open error

Other Changes:
  - Use tqdm to see progress even when using batch
  - See progress bar of encoding independent of verbosity (for now)
2021-09-16 10:15:54 -07:00
Debanjum Singh Solanky
d8abbc0552 Use XMP metadata in images to improve image search
- Details
  - The CLIP model can represent images, text in the same vector space

  - Enhance CLIP's image understanding by augmenting the plain image
    with it's text based metadata.
    Specifically with any subject, description XMP tags on the image

  - Improve results by combining plain image similarity score with
    metadata similarity scores for the highest ranked images

- Minor Fixes
  - Convert verbose to integer from bool in image_search.
    It's already passed as integer from the main program entrypoint

  - Process images with ".jpeg" extensions too
2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky
0e34c8f493 Allow semantic search on images from Emacs
Images are rendered inline a temporary org-mode buffer
2021-09-10 01:14:34 -07:00
Debanjum Singh Solanky
7d5514ecaa Allow user to override inferred search type with other valid options 2021-09-10 00:58:24 -07:00
Debanjum Singh Solanky
3bdeeb1e19 Autoload main semantic-search function 2021-09-09 22:10:37 -07:00
Debanjum Singh Solanky
f4bde75249 Decouple results shown to user and text the model is trained on
- Previously:
  The text the model was trained on was being used to
  re-create a semblance of the original org-mode entry.

- Now:
  - Store raw entry as another key:value in each entry json too
    Only return actual raw org entries in results
    But create embeddings like before
  - Also add link to entry in file:<filename>::<line_number> form
    in property drawer of returned results
    This can be used to jump to actual entry in it's original file
2021-08-29 06:06:54 -07:00
Debanjum Singh Solanky
7ee3007070 Get ID, QUERY, TYPE, CATEGORY properties from org property drawer when present 2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
0263d4d068 Enable semantic search for songs in org-music
Org-Music: https://github.com/debanjum/org-music
2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
fd7888f3d4 Resolve relative file paths to config YAML file in cli.py 2021-08-29 03:03:37 -07:00
Debanjum Singh Solanky
fc531a1915 Resolve relative file paths to model embeddings in all search types 2021-08-28 22:26:12 -07:00
Debanjum Singh Solanky
4daeddbbda Enable Semantic Search on Images 2021-08-22 21:42:37 -07:00
Debanjum Singh Solanky
fd217fe8b7 Enable Semantic Search for Beancount transactions 2021-08-22 21:36:06 -07:00
Debanjum Singh Solanky
97263b8209 Move CLI into a separate module. Move CLI tests into a separate file 2021-08-21 19:21:38 -07:00
Debanjum Singh Solanky
78a1f4ebb4 Use YAML file to allow user to configure application. Add tests
- YAML Config
  - Can specify all params[1] earlier being passed via cmd args in config YAML
  - Can now also configure sentence-transformer models to use etc for search
    - [1] Config params
       - org files
       - compressed entries file config path
       - embeddings file config path

  - Include sample_config.yaml
  - Include sample .org file from this repos readmes

- CLI
  - Configuration Priority: Config via cmd > Config via YAML > Default Config
  - Test CLI, include test config.yml for the tests

- Set default type to None unless set via query param to API
  Run notes search if search_enabled, also if type is None (default)
  Prepares for running queries on all search types unless type
  specified in API query param

- Update Readme
2021-08-21 19:07:39 -07:00
Debanjum Singh Solanky
bafc86d583 Add helpers to merge dictionaries and get keys deep inside a dictionary 2021-08-21 18:27:50 -07:00
Debanjum Singh Solanky
252266b62a Pass type of item via regenerate API. Default type query param to None 2021-08-17 18:25:07 -07:00
Debanjum Singh Solanky
ff7207a6bd Extract commandline arguments into separate testable method 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
a3a1100be9 Arrange modules in standardized ordering 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
569e30b1c8 Create a few basic tests 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
af9660f28e Move application files under src directory. Update Readmes
- Remove callign asymmetric search script directly command.
  It doesn't work anymore on calling directly due to internal package
  import issues
2021-08-17 04:11:03 -07:00