Commit graph

1997 commits

Author SHA1 Message Date
Debanjum Singh Solanky
ec41482324 Upgrade default cross-encoder to mixedbread ai's mxbai-rerank-xsmall
Previous cross-encoder model was a few years old, newer models should
have improved in quality. Model size increases by 50% compared to
previous for better performance, at least on benchmarks
2024-04-24 09:50:09 +05:30
Debanjum Singh Solanky
7eaf9367fe Support more embedding models by making query, docs encoding configurable
Most newer, better embeddings models add a query, docs prefix when
encoding. Previously Khoj admins couldn't configure these, so it
wasn't possible to use these newer models.

This change allows configuring the kwargs passed to the query, docs
encoders by updating the search config in the database.
2024-04-24 09:49:17 +05:30
Debanjum Singh Solanky
4f7237b158 Make chat actors generate valid json with more local models
Improve tool, online search, webpage links, docs search chat actor
prompts. Ensure works with hermes-2-pro and llama-3.

Be more specific about generating JSON and not saying anything else.
2024-04-24 09:40:00 +05:30
Debanjum Singh Solanky
a2e4e4bede Add support for Llama 3 in Khoj offline mode
- Improve extract question prompts to explicitly request JSON list
- Use llama-3 chat format if HF repo_id mentions llama-3. The
  llama-cpp-python logic for detecting when to use llama-3 chat format
  isn't robust enough currently
2024-04-24 09:40:00 +05:30
Debanjum Singh Solanky
8e77b3dc82 Fix infer_max_tokens func when configured_max_tokens is set to None 2024-04-24 09:36:29 +05:30
Debanjum Singh Solanky
8196ab62f9 Make valid file extension checking case insensitive on Desktop app 2024-04-24 09:35:20 +05:30
Debanjum Singh Solanky
5def14e3bb Skip indexing non-existent folders on Desktop app 2024-04-24 09:35:20 +05:30
Debanjum Singh Solanky
cd05f262a6 Pass auth headers to fix lazy load of chat messages on Desktop app 2024-04-24 09:35:20 +05:30
Debanjum Singh Solanky
4d5d3e6433 Set chat-message height to height of content in web, desktop
In some cases, especially with image generation requests, this was
causing the chat messages to overlap in the chat UI
2024-04-24 09:35:20 +05:30
sabaimran
60658a8037 Get rid of enable flag for the offline chat processor config
- Default, assume that offline chat is enabled if there is an offline chat model option configured
2024-04-23 23:08:29 +05:30
sabaimran
ac474fce38 Ensure that the tokenizer and max prompt size are used the wrapper method 2024-04-23 21:22:23 +05:30
Olatoyan George
ad59180fb8
Added indication in the desktop UI for back-end connectivity (#711)
* Changed the styling of the link that takes a user to the settings page into a button
* added an indicator that shows if a user is connected to the server or not
* made a class name more descriptive and also made the text in first run message more intuitive
* changed the command to install dependencies in the README.md
* changed the class name of the first run message text to be more descriptive
* added icons in the desktop UI that shows if a file is synced successfully or not
* made the link class name in the homepage more descriptive
* fixed the hover issue on status box in the chat header pane
* fixed hovering issue on status box on macOS
2024-04-23 16:43:48 +05:30
Debanjum
419b044ac5
Use set, inferred max token limits wherever chat models are used (#713)
- User configured max tokens limits weren't being passed to
  `send_message_to_model_wrapper'
- One of the load offline model code paths wasn't reachable. Remove it
  to simplify code
- When max prompt size isn't set infer max tokens based on free VRAM
  on machine
- Use min of app configured max tokens, vram based max tokens and
  model context window
2024-04-23 16:42:35 +05:30
AjaySDwivedi1
abf6f963ea
Replaced reinitialize and save all button to a sync button in config.… (#701)
Replaced reinitialize and save all button to a sync button in config
2024-04-23 16:42:11 +05:30
Debanjum Singh Solanky
c39c4e4ec4 Improve prompt for online search query generation chat actor
- Allow searching github, pypi for information about Khoj
- Enable creating multiple search queries by rewording prompt
2024-04-22 01:32:11 +05:30
Debanjum Singh Solanky
175169c156 Use set, inferred max token limits wherever chat models are used
- User configured max tokens limits weren't being passed to
  `send_message_to_model_wrapper'
- One of the load offline model code paths wasn't reachable. Remove it
  to simplify code
- When max prompt size isn't set infer max tokens based on free VRAM
  on machine
- Use min of app configured max tokens, vram based max tokens and
  model context window
2024-04-20 11:23:28 +05:30
Debanjum Singh Solanky
002cd14a65 Only let agent use online search tool if connected to it 2024-04-20 11:19:48 +05:30
Debanjum Singh Solanky
75c9ebbc54 Only show uvicorn debug logs at higher verbosity levels
Don't automatically show the uvicorn logs when in_debug_mode, only
show on at least verbosity = 2, i.e when start khoj with -vv flag
2024-04-20 11:18:01 +05:30
sabaimran
d11354f9c8 Remove additional references to image content config 2024-04-17 13:00:50 +05:30
sabaimran
105dbf49e4 Fix max_duration_in_seconds for the update_embeddings job 2024-04-17 13:00:18 +05:30
Debanjum Singh Solanky
8e0bae894d Extract run with process lock logic into func. Use for content reindexing 2024-04-17 12:31:19 +05:30
Debanjum Singh Solanky
e9f608174b Fix access to Khoj admin panel from non HTTPS custom domains
To access the Khoj admin panel from a non HTTPS custom domain the
`KHOJ_NO_SSL' and `KHOJ_DOMAIN' env vars need to be explictly set.

See the updated setup docs for details.

Resolves #662
2024-04-17 03:20:05 +05:30
sabaimran
b0059654c9 Do not create an import error if the resend module is not available 2024-04-17 01:00:22 +05:30
sabaimran
f04ead7c37 Remove seting up log line for configuring image search 2024-04-17 00:45:39 +05:30
sabaimran
0208688801 Increase factor for n_ctx reduciton to 2e6 2024-04-17 00:41:36 +05:30
Debanjum Singh Solanky
1f2ffce85b Copy chat message with it's markdown formatting in Web, Desktop apps 2024-04-16 22:10:34 +05:30
sabaimran
91c8b137f1
Add a database lock for jobs that shouldn't be run by multiple workers (#706)
* Add a database lock for jobs that shouldn't be run by multiple workers

* Import relevant functions from utils.helpers
2024-04-16 21:29:27 +05:30
sabaimran
adb2e8cc5f Check if n is populated before making a comparison 2024-04-16 02:05:58 +05:30
Debanjum Singh Solanky
6707ccc463 Check before updating "chat" key in meta_log in chat history API endpoint 2024-04-15 21:06:47 +05:30
Debanjum Singh Solanky
4e7812fe55 Use Django management cmd to update inline images in DB to/from WebP/PNG
This provides Khoj server admins more control on migrating their S3
images to WebP format from PNG
2024-04-15 20:19:49 +05:30
Debanjum Singh Solanky
7fab8d6586 Only use chat messages count in history API endpoint when set by client 2024-04-15 19:12:57 +05:30
Debanjum
6b3ef61dd2
Improve Chat Page Load Perf, Offline Chat Perf and Miscellaneous Fixes (#703)
### Store Generated Images as WebP 
- 78bac4ae Add migration script to convert PNG to WebP references in database
- c6e84436 Update clients to support rendering webp images inline
- d21f22ff Store Khoj generated images as webp instead of png for faster loading

### Lazy Fetch Chat Messages to Improve Time, Data to First Render
This is especially helpful for long conversations with lots of images
- 128829c4 Render latest msgs on chat session load. Fetch, render rest as they near viewport
- 9e558577 Support getting latest N chat messages via chat history API

### Intelligently set Context Window of Offline Chat to Improve Performance
- 4977b551 Use offline chat prompt config to set context window of loaded chat model

### Fixes
- 148923c1 Fix to raise error on hitting rate limit during Github indexing
- b8bc6bee Always remove loading animation on Desktop app if can't login to server
- 38250705 Fix `get_user_photo` to only return photo, not user name from DB

### Miscellaneous Improvements
- 689202e0 Update recommended CMAKE flag to enable using CUDA on linux in Docs
- b820daf3 Makes logs less noisy
2024-04-15 18:34:29 +05:30
Debanjum Singh Solanky
a352940dfd Use Django management command to update images URL in DB to WebP
This provides Khoj server admins more control on migrating their S3
images to WebP format from PNG
2024-04-15 17:53:41 +05:30
Debanjum Singh Solanky
7d8e8eb0cf Use Enum to type text-to-image intent of Khoj chat response 2024-04-15 17:53:40 +05:30
Debanjum Singh Solanky
128829c477 Show latest msgs on chat session load. Fetch rest as they near viewport
- Reduces time to first render when loading long chat sessions
- Limits size of first page load, when loading long chat sessions

These performance improvements are maximally felt for large chat
sessions with lots of images generated by Khoj

Updated web and desktop app to support these changes for now
2024-04-15 16:10:56 +05:30
Debanjum Singh Solanky
9e5585776c Support getting latest N chat messages via chat history API
Get latest N if N > 0, else return all messages except latest N from
the conversation
2024-04-15 15:32:32 +05:30
Debanjum Singh Solanky
e5ff85f6fb Start fetching khoj css before icons to reduce time with no styling
This should reduce frequency of page load jitter when icons are loaded
before style is applied
2024-04-15 15:32:32 +05:30
Debanjum Singh Solanky
d5de59d411 Do not assume results key present in notion content when indexing 2024-04-15 08:02:20 +05:30
Debanjum Singh Solanky
4977b55106 Use offline chat prompt config to set context window of loaded chat model
Previously you couldn't configure the n_ctx of the loaded offline chat
model. This made it hard to use good offline chat model (which these
days also have larger context) on machines with lower VRAM
2024-04-14 02:35:36 +05:30
Debanjum Singh Solanky
148923c13a Fix to raise error on hitting rate limit during Github indexing 2024-04-13 22:09:13 +05:30
sabaimran
f24d71c71c
Improve the agents UX (#702)
- Make the chat buttons look more clickable
- Show agent name in new conversation message
- Add an icon to the CTA to send agent a message
2024-04-13 20:11:37 +05:30
Debanjum Singh Solanky
78bac4ae05 Add migration script to convert PNG to WebP references in database 2024-04-13 19:06:28 +05:30
Debanjum Singh Solanky
c6e8443631 Update clients to support rendering webp images inline
This is for self-hosted scenarios where AWS S3 uploads is not enabled
2024-04-13 13:11:18 +05:30
Debanjum Singh Solanky
d21f22ffa1 Store Khoj generated images as webp instead of png for faster loading 2024-04-13 13:03:32 +05:30
Debanjum Singh Solanky
b820daf38f Makes logs less noisy
- Show telemetry enabled/disabled state on init, not every 2 minutes
- Convert no docs synced logs to debug level instead of warning
  Having synced docs isn't as important to use Khoj now, unlike before
2024-04-13 11:22:58 +05:30
Debanjum Singh Solanky
b8bc6bee83 Always remove loading animation on Desktop app if can't login to server 2024-04-13 11:02:44 +05:30
Debanjum Singh Solanky
382507051f Fix get_user_photo to only return photo, not user name from DB 2024-04-13 11:02:30 +05:30
sabaimran
f06ec485cb Fix redirect url process for login flow, existing user 2024-04-12 17:10:05 +05:30
sabaimran
b86e68a29d Make it easier to view agents in the admin page 2024-04-12 13:02:22 +05:30
sabaimran
1377a44a1a Suppress debug logs from uvicorn.error to avoid clutter from websockets
- If application is not in DEBUG_MODE
2024-04-12 12:12:16 +05:30
Debanjum Singh Solanky
89b8ec3546 Release Khoj version 1.10.2 2024-04-12 11:53:32 +05:30
Debanjum Singh Solanky
50b4788a91 Remove chat loading animation in login required state on Desktop app 2024-04-12 11:50:54 +05:30
Debanjum Singh Solanky
b3f4794d91 Remove the unnecessary async/await func chains on Desktop app 2024-04-12 11:49:25 +05:30
Debanjum Singh Solanky
1e30a072d4 Just use file ext to identify indexable files to fix Desktop app install
- Magika on Desktop app was too bloated (100Mb to 250Mb) and broke
  install for some reason. Not sure why it was causing the app install
  to fail but do not have time to currently investigate

- Just use file extensions whitelist it's good enough for now. Let
  server handle the deeper identification of file type
2024-04-12 11:16:07 +05:30
Debanjum Singh Solanky
5c7797dbca Only check content type if file extension cannot identify text file 2024-04-12 03:40:42 +05:30
Debanjum Singh Solanky
7d2ef728e6 Fix identifying pdf files on server
Introduced bug in previous commit that would stop indexing PDF files
as trying to check content_group instead of mime_type is application/pdf
2024-04-12 03:07:46 +05:30
Debanjum Singh Solanky
07f8fb5c5b Release Khoj version 1.10.1 2024-04-12 02:18:07 +05:30
Debanjum Singh Solanky
a7d9102c33 Make identifying text, code files with Magika more robust on server
Use identified content group rather than mime_type to find text files.
2024-04-12 02:12:26 +05:30
Debanjum Singh Solanky
60337086f9 Release Khoj version 1.10.0 2024-04-12 01:01:02 +05:30
Debanjum Singh Solanky
34c3f70203 Index only files with valid text extension in folders synced by Desktop app
This maintains consistent set of indexable files from Desktop app,
whether indexing via file or folder filters
2024-04-12 00:59:54 +05:30
Debanjum
9a48f72041
Index more text file types from Desktop, Github (#692)
### Index more text file types 
- Index all text, code files in Github repos. Not just md, org files
- Send more text file types from Desktop app and improve indexing them
- Identify file type by content & allow server to index all text files

### Deprecate Github Indexing Features
- Stop indexing commits, issues and issue comments in a Github repo
- Skip indexing Github repo on hitting Github API rate limit

### Fixes and Improvements
- **Fix indexing files in sub-folders from Desktop app**
- Standardize structure of text to entries to match other entry processors
2024-04-12 00:08:29 +05:30
Debanjum Singh Solanky
0819b83d0b Fix constructing status update strings for intermediate chat steps 2024-04-11 20:31:32 +05:30
Debanjum Singh Solanky
d15b9bc272 Tell doc search actor to not generate online queries for doc search
This can pick up irrelevant details from notes
2024-04-11 19:49:41 +05:30
Debanjum Singh Solanky
15a78b19ad Improve Inferred Document Search Query Extraction from GPT
Using stop_words = "\n" was preventing JSON responses with newlines in
them
2024-04-11 19:24:04 +05:30
Debanjum Singh Solanky
653681967e Show inferred document search queries in intermediate chat step on Web app 2024-04-11 19:24:04 +05:30
Debanjum Singh Solanky
997741119a Show better intermediate steps when responding to chat via web socket
- Show internet search, webpage read, image query, image generation steps
- Standardize, improve rendering of the intermediate steps on the web app

Benefits:
1. Improved transparency, allow users to see what Khoj is doing behind
   the scenes and modify their query patterns to improve response quality
2. Reduced websocket connection keep alive timeouts for long running steps
2024-04-11 18:04:40 +05:30
sabaimran
fae7900f19 Remove more 2024-04-11 00:27:44 +05:30
sabaimran
5d1dd3e2b7 If resend not enabled, don't send the welcome email 2024-04-10 23:52:42 +05:30
sabaimran
d2f9c43c8e Use datetime.timezone.utc instead of datetime.utc 2024-04-10 23:07:43 +05:30
Debanjum Singh Solanky
f2dc9709b7 Use Magika to more robustly identify text files to send for indexing
- `file-type' doesn't handle mis-labelled files or files without
   extensions well

- Only show supported file types in file selector dialog on Desktop app
  Use Magika to get list of text file extensions. Combine with other
  supported extensions to get complete list of supported file extensions.
  Use it to limit selectable files in the File Open dialog.

  Note: Folder selector will index text files with no extensions as well
2024-04-10 22:44:24 +05:30
sabaimran
3fe94a67b0
Send welcome emails when a new user signs up (#691)
* Don't trigger any re-indexing on server initailization

* Integrate Resend to send welcome emails when a new user signs up

- Only send if this is the first time they've signed in
- Configure welcome email with basic styling, as more complex designs don't work and style tag did not work
2024-04-10 19:57:33 +05:30
Debanjum
6d153022f6
Improve nav pane, chat session UI on Desktop, Web app (#693)
### Enable copying chat messages. Improve copy button behavior and styling
  - Add button to copy chat messages on Desktop, Web apps
  - Improve copy button's icon, hover color & click animation in Desktop, Web apps

### Improve Navigation, Chat Session Panes on Desktop, Web apps
  - Dynamically generate navigation menu based on user info from server
  - Create API endpoint to get authenticated user information
  - Collapse navigation tabs into icons on mobile. Add spacing to them
  - Add Chat navigation tab back to top pane on Web app
  - Use proper icons for Search, Chat and Agents tab on navigation pane
  
### Miscellaneous Improvements
  - Make current chat expand to full width when session panel collapsed on Desktop App
  - Add chat session loading spinner to Desktop App (same as Web app)
   
### Fixes 
  - Show title bar in Khoj desktop app on Windows to simplify close, minimize etc.
  - Only render first run setup message once if error or server not running
  - Fix showing Search navigation tab from Agent pages on web client
2024-04-10 19:54:12 +05:30
Debanjum Singh Solanky
48d249db9e Center the nav item text and user profile initial icons 2024-04-10 19:38:43 +05:30
Debanjum Singh Solanky
60f6a1c6f1 Use svg icons in nav pane to standardize styling on Web, Desktop apps
Emojis varied based on device. svg icons standardize icon styles of
the web, desktop apps
2024-04-10 19:38:43 +05:30
Debanjum Singh Solanky
cccea484e4 Pass username, location context in system prompt instead of chat message
The username and location in system prompt should disambiguate user
context from user's actual message for the chat model.

It doesn't need to be told to not mention the context or acknowledge
the context instructions in it's response, as it understands that this
information is just context and not part of the user's actual message.
2024-04-10 15:05:33 +05:30
Debanjum Singh Solanky
804c04f7b9 Do not render copy message button on every Khoj thinking step
Only render copy chat message button once, after message text is rendered
2024-04-10 14:48:36 +05:30
sabaimran
a4afada746 Remove client-side timeouts for the khoj socket 2024-04-10 13:35:25 +05:30
Debanjum Singh Solanky
cadeaac769 Align conversation sessions side panel on Desktop app with Web app
- Move new conversation button to right of "Conversation" title
- Reduce size of chat message loading ellipsis animation
- Add loading animation for chat session
2024-04-10 10:34:36 +05:30
Debanjum Singh Solanky
1c3d129e08 Add button to copy chat messages on Desktop client 2024-04-10 10:34:36 +05:30
Debanjum Singh Solanky
0a5a91619e Improve copy button's icon, hover color & click animation in Desktop UI 2024-04-10 10:34:36 +05:30
Debanjum Singh Solanky
184873213c Add button to copy chat messages on Web client 2024-04-10 10:34:36 +05:30
Debanjum Singh Solanky
f56522cb8e Improve copy button's icon, hover color & click animation in Web UI 2024-04-10 10:34:36 +05:30
Debanjum Singh Solanky
8ff3890ba8 Dynamically generate navigation menu based on user info from server 2024-04-10 10:34:36 +05:30
Debanjum Singh Solanky
94c69eb8e3 Create API endpoint to get authenticated user information
This help clients render UI with user information
2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
377e979800 Make current chat expand to full width when session panel collapsed
This behavior also matches web client behavior on chat session panel
collapse
2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
913dcdfbcd Only render first run setup message once if error or server not running 2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
3b630841bd s/aget_all_filenames_by_source/get_all_filenames_by_source as sync func 2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
e45edbb992 Collapse navigation tabs into icons on mobile. Add spacing to them 2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
93edd5427f Add Chat navigation tab back to top pane on web client
Reduces user confusion on how to go to chat pane
Add emoji's for each tab to provide cleaner, iconified division between
the nav options
2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
8159d1ab25 Fix showing Search navigation tab from Agent pages on web client
The `has_documents' flag wasn't being passed. So the search tab
always showing up as empty instead of being dynamically enabled if
documents had been indexed.
2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
76cb543347 Show title bar in Khoj desktop app on Windows 2024-04-09 21:04:44 +05:30
Debanjum Singh Solanky
f040418cf1 Fix indexing files in sub-folders on the Desktop app
- `fs.readdir' func in node version 18.18.2 has buggy `recursive' option
  See nodejs/node#48640, effect-ts/effect#1801 for details

- We were recursing down a folder in two ways on the Desktop app.
  Remove `recursive: True' option to the `fs.readdirSync' method call
  to recurse down via app code only
2024-04-09 20:19:40 +05:30
Debanjum Singh Solanky
a8dec1c9d5 Index all text, code files in Github repos. Not just md, org files 2024-04-09 20:19:40 +05:30
Debanjum Singh Solanky
8291b898ca Standardize structure of text to entries to match other entry processors
Add process_single_plaintext_file func etc with similar signatures as
org_to_entries and markdown_to_entries processors

The standardization makes modifications, abstractions easier to create
2024-04-09 20:19:40 +05:30
Debanjum Singh Solanky
079f409238 Skip indexing Github repo on hitting Github API rate limit
Sleep until rate limit passed is too expensive, as it keeps a
app worker occupied.

Ideally we should schedule job to contine after rate limit wait time
has passed. But this can only be added once we support jobs scheduling.
2024-04-09 20:19:40 +05:30
Debanjum Singh Solanky
d5c9b5cb32 Stop indexing commits, issues and issue comments in Github indexer
Normal indexing quickly Github hits rate limits. Purpose of exposing
Github indexer is for indexing content like notes, code and other
knowledge base in a repo.

The current indexer doesn't scale to index metadata given Github's
rate limits, so remove it instead of giving a degraded experience of
partially indexed repos
2024-04-09 20:19:40 +05:30
Debanjum Singh Solanky
7ff1bd9f8b Send more text file types from Desktop app and improve indexing them
- Allow syncing more file types from desktop app to index on server
  - Use `file-type' package to identify valid text file types on Desktop app

- Split plaintext entries into smaller logical units than a whole file
  Since the text splitting upgrades in #645, compiled chunks have more
  logical splits like paragraph, sentence.
  Show those (potentially) smaller snippets to the user as references

- Tangential Fix:
  Initialize unbound currentTime variable for error log timestamp
2024-04-09 20:19:40 +05:30
Debanjum Singh Solanky
89915dcb4c Identify file type by content & allow server to index all text files
- Use Magika's AI for a tiny, portable and better file type
  identification system
- Existing file type identification tools like `file' and `magic'
  require system level packages, that may not be installed by default
  on all operating systems (e.g `file' command on Windows)
2024-04-09 20:19:39 +05:30
sabaimran
312528d471 Fix typo in SECURE_PROXY_SSL_HEADER settings 2024-04-09 12:33:21 +05:30
sabaimran
e56c5e67dd Revert SSL Redirect setting as it prevents the admin page from loading 2024-04-09 12:24:48 +05:30
sabaimran
1770bb174b Add UUID to the KhojUser search fields and inc frequency of telemetry job to 2 mins 2024-04-09 11:51:51 +05:30
sabaimran
ab51ae9091 Use SECURE_SSL_REDIRECT to ensure requests are routed to https always 2024-04-09 10:18:12 +05:30
sabaimran
1c229dad91 Set daily limit for unsubsribed users to 5 in websocket API 2024-04-08 21:16:48 +05:30
sabaimran
27815d982c Redirect user to the login page when either of the csrf token inputs is missing 2024-04-08 20:22:17 +05:30
sabaimran
d257629f81 Handle case when properties field isn't present in the page 2024-04-08 16:15:47 +05:30
sabaimran
089e0d028b Add a more gracefull error message when the rate limit is exceeded 2024-04-08 15:20:54 +05:30
Debanjum
11ce3e2268
Update Text Chunking Strategy to Improve Search Context (#645)
## Major
- Parse markdown, org parent entries as single entry if fit within max tokens
- Parse a file as single entry if it fits with max token limits
- Add parent heading ancestry to extracted markdown entries for context
- Chunk text in preference order of para, sentence, word, character

## Minor
- Create wrapper function to get entries from org, md, pdf & text files
- Remove unused Entry to Jsonl converter from text to entry class, tests
- Dedupe code by using single func to process an org file into entries

Resolves #620
2024-04-08 13:56:38 +05:30
Debanjum Singh Solanky
67b1178aec Remove debug logs generated while compiling org-mode entries 2024-04-08 13:01:24 +05:30
sabaimran
731ad03348 Skip indexing commits that are missing properties 2024-04-07 15:19:07 +05:30
sabaimran
376eaf64cd Check if results are present in the pages or db response in Notion 2024-04-07 15:19:07 +05:30
Debanjum Singh Solanky
8222615280 Do not add original user message to knowledge search queries for offline chat
It's not required anymore. The extracted questions by the offline chat
model being used should be good enough.
2024-04-07 11:29:35 +05:30
sabaimran
351fb31a34 Add webpage search to socket codepath, add a feature page for online search 2024-04-07 09:23:29 +05:30
Debanjum Singh Solanky
4be4c53222 Release Khoj version 1.9.0 2024-04-05 17:13:58 +05:30
sabaimran
2aedd3c819 Increase freq. of telemetry upload to every 5 minutes 2024-04-05 14:13:47 +05:30
sabaimran
3b1234d084 Await the calls to the db in the notion.py file 2024-04-05 13:58:14 +05:30
sabaimran
00a67e9524 Add additional log lines when configuring the Notion settings for a user in the callback 2024-04-05 13:19:24 +05:30
sabaimran
d23f7da8e3 Handle the case where a previous serach model isn't set when updating the model 2024-04-05 13:18:51 +05:30
sabaimran
f57f9f672d
Address Notion, Image tech debt in indexing code path (#687)
* Add support for using OAuth2.0 in the Notion integration
* Add notion to the admin page
* Remove unnecessary content_index and image search/setup references
* Trigger background job to start indexing Notion after user configures it
* Add a log line when a new Notion integration is setup
* Fix references to the configure_content methods
2024-04-05 12:10:03 +05:30
sabaimran
a60321b68e Push khoj to include inline references when possible 2024-04-04 10:31:13 +05:30
sabaimran
5bdcb4e69c Wait for location data to be returned before setting up the socket connection 2024-04-04 10:31:13 +05:30
Debanjum Singh Solanky
00f599ea78 Fix passing flags to re.split to break org, md content by heading level
`re.MULTILINE' should be passed to the `flags' argument, not the
`max_splits' argument of the `re.split' func

This was messing up the indexing by only allowing a maximum of
re.MULTILINE splits. Fixing this improves the search quality to
previous state
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
32ac0622ff Extract dates from compiled text entries 2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
29c1c18042 Increase search distance to get relevant content for chat post indexer update
More content indexed per entry would result in an overall scores
lowering effect. Increase default search distance threshold to counter that

- Details
  - Fix expected results post indexing updates
  - Fix search with max distance post indexing updates

- Minor
  - Remove openai chat actor test for after: operator as it's not expected anymore
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
ad4fa4b2f4 Fix adding file path instead of stem to markdown entries 2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
44b3247869 Update logical splitting of org-mode text into entries
- Major
  - Do not split org file, entry if it fits within the max token limits
    - Recurse down org file entries, one heading level at a time until
      reach leaf node or the current parent tree fits context window
    - Update `process_single_org_file' func logic to do this recursion

  - Convert extracted org nodes with children into entries
    - Previously org node to entry code just had to handle leaf entries
    - Now it recieve list of org node trees
    - Only add ancestor path to root org-node of each tree
    - Indent each entry trees headings by +1 level from base level (=2)

- Minor
  - Stop timing org-node parsing vs org-node to entry conversion
    Just time the wrapping function for org-mode entry extraction
    This standardizes what is being timed across at md, org etc.
  - Move try/catch to `extract_org_nodes' from `parse_single_org_file'
    func to standardize this also across md, org
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
eaa27ca841 Only add spaces after heading if any tags in orgnode raw entry repr 2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
2ea8a832a0 Log error when fail to index md file. Fix, improve typing in md_to_entries 2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
44eab74888 Dedupe code by using single func to process an org file into entries
Add type hints to orgnode and org-to-entries packages
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
db2581459f Parse markdown parent entries as single entry if fit within max tokens
These changes improve context available to the search model.
Specifically this should improve entry context from short knowledge trees,
that is knowledge bases with sparse, short heading/entry trees

Previously we'd always split markdown files by headings, even if a
parent entry was small enough to fit entirely within the max token
limits of the search model. This used to reduce the context available
to the search model to select appropriate entries for a query,
especially from short entry trees

Revert back to using regex to parse through markdown file instead of
using MarkdownHeaderTextSplitter. It was easier to implement the
logical split using regexes rather than bend MarkdowHeaderTextSplitter
to implement it.
- DFS traverse the markdown knowledge tree, prefix ancestry to each entry
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
982ac1859c Parse markdown file as single entry if it fits with max token limits
These changes improve entry context available to the search model
Specifically this should improve entry context from short knowledge trees,
that is knowledge bases with small files

Previously we split all markdown files by their headings,
even if the file was small enough to fit entirely within the max token
limits of the search model. This used to reduce the context available
to select the appropriate entries for a given query for the search model,
especially from short knowledge trees
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
d8f01876e5 Add parent heading ancestory to extracted markdown entries for context
Improve, update the markdown to entries extractor tests
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
86575b2946 Chunk text in preference order of para, sentence, word, character
- Previous simplistic chunking strategy of splitting text by space
  didn't capture notes with newlines, no spaces. For e.g in #620

- New strategy will try chunk the text at more natural points like
  paragraph, sentence, word first. If none of those work it'll split
  at character to fit within max token limit

- Drop long words while preserving original delimiters

Resolves #620
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
a627f56a64 Remove unused Entry to Jsonl converter from text to entry class, tests
This was earlier used when the index was plaintext jsonl file. Now
that documents are indexed in a DB this func is not required.

Simplify org,md,pdf,plaintext to entries tests by removing the entry
to jsonl conversion step
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
28105ee027 Create wrapper function to get entries from org, md, pdf & text files
- Convert extract_org_entries function to actually extract org entries
  Previously it was extracting intermediary org-node objects instead
  Now it extracts the org-node objects from files and converts them
  into entries
- Create separate, new function to extract_org_nodes from files
- Similarly create wrapper funcs for md, pdf, plaintext to entries

- Update org, md, pdf, plaintext to entries tests to use the new
  simplified wrapper function to extract org entries
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
f01a12b1d2 Improve styling of chat sessions side panel
- Move green server connected dot to the bottom. Show status when
  disconnected from server
- Move "New conversation" button to right of the "Conversation" title
- Center alignment of the new conversation and connection status buttons
2024-04-04 01:43:26 +05:30
sabaimran
dd1e5e145a Use List[Any] for typing 2024-04-03 21:46:41 +05:30
sabaimran
b8087c4c8e Add typing to empty list variables in github_to_entries 2024-04-03 21:41:36 +05:30
sabaimran
d036fdfc26 If tree is not in the contents, then just return empty files list 2024-04-03 17:55:25 +05:30
Debanjum Singh Solanky
f915b2bd14 Fix passing model_name param to chatml formatter for online chat 2024-04-03 17:21:43 +05:30
sabaimran
6aa88761b8 Skip creating the default agent if there's no default conversation config 2024-04-03 17:21:01 +05:30
sabaimran
b4f71e06b3 Add timeout after 10 minutes of inactivity on socket 2024-04-02 22:12:27 +05:30
sabaimran
f48426623d resolve merge conflict in chat.html 2024-04-02 17:29:48 +05:30
sabaimran
bf1187f465 Use new online/websearch logic and add agent to chat_metadata 2024-04-02 17:20:38 +05:30
sabaimran
867e1007d1 Remove superfluous newline 2024-04-02 17:20:08 +05:30
sabaimran
228ad68042 Merge with origin/master 2024-04-02 17:02:21 +05:30
sabaimran
776550d5ce Add a migration for updating the default chat model, update for existing users 2024-04-02 17:01:31 +05:30
sabaimran
47fc7e1ce6 Rebase with matser 2024-04-02 16:16:06 +05:30
Debanjum
215ab6e66a
Extract More Dates from entries to improve Date Filter (#683)
- Overview
  - Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year)
  - Extract some natural, partial dates as well from entries
- Capability
  Add ability to extract the following additional date forms:
  - Natural Dates: 21st April 2000, February 29 2024
  - Partial Natural Dates: March 24, Mar 2024
  - Structured Dates: 20/12/24, 20.12.2024, 2024/12/20
  Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters
- Performance
  Using regexes is MUCH faster than using the `dateparser' python library
  It's a little crude but gives acceptable performance for large datasets
2024-04-02 16:14:53 +05:30
Debanjum Singh Solanky
7afee2d55c Let offline chat model set context window. Improve, fix prompts 2024-03-31 16:19:35 +05:30
Debanjum Singh Solanky
4228965c9b Handle msg truncation when question is larger than max prompt size
Notice and truncate the question it self at this point
2024-03-31 15:50:06 +05:30