Previous cross-encoder model was a few years old, newer models should
have improved in quality. Model size increases by 50% compared to
previous for better performance, at least on benchmarks
Most newer, better embeddings models add a query, docs prefix when
encoding. Previously Khoj admins couldn't configure these, so it
wasn't possible to use these newer models.
This change allows configuring the kwargs passed to the query, docs
encoders by updating the search config in the database.
Improve tool, online search, webpage links, docs search chat actor
prompts. Ensure works with hermes-2-pro and llama-3.
Be more specific about generating JSON and not saying anything else.
- Improve extract question prompts to explicitly request JSON list
- Use llama-3 chat format if HF repo_id mentions llama-3. The
llama-cpp-python logic for detecting when to use llama-3 chat format
isn't robust enough currently
* Changed the styling of the link that takes a user to the settings page into a button
* added an indicator that shows if a user is connected to the server or not
* made a class name more descriptive and also made the text in first run message more intuitive
* changed the command to install dependencies in the README.md
* changed the class name of the first run message text to be more descriptive
* added icons in the desktop UI that shows if a file is synced successfully or not
* made the link class name in the homepage more descriptive
* fixed the hover issue on status box in the chat header pane
* fixed hovering issue on status box on macOS
- User configured max tokens limits weren't being passed to
`send_message_to_model_wrapper'
- One of the load offline model code paths wasn't reachable. Remove it
to simplify code
- When max prompt size isn't set infer max tokens based on free VRAM
on machine
- Use min of app configured max tokens, vram based max tokens and
model context window
- User configured max tokens limits weren't being passed to
`send_message_to_model_wrapper'
- One of the load offline model code paths wasn't reachable. Remove it
to simplify code
- When max prompt size isn't set infer max tokens based on free VRAM
on machine
- Use min of app configured max tokens, vram based max tokens and
model context window
To access the Khoj admin panel from a non HTTPS custom domain the
`KHOJ_NO_SSL' and `KHOJ_DOMAIN' env vars need to be explictly set.
See the updated setup docs for details.
Resolves#662
### Store Generated Images as WebP
- 78bac4ae Add migration script to convert PNG to WebP references in database
- c6e84436 Update clients to support rendering webp images inline
- d21f22ff Store Khoj generated images as webp instead of png for faster loading
### Lazy Fetch Chat Messages to Improve Time, Data to First Render
This is especially helpful for long conversations with lots of images
- 128829c4 Render latest msgs on chat session load. Fetch, render rest as they near viewport
- 9e558577 Support getting latest N chat messages via chat history API
### Intelligently set Context Window of Offline Chat to Improve Performance
- 4977b551 Use offline chat prompt config to set context window of loaded chat model
### Fixes
- 148923c1 Fix to raise error on hitting rate limit during Github indexing
- b8bc6bee Always remove loading animation on Desktop app if can't login to server
- 38250705 Fix `get_user_photo` to only return photo, not user name from DB
### Miscellaneous Improvements
- 689202e0 Update recommended CMAKE flag to enable using CUDA on linux in Docs
- b820daf3 Makes logs less noisy
- Reduces time to first render when loading long chat sessions
- Limits size of first page load, when loading long chat sessions
These performance improvements are maximally felt for large chat
sessions with lots of images generated by Khoj
Updated web and desktop app to support these changes for now
Previously you couldn't configure the n_ctx of the loaded offline chat
model. This made it hard to use good offline chat model (which these
days also have larger context) on machines with lower VRAM
- Show telemetry enabled/disabled state on init, not every 2 minutes
- Convert no docs synced logs to debug level instead of warning
Having synced docs isn't as important to use Khoj now, unlike before
- Magika on Desktop app was too bloated (100Mb to 250Mb) and broke
install for some reason. Not sure why it was causing the app install
to fail but do not have time to currently investigate
- Just use file extensions whitelist it's good enough for now. Let
server handle the deeper identification of file type
### Index more text file types
- Index all text, code files in Github repos. Not just md, org files
- Send more text file types from Desktop app and improve indexing them
- Identify file type by content & allow server to index all text files
### Deprecate Github Indexing Features
- Stop indexing commits, issues and issue comments in a Github repo
- Skip indexing Github repo on hitting Github API rate limit
### Fixes and Improvements
- **Fix indexing files in sub-folders from Desktop app**
- Standardize structure of text to entries to match other entry processors
- Show internet search, webpage read, image query, image generation steps
- Standardize, improve rendering of the intermediate steps on the web app
Benefits:
1. Improved transparency, allow users to see what Khoj is doing behind
the scenes and modify their query patterns to improve response quality
2. Reduced websocket connection keep alive timeouts for long running steps
- `file-type' doesn't handle mis-labelled files or files without
extensions well
- Only show supported file types in file selector dialog on Desktop app
Use Magika to get list of text file extensions. Combine with other
supported extensions to get complete list of supported file extensions.
Use it to limit selectable files in the File Open dialog.
Note: Folder selector will index text files with no extensions as well
* Don't trigger any re-indexing on server initailization
* Integrate Resend to send welcome emails when a new user signs up
- Only send if this is the first time they've signed in
- Configure welcome email with basic styling, as more complex designs don't work and style tag did not work
### Enable copying chat messages. Improve copy button behavior and styling
- Add button to copy chat messages on Desktop, Web apps
- Improve copy button's icon, hover color & click animation in Desktop, Web apps
### Improve Navigation, Chat Session Panes on Desktop, Web apps
- Dynamically generate navigation menu based on user info from server
- Create API endpoint to get authenticated user information
- Collapse navigation tabs into icons on mobile. Add spacing to them
- Add Chat navigation tab back to top pane on Web app
- Use proper icons for Search, Chat and Agents tab on navigation pane
### Miscellaneous Improvements
- Make current chat expand to full width when session panel collapsed on Desktop App
- Add chat session loading spinner to Desktop App (same as Web app)
### Fixes
- Show title bar in Khoj desktop app on Windows to simplify close, minimize etc.
- Only render first run setup message once if error or server not running
- Fix showing Search navigation tab from Agent pages on web client
The username and location in system prompt should disambiguate user
context from user's actual message for the chat model.
It doesn't need to be told to not mention the context or acknowledge
the context instructions in it's response, as it understands that this
information is just context and not part of the user's actual message.
- Move new conversation button to right of "Conversation" title
- Reduce size of chat message loading ellipsis animation
- Add loading animation for chat session
The `has_documents' flag wasn't being passed. So the search tab
always showing up as empty instead of being dynamically enabled if
documents had been indexed.
- `fs.readdir' func in node version 18.18.2 has buggy `recursive' option
See nodejs/node#48640, effect-ts/effect#1801 for details
- We were recursing down a folder in two ways on the Desktop app.
Remove `recursive: True' option to the `fs.readdirSync' method call
to recurse down via app code only
Add process_single_plaintext_file func etc with similar signatures as
org_to_entries and markdown_to_entries processors
The standardization makes modifications, abstractions easier to create
Sleep until rate limit passed is too expensive, as it keeps a
app worker occupied.
Ideally we should schedule job to contine after rate limit wait time
has passed. But this can only be added once we support jobs scheduling.
Normal indexing quickly Github hits rate limits. Purpose of exposing
Github indexer is for indexing content like notes, code and other
knowledge base in a repo.
The current indexer doesn't scale to index metadata given Github's
rate limits, so remove it instead of giving a degraded experience of
partially indexed repos
- Allow syncing more file types from desktop app to index on server
- Use `file-type' package to identify valid text file types on Desktop app
- Split plaintext entries into smaller logical units than a whole file
Since the text splitting upgrades in #645, compiled chunks have more
logical splits like paragraph, sentence.
Show those (potentially) smaller snippets to the user as references
- Tangential Fix:
Initialize unbound currentTime variable for error log timestamp
- Use Magika's AI for a tiny, portable and better file type
identification system
- Existing file type identification tools like `file' and `magic'
require system level packages, that may not be installed by default
on all operating systems (e.g `file' command on Windows)
## Major
- Parse markdown, org parent entries as single entry if fit within max tokens
- Parse a file as single entry if it fits with max token limits
- Add parent heading ancestry to extracted markdown entries for context
- Chunk text in preference order of para, sentence, word, character
## Minor
- Create wrapper function to get entries from org, md, pdf & text files
- Remove unused Entry to Jsonl converter from text to entry class, tests
- Dedupe code by using single func to process an org file into entries
Resolves#620
* Add support for using OAuth2.0 in the Notion integration
* Add notion to the admin page
* Remove unnecessary content_index and image search/setup references
* Trigger background job to start indexing Notion after user configures it
* Add a log line when a new Notion integration is setup
* Fix references to the configure_content methods
`re.MULTILINE' should be passed to the `flags' argument, not the
`max_splits' argument of the `re.split' func
This was messing up the indexing by only allowing a maximum of
re.MULTILINE splits. Fixing this improves the search quality to
previous state
More content indexed per entry would result in an overall scores
lowering effect. Increase default search distance threshold to counter that
- Details
- Fix expected results post indexing updates
- Fix search with max distance post indexing updates
- Minor
- Remove openai chat actor test for after: operator as it's not expected anymore
- Major
- Do not split org file, entry if it fits within the max token limits
- Recurse down org file entries, one heading level at a time until
reach leaf node or the current parent tree fits context window
- Update `process_single_org_file' func logic to do this recursion
- Convert extracted org nodes with children into entries
- Previously org node to entry code just had to handle leaf entries
- Now it recieve list of org node trees
- Only add ancestor path to root org-node of each tree
- Indent each entry trees headings by +1 level from base level (=2)
- Minor
- Stop timing org-node parsing vs org-node to entry conversion
Just time the wrapping function for org-mode entry extraction
This standardizes what is being timed across at md, org etc.
- Move try/catch to `extract_org_nodes' from `parse_single_org_file'
func to standardize this also across md, org
These changes improve context available to the search model.
Specifically this should improve entry context from short knowledge trees,
that is knowledge bases with sparse, short heading/entry trees
Previously we'd always split markdown files by headings, even if a
parent entry was small enough to fit entirely within the max token
limits of the search model. This used to reduce the context available
to the search model to select appropriate entries for a query,
especially from short entry trees
Revert back to using regex to parse through markdown file instead of
using MarkdownHeaderTextSplitter. It was easier to implement the
logical split using regexes rather than bend MarkdowHeaderTextSplitter
to implement it.
- DFS traverse the markdown knowledge tree, prefix ancestry to each entry
These changes improve entry context available to the search model
Specifically this should improve entry context from short knowledge trees,
that is knowledge bases with small files
Previously we split all markdown files by their headings,
even if the file was small enough to fit entirely within the max token
limits of the search model. This used to reduce the context available
to select the appropriate entries for a given query for the search model,
especially from short knowledge trees
- Previous simplistic chunking strategy of splitting text by space
didn't capture notes with newlines, no spaces. For e.g in #620
- New strategy will try chunk the text at more natural points like
paragraph, sentence, word first. If none of those work it'll split
at character to fit within max token limit
- Drop long words while preserving original delimiters
Resolves#620
This was earlier used when the index was plaintext jsonl file. Now
that documents are indexed in a DB this func is not required.
Simplify org,md,pdf,plaintext to entries tests by removing the entry
to jsonl conversion step
- Convert extract_org_entries function to actually extract org entries
Previously it was extracting intermediary org-node objects instead
Now it extracts the org-node objects from files and converts them
into entries
- Create separate, new function to extract_org_nodes from files
- Similarly create wrapper funcs for md, pdf, plaintext to entries
- Update org, md, pdf, plaintext to entries tests to use the new
simplified wrapper function to extract org entries
- Move green server connected dot to the bottom. Show status when
disconnected from server
- Move "New conversation" button to right of the "Conversation" title
- Center alignment of the new conversation and connection status buttons
- Overview
- Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year)
- Extract some natural, partial dates as well from entries
- Capability
Add ability to extract the following additional date forms:
- Natural Dates: 21st April 2000, February 29 2024
- Partial Natural Dates: March 24, Mar 2024
- Structured Dates: 20/12/24, 20.12.2024, 2024/12/20
Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters
- Performance
Using regexes is MUCH faster than using the `dateparser' python library
It's a little crude but gives acceptable performance for large datasets