Previous logic was more brittle to break with simple unbalanced
'{' or '}' string present in the event data. This method of trying to
identify valid json obj was fairly brittle. It only allowed json
objects or processed event as raw strings.
Now we buffer chunk until we see our unicode magic delimiter and only
then process it.
This is much less likely to break based on event data and the
delimiter is more tunable if we want to reduce rendering breakage
likelihood further
- Deduplicate code to collect chat telemetry by relying on
end_llm_response event
- Log time to first token and total chat response time for latency
analysis of Khoj as an agent. Not just the latency of the LLM
- Remove duplicate timer in the image generation path
Do not need response generator to stuff compiled references in chat
stream using "### compiled references:" separator.
References are now sent to clients as structured json while streaming
## Overview
- Gemma 2 is a new open model family by Google. They've released a 9B, 29B param model. A 2B model is also expected.
- It performs really well on the Chatbot arena and shows good performance when testing within Khoj as well.
- Llama.cpp support for Gemma 2 architecture seems to have stabilized
- If Gemma 2 performs well in further testing, it can be made the default offline chat model for Khoj
- Once the 2B param model is released, the model size to download can be automatically chosen based on (V)RAM available
## Major
- Support Gemma 2 for Offline Chat
- Improve and fix chat model prompts for better, consistent context
## Minor
- Fix and improve offline chat actor, director tests
- Improve offline chat truncation to consider chat message delimiter tokens
Previously loading animation would be at top of message. Moving it to
bottom is more intuitve and easier to track.
Remove white-space: pre from list elements. It was adding too much y
axis padding to chat messages (and train of thought)
- Details
Only return notes refs, online refs, inferred queries and generated
response in non-streaming mode. Do not return train of throught and
other status messages
Incorporate missing logic from old chat API router into new one.
- Motivation
So we can halve chat API code by getting rid of the duplicate logic
for the websocket router
The deduplicated code:
- Avoids inadvertant logic drift between the 2 routers
- Improves dev velocity
- Overview
Use simpler HTTP Streaming Response to send status messages, alongside
response and references from server to clients via API.
Update web client to use the streamed response to show train of thought,
stream response and render references.
- Motivation
This should allow other Khoj clients to pass auth headers and recieve
Khoj's train of thought messages from server over simple HTTP
streaming API.
It'll also eventually deduplicate chat logic across /websocket and
/chat API endpoints and help maintainability and dev velocity
- Details
- Pass references as a separate streaming message type for simpler
parsing. Remove passing "### compiled references" altogether once
the original /api/chat API is deprecated/merged with the new one
and clients have been updated to consume the references using this
new mechanism
- Save message to conversation even if client disconnects. This is
done by not breaking out of the async iterator that is sending the
llm response. As the save conversation is called at the end of the
iteration
- Handle parsing chunked json responses as a valid json on client.
This requires additional logic on client side but makes the client
more robust to server chunking json response such that each chunk
isn't itself necessarily a valid json.
- Convert functions in SSE API path into async generators using yields
- Validate image generation, online, notes lookup and general paths of
chat request are handled fine by the web client and server API
- Add day of week to system prompt of openai, anthropic, offline chat models
- Pass more context to offline chat system prompt to
- ask follow-up questions
- know where to find information about khoj (itself)
- Fix output mode selection prompt. Log error if model does not select
valid option from list of valid output modes provided
- Use consistent names for question, answers passed to
extract_questions_offline prompt
- Log which model extracts question, what the offline chat model sees
as context. Similar to debug log shown for openai models
- Pass system message as the first user chat message as Gemma 2
doesn't support system messages
- Use gemma-2 chat format
- Pass chat model name to generic, extract questions chat actors
Used to figure out chat template to use for model
For generic chat actor argument was anyway available but not being
passed, which is confusing
- Deprecate khoj-assistant pypi package. Use more accurate and
succinct pypi project name, khoj
- Update references to sye khoj pypi package in docs and code instead
of the legacy khoj-assistant pypi package
- Update pypi workflow to publish to both khoj, khoj-assistant for now
- Update stale python 3.9 support mentioned in our pyproject. Can't
support python 3.9 as depend on latest django which support >=3.10
- Major
- Ask for prompt in prose
- Remove seed from SD3 image generation to improve diversity of output
for a given prompt
Otherwise for conversations with similar sounding
prompts, the images would be almost exactly the same. This maybe
another indicator of SD3's inability to capture detailed
instructions
- Consistently use "prompt" wording instead of "query" in improved
image generation prompts.
Previously a mix of those terms were being used, which could confuse
the chat model
- Minor
- Add day of week to prompt
- Remove 2-5 sentence limit on instructions to SD3. It seems to be
able to follow longer instructions just with less fidelity than
DALLE. And the 2-5 sentence instruction limit wasn't being adhered to
- Improve ability to edit, improve the image based on follow-up
instructions by the user
- Align prompts for DALLE and SD3. Only difference is to wrap text to
be rendered in quotes for SD3. This improves it's ability to render
requested text. DALLE cannot render text as well or consistently
- Because we're using a FastAPI api framework with a Django ORM, we're running into some interesting conditions around connection pooling and clean-up. We're ending up with a large pile-up of open, stale connections to the DB recurringly when the server has been running for a while. To mitigate this problem, given starlette and django run in different python threads, add a middleware that will go and call the connection clean up method in each of the threads.
- Issue
The Khoj docker build would fail with `ImportError: libGL.so.1: cannot open shared object file: No such file or directory`. This was required by the Khoj RapidOCR python package dependency.
- Fix
A minimal set of system packages have been added to resolve this issue.