Ensure order of new embedding insertion on incremental update
does not affect the order and value of existing embeddings when
normalization is turned off
Asymmetric was older name used to differentiate between symmetric,
asymmetric search.
Now that text search just uses asymmetric search stick to simpler name
- Current incorrect behavior:
All entries with duplicate compiled form are kept on regenerate
but on update only the last of the duplicated entries is kept
This divergent behavior is not ideal to prevent index corruption
across reconfigure and update
* Update the /chat endpoint to conditionally support streaming
- If streams are enabled, return the threadgenerator as it does currently
- If stream is disabled, return a JSON response with the response/compiled references separated out
- Correspondingly, update the chat.html UI to use the streamed API, as well as Obsidian
- Rename chat/init/ to chat/history
* Update khoj.el to use the /history endpoint
- Update corresponding unit tests to use stream=true
* Remove & from call to /chat for obsidian
* Abstract functions out into a helpers.py file and clean up some of the error-catching
- Fix testing gpt converse method after it started streaming responses
- Pass stop in model_kwargs dictionary and api key in openai_api_key
parameter to chat completion methods. This should resolve the arg
warning thrown by OpenAI module
The previous json parsing was failing to handle questions with date
filters
Fix the chat actor tests to run without throwing error with freezegun
complaining about importing transformers.local_llama model
Remove quote escapes from date filter examples provided to
extract_questions actor
Khoj will soon get a generic text indexing content type. This along
with a file filter should suffice for searching through Ledger
transactions, if required.
Having a specific content type for niche use-case like ledger isn't
useful. Removing unused content types will reduce khoj code to manage.
Org-music was just a custom content type that worked with org-music.
It was mostly only useful for me.
Cleaning up that code will reduce number of content types for khoj to
manage.
- Khoj chat will now respond to general queries if:
1. no relevant reference notes available or
2. when explicitly induced by prefixing the chat message with "@general"
- Previously Khoj Chat would a lot of times refuse to respond to
general queries not answerable from reference notes or chat history
- Make chat quality tests more robust
- Add more equivalent chat response options refusing to answer
- Force haiku writing to not give any preable, just the haiku
Previously filename was appended to the end of the compiled entry.
This didn't provide appropriate structured context
Test filename getting prepended as heading to compiled entry
All compiled snippets split by max tokens (apart from first) do not
get the heading as context.
This limits search context required to retrieve these continuation
entries
- Explicity split entry string by space during split by max_tokens
- Prevent formatting of compiled entry from being lost
- The formatting itself contains useful information
No point in dropping the formatting unnecessarily,
even if (say) the currrent search models don't account for it (yet)
Append originating filename to compiled string of each entry for
better search quality by providing more context to model
Update markdown_to_jsonl tests to ensure filename being added
Resolves#142
- Use tiktoken to count tokens for chat models
- Make conversation turns to add to prompt configurable via method
argument to generate_chatml_messages_with_context method
- Remove the need to split by magic string in emacs and chat interfaces
- Move compiling references into string as context for GPT to GPT layer
- Update setup in tests to use new style of setting references
- Name first argument to converse as more appropriate "references"
Merge pull request #189 from debanjum/add-search-actor-to-improve-notes-lookup-for-chat
### Introduce Search Actor
Search actor infers Search Queries from user's message
- Capabilities
- Use previous messages to add context to current search queries[^1]
This improves quality of responses in multi-turn conversations.
- Deconstruct users message into multiple search queries to lookup notes[^2]
- Use relative date awareness to add date filters to search queries[^3]
- Chat Director now does the following:
1. [*NEW*] Use Search Actor to generate search queries from user's message
2. Retrieve relevant notes from Knowledge Base using the Search queries
3. Pass retrieved relevant notes to Chat Actor to respond to user
### Add Chat Quality Tests
- Test Search Actor capabilities
- Mark Chat Director Tests for Relative Date, Multiple Search Queries as Expected Pass
### Give More Search Results as Context to Chat Actor
- Loosen search results score threshold to work better for searches with date filters
- Pass more search results (up to 5 from 2) as context to Chat Actor to improve inference
[^1]: Multi-Turn Example
Q: "When did I go to Mars?"
Search: "When did I go to Mars?"
A: "You went to Mars in the future"
Q: "How was that experience?"
Search: "How my Mars experience?"
*This gives better context for the Chat actor to respond*
[^2]: Deconstruct Example:
Is Alpha older than Beta? => What is Alpha's age? & When was Beta born?
[^3]: Date Example:
Convert user messages containing relative dates like last month, yesterday to date filters on specific dates like dt>="2023-03-01"
Update Search Actor prompt with answers, more precise primer and
two more examples for context
Mark the 3 chat quality tests using answer as context to generate
queries as expected to pass. Verify that the 3 tests pass now, unlike
before when the Search Actor did not have the answers for context
- Remove stale message_to_prompt test
It is too broad, reduces maintainability.
Remove as it doesn't really need its own test right now
- Setting skipif at module level for chat actor, director tests
reduces code duplication as earlier was using decorator on each chat
test
Combine hand-written custom notes and PG essays with personal
content to bulk up notes count
Delete old documentation markdown as not a representative dataset for
application (which is more tuned for personal notes)
- Chat directors are broad agents.
- Chat directors orchestrate narrow actor agents to synthesize
final response for the user
- Agents are Prompts + ML Model
- Test Chat Director Capabilities
1. [X] Answer from retrieved notes
2. [X] Answer from chat history
3. [X] Answer general questions
4. [X] Carry out multi-turn conversation
5. [X] Say don't know when answer not in provided context
6. [X] Answers that require current date awareness
This test is expected to fail as the chat is not capable of doing
this without the Search actor. But the test allows assessing chat quality
7. [X] Date-aware aggregation across multiple different notes
This test is expected to fail as the chat is not capable of doing
this without the Search actor. But the test allows assessing chat quality
8. [X] Ask clarification questions if no unambiguous answer in provided context
9. [X] Retrieve answer from chat history beyond lookback window
This test is expected to fail as the chat director is not capable
of searching chat history yet. But the test allows assessing chat quality
10. [X] Retrieve context for answer using multiple independent
searches on knowledge base
This test is expected to fail as the chat is not capable of doing
this without the Search actor. But the test allows assessing chat quality
- Index markdown test data as knowledge base. As easier to get good
markdown content (vs org)
- Setup markdown_content_config, processor_config and chat_client to
test chat API
- Mark chat quality tests, register custom mark for chat quality
- Filter unhelpful deprecation warnings from within dateparser library
- Error if tests use unregistered marks
- Chat actors are narrow agents (prompt + ML model)
Chat actors are different from the Chat director. who orchestrates
the narrow actor agents to synthesize final response to the user
- Test Chat Actor Capabilities
1. Answer from retrieved notes
2. Answer from chat history
3. Answer general questions
4. Carry out multi-turn conversation
5. Say don't know when answer not in provided context
6. Answers that require current date awareness
7. Date-aware aggregation across multiple different notes
8. Ask clarification questions if no unambiguous answer in provided context
This test is expected to fail as the chat is not capable of doing
this consistently yet. But having the test allows assessing chat quality
- Use Openai API Key from OPENAI_API_KEY environment variable
- Gitignore .env file, python virtualenv directory
Put OpenAI API Key in .env file to run chatbot tests via vscode
The .env file is default location for importing env vars
Answer does not rely on past conversations, just the knowledge base.
It is meant for one off interactions, like search rather than a
continuing conversation like chat
For now it is only exposed via API. Later it will be expose in the
interfaces as well
Remove ability to select different chat types from the chat web
interface as there is only a single chat type
Stop appending answers to the conversation logs
- Text before headings was not being indexed due to buggy orgnode
parsing logic
- Resolved indexing intro text from files with and without headings in
them
- Ensure intro text node has heading set to all title lines collected
from the file
Resolves#165
- Test /config/types API when no plugin configured, only plugin configured
and no content configured scenarios
- Do not throw null reference exception while configuring search types
when no plugin configured
- Do not throw null reference exception on calling /config/types API
when no plugin configured
Resolves bug introduced by #173
- Previously was return all core content types even if they had not been
setup
- Add test to validate only configured content types are returned by
the api/config/types API endpoint
Configure app routes after configuring server.
Import API routers after search type is dynamically populated.
Allow API to recognize the dynamically populated plugin search types
as valid type query param.
Enable searching for plugin type content.
- What
- The Emacs and Obsidian interfaces stay in their original
directories under src/
- src/khoj now only contains code meant for pypi packaging
- Benefits
- This avoids having to update khoj MELPA, Obsidian plugin config as
the Emacs, Obsidian code is under their original directories
- It separates the code in src/khoj meant for python packaging from
code for external interfaces like Emacs and Obsidian
- Why
The khoj pypi packages should be installed in `khoj' directory.
Previously it was being installed into `src' directory, which is a
generic top level directory name that is discouraged from being used
- Changes
- move src/* to src/khoj/*
- update `setup.py' to `find_packages' in `src' instead of project root
- rename imports to form `from khoj.*' in complete project
- update `constants.web_directory' path to use `khoj' directory
- rename root logger to `khoj' in `main.py'
- fix image_search tests to use the newly rename `khoj' logger
- update config, docs, workflows to reference new path `src/khoj'
Previously no query syntax helpers, like the "file:" prefix, were used
before checking if query contains file path.
This made query to image search brittle to misinterpretation and
pointless checking
Add test to verify search by image at file works as expected
- Previously top level headings would have get stripped of the
space between heading text and the prefix # symbols. That is,
`# Top Level Heading' would get converted to `#Top Level Heading'
- This would mess up their rendering as a heading in search results
- Add unit tests to text_to_jsonl processors to prevent regression
- Use latest davinci model for tests
- Wrap prompt in triple quotes to improve legibilty
- `understand' method returns dictionary instead of string. Fix its test
- Fix prompt for new model to pass `chat_with_history' test
Long words (>500 characters) provide less useful context to models.
Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
- Remove property drawer from test entry for max_words splitting test
- Property drawer is not required for the test
- Keep minimal test case to reduce chance for confusion
- Issue
ML Models truncate entries exceeding some max token limit.
This lowers the quality of search results
- Fix
Split entries by max tokens before indexing.
This should improve searching for content in longer entries.
- Miscellaneous
- Test method to split entries by max tokens
- Reason
- All clients that currently consume the API are part of Khoj
- Any breaking API changes will be fixed in clients immediately
- So decoupling client from API is not required
- This removes the burden of maintaining muliple versions of the API
- Context
- The app maintains all text content in a standard, intermediate format
- The intermediate format was loaded, passed around as a dictionary
for easier, faster updates to the intermediate format schema initially
- The intermediate format is reasonably stable now, given it's usage
by all 3 text content types currently implemented
- Changes
- Concretize text entries into `Entries' class instead of using dictionaries
- Code is updated to load, pass around entries as `Entries' objects
instead of as dictionaries
- `text_search' and `text_to_jsonl' methods are annotated with
type hints for the new `Entries' type
- Code and Tests referencing entries are updated to use class style
access patterns instead of the previous dictionary access patterns
- Move `mark_entries_for_update' method into `TextToJsonl' base class
- This is a more natural location for the method as it is only
(to be) used by `text_to_jsonl' classes
- Avoid circular reference issues on importing `Entries' class
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
- Start standardizing implementation of the `text_to_jsonl' processors
- `text_to_jsonl; scripts already had a shared structure
- This change starts to codify that implicit structure
- Benefits
- Ease adding more `text_to_jsonl; processors
- Allow merging shared functionality
- Help with type hinting
- Drawbacks
- Lower agility to change. But this was already an implicit issue as
the text_to_jsonl processors got more deeply wired into the app
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
- This also removes dependency on system Exiftool package for
XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
- For queries with only filters in them short-circuit and return
filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
generated by a separate server/app instance