- Main.py was becoming too big to manage. It had both
controllers/routers and component configurations (search, processors)
in it
- Now that the native app GUI code is also getting added to the main
path, good time to split/modularize/clean main.py
- Put global state into a separate file to share across modules
- Test invalid config file path throws. Remove redundant cli test
- Simplify cli parser code
- Do not need to explicitly check if args.config_file set.
argparser checks for positional arguments automatically
- Use standard semantics for cli args
- All positional args are required. Non positional args are optional
- Improve command line --help description
- Add custom validator to throw if neither input_filter or
input_<files|directories> are specified
- Set field expecting paths to type Path
- Now that default_config isn't used in code. We can update
fields in rawconfig to specify whether they're required or not.
This lets pydantic validate config file and throw appropriate error
- Reason
- Simplifies code. No merge_dict required
- 1 place for user to see all configurables, defaults and required values
- Details
- Remove default_config from code. Set defaults in khoj_sample.yml itself
- Keep fields required to be set by user as empty in khoj_sample to YAML
- Set defaults for fields not requiring configuration by user
- Setting up default compressed-jsonl, embeddings-file was only required
for org search_type, while org-files and org-filter were allowed to be
passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
command line argument as well for org search type
- Now that all search types are only configurable via config file, We
can default all search types to None. The default config for the
rest of the search types wasn't being used anyway
- Previously org-files were configurable via cmdline args.
Where as none of the other search types are
- This is an artifact of how the application grew
- It can be removed for better consistency and
equal preference given all search types
- Reason:
Allow natural search on markdown based notes, documentation,
websites etc
- Details:
- Create markdown processor to extract Markdown entries (identified by
Heading) into standard jsonl format required by text_search
- Update API, Configs to support interfacing with new markdown type
- Update Emacs, Web clients to support interfacing with new markdown
type via API
- Update Readme to mentiond markdown is also supported
Closes#35
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
Conversation logs structure now has session info too instead of just chat info
Session info will allow loading past conversation summaries as context for AI in new conversations
{
"session": [
{
"summary": <chat_session_summary>,
"session-start": <session_start_index_in_chat_log>,
"session-end": <session_end_index_in_chat_log>
}],
"chat": [
{
"intent": <intent-object>
"trigger-emotion": <emotion-triggered-by-message>
"by": <AI|Human>
"message": <chat_message>
"created": <message_created_date>
}]
}
Details
- Rename method query_* to query in search_types for standardization
- Wrapping Config code in classes simplified mocking test config
- Reduce args beings passed to a function by passing it as single
argument wrapped in a class
- Minimize setup in main.py:__main__. Put most of it into functions
These functions can be mocked if required in tests later too
Setup Flow:
CLI_Args|Config_YAML -> (Text|Image)SearchConfig -> (Text|Image)SearchModel
- Wrap Image, Music, Ledger search into the type of SearchModel they use
Similar to what was done for notes model by wrapping it's config
into an AsymmetricSearchModel.
- Use the uber wrapper class to expose all type specific search models
- Wrap asymmetric search model parameters into AsymmetricSearchModel class
- Create wrapper for all search type models. Put notes search model into it
- Test notes search end-to-end from client API layer to results.
Use model build on test data
- Cleaner, more idiomatic usage of a global variable
- Simplifies mocking when testing client in pytest as setting wrapped
in object rather than a simple type. So passed around by reference