- Avoid having to pass the khoj_sample.yml data file into pip, native apps
- Packaging data files into python packages is annoying.
- There's `MANIFEST.in`, `data_files` and `package_data` in setup.py
- Bdist, wheel, generated source tarball use different set of these fields
and put the data files in different locations
- Rather just code the default config into a constant. Avoid
pointless file reads as well this way
- Assume path is absolute in yaml util module while saving, loading file
- This follows same convention as jsonl. Which just operates on
passed file path, assuming it is of appropriate form.
Responsibility to put it in appropriate form is on the caller, for now
- Include khoj_sample.yml in pip package to load default config from
- Create khoj config directory if it doesn't exist
- Load config from khoj_sample.yml if khoj.yml config doesn't exist
- Track current (saved/loaded) config separate from the new config (to
be written) when user clicks Start
- Fallback to using default config when no config for the specific
content type or processor is specified in khoj.yml
- Earlier were only loading default config on first run, not after
- Create Child CheckBox, LineEdit classes for Processor Widgets
- Create ProcessorType, similar to SearchType
- Track ProcessorType the widgets are associated with
- Simplify update, save, load of config based on type
- Make config_file an optional arg. It defaults to default khoj config dir
- Return args.config as None if no config_file explicitly passed by user
- Parent can use args.config = None as signal to trigger first run experience
- Main.py was becoming too big to manage. It had both
controllers/routers and component configurations (search, processors)
in it
- Now that the native app GUI code is also getting added to the main
path, good time to split/modularize/clean main.py
- Put global state into a separate file to share across modules
- Test invalid config file path throws. Remove redundant cli test
- Simplify cli parser code
- Do not need to explicitly check if args.config_file set.
argparser checks for positional arguments automatically
- Use standard semantics for cli args
- All positional args are required. Non positional args are optional
- Improve command line --help description
- Add custom validator to throw if neither input_filter or
input_<files|directories> are specified
- Set field expecting paths to type Path
- Now that default_config isn't used in code. We can update
fields in rawconfig to specify whether they're required or not.
This lets pydantic validate config file and throw appropriate error
- Reason
- Simplifies code. No merge_dict required
- 1 place for user to see all configurables, defaults and required values
- Details
- Remove default_config from code. Set defaults in khoj_sample.yml itself
- Keep fields required to be set by user as empty in khoj_sample to YAML
- Set defaults for fields not requiring configuration by user
- Setting up default compressed-jsonl, embeddings-file was only required
for org search_type, while org-files and org-filter were allowed to be
passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
command line argument as well for org search type
- Now that all search types are only configurable via config file, We
can default all search types to None. The default config for the
rest of the search types wasn't being used anyway
- Previously org-files were configurable via cmdline args.
Where as none of the other search types are
- This is an artifact of how the application grew
- It can be removed for better consistency and
equal preference given all search types
- Reason:
Allow natural search on markdown based notes, documentation,
websites etc
- Details:
- Create markdown processor to extract Markdown entries (identified by
Heading) into standard jsonl format required by text_search
- Update API, Configs to support interfacing with new markdown type
- Update Emacs, Web clients to support interfacing with new markdown
type via API
- Update Readme to mentiond markdown is also supported
Closes#35
- The code for both the text search types were mostly the same
It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
text_search type
- This simplifies the app and making it easier to process other
text types
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types
Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
- The all-MiniLM-L6-v2 is more accurate
- The exact previous model isn't benchmarked but based on the
performance of the closest model to it. Seems like the new model
maybe similar in speed and size
- On very preliminary evaluation of the model, the new model seems
faster, with pretty decent results
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
- It doubles the encoding speed of all entries (down from ~8min to 4mins)
- It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)
[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
Conversation logs structure now has session info too instead of just chat info
Session info will allow loading past conversation summaries as context for AI in new conversations
{
"session": [
{
"summary": <chat_session_summary>,
"session-start": <session_start_index_in_chat_log>,
"session-end": <session_end_index_in_chat_log>
}],
"chat": [
{
"intent": <intent-object>
"trigger-emotion": <emotion-triggered-by-message>
"by": <AI|Human>
"message": <chat_message>
"created": <message_created_date>
}]
}