Create Schema Migrator and Reindex to Apply Index Corruption Fixes
- 83e1088 Manage `khoj.yml' config migrations on app start. Version the `khoj.yml' schema
- 429e1b4 Regenerate index to apply corruption fixes on first run of this khoj version
Otherwise users would need to manually re-index their contents with khoj
## Stabilize and Simplify Content Indexing
### Major Updates
- 9bcca43 Unify logic to update entries when indexing from scratch or incrementally
- 89c7819 Unify logic to update embeddings when indexing from scratch or incrementally
- 6a0297c Stable sort new entries when marking entries for update
- 58d86d7 Unify logic to configure server from API or on server start
- Create tests to ensure old entries, embeddings in index are unaffected on adding new entries
- Refer: 1482fd4, 7669b85, 88d1a29
- ad41ef3 Make normalization of embeddings configurable to test this in c73feeb
### Minor Updates
- 1673bb5 Add todo state to compiled form of each entry
- 6e70b91 Remove unused `dump_jsonl` helper method
- 7ad9603 Improve naming of lock
- b02323a Improve naming text search test methods
Resolves#190
Ensure order of new embedding insertion on incremental update
does not affect the order and value of existing embeddings when
normalization is turned off
Asymmetric was older name used to differentiate between symmetric,
asymmetric search.
Now that text search just uses asymmetric search stick to simpler name
Previous regenerate mechanism did not deduplicate entries with same key
So entries looked different between regenerate and update
Having single func, mark_entries_for_update, to handle both scenarios
will avoid this divergence
Update all text_to_jsonl methods to use the above method for
generating index from scratch
- Current incorrect behavior:
All entries with duplicate compiled form are kept on regenerate
but on update only the last of the duplicated entries is kept
This divergent behavior is not ideal to prevent index corruption
across reconfigure and update
Reuse Search Models across Content Types to reduce Memory Consumption
- Memory consumption now only scales with search models used, not with content types.
Previously each content type had it's own copy of the search ML models.
That'd result in 300+ Mb per enabled text content type
- Split model state into 2 separate state objects, `search_models` and `content_index`.
This allows loading text_search and image_search models first
and then reusing them across all content_types in content_index
- The change should cut down memory utilization quite a bit for most users.
I see a >50% drop in memory utilization on my Khoj instance.
But this will vary for each user based on the amount of content indexed vs number of plugins enabled.
- This change does not solve the RAM utilization scaling with size of the index,
as the whole content index is still kept in RAM while Khoj is running
Should help with #195, #301 and #303
Wrap acquire/release locks in try/catch/finally when updating content
index and search models to prevent lock not being released on error
and causing a deadlock
* Add additional telemetry in order to understand which data sources are the most useful
* Make actions side by side in the configuration page
* Restore main run command
* Update links to point to wiki pages for Github, Notion integrations
* Stanardize nomenclature of the api_type to use _config suffix
Remove header fields that aren't actually helpful for understanding config usage
- Memory consumption now only scales with search models used, not with
content types as well. Previously each content type had it's own
copy of the search ML models. That'd result in 300+ Mb per enabled
content type
- Split model state into 2 separate state objects, `search_models' and
`content_index'.
This allows loading text_search and image_search models first and then
reusing them across all content_types in content_index
- This should cut down memory utilization quite a bit for most users.
I see a ~50% drop in memory utilization.
This will, of course, vary for each user based on the amount of
content indexed vs number of plugins enabled
- This does not solve the RAM utilization scaling with size of the index.
As the whole content index is still kept in RAM while Khoj is running
Should help with #195, #301 and #303
* Add a Github workflow that allows you to build dev versions of Desktop applications
* Add pull_request trigger for testing
* Fix errant open quote in Package Khoj App step
* Nix the release step, since this isn't associated with any tags
- Set retention period for uploaded artifacts to 1 day
* Remove pull_request trigger - limit to manual triggers and pushes to master
Just use a random static version for Khoj on the Docker as otherwise
the hatch vcs dynamic versioning requires the .git directory in the
docker image too
My account doesn't have gpt-4 enabled and it wouldn't work as the default value was always used from extract_questions, where the caller could use the configured model.