khoj/src at a58c243bc0d55f37874be2293e6c7845150c737e - sij/khoj

sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-29 02:13:02 +01:00

History

Debanjum 06c25682c9 Split text entries by max tokens supported by ML models ### Background There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector. For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated) ### Issue Until now entries exceeding max token size would silently get truncated during embedding generation. So the truncated portion of the entries would be ignored when matching queries with entries This would degrade the quality of the results ### Fix - `e057c8e` Add method to split entries by specified max tokens limit - Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL - `b283650` Deduplicate results for user query by raw text before returning results ### Results - The quality of the search results should improve - Relevant, long entries should show up in results more often		2022-12-26 18:23:43 +00:00
..
interface	Update instructions in khoj.el to install it from MELPA stable	2022-12-23 19:08:38 -03:00
processor	Split entries by max tokens while converting Beancount entries To JSONL	2022-12-26 15:14:32 -03:00
routers	Add __init__.py to routers directory to register it as a package	2022-10-25 20:40:40 +05:30
search_filter	Use new Text Entry class to track text entries in Intermediate Format	2022-10-08 12:06:05 +03:00
search_type	Fix comments, use minimal test case, regenerate test index, merge debug logs	2022-12-25 22:33:04 -03:00
utils	Delete stale, unused installation helper script	2022-12-03 13:36:47 -03:00
__init__.py	Move application files under src directory. Update Readmes	2021-08-17 04:11:03 -07:00
configure.py	Move Custom Formatter class for logger to util.helper module from main.py	2022-10-20 00:32:24 +05:30
main.py	Move Custom Formatter class for logger to util.helper module from main.py	2022-10-20 00:32:24 +05:30