mirror of
https://github.com/khoj-ai/khoj.git
synced 2024-11-29 02:13:02 +01:00
06c25682c9
### Background There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector. For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated) ### Issue Until now entries exceeding max token size would silently get truncated during embedding generation. So the truncated portion of the entries would be ignored when matching queries with entries This would degrade the quality of the results ### Fix - |
||
---|---|---|
.. | ||
interface | ||
processor | ||
routers | ||
search_filter | ||
search_type | ||
utils | ||
__init__.py | ||
configure.py | ||
main.py |