khoj/src
Debanjum 06c25682c9
Split text entries by max tokens supported by ML models
### Background
There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated)

### Issue
Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

### Fix
- e057c8e Add method to split entries by specified max tokens limit
- Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL
- b283650 Deduplicate results for user query by raw text before returning results

### Results
- The quality of the search results should improve
- Relevant, long entries should show up in results more often
2022-12-26 18:23:43 +00:00
..
interface Update instructions in khoj.el to install it from MELPA stable 2022-12-23 19:08:38 -03:00
processor Split entries by max tokens while converting Beancount entries To JSONL 2022-12-26 15:14:32 -03:00
routers Add __init__.py to routers directory to register it as a package 2022-10-25 20:40:40 +05:30
search_filter Use new Text Entry class to track text entries in Intermediate Format 2022-10-08 12:06:05 +03:00
search_type Fix comments, use minimal test case, regenerate test index, merge debug logs 2022-12-25 22:33:04 -03:00
utils Delete stale, unused installation helper script 2022-12-03 13:36:47 -03:00
__init__.py Move application files under src directory. Update Readmes 2021-08-17 04:11:03 -07:00
configure.py Move Custom Formatter class for logger to util.helper module from main.py 2022-10-20 00:32:24 +05:30
main.py Move Custom Formatter class for logger to util.helper module from main.py 2022-10-20 00:32:24 +05:30