sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-29 02:13:02 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	afcfc3cd62	Split text_search.query logic into separate methods for modularity The query method had become too big. Extract out filter, score, sort and deduplicate logic used by text_search.query into separate methods. This should improve readabilty of code.	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	8dc6ee8b6c	Pass `model' arg to extract_search_type method from beta search API Issue caught by mypy	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	8498903641	Fix, add typing to Filter and TextSearchModel classes - Changes - Fix method signatures of BaseFilter subclasses. Else typing information isn't translating to them - Explicitly pass `entries: list[Entry]' as arg to `load' method - Fix type of `raw_entries' arg to `apply' method to list[Entry] from list[str] - Rename `raw_entries' arg to `apply' method to `entries' - Fix `raw_query' arg used in `apply' method of subclasses to `query' - Set type of entries, corpus_embeddings in TextSearchModel - Verification Ran `mypy --config-file .mypy.ini src' to verify typing	2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky	eace7c6215	Use torch.tensor as torch.Tensor cannot create tensor on MPS device - `torch.Tensor' is apparently a legacy tensor constructor - Using that to create tensor on MPS devices throws error: RuntimeError: legacy constructor expects device type: cpu but device type: mps was passed - `torch.tensor' can handle creating tensors on Mac GPU (MPS) fine	2023-01-09 19:47:19 -03:00
Debanjum Singh Solanky	9def3f8c6f	Add exception handling to beta APIs, in case OpenAI API call fails	2023-01-09 01:27:06 -03:00
Debanjum Singh Solanky	7b164de021	Add beta API to summarize top search result using an OpenAI model This is unlike the more general chat API that combines summarization of top search result and conversing with the OpenAI model This should give faster summary results. As no intent categorization API call required	2023-01-09 01:25:59 -03:00
Debanjum Singh Solanky	d36da46f7b	Truncate prompt to not exceed OpenAI prompt limit Truncate prompt containing the top retrieved entry to 500 words to avoid triggering the max_token limit error	2023-01-09 00:51:46 -03:00
Debanjum Singh Solanky	237123d18c	Fix tests for the conversation processor - Use latest davinci model for tests - Wrap prompt in triple quotes to improve legibilty - `understand' method returns dictionary instead of string. Fix its test - Fix prompt for new model to pass `chat_with_history' test	2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky	918af5e6f8	Make OpenAI conversation model configurable via khoj.yml - Default to using `text-davinci-003' if conversation model not explicitly configured by user. Stop using the older `davinci' and `davinci-instruct' models - Use `model' instead of `engine' as parameter. Usage of `engine' parameter in OpenAI API is deprecated	2023-01-09 00:17:51 -03:00
Debanjum Singh Solanky	74e779f8d0	Fix /beta/chat API to use Entry class instead of old dictionary pattern Search returns response of type SearchResponse instead of a dict now	2023-01-08 15:28:26 -03:00
Debanjum Singh Solanky	f2436039a0	Improve readability of GPT prompt strings in conversation processor	2023-01-08 15:27:41 -03:00
Debanjum Singh Solanky	6119005838	Improve comments, exceptions, typing and init of OpenAI model code	2023-01-08 00:36:18 -03:00
Debanjum Singh Solanky	c0ae8eee99	Allow using OpenAI models for search in Khoj - Init processor before search to instantiate `openai_api_key' from `khoj.yml'. The key is used to configure search with openai models - To use OpenAI models for search in Khoj - Set `encoder' to name of an OpenAI model. E.g text-embedding-ada-002 - Set `encoder-type' in `khoj.yml' to `src.utils.models.OpenAI' - Set `model-directory' to `null', as online model cannot be stored on disk	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	826f9dc054	Drop long words from compiled entries to be within max token limit of models Long words (>500 characters) provide less useful context to models. Dropping very long words allow models to create better embeddings by passing more of the useful context from the entry to the model	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	6a30a13326	Only create model directory if the optional field is set in SearchConfig	2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky	2fe37a090f	Make type of encoder to use for embeddings configurable via khoj.yml - Previously `model_type' was set in the setup of each `search_type' - All encoders were of type `SentenceTransformer' - All cross_encoders were of type `CrossEncoder' - Now `encoder-type' can be configured via the new `encoder_type' field in `TextSearchConfig' under `search-type` in `khoj.yml`. - All the specified `encoder-type' class needs is an `encode' method that takes entries and returns embedding vectors	2023-01-07 23:09:12 -03:00
Debanjum Singh Solanky	d55d7d53dc	Fix GPU usage by Khoj on Macs to speed up search and indexing - Ensure all tensors are on MPS device before doing operations across them - Background - GPU is used by default for Khoj on MacOS now - Needed PyTorch > 1.13.0 on Macs to use GPU, which we do now - MPS should speed up search and indexing on MacOS	2023-01-05 15:39:09 -03:00
Debanjum	abd035e2fa	Merge PR #112 to fix quote usage in khoj.el docstring from suliveevil/master Fix usage warning for unescaped single quote in `khoj.el' docstring. Converts usage of '<text>' into `<text>' to use the correct quote forms in generated docs	2023-01-05 13:24:11 -03:00
Debanjum Singh Solanky	e792523849	Bump version in metadata packages for khoj, khoj.el and obsidian plugin	2023-01-05 12:50:27 -03:00
suliveevil	b2812b409f	fix docstring usage warning ⛔ Warning (comp): khoj.el:119:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting) ⛔ Warning (comp): khoj.el:120:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting) ⛔ Warning (comp): khoj.el:121:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting) ⛔ Warning (comp): khoj.el:168:2: Warning: docstring has wrong usage of unescaped single quotes (use \= or different quoting)	2023-01-05 16:47:38 +08:00
Debanjum Singh Solanky	47015ee6cc	Fold Demo video descriptions, analysis by default in main Readme	2023-01-04 20:13:43 -03:00
Debanjum Singh Solanky	da17ff6ac8	Add Upgrade instructions for Khoj.el Readme. Fix version of khoj.el	2023-01-04 20:06:39 -03:00
Debanjum Singh Solanky	66ccd0c970	Create Obsidian plugin for Khoj - Features - Search using Khoj from within the Obsidian app Allow Natural language search on your (markdown) notes in Obsidian Vault - Show search results as rendered (instead of raw) Markdown Improve legibility of the results - Jump to selected note from search result in Khoj search modal Simplify seeing result within its original note context - Automatically configure khoj to index markdown files in current vault Reduce khoj setup steps for plugin users by using reasonable defaults - Code updates the markdown config in khoj.yml and triggers index update - It can be configured by user in khoj plugin settings, if required - Add Demo and detailed Readme for the Obsidian plugin Ease setup and usage. Give context about capabilities - Miscellaneous - Trying keep a mono repo until the Khoj project is mature enough to reduce maintainance burden	2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky	feddb6ce62	Add start_url to khoj webmanifest to show Khoj as PWA on Chrome	2023-01-04 13:37:56 -03:00
Debanjum Singh Solanky	3dee1aed9e	Create /config/data/default API endpoint to serve default khoj config This can ease configuring khoj from the different interfaces - Don't need to know all the (default) config used by khoj. - Just get default config by calling the above API endpoint. - Then modify desired portions and call POST /api/config/data to configure khoj.	2023-01-03 21:52:34 -03:00
Debanjum Singh Solanky	ce945f7a90	Configure processors too on calling /update API - Previously only search was being reconfigured - But Processors are configured on app start too - Match that behavior on calling /update API	2023-01-03 21:51:02 -03:00
Debanjum Singh Solanky	9d31988f42	Allow starting khoj in non-GUI mode without config file instantiated - Start khoj server (in non-GUI mode) without needing config file already instantiated. - But throw warning to configure khoj to use it - This allows plugins to configure the app via the /config/data APIs - To be used by the Khoj obsidian plugin to configure markdown content in khoj	2023-01-03 21:36:59 -03:00
Debanjum Singh Solanky	52664dd96c	Allow recursive glob pattern (**) to add files to search index - Simplify configuring files to index For Obsidian/Org-Roam type systems with lots of small files in khoj.yml using `input-filter'	2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky	152e5f1661	Return the file of each search result in response - Useful for enabling jump to note functionality in interfaces - It will be used in the Khoj plugin for Obsidian	2023-01-03 01:25:34 -03:00
Debanjum Singh Solanky	c535953915	Update index automatically in non GUI mode too - Poll scheduler every minute using threading.Timer - Use 60 seconds polling interval to avoid fork bombing - Schedule next via the same poll scheduler - Allow clean program interrupt by running scheduler in daemon mode	2023-01-01 21:03:19 -03:00
Debanjum Singh Solanky	701d92e17b	Lock the index before updating it via API or Scheduler - There are 3 paths to updating/setting the index (stored in state.model) - App start - API - Scheduler - Put all updates to the index behind a lock. As multiple updates path that could (potentially) run at the same time (via API or Scheduler)	2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky	3b0783aab9	Automate updating embeddings, search index on a hourly schedule - Use the schedule pypi package - Use QTimer to poll schedule.run_pending() regularly for jobs to run	2023-01-01 17:09:36 -03:00
Debanjum	06c25682c9	Split text entries by max tokens supported by ML models ### Background There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector. For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated) ### Issue Until now entries exceeding max token size would silently get truncated during embedding generation. So the truncated portion of the entries would be ignored when matching queries with entries This would degrade the quality of the results ### Fix - `e057c8e` Add method to split entries by specified max tokens limit - Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL - `b283650` Deduplicate results for user query by raw text before returning results ### Results - The quality of the search results should improve - Relevant, long entries should show up in results more often	2022-12-26 18:23:43 +00:00
Debanjum Singh Solanky	17fa123b4e	Split entries by max tokens while converting Beancount entries To JSONL	2022-12-26 15:14:32 -03:00
Debanjum Singh Solanky	f209e30a3b	Split entries by max tokens while converting Markdown entries To JSONL	2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky	24676f95d8	Fix comments, use minimal test case, regenerate test index, merge debug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion	2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky	b283650991	Deduplicate results for user query by raw text before returning results - Required because entries are now split by the max_word count supported by the ML models - This would now result in potentially duplicate hits, entries being returned to user - Do deduplication after ranking to get the top ranked deduplicated results	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	53cd2e5605	Regenerate initial model in asymmetric reload test to reduce flakyness - Fix logger message when converting org node to entries - Remove unused import from conftest	2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky	c79919bd68	Split entries by max tokens while converting Org entries To JSONL - Test usage the entry splitting by max tokens in text search	2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky	08dc5e3324	Update instructions in khoj.el to install it from MELPA stable - The instructions suggest installing khoj-assistant via pip install. This installs the latest tagged/release version of khoj - To match that version user should install khoj.el from MELPA stable instead of MELPA	2022-12-23 19:08:38 -03:00
Debanjum Singh Solanky	e057c8e208	Add method to split entries by specified max tokens limit - Issue ML Models truncate entries exceeding some max token limit. This lowers the quality of search results - Fix Split entries by max tokens before indexing. This should improve searching for content in longer entries. - Miscellaneous - Test method to split entries by max tokens	2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky	d3e175370f	Update readme to install khoj.el from MELPA stable unless using pre-release khoj Update readme to ask user to install khoj.el from MELPA when a pre-release version of the main khoj app is installed. Else install khoj.el from MELPA Stable	2022-12-20 23:29:22 -03:00
Debanjum Singh Solanky	cd463c5085	Update Khoj.el Install Instructions on Emacs	2022-12-20 11:06:33 -03:00
Debanjum Singh Solanky	23ca5a2d43	Improve (un-)quoting of funcs used in `khoj--get-enabled-content-types' - Based on melpa package feedback for khoj.el - Verified these changes don't affect behavior of the function	2022-12-19 18:02:23 -03:00
Debanjum Singh Solanky	5db3a67df5	Fix Khoj Emacs package URL in khoj.el	2022-12-14 22:49:19 -03:00
Debanjum Singh Solanky	abad6d5f44	Declare external khoj.el funcs. Remove undefined func warnings on install	2022-12-14 22:36:04 -03:00
Debanjum Singh Solanky	c52383b11c	Delete stale, unused installation helper script	2022-12-03 13:36:47 -03:00
Debanjum Singh Solanky	1990d09032	Bump khoj version in setup.py, khoj.el to 0.2.0	2022-12-02 14:58:54 -03:00
Debanjum Singh Solanky	a9cfd8b800	Extract hash func for incremental text indexing into separate method	2022-10-26 13:56:58 +05:30
Debanjum Singh Solanky	0de2ff9c97	Add __init__.py to routers directory to register it as a package	2022-10-25 20:40:40 +05:30
Debanjum Singh Solanky	55d2fea9be	Move Custom Formatter class for logger to util.helper module from main.py	2022-10-20 00:32:24 +05:30
Debanjum Singh Solanky	1c40f97114	Merge branch 'master' of github.com:debanjum/khoj into modularize-api-and-increase-typing - Conflicts: - src/interface/emacs/khoj.el Use our update to `config-url', use their `url-request-method'	2022-10-19 16:46:53 +05:30
Debanjum Singh Solanky	e1b5a87920	Rename Frontend Router to Web Client. Fix logger usage in routers - Use logger in api_beta router instead of print statements - Remove unused logger in web client router	2022-10-19 16:36:48 +05:30
Debanjum	4abd51cb04	Merge pull request #99 from telotortium/method Explicitly set `url-request-method' to GET in khoj.el	2022-10-19 10:31:37 +00:00
Debanjum Singh Solanky	c467df8fa3	Setup `mypy' for static type checking	2022-10-08 17:33:13 +03:00
Debanjum Singh Solanky	d292bdcc11	Do not version API. Premature given current state of the codebase - Reason - All clients that currently consume the API are part of Khoj - Any breaking API changes will be fixed in clients immediately - So decoupling client from API is not required - This removes the burden of maintaining muliple versions of the API	2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky	7e9298f315	Use new Text Entry class to track text entries in Intermediate Format - Context - The app maintains all text content in a standard, intermediate format - The intermediate format was loaded, passed around as a dictionary for easier, faster updates to the intermediate format schema initially - The intermediate format is reasonably stable now, given it's usage by all 3 text content types currently implemented - Changes - Concretize text entries into `Entries' class instead of using dictionaries - Code is updated to load, pass around entries as `Entries' objects instead of as dictionaries - `text_search' and `text_to_jsonl' methods are annotated with type hints for the new `Entries' type - Code and Tests referencing entries are updated to use class style access patterns instead of the previous dictionary access patterns - Move `mark_entries_for_update' method into `TextToJsonl' base class - This is a more natural location for the method as it is only (to be) used by `text_to_jsonl' classes - Avoid circular reference issues on importing `Entries' class	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	99754970ab	Type the /search API response to better document the response schema - Both Text, Image Search were already giving list of entry, score - This change just concretizes this change and exposes this in the API documentation (i.e OpenAPI, Swagger, Redocs)	2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky	0521ea10d6	Put image score breakdown under `additional' field in search response - Update web, emacs interfaces to consume the scores from new schema	2022-10-08 12:06:01 +03:00
Debanjum Singh Solanky	e42a38e825	Version Khoj API, Update frontends, tests and docs to reflect it - Split router.py into v1.0, beta and frontend (no-prefix) api modules under new router package. Version tag in main.py via prefix - Update frontends to use the versioned api endpoints - Update tests to work with versioned api endpoints - Update docs to mentioned, reference only versioned api endpoints	2022-09-28 20:08:38 +03:00
Robert Irelan	d25e1d8e86	fix: explicitly set url-request-method In my installation, it appears that `url-request-method` is sometimes set globally to POST. Need to explicitly set it to ensure that GET is always used as intended.	2022-09-19 15:46:46 -04:00
Debanjum Singh Solanky	ee65a4f2c7	Merge /reload, /regenerate into single /update API endpoint - Pass force=true to /update API to force regenerating index from scratch - Otherwise calls to the /update API endpoint will result in an incremental update to index	2022-09-16 00:53:19 +03:00
Debanjum Singh Solanky	02d944030f	Use Base TextToJsonl class to standardize <text>_to_jsonl processors - Start standardizing implementation of the `text_to_jsonl' processors - `text_to_jsonl; scripts already had a shared structure - This change starts to codify that implicit structure - Benefits - Ease adding more `text_to_jsonl; processors - Allow merging shared functionality - Help with type hinting - Drawbacks - Lower agility to change. But this was already an implicit issue as the text_to_jsonl processors got more deeply wired into the app	2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky	c16ae9e344	Ignore "Legacy way to download model" warning for upstream dependency	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	3169e3b78e	Use ellipsis instead of pass in base filter abstract methods for aesthetic	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	bf1ae038cb	Get XMP metadata from image using Pillow. Remove ExifTool dependency - Pillow already supports reading XMP metadata from Images - Removes need to maintain my fork of unmaintained PyExiftool - This also removes dependency on system Exiftool package for XMP metadata extraction - Add test to verify XMP metadata extracted from test images - Remove references to Exiftool from Documentation	2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky	8f57a62675	Remove unused imports. Fix typing and indentation - Typing issues discovered using `mypy'. Fixed manually - Unused imports discovered and fixed using `autoflake' - Fix indentation in `org_to_jsonl' manually	2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky	be57c711fd	Revert OrgNode.hasTag func to method instead of property as accepts argument	2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky	0109c7bd91	Disable ability to call <text>_to_jsonl, <type>_search packages directly - This code is de-synced with expected args by above scripts - Better to remove unused capabilitity that needlessly increases maintainance burden	2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky	1680a617da	Reflect updates to query and results count in URL - Simplify tracking khoj query history, saving/sharing links - Do not execute search, when query only contains whitespaces - Prevents error when try process results of empty query	2022-09-13 23:39:24 +03:00
Debanjum Singh Solanky	34314e859a	Call /reload instead of /regenerate API to update index from web interface - As `/reload` updates index incrementally, it's relatively quick - This makes exposing `/reload` endpoint a better default to expose via the web interface than `the /regenerate' endpoint	2022-09-12 23:39:10 +03:00
Debanjum Singh Solanky	13b5d5082f	Create input field to set results count on the web interface Resolves #96	2022-09-12 23:24:46 +03:00
Debanjum Singh Solanky	1bfe9c4ef2	Handle filter only queries. Short-circuit and return filtered results - For queries with only filters in them short-circuit and return filtered results. No need to run semantic search, re-ranking. - Add client test for filter only query and quote query in client tests	2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky	afc84de234	Make word filter regex explicit. Allow hyphen in word filters Helps with #88	2022-09-12 17:05:29 +03:00
Debanjum Singh Solanky	536f03af8f	Process text content files in sorted order for stable indexing - Image search already uses a sorted list of images to process - Prevents index of entries to desync when entries, embeddings generated by a separate server/app instance	2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky	a701ad08b9	Support multiple input-filters to configure content to index via khoj.yml - Update existings code, tests to process input-filters as list instead of str - Test `text_to_jsonl' get files methods to work with combination of `input-files' and `input-filters' Resolves #84	2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky	940c8fac8c	Use app LRU, not functools LRU decorator, to cache search results in router - Provides more control to invalidate cache on update to entries, embeddings - Allows logging when results are being returned from cache etc - FastAPI, Swagger API docs look better as the `search' controller not wrapped in generically named function when using functools LRU decorator	2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky	c6fa09d8fc	Fix querying with include word filter from web interface - Not encoding the `query' string before querying the backend API with it was causing the "+" prefix for include word filter to be lost	2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky	1502fbc9e9	Add index_heading_entries flag to default and sample khoj configs	2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky	7216cdff58	Add Date, Word filter for Org-Music content	2022-09-11 17:29:34 +03:00
Debanjum Singh Solanky	9d369ae4df	Fix OrgNode render of entries with property drawers and empty body - Issue - Indent regex was previously catching escape sequences like newlines - This was resulting in entries with only escape sequences in body to be prepended to property drawers etc during rendering - Fix - Update indent regex to only look for spaces in each line - Only render body when body contains non-escape characters - Create test to prevent this regression from silently resurfacing	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	253c9eae9a	Set index_heading_entries field in config to index entries with no body - Previously heading entries were not indexed to maintain search quality - But given that there are use-cases for indexing entries with no body - Add a configurable `index_heading_entries' field to index heading entries - This `TextContentConfig' field is currently only used for OrgMode content	2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky	1d3b3d5f39	Convert field get/set methods in OrgNode class to @property - Use more descriptive variable names in OrgNode parser and class - Convert OrgNode fields to private/protected, use property methods to get/set them	2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky	db37e38df7	Create OrgNode hasBody method. Use it in org_to_jsonl checks	2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky	b4878d76ea	Extract entries from scratch when regenerate requested - Do not rely on previously extracted entries to find new entries in regenerate scenario	2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky	52e3dd9835	Pass the whole TextContentConfig as argument to text_to_jsonl methods - Let the specific text_to_jsonl method decide which of the TextContentConfig fields it needs to convert <text> type to jsonl - This simplifies extending TextContentConfig for a specific type without modifying all text_to_jsonl methods - It keeps the number of args being passed to the `text_to_jsonl' methods in check	2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky	e951ba37ad	Raise exception when org file not found - No need to catch the IOError in OrgNode	2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky	2e1bbe0cac	Fix striping empty escape sequences from strings - Fix log message on jsonl write	2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky	a7cf6c8458	Use dictionary instead of list to track entry to file maps	2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky	3e1323971b	Stack function calls in jsonl converters to avoid unneeded variables	2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky	4eb84c7f51	Log performance metrics for beancount, markdown to jsonl conversion	2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky	ebd5039bd1	Merge branch 'master' into support-incremental-updates-of-embeddings	2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky	030fab9bb2	Support incremental update of Markdown entries, embeddings	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	91aac83c6a	Support incremental update of Beancount transactions, embeddings	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	b01b4d7daa	Extract logic to mark entries for embeddings update into helper function - This could be re-used by other text_to_jsonl converters like markdown, beancount	2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky	f97308bef2	Fix log message on writing JSONL data to file	2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky	c17a0fd05b	Do not store word filters index to file. Not necessary for now - It's more of a hassle to not let word filter go stale on entry updates - Generating index on 120K lines of notes takes 1s. Loading from file takes 0.2s. For less content load time difference will be even smaller - Let go of startup time improvement for simplicity for now	2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky	91d11ccb49	Only hash compiled entry to identify new/updated entries to update - Comparing compiled entries is the appropriately narrow target to identify entries that need to encode their embedding vectors. Given we pass the compiled form of the entry to the model for encoding - Hashing the whole entry along with it's raw form was resulting in a bunch of entries being marked for updated as LINE: <entry_line_no> is a string added to each entries raw format. - This results in an update to a single entry resulting in all entries below it in the file being marked for update (as all their line numbers have changed) - Log performance metrics for steps to convert org entries to jsonl	2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky	b9a6e80629	Make OrgNode tags stable sorted to find new entries for incremental updates - Having Tags as sets was returning them in a different order everytime - This resulted in spuriously identifying existing entries as new because their tags ordering changed - Converting tags to list fixes the issue and identifies updated new entries for incremental update correctly	2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky	2f7a6af56a	Support incremental update of org-mode entries and embeddings - What - Hash the entries and compare to find new/updated entries - Reuse embeddings encoded for existing entries - Only encode embeddings for updated or new entries - Merge the existing and new entries and embeddings to get the updated entries, embeddings - Why - Given most note text entries are expected to be unchanged across time. Reusing their earlier encoded embeddings should significantly speed up embeddings updates - Previously we were regenerating embeddings for all entries, even if they had existed in previous runs	2022-09-10 20:58:33 +03:00

1 2 3 4 5 ...

544 commits