sij/khoj

mirror of https://github.com/khoj-ai/khoj.git synced 2024-11-23 23:48:56 +01:00

Author	SHA1	Message	Date
Debanjum Singh Solanky	07b98d35f1	Use filename or #+TITLE as heading for 0th level content in org files - Set LINE, SOURCE link properties in property drawer correctly for content which falls under no heading - See Issue #83 for more details	2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky	d6bd7bf3e1	Fix initializing OrgNode level to string to parse org files - Parsed `level` argument passed to OrgNode during init is expected to be a string, not an integer - This was resulting in app failure only when parsing org files with no headings, like in issue #83, as level is set to string of `*`s the moment a heading is found in the current file	2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky	d835467f2c	Throw exception if no valid entries found in specified content files - Previously we were failing if no valid entries while computing embeddings. This was obscuring the actual issue of no valid entries found in the specified content files - Throwing an exception early with clear message when no entries found should make clarify the issue to be fixed - See issue #83 for details	2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky	e00bb53336	Init word filter dictionary with default value as set to simplify code	2022-09-10 12:19:09 +03:00
Debanjum Singh Solanky	4d776d9c7a	Bump khoj version to 0.1.9	2022-09-09 07:50:15 +03:00
Debanjum	b58b7d7483	Create App Directory, Fix Initialization GUI on First Run - `588f598` Pass empty list of `input_files` to `FileBrowser` on first run - `3ddffdf` Create config directory before setting up logging to file under it Resolves #78 Resolves #79 Resolves #80	2022-09-09 04:40:22 +00:00
Debanjum Singh Solanky	588f598949	Pass empty list of `input_files' to FileBrowser on first run - Default config has `input_files' set to None - This was being passed to `FileBrowser' on Initialization - But `FileBrowser' expects `content_files' of list type, not None - This resulted in an unexpected NoneType failure	2022-09-09 07:26:40 +03:00
Debanjum Singh Solanky	3ddffdfba4	Create config directory before setting up logging to file under it - The logging to file code expects the config directory to already be setup - But parent directory of config file was being set up later in code - This resulted in app start failing with ~/.khoj dir does not exist error	2022-09-09 07:21:42 +03:00
Debanjum	79894efc7a	Resolve GUI Issues in Docker Build - `17354aa` Install `pyqt` system package in Docker image to get qt dependencies - `5d3aeba` Do not start GUI when Khoj started from Docker - `26ff66f` (Re-)Enable image search via Docker image as image search issues fixed Resolves #76	2022-09-08 07:55:06 +00:00
Debanjum Singh Solanky	26ff66f38b	(Re-)Enable image search via Docker image as image search issues fixed	2022-09-08 10:42:34 +03:00
Debanjum Singh Solanky	17354aaffd	Install pyqt system package in Docker image to get qt dependencies Otherwise app start fails with pyqt package import related errors. See #76 for bug	2022-09-08 10:39:11 +03:00
Debanjum Singh Solanky	5d3aeba22f	Use --no-gui flag on starting Khoj from docker-compose As the GUI wouldn't work when run from a docker container	2022-09-08 10:37:39 +03:00
Debanjum Singh Solanky	e4d40e4d4d	Update setup.py version, Readme. Remove faulty release badge for now	2022-09-07 14:51:03 +03:00
Debanjum Singh Solanky	35d81de1a1	Update khoj version to 0.1.7 in setup.py This should have been done right after the 0.1.6 release. To allow pre-release versions for 0.1.7 published to pypi from master to be installable. Currently their being published as 0.1.6 pre-release versions instead	2022-09-07 13:38:15 +03:00
Debanjum Singh Solanky	762607fc9f	Log processed entries by org_to_jsonl only if verbosity > 2 Output too verbose for even debug mode logging. So gated behind -vvv	2022-09-06 23:03:29 +03:00
Debanjum Singh Solanky	490157cafa	Setup File Filter for Markdown and Ledger content types - Pass file associated with entries in markdown, beancount to json converters - Add File, Word, Date Filters to Ledger, Markdown Types - Word, Date Filters were accidently removed from the above types yesterday - File Filter is the only filter that newly got added	2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky	94cf3e97f3	Log app logs to file for posthoc debugging and performance analysis	2022-09-06 14:51:48 +03:00
Debanjum	0a78cd5477	Create File Filter. Improve, Consolidate Filter Code ### General Filter Improvements - `e441874` Create Abstract Base Class for all filters to inherit from - `965bd05` Make search filters return entry ids satisfying filter - `092b9e3` Setup Filters when configuring Text Search for each Search Type - `31503e7` Do not pass embeddings in argument to `filter.apply` method as unused ### Create File Filter - `7606724` Add file associated with each entry to entry dict in `org_to_jsonl` converter - `1f9fd28` Create File Filter to filter files specified in query - `7dd20d7` Pre-compute file to entry map in speed up file based filter - `7e083d3` Cache results for file filters passed in query for faster filtering - `2890b4c` Simplify extracting entries satisfying file filter ### Miscellaneous - `f930324` Rename `explicit filter` to more appropriate name `word filter` - `3707a4c` Improve date filter perf. Precompute date to entry map, Cache results	2022-09-05 15:29:55 +00:00
Debanjum Singh Solanky	3707a4cdd4	Improve date filter perf. Precompute date to entry map, Cache results - Precompute date to entry map - Cache results for faster recall - Log preformance timers in date filter	2022-09-05 18:21:29 +03:00
Debanjum Singh Solanky	31503e7afd	Do not pass embeddings as argument to filter.apply method	2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky	965bd052f1	Make search filters return entry ids satisfying filter - Filter entries, embeddings by ids satisfying all filters in query func, after each filter has returned entry ids satisfying their individual acceptance criteria - Previously each filter would return a filtered list of entries. Each filter would be applied on entries filtered by previous filters. This made the filtering order dependent - Benefits - Filters can be applied independent of their order of execution - Precomputed indexes for each filter is not in danger of running into index out of bound errors, as filters run on original entries instead of on entries filtered by filters that have run before it - Extract entries satisfying filter only once instead of doing this for each filter - Costs - Each filter has to process all entries even if previous filters may have already marked them as non-satisfactory	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7dd20d764c	Pre-compute file to entry map in file filter to mark ids to include faster	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	2890b4cd44	Simplify extracting entries satisfying file filter	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7606724dbc	Add file of each entry to entry dict in org_to_jsonl converter - This will help filter query to org content type using file filter - Do not explicitly specify items being extracted from json of each entry in text_search as all text search content types do not have file being set in jsonl converters	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	7e083d3e96	Cache results for file filters passed in query for faster filtering	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	f634399f23	Convert simple file filters with no path separator into regex - Specify just file name to get all notes associated with file at path - E.g `query` with `file:"file1.org"` will return `entry1` if `entry1` is in `file1.org` at `~/notes/file.org` - Test - Test converting simple file name filter to regex for path match - Test file filter with space in file name	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	092b9e329d	Setup Filters when configuring Text Search for each Search Type - Allows enabling different filters for different Text Search Types - Use FileFilter in Text Search on Org Files	2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky	1f9fd28b34	Create File Filter to filter files to query. Add tests for file filter	2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky	e4418746f2	Create Abstract Base Class for Filters. Make Word, Date Filter Child of BaseFilter	2022-09-04 18:48:16 +03:00
Debanjum Singh Solanky	c9f6200007	Ignore pytest_cache directory from git using .gitignore	2022-09-04 17:19:22 +03:00
Debanjum Singh Solanky	f930324350	Rename explicit filter to word filter to be more specific	2022-09-04 17:18:47 +03:00
Debanjum	d153d420fc	Improve Latency of Explicit Filter ### Goal - Improve explicit filter latency to work better with incremental search ### Reasons for High Explicit Filter Latency - Deleting entries to be excluded from existing list of entries, embeddings - Explicit filtering on partial words during incremental search - Creating word set for all entries on the fly during query - Deep copying of entries, embeddings before applying filter ### Improvement Details - Major - `191a656` Use word to entry map, list comprehension to speed up explicit filter - Use list comprehension and `torch.index_select` methods - to speed selection of entries, embedding tensors satisfying filter - avoid deep copy and direct manipulation of entries, embeddings - Use word to entry map and set operations to mark entries satisfying inclusion, exclusion filters - `c7de57b` Pre-compute entry word sets to improve explicit filter query performance - `3308e68` Cache explicitly filtered entries, embeddings by required, blocked words - `cdcee89` Wrap explicit filter words in quotes to trigger filter - E.g `+"word_to_include"` instead of `+word_to_include` - Signals explicit filter term completed - Prevents latency due to incremental search with explicit filtering on partial terms - Minor - `28d3dc1` Deep copy entries, embeddings in filters. Defer till actual filtering - `8d9f507` Load entries_by_word_set from file only once on first load of explicit filter - `546fad5` Use regex to check for and extract include, exclude filter words from query - `b7d259b` Test Explicit Include, Exclude Filters ### Results - Improve exclude word filter latency from 20s+ to 0.02s on 120K line notes corpus	2022-09-04 13:55:17 +00:00
Debanjum Singh Solanky	6087862521	Use LRU helper class for explicit filter cache	2022-09-04 16:42:28 +03:00
Debanjum Singh Solanky	8f3326c8d4	Create LRU helper class for caching	2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky	191a656ed7	Use word to entry map, list comprehension to speed up explicit filter - Code Changes - Use list comprehension and `torch.index_select' methods - to speed selection of entries, embedding tensors satisfying filter - avoid deep copy of entries, embeddings - avoid updating existing lists (of entries, embeddings) - Use word to entry map and set operations to mark entries satisfying inclusion, exclusion filters - Results - Speed up explicit filtering by two orders of magnitude - Improve consistency of speed up across inclusion and exclusion filtering	2022-09-04 15:22:35 +03:00
Debanjum Singh Solanky	28d3dc1434	Deep copy entries, embeddings in filters. Defer till actual filtering - Only the filter knows when entries, embeddings are to be manipulated. So move the responsibility to deep copy before manipulating entries, embeddings to the filters - Create deep copy in filters. Avoids creating deep copy of entries, embeddings when filter results are being loaded from cache etc	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	3308e68edf	Cache explicitly filtered entries, embeddings by required, blocked words	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	cdcee89ae5	Wrap words in quotes to trigger explicit filter from query - Do not run the more expensive explicit filter until the word to be filtered is completed by user. This requires an end sequence marker to identify end of explicit word filter to trigger filtering - Space isn't a good enough delimiter as the explicit filter could be at the end of the query in which case no space	2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky	8d9f507df3	Load entries_by_word_set from file only once on first load of explicit filter	2022-09-04 00:37:37 +03:00
Debanjum Singh Solanky	858d86075b	Use regexes to check if any explicit filters in query. Test can_filter	2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky	546fad570d	Use regex to extract include, exclude filter words from query	2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky	b7d259b1ec	Test Explicit Include, Exclude Filters	2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky	ffb8e3988e	Use Python Logging Framework to Time Performance of Explicit Filter	2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky	30c3eb372a	Update Tests to Configure Filters and Setup Text Search	2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky	c7de57b8ea	Pre-compute entry word sets to improve explicit filter query performance	2022-09-03 16:16:31 +03:00
Debanjum Singh Solanky	094bd18e57	Use python standard logging framework for app logs - Stop passing verbose flag around app methods - Minor remap of verbosity levels to match python logging framework levels - verbose = 0 maps to logging.WARN - verbose = 1 maps to logging.INFO - verbose >=2 maps to logging.DEBUG - Minor clean-up of app: unused modules, conversation file opening	2022-09-03 14:43:32 +03:00
Debanjum Singh Solanky	d0531c3064	Update URL QueryParam when Type set in Dropdown on Web Interface - This also pushes the updated URL state to history - Allows jumping back to the web interface after clicking on an image and having the type set to image search - Previously type would get reset to the default search type on jumping back	2022-08-28 12:22:22 +03:00
Debanjum Singh Solanky	2eae32d743	Time, Log Image Search Performance	2022-08-28 00:28:46 +03:00
Debanjum Singh Solanky	c3ca99841b	Scale down images to generate image embeddings faster, with less memory - CLIP doesn't need full size images for generating embeddings with decent search results. The sentence transformers docs use images scaled to 640px width - Benefits - Normalize image sizes - Increase image embeddings generation speed - Decrease memory usage while generating embeddings from images	2022-08-24 14:09:02 +03:00
Debanjum Singh Solanky	ea4fdd9134	Fix logic to ignore notes with no body. Add tests to prevent regression - Notes with empty newlines in body were not being ignored - Add regression tests to avoid above regression in org_to_jsonl conversion	2022-08-21 19:41:40 +03:00

1 2 3 4 5 ...

676 commits