Commit graph

89 commits

Author SHA1 Message Date
Debanjum Singh Solanky
52664dd96c Allow recursive glob pattern (**) to add files to search index
- Simplify configuring files to index For Obsidian/Org-Roam type
  systems with lots of small files in khoj.yml using `input-filter'
2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky
17fa123b4e Split entries by max tokens while converting Beancount entries To JSONL 2022-12-26 15:14:32 -03:00
Debanjum Singh Solanky
f209e30a3b Split entries by max tokens while converting Markdown entries To JSONL 2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky
53cd2e5605 Regenerate initial model in asymmetric reload test to reduce flakyness
- Fix logger message when converting org node to entries
- Remove unused import from conftest
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
c79919bd68 Split entries by max tokens while converting Org entries To JSONL
- Test usage the entry splitting by max tokens in text search
2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky
e057c8e208 Add method to split entries by specified max tokens limit
- Issue
   ML Models truncate entries exceeding some max token limit.
   This lowers the quality of search results

- Fix
  Split entries by max tokens before indexing.
  This should improve searching for content in longer entries.

- Miscellaneous
  - Test method to split entries by max tokens
2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky
a9cfd8b800 Extract hash func for incremental text indexing into separate method 2022-10-26 13:56:58 +05:30
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
8f57a62675 Remove unused imports. Fix typing and indentation
- Typing issues discovered using `mypy'. Fixed manually
- Unused imports discovered and fixed using `autoflake'
- Fix indentation in `org_to_jsonl' manually
2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky
be57c711fd Revert OrgNode.hasTag func to method instead of property as accepts argument 2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
0109c7bd91 Disable ability to call <text>_to_jsonl, <type>_search packages directly
- This code is de-synced with expected args by above scripts
- Better to remove unused capabilitity that needlessly increases
  maintainance burden
2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky
db37e38df7 Create OrgNode hasBody method. Use it in org_to_jsonl checks 2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
52e3dd9835 Pass the whole TextContentConfig as argument to text_to_jsonl methods
- Let the specific text_to_jsonl method decide which of the
  TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
  modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
  methods in check
2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky
e951ba37ad Raise exception when org file not found
- No need to catch the IOError in OrgNode
2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky
2e1bbe0cac Fix striping empty escape sequences from strings
- Fix log message on jsonl write
2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky
a7cf6c8458 Use dictionary instead of list to track entry to file maps 2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky
3e1323971b Stack function calls in jsonl converters to avoid unneeded variables 2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky
4eb84c7f51 Log performance metrics for beancount, markdown to jsonl conversion 2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky
ebd5039bd1 Merge branch 'master' into support-incremental-updates-of-embeddings 2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky
030fab9bb2 Support incremental update of Markdown entries, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
91aac83c6a Support incremental update of Beancount transactions, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
b01b4d7daa Extract logic to mark entries for embeddings update into helper function
- This could be re-used by other text_to_jsonl converters like
  markdown, beancount
2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
91d11ccb49 Only hash compiled entry to identify new/updated entries to update
- Comparing compiled entries is the appropriately narrow target to
  identify entries that need to encode their embedding vectors. Given we
  pass the compiled form of the entry to the model for encoding

- Hashing the whole entry along with it's raw form was resulting in a
  bunch of entries being marked for updated as LINE: <entry_line_no>
  is a string added to each entries raw format.

- This results in an update to a single entry resulting in all entries
  below it in the file being marked for update (as all their line
  numbers have changed)

- Log performance metrics for steps to convert org entries to jsonl
2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky
b9a6e80629 Make OrgNode tags stable sorted to find new entries for incremental updates
- Having Tags as sets was returning them in a different order
  everytime
- This resulted in spuriously identifying existing entries as new
  because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
  entries for incremental update correctly
2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
976397bd82 Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings 2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky
11917c6ddd Do not normalize absolute filenames for creating links in OrgNode 2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
07b98d35f1 Use filename or #+TITLE as heading for 0th level content in org files
- Set LINE, SOURCE link properties in property drawer correctly for
  content which falls under no heading
- See Issue #83 for more details
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
d6bd7bf3e1 Fix initializing OrgNode level to string to parse org files
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue #83, as level is set to string of `*`s
  the moment a heading is found in the current file
2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky
762607fc9f Log processed entries by org_to_jsonl only if verbosity > 2
Output too verbose for even debug mode logging. So gated behind -vvv
2022-09-06 23:03:29 +03:00
Debanjum Singh Solanky
490157cafa Setup File Filter for Markdown and Ledger content types
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
  - Word, Date Filters were accidently removed from the above types yesterday
  - File Filter is the only filter that newly got added
2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky
7606724dbc Add file of each entry to entry dict in org_to_jsonl converter
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
  entry in text_search as all text search content types do not have
  file being set in jsonl converters
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
094bd18e57 Use python standard logging framework for app logs
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
  - verbose = 0 maps to logging.WARN
  - verbose = 1 maps to logging.INFO
  - verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
2022-09-03 14:43:32 +03:00
Debanjum Singh Solanky
ea4fdd9134 Fix logic to ignore notes with no body. Add tests to prevent regression
- Notes with empty newlines in body were not being ignored
- Add regression tests to avoid above regression in org_to_jsonl conversion
2022-08-21 19:41:40 +03:00
Debanjum Singh Solanky
150ae19660 Indent Timestamps, Drawers at Body Level in OrgNode Entry Representation 2022-08-10 18:55:37 +03:00
Debanjum Singh Solanky
fd31d339c1 Remove spurious space in Entries without Todo in OrgNode Entry Repr 2022-08-10 13:48:44 +03:00
Debanjum Singh Solanky
38df727ef4 Fix escape sequence usage in strings. Remove unneeded import of os
Rename /config API method to config to match it's purpose. UI is
anyway too generic, and not what it is doing
2022-08-03 18:51:55 +03:00
Debanjum Singh Solanky
1b55462fb0 Convert search_filter, conversation dir to proper modules
Add __init__.py files to their directories
2022-08-02 20:23:42 +03:00
Debanjum Singh Solanky
f094c86204 Trace query response performance and display timings in verbose mode 2022-07-26 21:03:53 +04:00
Debanjum Singh Solanky
d4d7dbaca6 Support Natural Search on Markdown Files
- Reason:
  Allow natural search on markdown based notes, documentation,
  websites etc

- Details:
  - Create markdown processor to extract Markdown entries (identified by
    Heading) into standard jsonl format required by text_search
  - Update API, Configs to support interfacing with new markdown type
  - Update Emacs, Web clients to support interfacing with new markdown
    type via API
  - Update Readme to mentiond markdown is also supported

Closes #35
2022-07-21 22:07:05 +04:00
Debanjum Singh Solanky
0917f1574d Consolidate jsonl helper methods in a single file under utils module 2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky
5aad297286 Reuse logic to extract entries across symmetric, asymmetric search
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types

Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky
e220ecc00b Generate compiled form of each transaction directly in the beancount processor
- The logic for compiling a beancount entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows symmetric search to be generic and not be aware of
  beancount specific properties that were extracted by the
  beancount-to-jsonl processor layer

- Now symmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be location, transaction, chat etc, it doesn't have to care
2022-07-21 02:43:28 +04:00
Debanjum Singh Solanky
06cf425314 Generate compiled form of each entry directly in the org-mode processor
- The logic for compiling an org-mode entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows asymmetric search to be generic and not be aware of
  org-mode specific properties that were extracted by the org-to-jsonl
  processor layer

- Now asymmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be mail, chat, markdown, org-mode etc, it doesn't have to care
2022-07-21 02:08:02 +04:00