Commit graph

5 commits

Author SHA1 Message Date
Debanjum Singh Solanky
7b4f78776c Fix extracting Markdown Entries with Top Level Headings
- Previously top level headings would have get stripped of the
  space between heading text and the prefix # symbols. That is,
  `# Top Level Heading' would get converted to `#Top Level Heading'
- This would mess up their rendering as a heading in search results

- Add unit tests to text_to_jsonl processors to prevent regression
2023-01-17 13:06:28 -03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
d3267554ae Add basic tests for markdown to jsonl conversion 2022-09-11 00:15:27 +03:00