Read Markdown file as utf8 instead of the default encoding used by OS

### Background
  1. Obsidian stores markdown notes as `utf8`[1]
  2. By default, the python `open` command uses the OS locale encoding[2]

### Issue
  Based on above background, if the OS locale encoding isn't `utf8` it causes the `UnicodeDecodeError: <locale_encoding> codec can't decode byte` error

### Fix
  - Read markdown files as `utf8`
    The Obsidian plugin is the main use-case for markdown files in khoj currently and that stores md files as `utf8`.
    Do not assume utf8 for other content types like org-mode, beancount for now.
  - Fail if error in reading file as utf8, instead of ignoring errors.
    Would rather have user realize that their files are not going to get indexed correctly.

[1]: https://forum.obsidian.md/t/better-handle-md-files-not-stored-in-utf8-format/13524/3
[2]: https://docs.python.org/3/library/functions.html#open
This commit is contained in:
Debanjum 2023-02-07 01:46:42 -03:00 committed by GitHub
commit 99a03da3f7
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
2 changed files with 11 additions and 4 deletions

View file

@ -24,13 +24,20 @@ jobs:
test:
name: Run Tests
runs-on: ubuntu-latest
steps:
strategy:
fail-fast: false
matrix:
python_version:
- 3.8
- 3.9
- 3.10
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: ${{ matrix.python_version }}
- name: Install Dependencies
run: |

View file

@ -97,7 +97,7 @@ class MarkdownToJsonl(TextToJsonl):
entries = []
entry_to_file_map = []
for markdown_file in markdown_files:
with open(markdown_file) as f:
with open(markdown_file, 'r', encoding='utf8') as f:
markdown_content = f.read()
markdown_entries_per_file = []
for entry in re.split(markdown_heading_regex, markdown_content, flags=re.MULTILINE):