Read Markdown file as utf8 instead of the default encoding used by OS

### Background 1. Obsidian stores markdown notes as `utf8`[1] 2. By default, the python `open` command uses the OS locale encoding[2] ### Issue Based on above background, if the OS locale encoding isn't `utf8` it causes the `UnicodeDecodeError: <locale_encoding> codec can't decode byte` error ### Fix - Read markdown files as `utf8` The Obsidian plugin is the main use-case for markdown files in khoj currently and that stores md files as `utf8`. Do not assume utf8 for other content types like org-mode, beancount for now. - Fail if error in reading file as utf8, instead of ignoring errors. Would rather have user realize that their files are not going to get indexed correctly. [1]: https://forum.obsidian.md/t/better-handle-md-files-not-stored-in-utf8-format/13524/3 [2]: https://docs.python.org/3/library/functions.html#open
2024-11-24 07:55:07 +01:00 · 2023-02-07 01:46:42 -03:00 · 2023-02-07 01:46:42 -03:00 · 99a03da3f7
commit 99a03da3f7
parent d3e82b918f c11f7b47e4
2 changed files with 11 additions and 4 deletions
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -24,13 +24,20 @@ jobs:
  test:
    name: Run Tests
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version:
          - 3.8
          - 3.9
          - 3.10
     steps:
      - uses: actions/checkout@v3
-      - name: Set up Python 3.10
+      - name: Set up Python
        uses: actions/setup-python@v4
        with:
-          python-version: '3.10'
+          python-version: ${{ matrix.python_version }}
      - name: Install Dependencies
        run: |
--- a/src/processor/markdown/markdown_to_jsonl.py
+++ b/src/processor/markdown/markdown_to_jsonl.py
@ -97,7 +97,7 @@ class MarkdownToJsonl(TextToJsonl):
        entries = []
        entry_to_file_map = []
        for markdown_file in markdown_files:
-            with open(markdown_file) as f:
+            with open(markdown_file, 'r', encoding='utf8') as f:
                markdown_content = f.read()
                markdown_entries_per_file = []
                for entry in re.split(markdown_heading_regex, markdown_content, flags=re.MULTILINE):