View each chunk of a non-hierarchical files as a separate corpus

If raw_is_compiled, it means there is no inherent hierarchical
structure of the document being chunked.

The corpus_id shouldn't be shared for these chunks.

Otherwise all chunks of a plain text file will be shown as one during
dedupe (default) search
This commit is contained in:
Debanjum Singh Solanky 2024-07-07 13:25:25 +05:30
parent 2d35004371
commit 00620356e6

View file

@ -108,7 +108,7 @@ class TextToEntries(ABC):
raw=entry.raw,
heading=entry.heading,
file=entry.file,
corpus_id=corpus_id,
corpus_id=uuid.uuid4() if raw_is_compiled else corpus_id,
)
)