Skip to content

Conversation

@rodjuncode
Copy link

Hello! I was trying to train my own vector model, but got decoding errors during the linebreaks removal. I figured out that forcing the encoding (encoding="utf-8") during the file opening would fix it. Added to both input and output files. Also, added "ensure_ascii=False" to the json dump, so the "\u" characters are dumped as human readable chars.

Hope this helps, and thanks for the great work!

Forcing utf 8 guarantees compatibility with non-english languages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant