Various technical articles, IT-related tutorials, software information, and development journals
Thursday, May 21, 2020
Facebook's download-your-information JSON encodes non-ASCII as UTF-8
Today I wrote a script to reupload Facebook posts from the JSON files produced by their download-your-information tool. Getting the post text itself was straightforward except for non-ASCII characters. JSON escape sequences for control characters specify 16 bits, but Facebook's never used more than a byte. Apparently they store their strings as UTF-8 and encode each byte to JSON. The .NET Framework (used by my script) represents strings as UTF-16, so Facebook's text was mangled. To get the correct text, I had to take the string from the JSON, map each character to a byte, and decode the byte array as UTF-8 to produce a UTF-16 string in memory.
Labels:
facebook
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment