Thursday, May 21, 2020

Facebook's download-your-information JSON encodes non-ASCII as UTF-8

Today I wrote a script to reupload Facebook posts from the JSON files produced by their download-your-information tool. Getting the post text itself was straightforward except for non-ASCII characters. JSON escape sequences for control characters specify 16 bits, but Facebook's never used more than a byte. Apparently they store their strings as UTF-8 and encode each byte to JSON. The .NET Framework (used by my script) represents strings as UTF-16, so Facebook's text was mangled. To get the correct text, I had to take the string from the JSON, map each character to a byte, and decode the byte array as UTF-8 to produce a UTF-16 string in memory.

No comments:

Post a Comment