Difference between UNICODE and UTF-8 files in Windows Notepad

Posted by decipherinfosys on February 25, 2013

While generating flat files through SSIS for a feed process at a client site, we noticed that the developer had left the file encoding to be UNICODE with the thought that it would be the best practice and the consumption of the file will not be an issue at the receiving system.  The file was being opened up in Notepad.

When generating a flat file in Windows, you have the option (just like you would when you are using Notepad) to use the encoding of ANSI, UNICODE, UTF-8 or Unicode big-endian.  What is important to understand is that in case you are using UNICODE, it is essentially UTF-16 little-endian and if you are using ANSI, it is Code Page 1252.

Microsoft’s Notepad writes UTF-16 with a Byte Order Mark (BOM) and also looks for that BOM when reading the file.  If you are un-aware of what a BOM is, read this entry in Wikipedia – here.  So, in the case of a UNICODE file, the BOM is what helps in determining whether the file is UTF-16 big-endian or little-endian. Now, if Notepad is not able to find the BOM, then it calls a library function called isTextUnicode and it looks at the data and attempts to determine the encoding.  If the interpretation of this function comes out wrongly, it will cause it to display wrong glyphs.

Best approach in our opinion is to use UTF-8 everywhere.  It is a universally accepted encoding and even if you are sharing files across different operating systems, you would still be assured of providing proper data.

