1. Home
  2. Working with Unicode in a Spreadsheet

Working with Unicode in a Spreadsheet

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The standard is maintained by the Unicode Consortium, and as of March 2019 the most recent version, Unicode 12.0, contains a repertoire of 137,993 characters covering 150 modern and historic scripts, as well as multiple symbol sets and emoji. More information may be found at https://en.wikipedia.org/wiki/Unicode. The full set of Unicode characters can be found at http://www.unicode.org/charts/, https://en.wikipedia.org/wiki/List_of_Unicode_characters or https://unicode-table.com/en. You can search for a Unicode character by name at https://unicodelookup.com.

Genstat internally represents Unicode using the UTF-8 encoding which is backwards compatible with ASCII. UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.[2] The encoding is defined by the Unicode Standard. The name is derived from Unicode Transformation Format – 8-bit. It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. For more on UTF-8 see https://en.wikipedia.org/wiki/UTF-8.

Genstat will handle both UTF-8 and Unicode text from text files and the clipboard, and converts all Unicode to UTF-8. UTF-8 is the format used within Excel and Genstat supports loading UTF-8 from Excel and Open Office spreadsheets. How this is done can be controlled with options for handling Unicode in data in Tools | Spreadsheet (Conversions tab) and in column names in Tools | Spreadsheet (Columns tab).

When a spreadsheet contains Unicode characters, they will be displayed in the data, name, description and factor labels as expected. Any dialog that accepts text will also allow Unicode from clipboard or keyboard. Unicode can be pasted from the clipboard to the cells as usual, but when the any of the cells in a column contain Unicode, the text cannot be edited in place as usual (this is only supported currently for ASCII text), but a dialog that accepts Unicode will pop up for you to type in the Unicode text. The same behaviour also occurs in the Edit Factor Levels and Labels, Rename Columns, Recode Column or Code to Groups dialogs. The dialog which pops up in these case is the Edit Unicode Text dialog.

The fonts in the cell edit dialogs will be larger than the normal ASCII version to better display the Unicode symbols. If your keyboard does not support Unicode (various language keyboards, e.g. Chinese or Thai, support systems for typing Unicode characters in the supported language), then you can cut and paste these from the Internet, or use the TXINTEGERCODES directive to create characters from their Unicode decimal code. The Unicode character’s code is normally given hex format, and this will have to be converted to their decimal equivalent for use in Genstat. For example the Greek letter alpha (α) has a Unicode hex code of 0x3B1 which is 945 in decimal (945 = 16*16*3 + 16*11 + 1), so

TXINTEGERCODES [CONVERTTO=text] CODE=945; TEXT=alpha
PRINT alpha

will put an α in the output window that may be cut and pasted into any field in a spreadsheet, menu or dialog. Note https://unicodelookup.com is a good internet site to find the decimal code for a Unicode character.

In Windows 10 you can also type Unicode characters into edit windows by pressing and holding down the Alt key, typing the decimal code for Unicode character, and then releasing the Alt key. The number lock mode for the keypad needs to be on for this to work. Press the Num Lock key if the number lock indicator light is not on. For example, α can be entered with the key strokes Alt+945, then releasing the Alt key after typing the 5. 

If the Unicode characters are hard to distinguish you may need a larger font size which can be controlled in the Tools | Spreadsheet (Appearance tab). Note: if the Unicode character is not supported in your selected font, Windows will substitute another font that supports the character. The substituted font may vary according to which Windows Language packs you have installed.

See also

How to Handle Unicode characters in a column name dialog
Rename Column Cursor 
Duplicate Column Name Warning
Column Attributes/Format
Edit Factor Levels and Labels
Rename Columns
Recode Column
Code to Groups
Edit Unicode Text
Tools | Spreadsheet (Columns tab)
Tools | Spreadsheet (Appearance tab)
TXINTEGERCODES directive

Updated on September 13, 2019

Was this article helpful?