Development on GAE using Khmer Unicode

To handle Khmer Unicode text development, you need to configure a few settings to work properly. This tutorial continues the series from: Setup GAE Environment and complete a Hello World example with Khmer Unicode text. This tutorial shows how to setup a Khmer Unicode project using Eclipse with PyDev for Python Google AppEngine. It involves Unicode specific setup to handle Unicode text and how to display the Khmer Unicode correctly in your code and output to the web browser.

Requirements

  1. You already completed setup the environment from this tutorial Setup GAE Environment.
  2. Your development system is setup with Khmer Unicode and can type Khmer Unicode text.
  3. You have basic knowledge of Python. This is not a Python tutorial.

Setup File Encoding to 8-bit Unicode Format (UTF-8)

To display Khmer Unicode in Eclipse, first ensure that your project file encoding setting is UTF-8 so as you save the file, the Khmer Unicode text is saved properly.
  1. To do this, right click on the helloworld project and select Properties. Image
  2. In Resource tree, ensure UTF-8 is selected and you have Unix format for GAE env. For "Text file encoding", select Others: UTF-8. Then for "New text file line delimiter" select Unix. This is good for GAE environment.
  3. Also ensure that the default file encoding preferences is UTF-8. To do that go to Window -> Preferences, then in the General -> Workspace and select the same options as the above step.

Configure Unicode Display

Now we are going to configure the font so Eclipse can display Khmer Unicode text properly. Note that if you skip this step, all Khmer Unicode text will not appear in the editor even though the character exists in the file. This is contrary to the web browser where it displays boxes for each character if the Khmer Unicode is not setup properly.
  1. Select General - Appearance - Colors and Fonts
  2. Click on Basic then "Text Font" and the three buttons to the right are enabled. Image
  3. Click on "Change..." then select a Khmer unicode font (ie.: Khmer OS System)
  4. Then click OK on Font dialgue,
  5. Do the same for the Debug, Console Font.
  6. Click OK on Preferences window.

Now test and make sure you can see Khmer text in the text editor:
  1. From the Package Explorer, double click on helloworld.py (under helloworld -> src)
  2. Now switch to Khmer Unicode Keyboard layout (simultaneously press left-Alt and right Shift key)
  3. Then enter the Khmer text in the file by changing the last line to something like this:
    print 'សួស្ដី ពិភពលោក!  Hello, world!'
    
  4. Now to ensure that the encoding is correct, save the file (press Ctrl-S). It should save correctly withou any prompt. If you get this error dialogue "Save could not be completed." In this case go to setup the file encoding steps. As a side notes, if you were to run this code directly, you may get the following error.
    File "C:\workspace\helloworld\src\helloworld.py", line 3
    SyntaxError: Non-ASCII character '\xe1' in file C:\workspace\helloworld\src\helloworld.py on line 3,
    but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
    
    In this case, the Python does not know it is UTF-8 characters. See the next steps to complete the setup. If you were to run using GAE, you will not run into this error. But it recommended to go through the next steps.

Handling UTF-8 in Python

To tell Python compiler that you have UTF-8 text in your code, you can specified in the header the file encoding. To do that add the following line to the beginning of the file.
# -*- coding: utf-8 -*-
When you have UTF-8 strings in your code, specifies with a "u" in front of the strings to tell the compiler that it is Unicode text. For example:
# -*- coding: utf-8 -*-

s = 'ក'     # declare as regular string
utf = u'ក'  # declare as utf-8 string

print  "s type:", type(a), " utf type:", type(utf)
print  "s length:", len(a), " utf length:", len(utf)
This code outputs:
s type: <type 'str'>  utf type: <type 'unicode'>
s length: 3  utf length: 1
Notice that the same string is interpreted differently. The variable s without specified as Unicode is recognized as length 3 (3 bytes), where the variable utf is recognized as unicode type of length 1 as intended.

Output Khmer Unicode

Notice the above example does not print the Unicode text. If you were to print it out like this: "print utf", you will get this error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
This is because it does not know how to convert the Unicode string specified by the "u". To tell the encoding you would need to do this:
print utf.encode("utf-8")
Note that in GAE, you can just output Khmer Unicode string as a normal string without specifying it to be Unicode. That way you can print it out normally without specify the encoding to UTF-8. But the only issue is that it would be difficult to parse that string properly. To complete a new Khmer Unicode Hello World version, we need to do a few more things.
  1. Tell the web browser that the character encoding for the output is UTF-8 by changing the HMTL header line to:
    print 'Content-Type: text/html; charset=UTF-8'
    
  2. Now tell the browser that this is Khmer Unicode text by format a HTML tag:
    print "<h1 style='font-family: \"Khmer OS\"'>"
    
  3. Now change the output string to Khmer Unicode and specified the encoding as:
    print u'សួស្ដី ពិភពលោក!  Hello, world!'.encode("UTF-8")
    
Here is a complete code:
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=UTF-8'
print ''
print "<h1 style='font-family: Khmer OS'>"
print u'សួស្ដី ពិភពលោក!  Hello, world!'.encode("UTF-8")
print "</h1>"
Here is the output:
Image
Hello World output in Khmer

Tip

Without specified the font-family, most browsers like Firefox or Internet Explorer still can render Khmer Unicode text Ok. But some of the ligature might not work correctly. It is recommended that you specify the font-family to ensure that the text is rendering correctly. When setting letter spacing setting like "letter-spacing:-0.05em;", it will result in Firefox 2 not rendering correctly but Microsoft Internet Explorer is rendering fine. Now you have completed a basic Khmer Unicode Hello World example. Enjoy coding and share your knowledge.