Digital Representation of Khmer Language

This article is part of the content from "INFORMATION RETRIEVAL FOR KHMER DOCUMENTS: CHALLENGES AND APPROACHES TO WORD SEGMENTATION". This is a thesis paper written by Phylypo Tum on August 2007. This content is from one of the seven chapters, chapter 2, "KHMER OVERVIEW".


Cambodian or Khmer, pronounced “kmâr,” is the official language of Cambodia, a country almost half the size of California located in Southeast Asia bordered by Thailand, Laos, and Vietnam. Cambodia is home to about 14 million people. Khmer speakers also live in the United States, France, Canada, Australia, and Malaysia. Khmer is also spoken in the northern part of Thailand and southern part of Vietnam.

Khmer is a part of the eastern branch of the Mon-Khmer language family which is a subgroup of the Austroasiatic language family (Huffman, 1967; Sak-Humphry, 1996). Khmer script is derived from Pallava, a variant used in some of the Indian languages. The oldest dated inscription of Khmer was found in A.D. 611 (Huffman). The language remains largely unique and independent since its origin. Khmer script is written from left to right, has no capitalization but makes use of “aksaa muul” or round style to indicate headline or boldness while “aksaa crieng” or slanted style is used for normal text (The Unicode Consortium, 2003).

Khmer is a non-tonal language where different pitches do not change the meaning of the word. Khmer language has no word inflection to indicate an expression of tense, singular or plural, or gender. Like many other Asian scripts, Khmer script has no space separation between words.

See the pdf version for the next sections.

  1. Introduction to Khmer Script
  2. Properties of Khmer Script
  3. Digital Representation of Khmer Script
  4. Introduction to Khmer Unicode
  5. References