Friday, December 22, 2006

Read and Write vietnamese text in UTF-8 codes

Some of us had problem of viewing vietnamese text on the screen. In order to help, I had pointed to other places. But to some extend, it muddled the subject even more. Well, I had made a naive assumption that your pc's set up are perfect...like mine and that you can read it - vietnamese text - right at the beginning, so that our core effort is on typing vietnamese text, not reading! :-)

For Tran Ngoc Lan, one of our ND classmate, had raised the afore mentioned issue, I put up few steps below to correct this and if it can not resolve (the nut), it will swell to one of our new year resolution! Yep, So here I go quickly:

A. To read vietnamese text

For IE6:

1. Tools/Internet/Options/Accessibility/
Check box "ignore font styles specified on Webpages" box. See image 1.
Comments: many webpages call for fonts which are not
installed in C:\Windows\fonts or old fonts that are
not unicode supported, when Windows could not locate
them, it used its own fonts. It's best to disable this feature
and force Windows to use the system fonts selected by
us which we will do in the next step.

2. Tools/Internet Options/Fonts/
See these settings as in image 2:
Language script: "Latin based"
Webpage font: "Times New Roman" or "Arial"
Plain text font: "Courier New"
Comment: these unicode font files are greater than 200k bytes. To test which fonts can display UTF-8 codes properly, copy some vn text from our blog to clipboard ([ctrl][c]) then open MS-Word/Wordpad/EditPlus and paste ([ctrl][v]) into its working window. Set screen font in editor to Arial or Times Roman to display text. These two fonts had been known to support unicode.

3. Click [OK] to go up one level, then another [OK] to return to main webpage to see the text is displayed properly. Click on tool bar View/encoding to see it is set to UTF-8. Adjust text size in View/text size. See image 3.

For Firefox:

1. Tool/Options/Contents/ click on [Avanced] button.
See image 4.

2. Uncheck box "allow pages to choose their own fonts, instead of selections above". Click drop down button on "default character encoding" to select "UTF-8". The rest can be set as shown on image 5, later we 'll go back here and change font styles if needed.

3. Click OK to go back one level, then another OK to go back to main page. We should see the text displayed properly. To adjust text size, go to View/Text size or using shortcut keys [ctrl][+] or [ctrl][-]. See image 6.


Note: To read vietnamese text message on yahoo webmail , we must manually force the web browser into UTF-8 encoding scheme then refresh the page.

B. To Write vietnamese text:
This required installing a vietnamese keyboard driver, a unicode text editor and unicode fonts.
Why? Read about the Nuts and Bolts here

1. Installing and run Unikey driver software:
A popular vietnamese keyboard driver is Unikey which can be downloaded here. After installing this driver, double clicking its shortcut icon shows its current default setting box. We will keep default settings match exactly to image 7, that is:

"Bang ma: unicode" (encoding scheme: unicode UTF-8)
"Kieu go: telex" (typing style: telex)
"Phim chuyen: CTL-SHFT" (switch key: CTL-SHFT)

So, when we press CTL-SHFT, the little icon [E] in the bottom right corner (Notification Area/System Tray) will toggle between [E] and [V] or between [V]ietnamese and [E]nglish mode.
Note: when typing password (seen as ****...) to log in a network, remember to switch back to [E] mode.

2. Run a unicode handling text editor:
With Unikey driver running in the background (in [V] mode), launch Wordpad/MSword/Editplus.
Press [Alt][Tab] or using mouse pointer to switch to a window task which currently opens a webpage such as this webpage . Drag mouse over some vn txt and press CTL-C to copy it to clipboard. Switch window task back to the text editor program and press CTL-V to paste it into the editing window. The vn text should be displayed properly. If not, try different screen fonts such as Times New Roman or Arial.
Start typing some vn text, for further help on telex and other typing styles click on Unikey [huong dan] button. The following was extracted for quick reference on telex style:

Các phím gõ tiếng Việt của kiểu gõ TELEX:

s sắc
f huyền
r hỏi
x ngã
j nặng
z xóa dấu thanh đã bỏ. Ví dụ: toansz = toan

w dấu trăng trong chữ ă, dấu móc trong các chữ ư, ơ
Chữ w đơn lẻ tự động chuyển thành chữ ư.

aa â
dd đ
ee ê
oo ô

[ gõ nhanh chữ ư
[ gõ nhanh chữ ơ

Finally, we want to save vn text into a file, Eg: myvntext.txt.
On Wordpad, the save dialog window is shown in image 8, note the prompt line "Save as type: unicode text document".
On EditPlus, the save dialog window is shown in image 9, note the prompt line "Encoding: UTF-8".

3. Using the Unikey converting tool to convert other encoding schemes to UTF-8:
Frequently, we ran into some webpages that used legacy encoding techniques. We need to convert it back to UTF-8 for viewing. Unikey provided a utility to do so, all in clipboard memory.
With editor program and Unikey [V] are running in the background, drag mouse over the scrambled vn text and press CTL-C to copy to clipboard.
3a. Press CTL-SHFT-F6, a dialog box appears as in image 10. Set [nguon] = "VIQR", [dich] = "unicode" then click [chuyen ma]. A message box appears "successfully convert". Click [dong] button.
3b. Now the converted text in UTF-8 form was already saved in clipboard. Press CTL-V on wordpad editor window and the text should read correctly. If not, we have to go back to step 3a and re select [nguon] = "something else".

Techie note:



Without rewriting the technical jumble, let's see what happens when key 'a' is pressed on the keyboard:

1. The micro controller which is housed under the keyboard hood intercepts the key 'a' and retrieves its code (in hex=41, dec=64) and sends it to the pc's iron box.

2. The Unikey driver receives 'a' code, saves it, then examines the previously saved code, since it is null (nothing), it passes 'a' code to the editor program.

3. The editor program gets 'a' code, recognises it as normal symbol and saves it, then it is passed to the display routine which looks up for instructions to display pattern 'a' (a matter of set and reset pixels instructions contained in the font file) on the screen.

4. Next, the acute accent ['] is pressed and as in (2), the driver combines ['] with previously stored 'a' code to make a two-byte long UTF-8 code and sends it to the editor program, preceded with a back space code. This is necessary for the editor program to clear the stored 'a' in the buffer and the screen displaying routine to go back one cursor position.

The editor saves UTF-8 code in its buffer and since it is unicode compatible, this 2-byte UTF-8 is translated into a codepointer (address pointer) to the codepage where the ' a + acute accent ( á ) pattern is located for displaying on the screen.

Unicode fonts have all vietnamese symbols scattered from the base range (first 256 locations where ASCII symbols reside) into Latin I, Latin II page,....Hence, the need for 16-bit address page pointer. Some old encoding techniques contained all symbols within the base range (ASCII's range), by making use some of ASCII control characters to represent accent marks (diacritic).

Finally, one can easily tell which part is at fault:
* If a font is not unicode ready, a vowel + 2 accent mark is displayed as a square box.
* If the editor can not read unicoded text, the text appeared all scrambled on the screen.

That's all about it. I hope you soon will discover the joy of reading and writing vn text on the net.

For help, comment, critizing, report error/problem, please e-mail me: duchyca@yahoo.com
Or you can also post on this webpage.