Utf-8 is a nice, efficient coding of Unicode character set. utf-8 is an attractive option for pages that are mostly the old ASCII with occasional exotic character from the new Unicode space. Here is My special symbol sampler.

Here is the saga of getting my web site converted to utf-8. I write this because it has required many forays into obscure niches of the cyberworld. I may have to do this again and I will have forgotten some of the niches.

http

A browser connects with a web server with http. When a browser goes for a new page the web server responds first with some text called the ‘http header’ which may define the specific character encoding of the forthcoming page, including whether such compression schemes as utf-8 are employed. This is specified in the http protocol. See “charset” in this RFC. If the header is silent about the encoding, a meta tag within the head element of the html file proper may convey the same information. A tag such as
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
in a file announces the file’s format.

I see here that links to a file can include a charset attribute for the target. I wonder which overrides! Here are the legal charset values.

Web Server

How does the web server know that some particular file is utf-8? Apache, at least, looks in each directory in the path name within the URL for a file named “.htaccess”. The last such file with a line such as
AddType zzz .html
instructs the server for any file whose name ends with “.html”, to inform the browser via http that the file format will be MIME-type zzz. "text/html; charset=UTF-8" is such a type and must be quoted in the AddType line because it contains spaces.

Now the web server may decide to ignore the .htaccess files depending on the config file it reads as it begins. On the Mac this file is “/etc/httpd/httpd.conf” (“/private/etc/httpd/httpd.conf” seems to be a twin?? “/etc” seems to be the same directory as “/private/etc”. As of OS X 10.5.2 the normal web server is Apache 2.2 and the httpd.conf file is in /etc/apache2.) See comments in this file concerning the “AllowOverride” command. See the sudo command about modifying this file. The administrative password is required.

Tools

I prefer to edit my own html markup. The Apple program TextEdit comes with Mac OS X and can speak Unicode utf-8 files natively. Under Preferences > Rich text processing, I choose “Ignore rich text commands in HTML files” which causes the editor to show rather than interpret html markup. Under Preferences > Default plain text processing, I select UTF-8 for both Open and Save. In this mode unicode characters are first class and fit on the clipboard. For input I use the new keyboard facility thus
Under Apple > System Preferences > International > Input Menu I select the US keyboard plus the Character Palette
and select “Show input menu in menu bar”. A small U.S. flag shows in the menu bar with which I can summon the character palette where I can find all unicode characters.

TextEdit interprets both the Unix line-end and Mac line-end the same as it displays the text on the screen. It puts Mac ends where ever you type a return. This causes surprisingly little trouble to me. Indeed this html file has both. The browsers I have tested tolerate both.

This code converts from the character reference form of Unicode to utf-8 thus:

cv < xref.html > xutf8.html
It converts back thus:
cv b < xutf8.html > xref.html
It is sort of picky but trys to explain what it doesn’t like.

Here is too much information on UTF-8.

Alternatives to UTF-8

There are three other ways to invoke Unicode that avoid telling the web server to tell the browser to watch out for UTF-8 encoding. These also avoid finding a UTF-8 savvy editor with some sort of virtual keyboard with tens of thousands of keys. A Greek alpha can be had by putting any of “&alpha;”, “&#x3b1;” or “&#945;” in your html file. 3b1 is hexadecimal for 945 which is the Unicode code point for lower case Greek alpha. Upper case alpha is correspondingly “&Alpha;”, “&#x391;” or “&#913;” {to wit: “α”, “α” or “α”; “Α”, “Α” or “Α”}. If you know your Greek alphabet you can count from there. Otherwise you can consult the whole zoo of possibilities.

You can even do Asian characters: Mountain in Chinese is 山 (&#x5c71;), or 山 (&#23665;). (utf-8: 山) On the Mac, Asian characters come in many “fonts” and take tens of megabytes and are optional but free. They are part of the platform and not the browser. The browser must be Unicode savvy. The seven modern browsers I have on my Mac are all Unicode savvy. I don’t know the situation on Windows. ᴒ∀❊⺀⺁⺃⺄⺅⺆⺇⺈⺉⻄