HTML Character Entities

This page enumerates the character entities HTML provides. These are tokens between an & (ampersand) and a ; (semicolon), of form &token; and displayed by an HTML renderer as the symbol described by the token thus enclosed. Wherever an HTML document uses a character which is not part of the character set covered by the encoding the document's web server will claim it uses, it should represent it by a character entity. One can also use numeric character entities, consisting of a number (the unicode code-point for the character) between &# and a ; (semicolon) – however the verbal character entities (when available) are more intelligible to anyone reading the page source.

When a web server responds to a request for a page, it reports a (content, as opposed to transfer) encoding, which specifies how the stream of bytes delivered should be interpreted as characters. If the page's author used an authoring tool which worked in some native encoding, but the server doesn't know about this (so doesn't report it, or reports some default at odds with it) this can confuse the user agent (though, since this not uncommonly happens, many attempt to guess the actual encoding – but the less we rely on programs to guess, the less scope they have for bugginess). While a page can contain meta-data which specifies data equivalent to that in HTTP headers, it is in principle hopeless (though in practice it may help the user agent's guess-work) to specify the encoding (and some others, such as MIME content-type) this way, since the user agent won't correctly read this meta-data unless it's correctly interpreting the byte-stream as the sequence of characters it's supposed to understand as telling it how to read the byte-stream – a classic chicken and egg problem. Consequently, web pages should be written in the character encoding the web server will report for them. HTML character entities provide a way to use characters absent from the encoding advertised by the web server.

Some authoring tools will allow the user to switch character sets (and hence, typically, encodings) without stopping to warn about this issue. While this provides a convenient way to author a page using a wide repertoire of characters, the results are more or less guaranteed to display unintelligibly to the page's readers. For contrast, if the page uses a character entity that some browser does not support, it will typically display the &token; verbatim; this may not look beautiful to the reader, but at least it won't look like some arbitrary other (i.e. wrong) character. If your authoring tool leaves raw characters in web documents, you may find the demoronizer useful. It ain't perfect, but I haven't written anything better yet.

The non-experimental parts of this page are derived from the HTML 4 character entity set, which supercedes and subsumes the Web Project's description of the ISO 8859 Latin 1 "ISO 8879:1986//ENTITIES Added Latin 1//EN" character entities. I provide illustrations, so you can see what's what (and whether your browser copes), and shuffled the order. Jukka Korpela provides similar in a table.

In due course this page is due an update to take account of stuff I've been told about unicode; here are pages of charts and names. Ian Hickson's data: URI kitchen can also be useful … and, speaking of Ian, HTML 5 has a much expanded repertoire of character entities, that I should document some day. Some entries from it are included below.

Another update, possibly superseding the preceding: in 2010/April, the W3C MathML WG published its entity definitions for characters, which aims to be fairly comprehensive. It is way too big to assimilate here.

In an attempt to make it easier to find particular characters, I've also broken the list into logical groups:

Accented letters
acute, circumflex, grave, umlaut, other vowels and consonants.
Symbols
lone accents, legal symbols, enclosures (e.g. quotation marks), punctuation, spaces, Unicode magic, a miscellany and symbols I didn't recognise until I looked them up. So I can sympathise with the many browsers that still don't recognise them.
Mathematical Symbols
binary operators, arithmetic comparators, set relations, prefix or unary operators, special names, a miscelany and the Greek alphabet.
Arrows
Of so many kinds, with so many variations, that they needed their own section.
… and, finally, experiments.

Aside from the table of greek letters, each entry is of form:

what
how [ = emacs ] [ comments ]

in which the bold punctuation won't be bold in actual entries, portions enclosed in [] aren't always present (and the [ and ] themselves never are),

Note that raw, numeric and emacs forms are only provided for actual ISO 8859 Latin-1 characters; and that only browsers which claim to support HTML 5 can be complained at for failure to support the rest.

Accented Letters

Symbols

Mathematical Symbols

See also floor and ciel enclosures, above; þeoretical physicists should also see the Icelandic letter eth, above; and the section of arrows.

There may be more mathematical symbols in a table in W3.org's tour of HTML 3. See also W3.org's table of HTML MATH mode symbols, if it still exists.

Special type-faces

There are a few font-styles that have been widely adopted in mathematics to provide distinct forms of letters that have thereby taken on their own meanings; thus ℕ and its friends are from an 𝕠𝕡𝕖𝕟 type-face, whose letters can be obtained by putting opf; after the plain ASCII letter and an & before, e.g. ℂ, ℕ, ℚ, ℝ and ℤ. There's also a rather 𝔊𝔬𝔱𝔥𝔦𝔠 type-face, using suffix fr; in the same way, but I find it mostly unreadable: for example, 𝔄 is its A, which I would flatly fail to recognise if I hadn't just looked it up in the table; compare 𝔘 (which is U). Then there's a 𝓈𝒸𝓇𝒾𝓅𝓉 font-face, using suffix scr; to get 𝒜, ℬ, … 𝒴, 𝒵, 𝒶, … 𝒾, 𝒿, 𝓀, … 𝓏. It's OK, but I doubt I'll use it much.

There's also at least some of the hebrew alphabet: ℵ, ℶ, ℷ, …, but that's as far as it seems to get (early 2018).

Arrows

There are up to eight directions for arrows – each way horizontal or vertical and each of the diagonals between these – and many styles of arrow, albeit not every style has all directions.

There's a whole mess of others, such as right angle with downwards zig-zag, ⍼, but this starts to get into unrecognisable gibberish.

Experiments

I lob experiments in here to see if they work. Some are inspired by TeX, others by wishful thinking and randomness. W3.org doesn't sanction them and I don't necessarily think they're a good idea.

&cdots;
&cdots; centered dots

And if you think I should have done all that with tables (I admit I've been tempted – much of it is crying out for it; and I've succumbed for the Greek alphabet), please pause to consider that even under Lynx I can use this list-form to find the right code to type into a file I'm writing. If I did it with tables, it would look a total mess under browsers that don't support them, which would greatly diminish its utility. Meanwhile the present form works fine in, Arena, Mozilla, Grail – all free (as in liberty) – as well as the proprietary (but gratis) Opera and (gratis once you've paid for the operating system they want you to use) IE.

On semantic character entities

Back in 2002, I suggested to the W3C's CSS folk that maybe it'd be a good idea for style sheet mechanisms to provide for mapping style-sheet-defined &…; tokens to official ones. To illustrate why this would be useful, consider:

Similarly, one might wish to have &tensor; map to ⊗, &union; to ∪, ⇔ to ⇔ and so on, enabling mnemonic names even for character entities with only one (orthodox) reading. When co-opting some unicode character to serve some particular purpose, it would likewise make sense to give it a mnemonic name, indicative of that purpose, thereby abstracting away the choice of particular unicode character selected to denote it.

It turns out this can be achieved in various ways.

I still think, fundamentally, that the semantic web would be better served by scrapping the whole ghastly mess of character entities in the DTDs and replacing them with a style-sheet-based approach (or an approach similar to that taken by style sheets). The W3C could perfectly readily provide a standard set of style sheets specifying the present entities (and browsers could still have these built in) for backwards compatibility, but authors would be enabled to provide domain-specific semantic names for the characters they're using and @import the default specs. Doing it via style-sheets is more compatible with existing infrastructure than doing it via DTDs and, in any case, what we're doing here is specifying presentation for character entities, so it belongs in style sheets.


Valid CSS ? Not Valid HTML (due to experimental entities). Written by Eddy.