Bad schemas for names

This is a rant about something that routinely bugs me about the anglo-centrism of those who write for computer audiences. It's also a rant about how those who try to teach the benefits of XML fail to actually think about informatics – doing so would enable them to make a better case for XML. The rant is provoked by the example that seemingly every tutorial on XML feels obliged to give, which they always do badly. A relatively mild example goes like this:

<?xml version="1.0"?>
<person>
 <name>
  <firstname>Paul</firstname>
  <lastname>McCartney</lastname>
 </name>
 <job>Singer</job>
 <gender>Male</gender>
</person>

I've actually chosen one of the less severe transgressors here; they only give this in passing and don't belabour their error, as if it were a good way to do things. The error is in using firstname and lastname as element types – I might quibble using name the way they do, but the rest is good.

So what's my beef with using firstname and lastname ? Well, first, there's the simple fact that they're fatuous – that is, Paul appears first and McCartney appears last, so why throw in spurious mark-up to say that ? For my purposes, that's a minor quibble, but actually it's a serious issue with a tutorial, at least when it's trying to get across that XML is a good thing. The reasonable reader is going to wonder how dumb the system must be, if it needs the blindingly obvious pointed out. Still, that's incidental: my real concern is the bad informatics implicit in the chosen schema.

What this example tells me, about the collection of information about people that's illustrated by one person's data above, is that those collecting it have not thought about the data they're collecting. They've taken a structure that works for most of the people they know and adopted it as a scheme into which to shoe-horn all the data for all the people that it doesn't match. I anticipate that the rest of the software using this collection will, for example, start mails to people in the list with Dear Paul, using firstname, which will lead to stupid situations when you want to write to someone who is normally addressed by their middle name: the data will then have to tag the middle name as firstname to make the mail be sensibly recorded. The mark-up will then make it obvious that we have a misnomer on our hands: that firstname should really have been tagged as a personal name. Likewise, when you want to find all the members of clan McCartney in your collection, you'll filter on that last-name; what you're really doing is using it as family name. The case becomes even more obvious with names from China, among other places where personal name appears after their family name.

Now you might object, to my complaint, that I'm just grumbling about nomenclature: I'll come back to why there's more to it than that, but let me just, first, point out that nomenclature matters. If you name a data-field wrong, people are going to end up putting the wrong datum in it, because they'll supply the thing the name appears to say it wants. If you give something, that has a generic name, a name that's specific to one culture, you run the risk of offending folk of another culture, to whom the generic name would be appropriate but your chosen name is misguided. You will make stupid mistakes like failing to notice that two people with the same first name are part of the same family, or mistakenly thinking that two people with the same personal name are from the same family.

Now, in fact, naming is really rather complex: sufficiently so that I'd even go so far as to suggest that those writing XML tutorials steer clear of it unless they're really willing to do their homework. On the other hand, if they do bother to do that homework, and XML is as good as they say, then they should be able to make a really good illustration out of naming (and getting their hands dirty with some real informatics might do them no harm; if nothing else, it'd earn them some respect from any readers who've thought about the problem). It has the potential to make a good illustration for the simple reason that it is really quite complex; which means it can illustrate the power and flexibility of XML. I'll have a stab at doing this, below, but I should mention up front that I'm not an XML expert, so I probably won't do it as well as it could be done.

At the very least, I'd like to see those writing tutorials use personalname and familyname instead of firstname and lastname. It would at least set a good example to their students; the labels are then semantic rather than merely describing a layout fact that happens, in the examples the author bothers to use, to coincide with the semantic property they should really be tagging.

Trying to do it right

Human culture is diverse and complex; while names are terribly important to people – some can easilly get highly offended if you get their name wrong or if you use the wrong parts of it for particular social interactions. Different cultures even have different semantics to names, quite apart from word order. The norms governing relevant customs, furthermore, change over time – my grandfather, for example, would have expected letters from his peers to address him as Welbourne while I, with exactly the same name, expect to be addressed as Edward. The software to deploy a datastore full of information about people needs, as a result, to encode a set of rules that will conform to social norms; the datastore, meanwhile, must record the information that software shall need.

Stating the problem

Let's start with a sample of some names of real people, that don't so neatly fit the simplistic firstname lastname pattern assumed by so many tutorials:

Aubrey de Grey

He might also have some middle names, but his friends call him Aubrey and he inherited the whole of de Grey from his parents: family name may have more than one word (without being hyphenated) and fragments of a name aren't always capitalised.

Anne van Kesteren

Another two-part family name; and, while we're at it, notice that a personal name that some cultures only give to girls may, in another, be given to boys. (They're given to children, who then grow up into relevant adults, of course.)

Tollef Fog Heen

Another two-parter of a family name, this time with both halves capitalised, still with no hyphenation.

Mark (Tarquin) Wilton-Jones

I'm not sure, but I doubt any official documents call him Tarquin – but that's how he prefers to be addressed, just as I prefer to be addressed as Eddy.

Margaret Louise Scot

Always addressed as Louise, even though it's not her first name.

Ho Su Lian

She inherited Ho from her parents and the right way to address her is as Su Lian; a personal name may be more than one word – and the order of parts depends on culture.

Jan Vidar Krey

Like many (but by no means all) Norwegians with two personal names, he's addressed using both, as Jan Vidar.

Haraldur Karlsson

He's Icelandic, so his family name is actually a patronymic: it's not a patrilineal family name (i.e. the same as his father's, and his before him), it literally says he's the son of Karl; if he has a sister, she'll be Karlsdottir.

… and that's not even going near to Russian complications or the Arab practice of indicating the names of sons (if a name ends abu Rashid it means the person thus named has a son named Rashid). Notice that none of these is unusual within relevant cultures; if we were to delve into the complications that arise from idiosyncratic naming, it would get more complex yet.

There are also plenty of women who, on marriage, retain their maiden name (family name at birth) as a last-but-one name and append their husband's family name; many others discard their maiden name, replacing it with their husband's family name; and some women don't change their names on marriage. There are also some men who, on marriage, discard their family name and adopt their wife's. Names that derive from family names are worth identifying as such in the mark-up. Sometimes an ancestral maiden name will be used as a penultimate name for a child; whether this counts as a personal name or a family name is a further complication.

So we have personal names, one or some of which may be used as a familiar name, and family names, some of which may be transient. We have various forms of address – formal and informal; social, professional (among peers) and commercial (e.g. when your bank or doctor addresses you); spoken and written; and, as to written, one may care about the distinction between proper forms for use on the outside of an envelope, in the salutation that begins the letter and when referring to the person in writing others shall read – and need to be able to indicate the fragments of name relevant to each. To do the job entirely fully, one would, I suspect, have to duplicate at least some fragments – or, at the very least, have the schema support doing so.

Name fragments may be derived from father's personal name or may be the same as that of father; we may as well cater to the possibility of the same for mother, although I don't know of a culture that does so (but check Russian). The correct terms for those are patronymic (from father's personal name), patrilineal (same as father), matronymic and matrilineal.

There are plenty of other complications. When folk go by a personal name other than the first, some simply omit the first; others include its initial; I suspect some vary, depending on context. At least one author used his first and last names when writing one genre of fiction, but included his middle name as an initial when writing another genre. Folk have titles or qualifications that are coventionally included when naming them; the form this takes may vary with context, either as a prefix or a suffix (prefix Dr. or suffix PhD., for example). Names that a family has reused down several generations may be qualified to indicate which of the people with a given name is indicated, as for example the USAish convention of Senior, Junior and subsequent numbering with Latin numerals.

Of course, for any given application, only some of the details shall be relevant; it's important to identify which those are and not waste too much time and effort on capturing (much) more information than you actually need; but what you do chose to capture, you need to annocate correctly, so that your software can use it correctly.

Partial Solution

So, clearly, we can have mark-up delineating personal and family parts of names; and we can delineate preferred name directly when it's present as part of the name, but we need some form of alternatives structure for dealing with the case of preferred forms of address. I'll use a mixture of elements and classes. Our original quoted example can become:

<?xml version="1.0"?>
<person>
 <name>
  <personal>Paul</personal>
  <patrilineal>McCartney</patrilineal>
 </name>
 <job>Singer</job>
 <gender>Male</gender>
</person>

with nothing more than some trivial renaming, as long as we introduce some simple rules like allowing that, absent any indication of preferred from of address, normative rules are to be applied. Now let's try the examples I listed above, plus a few more for illustrative purposes:

<?xml version="1.0"?>
<name>
 <personal>Aubrey</personal>
 <patrilineal>de Grey</patrilineal>
</name>
<name>
 <personal>Anne</personal>
 <patrilineal>van Kesteren</patrilineal>
</name>
<name>
 <personal>Tollef</personal>
 <patrilineal>Fog Heen</patrilineal>
</name>
<name>
 <personal>
  <alternative class="official">Mark</alternative>
  <alternative class="preferred nick">Tarquin</alternative>
 </personal>
 <patrilineal>Wilton-Jones</patrilineal>
</name>
<name>
 <personal>Margaret</personal>
 <personal class="preferred">Louise</personal>
 <patrilineal>Scot</patrilineal>
</name>
<name>
 <patrilineal>Ho</patrilineal>
 <personal>Su Lian</personal>
</name>
<name>
 <personal>Jan Vidar</personal>
 <patrilineal>Krey</patrilineal>
</name>
<name>
 <personal>Haraldur</personal>
 <patronymic>Karlsson</patronymic>
</name>
<name>
 <alternative class="official">
  <personal>Edward</personal>
  <patrilineal>Welbourne</patrilineal>
 </alternative>
 <alternative class="preferred nick">Eddy</alternative>
</name>

Note that I've presumed that Tarquin is OK with being addressed as Tarquin Wilton-Jones in contrast to my preference, which is for Eddy to be used only if you aren't using my surname. I'm also not sure Chinese names are inherited patrilineally, I just guessed.


Valid CSSValid HTML 4.01 Written by Eddy.