Why is XML case-sensitive?

| 9 Comments | No TrackBacks

Sriram Krishnan asks strange question:

I see someone flaming someone else for not being XHTML compliant. Tim Bray - if you're reading this, I want to know something. Why is XML case-sensitive? No human-being ever thinks in case-sensitive terms. A is a. End of story. So now, I have a situation where writing <html> </HTML> wouldn't be XHTML compliant. And what do I get out of XHTML apart from geek-bragging rights and this strange idea of 'standards-compliance'? Does it give me more freedom? Does it help my viewers? My customers?
Well, this guy is definitely heavily sloppy-HTML-contaminated. What? <html> </HTML> isn't XHTML complaint? Thanks GOD! Anyway, Tim Bray does answer his question:
XML markup is case-sensitive because the cost of monocasing in Unicode is horrible, horrible, horrible. Go look at the source code in your local java or .Net library.

Also, not only is it expensive, it's just weird. The upper-case of e' is different in France and Quebec, and the lower-case of 'I' is different here and in Turkey.

XML was monocase until quite late in its design, when we ran across this ugliness. I had a Java-language processor called Lark - the world's first - and when XML went case-sensitive, I got a factor of three performance improvement, it was all being spent in toLowerCase(). -Tim
Nice.

Related Blog Posts

No TrackBacks

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/356

9 Comments

At the turn of the 19th century, when ASIIC was defined, after IBM's EBCDIC format...

Development on ASCII began in 1960. EBCDIC development began afterwards, in 1963. And both of those were in the latter half of the 20th century. As a point of reference, Unix development began in 1969 (same decade as ASCII and EBCDIC).

Compare your incorrect sentence:
"Help my uncle Jack off a horse"
to the correct version:
"Help my uncle, Jack, off a horse"

and you'll realize you don't know what your talking about.

gstangler said:

> Unicode is incapable of making this mapping in a single clock.

This is actually not true. English text encoded in Unicode can still be converted between cases using bitwise operations.

The article Dave linked states that "change in case does not change the meaning of a word in spoken language."

Compare:
I had to help my uncle Jack off a horse.
I had to help my uncle jack off a horse.

> Seperate from that, being case insensitive is simply a form of laziness, and or lack of discipline.

Absolutely wrong.
Being case sensitive is a form of laziness on the part of the parser who doesn't want to write code to equate A to a.

For a really good article on why case sensitivity is not only stupid but dangerous, lazy and wrong, see here:

http://www.tonymarston.co.uk/php-mysql/case-sensitive-software-is-evil.html

XML tags, etc have no requirement to be english. That means that a tag such as MyTAG and Mytag are as different to any OS as MyTAG and My123 or My_1_.

A-Z and a-z are 26 completely unique numerical values in the system.

At the turn of the 19th century, when ASIIC was defined, after IBM's EBCDIC format, some Genious realized it would be beneficial to Map A-Z and a-z, with just a single bit difference to map these 'special' characters from upper to lower and visa versa. This feature was purely for performance, since bit masking is a single clock in all hardware.

The good people on the Unicode committee, although I'm sure they tried, failed to incorporate such an efficiency.

Since massive amounts of data processing now incorporates XML, performance is very important.

Unicode is incapable of making this mapping in a single clock.

Seperate from that, being case insensitive is simply a form of laziness, and or lack of discipline.

Greg

XML case sensitivity any thing to do with
WC3 consortium rules ?

Well, not *all* Ant users like it. Personally, I think case insensitivity is a plague that makes code more complicated everywhere it appears. (Instead of being able to simply compare strings, you now have to know to do a case-insensitive compare). I'd rather that Ant made a clear decision (e.g. lowercase only for everything) rather than allowing users this "flexibility" to have their Ant scripts look different from those in the next cubicle, for no good reason.

Interesting point. Ant is case insensitive on all attributes, because users like it, but we remain case sensitive on element names because the XML parser refuses to match up element ends with element starts if they dont match properly.

Leave a comment