On Mon, 25 Aug 1997 [log in to unmask] wrote:
> Martin, you write (> ) in response to my earlier posting (>> ):-
> > URLs are indeed moving towards UTF-8, but not in the sense that
> > raw UTF-8 octets would be included e.g. in an iso-8859-1 document.
> > HTML transports characters, and so an iso-8859-1 document will
> > transport characters in iso-8859-1 (+others with SGML/HTML
> > constructs). It is only when the URL is prepared for a HTTP
> > request, or otherwise extracted from the document, that it will
> > be converted (as a kind of "normalization") to UTF-8.
> This is useful because it allows a "canonical" UTF-8 URL to be specified in
> a local (i.e., keyboardable) character set.
> > While strictly speaking, only using "UNKNOWN-8BIT" would be
> > enough for MHTML, and I guess we should indeed use only
> > that if the only other alternative is "US-ASCII", I think
> > it is very bad practice to drop or ignore information that
> > is in many cases present and that may be put to very good
> > use. So only allowing "UNKNOWN-8BIT", or only "UNKNOWN-8BIT"
> > plus "US-ASCII", is not a good solution.
> The advantage of specifying "UNKNOWN-8BIT" is that it is consistent with
> employing an octet by octet comparison of an URL extracted from a text/html
> root object and a decoded Content-base and/or Content-location header. If
> we start specifying other character sets (including UTF-8), then this
> implies a canonicalization to UTF-8 (consistent with that which you
> describe above) of URLs extracted from a text/html root object :- a) on
> transmission prior to encoding Content-base and/or Content-location header,
> and b) on reception, prior to an UTF-8 comparison of a URL extracted from a
> text/html root object and a decoded Content-base and/or Content-location
It could indeed imply a canonicalization to UTF-8, but this is
definitely not intended for MHTML purposes, as the text about
the irrelevance of "charset" for MHTML in my proposal should
express clearly enough.
A canonicalization could become necessary if MTAs have a look
at the message headers or message bodies and transcode them.
I don't know whether this would be allowed or frequent.
But it could happen that an MTA serving as a gateway between
the ASCII and EBCDIC world would have a look at a header and
say: Hey, US-ASCII, my MUAs won't grok that, so I transcode
to EBCDIC. Then "UNKNOWN-8BIT" would indeed be safer. BUT
if an MTA is transcoding, chances are big that it transcodes
both the headers and the bodies, and then we would be in
terribly bad shape if we used "UNKNOWN-8BIT", because this
header would stay octet-by-octet the same (or probably
worse, ASCII characters getting transcoded to EBCDIC,
whereas QP would remain as octet values), but the body,
hopefully correcty labeled, would get transcoded and look
So my bet at this is that we are safest if we try to
keep the "charset" label on the body and on the URL
in sync, because then the chances are best that both
get transcoded in the same way if an MTA decides to
If anybody has an idea about how frequent transcoding
is done by MTAs, and in what form exactly, this would
be very helpful.