LISTSERV mailing list manager LISTSERV 15.5

Help for MHTML Archives

MHTML Archives

MHTML Archives


Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font


Join or Leave MHTML
Reply | Post New Message
Search Archives

Subject: Re: More on wrongly(?) formatted urls
From: Nick Shelness <[log in to unmask]>
Reply-To:IETF working group on HTML in e-mail <[log in to unmask]>
Date:Mon, 25 Aug 1997 15:37:19 +0100

text/plain (58 lines)

Martin, you write (> ) in response to my earlier posting (>> ):-

>> The encoding of URLs that cannot be represented legally without encoding
>> MIME Content-base or Content location headers MUST employ the encoding
>> method described in RFC 2047. If the URL to be encoded contains only
>> in the ABNF range %d32-126 then a RFC 2047 charset parameter value of
>> "US-ASCII" or "UNKNOWN-8BIT" [RFC 1428] MUST be specified. If the URL to
>> encoded contains octets in the ABNF ranges %d0-31 or %d127-255, then an
>> 2047 charset parameter value of "UNKNOWN-8BIT" [RFC 1428] MUST be
>> specified.
>>       ASIDE - We could require "UNKNOWN-8BIT" in all cases and be
>>       done with it! Note, the use of "UTF-8" is problematic because
>>       in that case, 2047 decoding may be to a local character set.
>>       This is not possible in the case of "UNKNOWN-8BIT".
> The use of UTF-8 is very problematic because the octets taken out
> of an HTML document won't have anything to do with UTF-8 unless
> the HTML document itself is in UTF-8 (which is a rare case currently
> and nothing necessitating special treatment in the future).


> URLs are indeed moving towards UTF-8, but not in the sense that
> raw UTF-8 octets would be included e.g. in an iso-8859-1 document.
> HTML transports characters, and so an iso-8859-1 document will
> transport characters in iso-8859-1 (+others with SGML/HTML
> constructs). It is only when the URL is prepared for a HTTP
> request, or otherwise extracted from the document, that it will
> be converted (as a kind of "normalization") to UTF-8.

This is useful because it allows a "canonical" UTF-8 URL to be specified in
a local (i.e., keyboardable) character set.

> While strictly speaking, only using "UNKNOWN-8BIT" would be
> enough for MHTML, and I guess we should indeed use only
> that if the only other alternative is "US-ASCII", I think
> it is very bad practice to drop or ignore information that
> is in many cases present and that may be put to very good
> use. So only allowing "UNKNOWN-8BIT", or only "UNKNOWN-8BIT"
> plus "US-ASCII", is not a good solution.

The advantage of specifying "UNKNOWN-8BIT" is that it is consistent with
employing an octet by octet comparison of an URL extracted from a text/html
root object and a decoded Content-base and/or Content-location header. If
we start specifying other character sets (including UTF-8), then this
implies a canonicalization to UTF-8 (consistent with that which you
describe above) of URLs extracted from a text/html root object :- a) on
transmission prior to encoding Content-base and/or Content-location header,
and b) on reception, prior to an UTF-8 comparison of a URL extracted from a
text/html root object and a decoded Content-base and/or Content-location


Back to: Top of Message | Previous Page | Main MHTML Page



CataList Email List Search Powered by the LISTSERV Email List Manager