On Tue, 19 Aug 1997, Larry Masinter wrote:
Many thanks to Larry for the careful wording.
I wonder whether we shouldn't exchange (a) and (b), because
it seems that the current (b) is the method of choice.
[(a) is %HH encoding with its difficulties, (b) is RFC 2047
> (b) Use the encoding method for message headers described
> in RFC 2047, using either a charset value of "US-ASCII",
> or, if the URL contains octets outside of the 7-bit range,
> "UNKNOWN-8BIT" [RFC 1428], or "UTF8", as appropriate.
First, please note that it is "UTF-8". Many thanks to Larry for
bringing in UTF8 for URLs here. I wish would already have all URLs
(with such octets) in UTF-8, and could just write it that way.
But UTF-8 URLs are not supposed to work by just labeling them as
UTF-8. If an HTML document, encoded say in koi-8, contains 8-bit
octets in an URL, then these octets are octets that have to be
interpreted in koi-8, and not in UTF-8. HTML transports characters,
and these may undergo transcoding, e.g. in proxies or by cut-and-paste.
So an URL extracted from an HTML document labeled as KOI-8 should be
by itself be labeled as KOI-8. Labeling as UTF-8 is only appropriate
if the URL is indeed transcoded to UTF-8 (but then comparison between
the URL inside the KOI-8 document and the URL encoded with UTF-8 will
be quite difficult). So here the earlier text by Jacob should be used.