On Mon, 25 Aug 1997 [log in to unmask] wrote:
> The encoding of URLs that cannot be represented legally without encoding in
> MIME Content-base or Content location headers MUST employ the encoding
> method described in RFC 2047. If the URL to be encoded contains only octets
> in the ABNF range %d32-126 then a RFC 2047 charset parameter value of
> "US-ASCII" or "UNKNOWN-8BIT" [RFC 1428] MUST be specified. If the URL to be
> encoded contains octets in the ABNF ranges %d0-31 or %d127-255, then an RFC
> 2047 charset parameter value of "UNKNOWN-8BIT" [RFC 1428] MUST be
> ASIDE - We could require "UNKNOWN-8BIT" in all cases and be
> done with it! Note, the use of "UTF-8" is problematic because
> in that case, 2047 decoding may be to a local character set.
> This is not possible in the case of "UNKNOWN-8BIT".
The use of UTF-8 is very problematic because the octets taken out
of an HTML document won't have anything to do with UTF-8 unless
the HTML document itself is in UTF-8 (which is a rare case currently
and nothing necessitating special treatment in the future).
URLs are indeed moving towards UTF-8, but not in the sense that
raw UTF-8 octets would be included e.g. in an iso-8859-1 document.
HTML transports characters, and so an iso-8859-1 document will
transport characters in iso-8859-1 (+others with SGML/HTML
constructs). It is only when the URL is prepared for a HTTP
request, or otherwise extracted from the document, that it will
be converted (as a kind of "normalization") to UTF-8.
While strictly speaking, only using "UNKNOWN-8BIT" would be
enough for MHTML, and I guess we should indeed use only
that if the only other alternative is "US-ASCII", I think
it is very bad practice to drop or ignore information that
is in many cases present and that may be put to very good
use. So only allowing "UNKNOWN-8BIT", or only "UNKNOWN-8BIT"
plus "US-ASCII", is not a good solution.