LISTSERV mailing list manager LISTSERV 15.5

Help for MHTML Archives

MHTML Archives

MHTML Archives


Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font


Join or Leave MHTML
Reply | Post New Message
Search Archives

Subject: Re: More on wrongly(?) formatted urls
From: Martin J. Dürst <[log in to unmask]>
Reply-To:IETF working group on HTML in e-mail <[log in to unmask]>
Date:Mon, 25 Aug 1997 15:01:15 +0200

TEXT/PLAIN (39 lines)

On Mon, 25 Aug 1997 [log in to unmask] wrote:

> The encoding of URLs that cannot be represented legally without encoding in
> MIME Content-base or Content location headers MUST employ the encoding
> method described in RFC 2047. If the URL to be encoded contains only octets
> in the ABNF range %d32-126 then a RFC 2047 charset parameter value of
> "US-ASCII" or "UNKNOWN-8BIT" [RFC 1428] MUST be specified. If the URL to be
> encoded contains octets in the ABNF ranges %d0-31 or %d127-255, then an RFC
> 2047 charset parameter value of "UNKNOWN-8BIT" [RFC 1428] MUST be
> specified.
>       ASIDE - We could require "UNKNOWN-8BIT" in all cases and be
>       done with it! Note, the use of "UTF-8" is problematic because
>       in that case, 2047 decoding may be to a local character set.
>       This is not possible in the case of "UNKNOWN-8BIT".

The use of UTF-8 is very problematic because the octets taken out
of an HTML document won't have anything to do with UTF-8 unless
the HTML document itself is in UTF-8 (which is a rare case currently
and nothing necessitating special treatment in the future).

URLs are indeed moving towards UTF-8, but not in the sense that
raw UTF-8 octets would be included e.g. in an iso-8859-1 document.
HTML transports characters, and so an iso-8859-1 document will
transport characters in iso-8859-1 (+others with SGML/HTML
constructs). It is only when the URL is prepared for a HTTP
request, or otherwise extracted from the document, that it will
be converted (as a kind of "normalization") to UTF-8.

While strictly speaking, only using "UNKNOWN-8BIT" would be
enough for MHTML, and I guess we should indeed use only
that if the only other alternative is "US-ASCII", I think
it is very bad practice to drop or ignore information that
is in many cases present and that may be put to very good
use. So only allowing "UNKNOWN-8BIT", or only "UNKNOWN-8BIT"
plus "US-ASCII", is not a good solution.

Regards,        Martin.

Back to: Top of Message | Previous Page | Main MHTML Page



CataList Email List Search Powered by the LISTSERV Email List Manager