LISTSERV mailing list manager LISTSERV 15.5

Help for MHTML Archives

MHTML Archives

MHTML Archives


Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font


Join or Leave MHTML
Reply | Post New Message
Search Archives

Subject: Re: More on wrongly(?) formatted urls
From: Martin J. Dürst <[log in to unmask]>
Reply-To:IETF working group on HTML in e-mail <[log in to unmask]>
Date:Mon, 18 Aug 1997 20:53:26 +0200

TEXT/PLAIN (74 lines)

This is some detail left over from the MHTML meeting in Munich.

The problem of transporting URLs in mail headers was discussed.
For those cases where the URL syntax conflicted with the
syntax for mail headers, in particular in the case the URL
contained illegal characters such as space, the proposal by
Ed Levinson (excerpts below) was accepted (as far as I remember).

The advantage of this proposal is that it does not interfere
with URL encoding (%HH), which would in some cases be difficult
to disentangle.

When Ed originally proposed his idea, I had some concerns
regarding 8-bit octets (there are no 8-bit characters in
URLs!). During the meeting, Harald Alvestrand helped to
sort some of this out by pointing me to an interesting RFC.

On Thu, 24 Jul 1997, Ed Levinson wrote:

> URLs that have characters in-approprirate for an 822 header,
> SPACE, CTLs, double quotes, backslashes, and 8-bit characters
> (RFC 2017, URL access-type) be converted into RFC 2047 (Message
> Header Extensions) encoded words using the Q encoding.
> Using an encoding different from the URL %hh encoding eliminates
> the ambiguity that Larry Masinter pointed out.
> is it
> when converted to a Content-Location: becomes, under %hh encoding
>         Content-Location:
> and then decoded back is
> is it
> Remember?  On "abc/def" is a file name, not two
> directories.
> Using 2047 the Content-Location becomes
>   Content-Location:
>       =?us-ascii?q?
> which decodes to the exact string we started with.
> Doing this make mhtml blind to miscoded and misguided URLs.  As Stef
> delights in saying, "it becomes somebody else's problem" ;-).

As long as the URL only contains 7-bit octets, the use of US-ASCII
as the charset in the RFC 2047 encoding is the best choice. The
question is: What to use in the case of 8-bit octets?
If the URL comes from an HTML file with well-established character
encoding (charset), that charset can be used. But what do we do
if the charset cannot be established? For this, Harald pointed me
to RFC 1428, "Transition to 8bit-SMTP/MIME", by G. Vaudreuil,
which defines "unknown-8bit". This is registered as MIBenum 2080
in the IANA registry (Harald: an explanatory comment such as:
"Not a real charset, use when charset is not known." would be
a very nice improvement to the charset registry).

I therefore propose the following text for the new version of
the MHTML spec (adapted from Ed above):

   URLs that contain characters or octets in-approprirate for an
   822 header, such as SPACE, CTLs, double quotes, backslashes,
   and so on, and 8-bit octets MUST be encoded using the method
   for message headers described in RFC 2047. As long as there
   are no 8-bit octets, the charset value "US-ASCII" MUST be used.
   For URLs containing 8-bit octets, the original character encoding
   (charset) SHOULD be used if it is known without doubt. Otherwise,
   the charset value "UNKNOWN-8BIT" (RFC 1428, MIBenum 2079) MUST
   be used.

NOTE: For MHTML processing (URL matching), the charset value is
        irrelevant, but it may be relevant for other operations
        on the URL.

Hope this helps,                Martin.

Back to: Top of Message | Previous Page | Main MHTML Page



CataList Email List Search Powered by the LISTSERV Email List Manager