Content-Type: text/html
At 09.13 -0700 97-08-22, Larry Masinter wrote:
> 1738 is being updated. In the real world, people are using all kinds of
> characters as "reserved", and the truth is you're really taking a risk
> if you encode something on your own that wasn't encoded.
>
> In the interest of safety, I think you're better off *not* recommending
> using %xx encoding as a way of making illegal URLs safer. But this
> is still just implementation advice.
>
> > I assume this means that any other character, if occuring in the value
> > submitted to a mailer for a Content-Location, must be encoded either
> > using the RFC 1738 encoding method or the RFC 2047 encoding method.
>
> It is misleading to talk about "encoding a character using the
> RFC 1738 encoding method", because the RFC 1738 encoding method
> is not a character-by-character encoding. That is, you have to
> look at the whole URL and the scheme and the context of the
> character. RFC 2047 encoding, on the other hand, can be decided
> character-by-character, because it is at a different layer.
Here is a new draft text, based on your suggestions; exclamation
marks in the border marks changes to the previous draft text:
Handling of URLs containing inappropriate characters
Some URLs may contain characters that are inappropriate for an
RFC 822 header, either because the URL itself has an incorrect
syntax or the URL syntax has changed to allow characters not
allowed in mail headers. To include such a URL in a mail
header, an implementation can either (a) arrange so that the
URL becomes correctly formatted or (b) encode the header using
the encoding method described in RFC 2047.
Method (a) MUST be applied to the URL both in Content-
Location headers and in body text. It MUST NOT be reversed by
receiving mailers before matching hyperlinks to body parts.
Method (b) MUST not be applied to the URL in the HTML text and
MUST be reversed by receiving clients before comparing
hyperlinks in body text to URLs in Content-Location headers.
Method (a) is not always easy. It may include cooperation with
the user and the software which produced the faulty URL. The
encoding method of RFC 1738 can make a correct URL faulty if
not done the right way. Changing the URL of documents already
available on the Internet or an Intranet may invalidate
existing links to this document. Changing the HTML body may
! invalidate message integrity checks. For these reasons, this
! standards recommends method (b).
! With method (b), the charset US-ASCII can be used, or, if the
URL contains octets outside of the 7-bit range, "UKNOWN-8BIT"
[RFC 1428] or "UTF-8" may be appropriate. Note that for MHTML
processing (matching of URLs in body text to URL in Content-
Location headers) the choice of character encoding need not be
the "correct" choice, it need only be a choice which, after
reversal of the encoding by the receiving mailer, returns the
same octet string as before the encoding.
------------------------------------------------------------------------
Jacob Palme <[log in to unmask]> (Stockholm University and KTH)
for more info see URL: http://www.dsv.su.se/~jpalme