> (a) Use the encoding scheme described in RFC 1738 [URL].
> If this method is used, the corresponding URL in the HTML
> text must also be changed with the same encoding. This has
> the disadvantage that an URL which could be used for direct
> network retrieval will not work any more, that the HTML
> text may not any more agree with the corresponding document
> on the net, and that electronic seals may not work any more.
> Warning: RFC 1738 encoding may change the meaning of an
> URL. For example: "one/two%2ethree" is not the same
> URL as "one%2etwo%2ethree".
> (b) If the URLs is illegal, inform the user and ask the
> user to correct it (in both the HTML text and the URL of
> the object it refers to).
> (c) Use the encoding method for message headers described
> in RFC 2047. As long as there are no 8-bit octets, the
> charset value "US-ASCII" MUST be used. For URLs containing
> 8-bit octets, the original character encoding (charset)
> SHOULD be used if it is known without doubt. Otherwise,
> the charset value "UNKNOWN-8BIT" (RFC 1428, MIBenum 2079)
> MUST be used.
At 12.14 +0200 97-08-19, Martin J. Dürst replied:
> Why do we need three methods? I have not heard from anyone
> at the meeting that all are needed. I guess we can keep
> implementations simpler if we define ony one method.
First note that all this discussion is only on how to handle
illegal URLs. URLs which have the permitted URL syntax according
to RFC 1738 will never need any further encoding.
Method (c) means that you allow illegal URLs, like URLs containing
the space character which is not allowed according to RFC 1738.
My feeling is that we should not recommend as the only method
to use, a method which means you send illegal URLs, when there
are two methods, (a) and (b), which means you send correctly
If only one of method (a) and (b) is to be recommended, I
would prefer method (b), since a user who has produced faulty
URLs may prefer to get warned about this. However, method
(a) has the advantage, in those cases where it does not
corrupt the URL, that it is automatic (no trouble for the
user) and produced legal URLs.
Method (a) and (b) are actually on a different layer than
metod (c). Method (a) and (b) are methods of ensuring that
the URLs used are correct according to RFC 1738, while method
(c) is a way of coping with URL which are not correct and
cannot be made correct for some reason.
Here is a new draft text, which more clearly shows that this
is a matter of several layers:
URLs in Content-Location and Content-Base headers SHOULD
have the permitted syntax for URLs according to RFC 1738.
In particular, this means that many characters, for example
SPACE, SHOULD be encoded using the % method specified in
In certain cases, a mailer may be provided with HTML text
containing wrongly formatted URLs, for example containing
unencoded SPACE characters, in hyperlinks to other body
parts. In such cases, the mailer has several options.
One option is to correct the URL in both the Content-Location
and the HTML text. Such a correction can sometimes be done
automatically, but may in some cases require communication
with the user. In some cases this option is not suitable,
because it requires rewriting of the original HTML text
and it may cause the Content-Location value to be different
from the location from where an object is actually available
on the Internet. In such cases, an alternative option is
to keep the HTML text as it is, and encode the Content-
Location header using the encoding methods from RFC 2047.
RFC 2047 encoding must therefore be reversed by recipients
before comparing URLs in Content-Location headers with
URLs in HTML text bodies. RFC 1738 encoding, however, should
not be reversed before this comparison.
The text I have written above is certainly not easier to
understand, but it is more correct and more clearly pinpoints
the difference between the two encoding methods.
Jacob Palme <[log in to unmask]> (Stockholm University and KTH)
for more info see URL: http://www.dsv.su.se/~jpalme