On Tue, 19 Aug 1997, Jacob Palme wrote:
> I wrote:
> > (a) Use the encoding scheme described in RFC 1738 [URL].
> > If this method is used, the corresponding URL in the HTML
> > text must also be changed with the same encoding. This has
> > the disadvantage that an URL which could be used for direct
> > network retrieval will not work any more, that the HTML
> > text may not any more agree with the corresponding document
> > on the net, and that electronic seals may not work any more.
> > Warning: RFC 1738 encoding may change the meaning of an
> > URL. For example: "one/two%2ethree" is not the same
> > URL as "one%2etwo%2ethree".
> > (b) If the URLs is illegal, inform the user and ask the
> > user to correct it (in both the HTML text and the URL of
> > the object it refers to).
> > (c) Use the encoding method for message headers described
> > in RFC 2047. As long as there are no 8-bit octets, the
> > charset value "US-ASCII" MUST be used. For URLs containing
> > 8-bit octets, the original character encoding (charset)
> > SHOULD be used if it is known without doubt. Otherwise,
> > the charset value "UNKNOWN-8BIT" (RFC 1428, MIBenum 2079)
> > MUST be used.
> At 12.14 +0200 97-08-19, Martin J. Dürst replied:
> > Why do we need three methods? I have not heard from anyone
> > at the meeting that all are needed. I guess we can keep
> > implementations simpler if we define ony one method.
> First note that all this discussion is only on how to handle
> illegal URLs. URLs which have the permitted URL syntax according
> to RFC 1738 will never need any further encoding.
Yes and no. URL syntax may change in the future.
> Method (c) means that you allow illegal URLs, like URLs containing
> the space character which is not allowed according to RFC 1738.
> My feeling is that we should not recommend as the only method
> to use, a method which means you send illegal URLs, when there
> are two methods, (a) and (b), which means you send correctly
> formatted URLs.
They are correctly formatted. But maybe their semantics got
wrong. See below for why that could happen.
> If only one of method (a) and (b) is to be recommended, I
> would prefer method (b), since a user who has produced faulty
> URLs may prefer to get warned about this. However, method
> (a) has the advantage, in those cases where it does not
> corrupt the URL, that it is automatic (no trouble for the
> user) and produced legal URLs.
There are two different situations, the one where the mail
sender is the same as the HTML author, and the other where
they are different. As we are concerned with HTML over Mail,
I think we should abstract from this difference. A good
HTML authoring tool will most probably not let the user
produce illegal URLs, or at least warn him. But that's not
our concern. It is the concern of vendors and other software
producers to organize things nicely, but we can't assume
any scenario. Method (b) might be what the user perceives
in a tightly integrated package of an HTML editor and a mail
UA, but we shouldn't prescribe it because it only covers part
of our usage scenarios and for the protocol we are concerned
with is an user interface issue. It's about the same as if
we had an additional method (0) saying: try to make sure
somehow you never get illegal URLs.
Now back to method (a). It produces syntactically legal
URLs. But syntactic legality is only half of the job.
The URL can get corrupted in that process, and that's why
method (a) should go away. There are two cases that can
1) Reserved characters: The distinction between %2F and /
is in many cases crucial.
2) 8-bit octets: The assumed practice of taking the 8-bit
octets as they appear in the HTML text and convert
them to %HH may work in some limited cases, but is
actually severely broken. If you store a Cyrillic
page in KOI-8 and later convert it to iso-8859-5,
the above assumed practice will produce two different
URLs for the same resoure. At least one of them will
fail. The ultimately correct thing here is to take
the characters in the URL, convert them to UTF-8,
and then use the %HH escaping. It is rather probable
that HTML 4.0, in its final version, will prescribe
such behaviour (with the necessary/possible precautions
for backwards compatibility). Method (a) under these
considerations will just produce more headaches.
Method (c) has the best chances of a recipient
being able to reach the original URL.
Because of these problems with method (a), I propose to only
have method (c). I'm sorry I haven't given the full reasoning