On Wed, 20 Aug 1997, Jacob Palme wrote:
> At 12.55 +0200 97-08-20, Martin J. Dürst wrote:
> > > First note that all this discussion is only on how to handle
> > > illegal URLs. URLs which have the permitted URL syntax according
> > > to RFC 1738 will never need any further encoding.
> > Yes and no. URL syntax may change in the future.
> Perhaps. But it would surprise me if the URL syntax changes in ways
> to allow unusual encoded characters. The present syntax very carefully
> is designed to only allow a common subset of ASCII, so as to ensure
> that URLs can be rendered anywhere in the world and can easily be
> copied by hand.
I don't want to go into too many details here, but it can be a lot
easier for Japanese to copy a Japanese URL for a Japanese document
(which is already named in Japanese on a local system) by hand than
having to use ASCII for this. You can substitute "Japanese" by many
other nationalities. The %HH encoding will continue to be available
where it is really needed.
> If the URL syntax is extended, for example to allow
> national characters, these will probably be encoded in the URL string,
> so that no further encoding is needed.
It may indeed be that the formal URL syntax standard for quite some
time continues to only view %HH-escaped URLs as correct URLs.
But as I said, HTML 4.0, as well as XML, may define that if
national characters are used in places where an URL is expected,
these are shortcuts for the %HH-escaped UTF-8 encoding of the
said characters. This way, the URL syntax won't change; only
HTML will have a special and very convenient shortcut convention
for certain URLs. As a consequence, MHTML may have to deal with
these things that are not legal URLs, but perfectly legal
shortcuts for URLs inside HTML.
> It is not common practice in IETF to make standards for what might
> be needed, perhaps, some time in the future. That is the OSI method
> of standards development, which has proven to be unsuccessful. The
> IETF view is to standardise only what we need and understand just
This is definitely an important point. But please note two things:
(1) ISO methods would not have the possibility to deal with
illegal URLs in the first place. They are illegal, so
they are supposed not to exist, so there is no need
to deal with them. That we deal with them here is a
peculiarity and an advantage of the IETF process.
(2) The fact that these illegal URLs exist shows a current need
for them. It's not needed in the future, it's used now.
Because it works for local situations (intranets and
sometimes national/regional contexts), people just use
it and don't get aware of the fact that it's illegal
or that it's not working worldwide. But we understand
that (if you or somebody else doesn't, please tell me,
I will expand). So I see it as our responsibility,
when we discuss various methods to solve a problem
we have, to evaluate their advantages and disadvantages
against our current understanding.
> > > Method (c) means that you allow illegal URLs, like URLs containing
> > > the space character which is not allowed according to RFC 1738.
> > > My feeling is that we should not recommend as the only method
> > > to use, a method which means you send illegal URLs, when there
> > > are two methods, (a) and (b), which means you send correctly
> > > formatted URLs.
> > They are correctly formatted. But maybe their semantics got
> > wrong. See below for why that could happen.
> No, RFC 1738 clearly specifies that for example SPACE characters
> are not allowed in URLs. An URL which contains a SPACE, which is
> not encoded as %20, is thus not in agreement with RFC 1738.
I didn't say anything else. I said that after using method (a)
or (b), the URLs are correctly formatted, but maybe their
semantics got wrong or lost.
The problem is that because of the specifics of MHTML, namely
transporting one standard over another standard, we have to
live both with differences between those standards (as we have
seen them e.g. in the case of line endings), as well as with
differences between the standards and the actual implementation.
What counts most for us is to preserve the workings of HTML
across mail. Trying to force actual HTML to better conform to
standards is futile. Trying to correct some things that don't
conform to some standard is something worthwile to attempt,
but if it breaks functionality, as currently proposal (a)
risks to do, we better go with a working solution than with
a conforming solution.
> > Method (b) might be what the user perceives
> > in a tightly integrated package of an HTML editor and a mail
> > UA, but we shouldn't prescribe it because it only covers part
> > of our usage scenarios and for the protocol we are concerned
> > with is an user interface issue.
> It was not my intention to prescribe. My intention was to list
> three different methods, and allow implementors to choose any
> or all of them.
This is a worthwile idea. But proposing more methods than
necessary may only increase implementation complexity. And
proposing a method that does not work is not something we
should do. I admit that I wasn't very well aware of the
problems of method (a), and so I don't blame anybody else
about not having been aware of it, but now that I am aware
of the problems, I clearly think we should not propose it.
> > Now back to method (a). It produces syntactically legal
> > URLs. But syntactic legality is only half of the job.
> > The URL can get corrupted in that process, and that's why
> > method (a) should go away. There are two cases that can
> > happen:
> > 1) Reserved characters: The distinction between %2F and /
> > is in many cases crucial.
> "/" is not a forbidden character in an URL. It is allowed,
> properly used, and the mailer thus need not encode it further.
> I cannot see any reason for a mailer which encounters a "/"
> in an URL to assume that the "/" is there otherwise than
> as defined in the URL syntax.
Okay. SPACE is not a legal URL character, so it can not be
used as a reserved character, so we can safely encode it.
"/" is not forbidden in URLs, and is not forbidden in mail
headers I assume, so we do not need to encode it and therefore
can preserve the distinction between "/" and %2F in the
original URL. The question then is: Are there characters
that are legal URL characters, are reserved [or may in the
future be used in some place as reserved], and can't appear
in a mail header? The currently reserved characters are:
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+"
If there is no overlap, there is no problem here, and I
> > 2) 8-bit octets: The assumed practice of taking the 8-bit
> > octets as they appear in the HTML text and convert
> > them to %HH may work in some limited cases, but is
> > actually severely broken. If you store a Cyrillic
> > page in KOI-8 and later convert it to iso-8859-5,
> > the above assumed practice will produce two different
> > URLs for the same resoure.
> Your argument is an argument for changing the encoding
> method specified in RFC 1738. It cannot be the task of mailers
> to correct deficiencies in the encoding method of RFC 1738,
> those should be corrected by modifying RFC 1738. Perhaps you
> can submit a proposal on this to IETF?
We are working on this. But there is no change to the encoding
method specified in RFC 1738. 1738 only says "take the octets,
and escape them with %HH". This applies to the octets the server
is expecting. It does not apply to the octets that may be found
in a particular encoding of a HTML document [if it did, it would
be broken, as can be seen with the KOI-8/iso-8859-5 example above].
HTML contains characters, not octets. This should be clear at
least since HTML 2.0. Of course these characters have to be
encoded with octets in some way to pass the network. But taking
the octets of a particular encoding that we just happen to encounter
to construct a syntactically correct URL and assume that this URL
will also be semantically correct (i.e. that it will correspond
to what the server is expecting) is just wrong and dangerous.