Content-Type: text/html On Wed, 20 Aug 1997, Jacob Palme wrote: > At 12.55 +0200 97-08-20, Martin J. Dürst wrote: > > > First note that all this discussion is only on how to handle > > > illegal URLs. URLs which have the permitted URL syntax according > > > to RFC 1738 will never need any further encoding. > > > > Yes and no. URL syntax may change in the future. > > Perhaps. But it would surprise me if the URL syntax changes in ways > to allow unusual encoded characters. The present syntax very carefully > is designed to only allow a common subset of ASCII, so as to ensure > that URLs can be rendered anywhere in the world and can easily be > copied by hand. I don't want to go into too many details here, but it can be a lot easier for Japanese to copy a Japanese URL for a Japanese document (which is already named in Japanese on a local system) by hand than having to use ASCII for this. You can substitute "Japanese" by many other nationalities. The %HH encoding will continue to be available where it is really needed. > If the URL syntax is extended, for example to allow > national characters, these will probably be encoded in the URL string, > so that no further encoding is needed. It may indeed be that the formal URL syntax standard for quite some time continues to only view %HH-escaped URLs as correct URLs. But as I said, HTML 4.0, as well as XML, may define that if national characters are used in places where an URL is expected, these are shortcuts for the %HH-escaped UTF-8 encoding of the said characters. This way, the URL syntax won't change; only HTML will have a special and very convenient shortcut convention for certain URLs. As a consequence, MHTML may have to deal with these things that are not legal URLs, but perfectly legal shortcuts for URLs inside HTML. > It is not common practice in IETF to make standards for what might > be needed, perhaps, some time in the future. That is the OSI method > of standards development, which has proven to be unsuccessful. The > IETF view is to standardise only what we need and understand just > now. This is definitely an important point. But please note two things: (1) ISO methods would not have the possibility to deal with illegal URLs in the first place. They are illegal, so they are supposed not to exist, so there is no need to deal with them. That we deal with them here is a peculiarity and an advantage of the IETF process. (2) The fact that these illegal URLs exist shows a current need for them. It's not needed in the future, it's used now. Because it works for local situations (intranets and sometimes national/regional contexts), people just use it and don't get aware of the fact that it's illegal or that it's not working worldwide. But we understand that (if you or somebody else doesn't, please tell me, I will expand). So I see it as our responsibility, when we discuss various methods to solve a problem we have, to evaluate their advantages and disadvantages against our current understanding. > > > Method (c) means that you allow illegal URLs, like URLs containing > > > the space character which is not allowed according to RFC 1738. > > > My feeling is that we should not recommend as the only method > > > to use, a method which means you send illegal URLs, when there > > > are two methods, (a) and (b), which means you send correctly > > > formatted URLs. > > > > They are correctly formatted. But maybe their semantics got > > wrong. See below for why that could happen. > > No, RFC 1738 clearly specifies that for example SPACE characters > are not allowed in URLs. An URL which contains a SPACE, which is > not encoded as %20, is thus not in agreement with RFC 1738. I didn't say anything else. I said that after using method (a) or (b), the URLs are correctly formatted, but maybe their semantics got wrong or lost. The problem is that because of the specifics of MHTML, namely transporting one standard over another standard, we have to live both with differences between those standards (as we have seen them e.g. in the case of line endings), as well as with differences between the standards and the actual implementation. What counts most for us is to preserve the workings of HTML across mail. Trying to force actual HTML to better conform to standards is futile. Trying to correct some things that don't conform to some standard is something worthwile to attempt, but if it breaks functionality, as currently proposal (a) risks to do, we better go with a working solution than with a conforming solution. > > Method (b) might be what the user perceives > > in a tightly integrated package of an HTML editor and a mail > > UA, but we shouldn't prescribe it because it only covers part > > of our usage scenarios and for the protocol we are concerned > > with is an user interface issue. > > It was not my intention to prescribe. My intention was to list > three different methods, and allow implementors to choose any > or all of them. This is a worthwile idea. But proposing more methods than necessary may only increase implementation complexity. And proposing a method that does not work is not something we should do. I admit that I wasn't very well aware of the problems of method (a), and so I don't blame anybody else about not having been aware of it, but now that I am aware of the problems, I clearly think we should not propose it. > > Now back to method (a). It produces syntactically legal > > URLs. But syntactic legality is only half of the job. > > The URL can get corrupted in that process, and that's why > > method (a) should go away. There are two cases that can > > happen: > > > > 1) Reserved characters: The distinction between %2F and / > > is in many cases crucial. > > "/" is not a forbidden character in an URL. It is allowed, > properly used, and the mailer thus need not encode it further. > I cannot see any reason for a mailer which encounters a "/" > in an URL to assume that the "/" is there otherwise than > as defined in the URL syntax. Okay. SPACE is not a legal URL character, so it can not be used as a reserved character, so we can safely encode it. "/" is not forbidden in URLs, and is not forbidden in mail headers I assume, so we do not need to encode it and therefore can preserve the distinction between "/" and %2F in the original URL. The question then is: Are there characters that are legal URL characters, are reserved [or may in the future be used in some place as reserved], and can't appear in a mail header? The currently reserved characters are: reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" If there is no overlap, there is no problem here, and I stand corrected. > > 2) 8-bit octets: The assumed practice of taking the 8-bit > > octets as they appear in the HTML text and convert > > them to %HH may work in some limited cases, but is > > actually severely broken. If you store a Cyrillic > > page in KOI-8 and later convert it to iso-8859-5, > > the above assumed practice will produce two different > > URLs for the same resoure. > > Your argument is an argument for changing the encoding > method specified in RFC 1738. It cannot be the task of mailers > to correct deficiencies in the encoding method of RFC 1738, > those should be corrected by modifying RFC 1738. Perhaps you > can submit a proposal on this to IETF? We are working on this. But there is no change to the encoding method specified in RFC 1738. 1738 only says "take the octets, and escape them with %HH". This applies to the octets the server is expecting. It does not apply to the octets that may be found in a particular encoding of a HTML document [if it did, it would be broken, as can be seen with the KOI-8/iso-8859-5 example above]. HTML contains characters, not octets. This should be clear at least since HTML 2.0. Of course these characters have to be encoded with octets in some way to pass the network. But taking the octets of a particular encoding that we just happen to encounter to construct a syntactically correct URL and assume that this URL will also be semantically correct (i.e. that it will correspond to what the server is expecting) is just wrong and dangerous. Regards, Martin.