This is some detail left over from the MHTML meeting in Munich.
The problem of transporting URLs in mail headers was discussed.
For those cases where the URL syntax conflicted with the
syntax for mail headers, in particular in the case the URL
contained illegal characters such as space, the proposal by
Ed Levinson (excerpts below) was accepted (as far as I remember).
The advantage of this proposal is that it does not interfere
with URL encoding (%HH), which would in some cases be difficult
When Ed originally proposed his idea, I had some concerns
regarding 8-bit octets (there are no 8-bit characters in
URLs!). During the meeting, Harald Alvestrand helped to
sort some of this out by pointing me to an interesting RFC.
On Thu, 24 Jul 1997, Ed Levinson wrote:
> URLs that have characters in-approprirate for an 822 header,
> SPACE, CTLs, double quotes, backslashes, and 8-bit characters
> (RFC 2017, URL access-type) be converted into RFC 2047 (Message
> Header Extensions) encoded words using the Q encoding.
> Using an encoding different from the URL %hh encoding eliminates
> the ambiguity that Larry Masinter pointed out.
> http://hosed.xison.com/abc%2fdef/this is it
> when converted to a Content-Location: becomes, under %hh encoding
> Content-Location: http://hosed.xison.com/abc%2fdef/this%20is%20it
> and then decoded back is
> http://hosed.xison.com/abc/def/this is it
> Remember? On hosed.xison.com "abc/def" is a file name, not two
> Using 2047 the Content-Location becomes
> which decodes to the exact string we started with.
> Doing this make mhtml blind to miscoded and misguided URLs. As Stef
> delights in saying, "it becomes somebody else's problem" ;-).
As long as the URL only contains 7-bit octets, the use of US-ASCII
as the charset in the RFC 2047 encoding is the best choice. The
question is: What to use in the case of 8-bit octets?
If the URL comes from an HTML file with well-established character
encoding (charset), that charset can be used. But what do we do
if the charset cannot be established? For this, Harald pointed me
to RFC 1428, "Transition to 8bit-SMTP/MIME", by G. Vaudreuil,
which defines "unknown-8bit". This is registered as MIBenum 2080
in the IANA registry (Harald: an explanatory comment such as:
"Not a real charset, use when charset is not known." would be
a very nice improvement to the charset registry).
I therefore propose the following text for the new version of
the MHTML spec (adapted from Ed above):
URLs that contain characters or octets in-approprirate for an
822 header, such as SPACE, CTLs, double quotes, backslashes,
and so on, and 8-bit octets MUST be encoded using the method
for message headers described in RFC 2047. As long as there
are no 8-bit octets, the charset value "US-ASCII" MUST be used.
For URLs containing 8-bit octets, the original character encoding
(charset) SHOULD be used if it is known without doubt. Otherwise,
the charset value "UNKNOWN-8BIT" (RFC 1428, MIBenum 2079) MUST
NOTE: For MHTML processing (URL matching), the charset value is
irrelevant, but it may be relevant for other operations
on the URL.
Hope this helps, Martin.