I have been observing this thread with growing concern. IMHO we have one
and only one objective in the MHTML standard. That is to specify a
reversible way of encoding an HTML aggregate (i.e., a text/html root and
the objects to which it refers via URLs) in a MIME structure. This is only
problematic when text within an HTML aggregate is employed directly in a
MIME header and that header then violates RFC 822 header syntax. In this
case, we have a pre-existing means (defined in RFC 2047) of encoding that
content. I would, therefore, argue that RFC 2110 bis should contain the
following shorter and much more prescriptive text:-
A text/html root object may contain absolute or relative URLs that cannot
be employed directly in MIME Content-base or Content location headers. This
is because their direct employment would violate RFC 822 header syntax.
When these URLs are encountered, they may either be:- a) replaced in the
text/html root object by absolute or relative URLs that can then be
employed directly in MIME Content-base or Content location headers without
violating RFC 822 header syntax, or b) retained in the text/html object and
encoded in MIME Content-base or Content location headers so as not to
violate RFC 822 header syntax.
The replacement of URLs that cannot be represented legally in MIME
Content-base or Content location headers is beyond the scope of this
ASIDE - We may wish to discuss various methods for URL
replacement including replacing HTTP URLs with CID URLs
and RFC 1738 encoding of URLs in the Informational MHTML RFC.
The encoding of URLs that cannot be represented legally without encoding in
MIME Content-base or Content location headers MUST employ the encoding
method described in RFC 2047. If the URL to be encoded contains only octets
in the ABNF range %d32-126 then a RFC 2047 charset parameter value of
"US-ASCII" or "UNKNOWN-8BIT" [RFC 1428] MUST be specified. If the URL to be
encoded contains octets in the ABNF ranges %d0-31 or %d127-255, then an RFC
2047 charset parameter value of "UNKNOWN-8BIT" [RFC 1428] MUST be
ASIDE - We could require "UNKNOWN-8BIT" in all cases and be
done with it! Note, the use of "UTF-8" is problematic because
in that case, 2047 decoding may be to a local character set.
This is not possible in the case of "UNKNOWN-8BIT".
Prior to comparing the content of Content-base and Content-location headers
against URLs in a text/html root object, any RFC 2047 encoding of these
headers MUST be reversed.