Sunday, September 09, 2007

Email and MIME(3): message header extension

RFC 2047. In previous RFC document, the values of field name and field body are limited to US Ansi characters. The extension to message headers allows users to use their own language(e.g. Chinese) as values of headers. More detail is in RFC 2047. An 'encoded-word' is defined by the following ABNF grammar: encoded-word = "=?" charset "?" encoding "?" encoded-text "?=" charset = token ; see section 3 encoding = token ; see section 4 token = 1* especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / " <"> / "/" / "[" / "]" / "?" / "." / "=" encoded-text = 1*<Any printable ASCII character other than "?" or SPACE> ; (but see "Use of encoded-words in message ; headers", section 5) The "?" character is used within an 'encoded-word' to separate the various portions of the 'encoded-word' from one another, and thus cannot appear in the 'encoded-text' portion. An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used. While there is no limit to the length of a multiple-line header field, each line of a header field that contains one or more 'encoded-word's is limited to 76 characters. unencoded white space characters (such as SPACE and HTAB) are FORBIDDEN within an 'encoded-word'. For example, the character sequence =?iso-8859-1?q?this is some text?= would be parsed as four 'atom's, rather than as a single 'atom' (by an RFC 822 parser) or 'encoded-word' (by a parser which understands 'encoded-words'). The correct way to encode the string "this is some text" is to encode the SPACE characters as well, e.g. =?iso-8859-1?q?this=20is=20some=20text?= These are the ONLY locations where an 'encoded-word' may appear. In particular: + An 'encoded-word' MUST NOT appear in any portion of an 'addr-spec'. + An 'encoded-word' MUST NOT appear within a 'quoted-string'. + An 'encoded-word' MUST NOT be used in a Received header field. + An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'. Initially, the legal values for "encoding" are "Q" and "B". These encodings are described below. The "Q" encoding is recommended for use when most of the characters to be encoded are in the ASCII character set; otherwise, the "B" encoding should be used.

No comments: