Sunday, September 09, 2007

Email and MIME(1)

Note: Most of content in this article is cited from RFC document(http://tools.ietf.org/html/rfc2822).

Format of Internet Message is defined in RFC 2822(old ones are RFC822 and RFC733).
MIME is defined by in RFC2045 through RFC2049.
Internet Message consists of two parts: Header and Body.
Standard header fields of mail and MIME are recorded in RFC4021 which actually is collection of header fields defined in other RFC documents(e.g. RFC 2822). So it is a quick reference programmers can look up.

Basics:
At the most basic level, a message is a series of characters. A message that is conformant with this RFC2822 is comprised of characters with values in the range 1 through 127 and interpreted as US-ASCII characters. Messages are divided into lines of characters. A line is a series of characters that is delimited with the two characters carriage-return and line-feed; that is, the carriage return (CR) character (ASCII value 13) followed immediately by the line feed (LF) character (ASCII value 10). (The carriage-return/line-feed pair is usually written document as "CRLF".)
A message consists of header fields (collectively called "the header of the message") followed, optionally, by a body. The header is a sequence of lines of characters with special syntax as defined in RFC2822. The body is simply a sequence of characters that follows the header and is separated from the header by an empty line(i.e., a line with nothing preceding the CRLF).

Header Fields
Format: field-name:field-body

Header fields are lines composed of a field name, followed by a colon(":"), followed by a field body, and terminated by CRLF. A field name MUST be composed of printable US-ASCII characters (i.e.,characters that have values between 33 and 126, inclusive), except colon. A field body may be composed of any US-ASCII characters, except for CR and LF. However, a field body may contain CRLF when used in header "folding" and "unfolding".

For convenience however, and to deal with the 998/78 character limitations per line,
the field body portion of a header field can be split into a multiple line epresentation; this is called "folding". The general rule is that wherever this standard allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP.

For example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test

Note: Though structured field bodies are defined in such a way that folding can take place between many of the lexical tokens (and even within some of the lexical tokens), folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items in preference to other places where the field could be folded, even if it is allowed elsewhere.

The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding". Unfolding is ccomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation.
Body
The body of a message is simply lines of US-ASCII characters. The only two limitations on the body are as follows:
- CR and LF MUST only occur together as CRLF; they MUST NOT appear independently in the body.
- Lines of characters in the body MUST be limited to 998 characters, and SHOULD be limited to 78 characters, excluding the CRLF.
Token definition:
NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
%d11 / ; that do not include the
%d12 / ; carriage return, line feed,
%d14-31 / ; and white space characters
%d127
text = %d1-9 / ; Characters excluding CR and LF
%d11 /
%d12 /
%d14-127 /
obs-text

specials = "(" / ")" / ; Special characters used in
"<" / ">" / ; other parts of the syntax
"[" / "]" /
":" / ";" /
"@" / "\" /
"," / "." /
DQUOTE
How to escape scharacter
Some characters are reserved for special interpretation, such as delimiting lexical tokens. To permit use of these characters as uninterpreted data, a quoting mechanism is provided.

quoted-pair = ("\" text) / obs-qp

Where any quoted-pair appears, it is to be interpreted as the text character alone. That is to say, the "\" character that appears as part of a quoted-pair is semantically "invisible".
Note: The "\" character may appear in a message where it is not part of a quoted-pair. A "\" character that does not appear in a quoted-pair is not semantically invisible. The only places in this standard where quoted-pair currently appears are ccontent, qcontent, dcontent, no-fold-quote, and no-fold-literal.

Comments
Strings of characters enclosed in parentheses are considered comments so long as they do not appear within a "quoted-string".There are several places in this standard where comments and FWS may be freely inserted. To ccommodate that syntax, an additional token for "CFWS" is defined for places where comments and/or FWS can occur.
FWS             =       ([*WSP CRLF] 1*WSP) /   ; Folding white space
obs-FWS
ctext = NO-WS-CTL / ; Non white space controls
%d33-39 / ; The rest of the US-ASCII
%d42-91 / ; characters not including "(",
%d93-126 ; ")", or "\"
ccontent = ctext / quoted-pair / comment
comment = "(" *([FWS] ccontent) [FWS] ")"
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
A comment is normally used in a structured field body to provide some human readable informational text. Since a comment is allowed to contain FWS, folding is permitted within the comment. Also note that since quoted-pair is allowed in a comment, the parentheses and backslash characters may appear in a comment so long as they appear as a quoted-pair.

Email Address
address         =       mailbox / group
mailbox = name-addr / addr-spec
name-addr = [display-name] angle-addr
angle-addr = [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group = display-name ":" [mailbox-list / CFWS] ";"
[CFWS]
display-name = phrase
mailbox-list = (mailbox *("," mailbox)) / obs-mbox-list
address-list = (address *("," address)) / obs-addr-list
Field Definition
fields          =       *(trace
*(resent-date /
resent-from /
resent-sender /
resent-to /
resent-cc /
resent-bcc /
resent-msg-id))
*(orig-date /
from /
sender /
reply-to /
to /
cc /
bcc /
message-id /
in-reply-to /
references /
subject /
comments /
keywords /
optional-field)
Date:The origination date specifies the date and time at which the creator of the message indicated that the message was complete and ready to enter the mail delivery system.
from: The author(s) of the message. This field contains more than one email address when number of authors is more than 1.
sender: the agent that is responsible for transporting the message. This is different from "from". For example, a secretary can send email for her superior. In this case, value of "from" field should be the superior and value of "sender" field should be the secretary. If the author and transmitter are identical, the "sender" field should not be used.
reply-to:When this field is present, it indicates the mailbox(es) to which the author of the message suggests that replies be sent.
To:contains the address(es) of the primary recipient(s) of the message.
cc:contains the addresses of others who are to receive the message, though the content of the message may not be directed at them.
bcc:contains addresses of recipients of the message whose addresses are not to be revealed to other recipients of the message. There are three ways in which the "Bcc:" field is used.
Message-ID:provides a unique message identifier that refers to a particular version of a particular message.
In-reply-to and References:These two fields are used when creating a reply to a message. They hold the message identifier of the original message and the message identifiers of other messages (for example, in the case of a reply to a message which was itself a reply). The "In-Reply-To:" field may be used to identify the message (or messages) to which the new message is a reply, while the "References:" field may be used to identify a "thread" of conversation.
Keywords:contains a comma-separated list of one or more words or quoted-strings.
subjectcontains a short string identifying the topic of the message
comments:contains any additional comments on the text of the body of the message.
Resent Field
Resent fields SHOULD be added to any message that is reintroduced by a user into the transport system. A separate set of resent fields SHOULD be added each time this is done. All of the resent fields corresponding to a particular resending of the message SHOULD be together. Each new set of resent fields is prepended to the message; that is, the most recent set of resent fields appear earlier in the message. No other fields in the message are changed when resent fields are added.

Each of the resent fields corresponds to a particular field elsewhere in the syntax. For instance, the "Resent-Date:" field corresponds to the "Date:" field and the "Resent-To:" field corresponds to the "To:" field. In each case, the syntax for the field body is identical to the syntax given previously for the corresponding field.

When resent fields are used, the "Resent-From:" and "Resent-Date:" fields MUST be sent. The "Resent-Message-ID:" field SHOULD be sent. "Resent-Sender:" SHOULD NOT be used if "Resent-Sender:" would be identical to "Resent-From:".
The purpose of
using resent fields is to have the message appear to the final recipient as if it were sent directly by the original sender, with all of the original fields remaining the same. Each set of resent fields correspond to a particular resending event. That is, if a message is resent multiple times, each set of resent fields gives identifying information for each individual time. Resent fields are strictly informational.
Trace fields:Full discussion in RFC2821.
The trace fields are a group of header fields consisting of an optional Return-Path:" field, and one or more "Received:" fields.
Return-Path:" contains a pair of angle brackets that enclose an optional addr-spec.
Received:contains a (possibly empty) list of name/value pairs followed by a semicolon and a date-time specification.


Limit of number of characters in a single line
There are two limits that this standard places on the number of characters in a line. Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF.

The 998 character limit is due to limitations in many implementations which send, receive, or store Internet Message Format messages that simply cannot handle more than 998 characters on a line. Receiving implementations would do well to handle an arbitrarily large number of characters in a line for robustness sake. However, there are so many implementations which (in compliance with the transport requirements of [RFC2821]) do not accept messages containing more than 1000 character including the CR and LF per line, it is important for implementations not to create such messages.

The more conservative 78 character recommendation is to accommodate the many implementations of user interfaces that display these messages which may truncate, or disastrously wrap, the display of more than 78 characters per line, in spite of the fact that such implementations are non-conformant to the intent of this specification (and that of [RFC2821] if they actually cause information to be lost). Again, even though this limitation is put on messages, it is encumbant upon implementations which display messages to handle an arbitrarily large number of characters in a line (certainly at least up to the 998 character limit) for the sake of robustness.

No comments: