use MIME::Charset: $charset = MIME::Charset->new("euc-jp");
Getting charset information:
$benc = $charset->body_encoding; # e.g. "Q" $cset = $charset->as_string; # e.g. "US-ASCII" $henc = $charset->header_encoding; # e.g. "S" $cset = $charset->output_charset; # e.g. "ISO-2022-JP"
Translating text data:
($text, $charset, $encoding) = $charset->header_encode( "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa". "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef", Charset => 'euc-jp'); # ...returns e.g. (<converted>, "ISO-2022-JP", "B"). ($text, $charset, $encoding) = $charset->body_encode( "Collectioneur path\xe9tiquement ". "\xe9clectique de d\xe9chets", Charset => 'latin1'); # ...returns e.g. (<original>, "ISO-8859-1", "QUOTED-PRINTABLE"). $len = $charset->encoded_header_len( "Perl\xe8\xa8\x80\xe8\xaa\x9e", Charset => 'utf-8', Encoding => "b"); # ...returns e.g. 28.
Manipulating module defaults:
MIME::Charset::alias("csEUCKR", "euc-kr"); MIME::Charset::default("iso-8859-1"); MIME::Charset::fallback("us-ascii");
Non-OO functions (may be deprecated in near future):
use MIME::Charset qw(:info); $benc = body_encoding("iso-8859-2"); # "Q" $cset = canonical_charset("ANSI X3.4-1968"); # "US-ASCII" $henc = header_encoding("utf-8"); # "S" $cset = output_charset("shift_jis"); # "ISO-2022-JP" use MIME::Charset qw(:trans); ($text, $charset, $encoding) = header_encode( "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa". "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef", "euc-jp"); # ...returns (<converted>, "ISO-2022-JP", "B"); ($text, $charset, $encoding) = body_encode( "Collectioneur path\xe9tiquement ". "\xe9clectique de d\xe9chets", "latin1"); # ...returns (<original>, "ISO-8859-1", "QUOTED-PRINTABLE"); $len = encoded_header_len( "Perl\xe8\xa8\x80\xe8\xaa\x9e", "b", "utf-8"); # 28
The encoding is that used in MIME to refer to a method of representing a body part or a header body as sequence(s) of printable US-ASCII characters.
OPTS may accept following key-value pair. NOTE: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), conversion will not be performed. So this option do not have any effects.
Returned value will be one of "B" (BASE64), "Q" (QUOTED-PRINTABLE), "S" (shorter one of either) or "undef" (might not be transfer-encoded; either 7BIT or 8BIT). This may not be same as encoding for message header.
If optional CHARSET is specified, replace encoder (and output charset name) of $charset object with those of CHARSET, therefore, $charset object will be a converter between original charset and new CHARSET.
Returned value will be one of "B", "Q", "S" (shorter one of either) or "undef" (might not be encoded). This may not be same as encoding for message body.
When Unicode/multibyte support is disabled (see ``USE_ENCODE''), this function will simply return the result of ``canonical_charset''.
OPTS may accept following key-value pairs. NOTE: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), conversion will not be performed. So these options do not have any effects.
3-item list of (converted string, charset for output, transfer-encoding) will be returned. Transfer-encoding will be either "BASE64", "QUOTED-PRINTABLE", "7BIT" or "8BIT". If charset for output could not be determined and converted string contains non-ASCII byte(s), charset for output will be "undef" and transfer-encoding will be "BASE64". Charset for output will be "US-ASCII" if and only if string does not contain any non-ASCII bytes.
Note: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), this function will die.
Note: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), this function will die.
ENCODING may be one of "B", "Q" or "S" (shorter one of either "B" or "Q").
OPTS may accept following key-value pairs. NOTE: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), conversion will not be performed. So these options do not have any effects.
3-item list of (converted string, charset for output, encoding scheme) will be returned. Encoding scheme will be either "B", "Q" or "undef" (might not be encoded). If charset for output could not be determined and converted string contains non-ASCII byte(s), charset for output will be "8BIT" (this is not charset name but a special value to represent unencodable data) and encoding scheme will be "undef" (should not be encoded). Charset for output will be "US-ASCII" if and only if string does not contain any non-ASCII bytes.
Note: When Unicode/multibyte support is disabled (see ``USE_ENCODE''), this function will die.
If CHARSET is given and isn't false, ALIAS will be assigned as an alias of CHARSET. Otherwise, alias won't be changed. In both cases, current charset name that ALIAS is assigned will be returned.
Default charset is used by this module when charset context is unknown. Modules using this module are recommended to use this charset when charset context is unknown or implicit default is expected. By default, it is "US-ASCII".
If CHARSET is given and isn't false, it will be set to default charset. Otherwise, default charset won't be changed. In both cases, current default charset will be returned.
NOTE: Default charset should not be changed.
Fallback charset is used by this module when conversion by given charset is failed and "FALLBACK" error handling scheme is specified. Modules using this module may use this charset as last resort of charset for conversion. By default, it is "UTF-8".
If CHARSET is given and isn't false, it will be set to fallback charset. If CHARSET is "NONE", fallback charset will be undefined. Otherwise, fallback charset won't be changed. In any cases, current fallback charset will be returned.
NOTE: It is useful that "US-ASCII" is specified as fallback charset, since result of conversion will be readable without charset information.
If optional arguments are given and any of them are not false, profiles for CHARSET will be set by those arguments. Otherwise, profiles won't be changed. In both cases, current profiles for CHARSET will be returned as 3-item list of (HEADERENC, BODYENC, ENCCHARSET).
HEADERENC is recommended encoding scheme for message header. It may be one of "B", "Q", "S" (shorter one of either) or "undef" (might not be encoded).
BODYENC is recommended transfer-encoding for message body. It may be one of "B", "Q", "S" (shorter one of either) or "undef" (might not be transfer-encoded).
ENCCHARSET is a charset which is compatible with given CHARSET and is recommended to be used for MIME messages on Internet. If conversion is not needed (or this module doesn't know appropriate charset), ENCCHARSET is "undef".
NOTE: This function in the future releases can accept more optional arguments (for example, properties to handle character widths, line folding behavior, ...). So format of returned value may probably be changed. Use ``header_encoding'', ``body_encoding'' or ``output_charset'' to get particular profile.
If error handling scheme is not specified or unknown scheme is specified, "DEFAULT" will be assumed.
Development versions of this module may be found at <http://hatuka.nezumi.nu/repos/MIME-Charset/>.