Unicode with Perl
These are some notes on my experiences with Perl and Unicode.
Strings in Perl 5.8 are tagged as either strings of bytes or strings of characters. Character strings are stored internally as UTF-8 but appear to the programmer as sequences of UCS characters.
Script Encoding (Perl 5.6.1 and Perl 5.8 )
The declaration "use utf8;" determines how the script source file containing it is interpret. If this declaration is used, the source file (especially string literals within the source file) will be interpret as being encoded in utf8.
- The UTF-8-Declaration
use utf8;
If a file with this declaration is including another perl-file with "do", the utf8-declaration needs to be repeated at the beginning of the included file.
Converting From ISO-8859-1 (Perl 5.6.1 )
As long as no string literal depending on this occurs, it does not matter if "use utf8;" was used.
To convert from ISO-8859-1 to UTF-8, the Module "Unicode::String" can be used as shown in the following example.
- Converting From ISO-8859-1
use Unicode::String;
Unicode::String->stringify_as( 'utf8' ); # utf8 already is the default
my $string_iso_8859_1 = "This is latin text.";
my $string_utf8 = Unicode::String::latin1( $string_iso_8859_1 );
# $string_utf8 is now string_iso_8859_1 encoded as UTF-8
print $string_utf8;
Reading or Writing UTF-8 Text Files (Perl 5.8 )
When reading characters from a UTF-8 text file, the file has to be prepared with "binmode".
- Reading a Unicode Text File
$source = IO::File->new( $filename, 'r' );
binmode( $source, ':utf8' );
$ch = getc( $source );
The same approach applies to the writing of UTF-8 text files.
To get the UTF-8 representation of a Perl -string into a string the following approach might be used:
use Encode;
$octets = Encode::encode( "utf8", $string );
A Perl -string already might be represented internally using UTF-8 , but this is hidden from the script and a string appears to be a sequence of UCS-characters. So, to get a visible octet representation of a string in UTF-8, the preceding code might be used.
Sometimes, it might be appropriate to adjust the standard streams to use UTF-8.
- Setting the standard streams
binmode( STDIN, ':utf8' );
binmode( STDOUT, ':utf8' );
Reading a Unicode Jet -4.0-Database (Perl 5.8 )
When reading characters from a Unicode Jet -4.0-Database, it will help to use the "dbi:ADO"-driver instead of the ODBC-driver. (I saw one ODBC-driver that issued the spurious message "out of memory", when it encountered Unicode -data.).
Setting the property "LongReadLen" to a large value, such as 65535, will indeed consume more memory but can prevent some problems reading Unicode -data from some text fields (so called „memo-fields“): With UTF-8, some Characters will require more than one octet and the size set with the property "LongReadLen" is measured in octets.
Finally, as Jan Dubois explains, UTF-8 support for the module "Win32::OLE" explicitly needs to be permitted by putting the line "Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);" somewhere after the line "use DBD::ADO". This is believed to work reliably only with Win32::OLE version 0.16 (as of 2002-10-18) and later, preferably with Perl 5.8.
- Reading a Unicode Database
use strict;
use warnings;
use DBI;
use DBD::ADO;
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);
my $dbh = DBI->connect( "dbi:ADO:Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\\tmp.mdb;" );
$dbh->{LongReadLen} = 65535;
my $sth = $dbh->prepare( "SELECT tmp FROM tmp" );
$sth->execute();
my $row = $sth->fetchrow_hashref;
print ord( substr( $row->{'tmp'}, 0, 1 )); # prints "602"
$sth->finish(); $dbh->disconnect();
See Also
- Perl Documentation (5.6 and 5.8)
- perldoc perlunicode
- Perl Documentation (5.8)
- perldoc perluniintro
- perldoc encoding
- DBD:ADO, DBI and UTF-8
- Jan Dubois
- http://www.mail-archive.com/perl-unicode@perl.org/msg01954.html
- http://www.xray.mpe.mpg.de/mailing-lists/dbi/2003-12/msg00063.html
- adventures with mysql, perl and unicode
- Brigitte Jellinek
- http://perlwelt.horus.at/Beispiele/Magic/PerlUnicodeMysql/
- Unicode in Perl 5.6.1
- http://developers.sun.com/dev/gadc/unicode/perl/perl561.html
- Perl-XML Frequently Asked Questions (Section on encodings)
- http://perl-xml.sourceforge.net/faq/#encodings
- Perl, Unicode and i18N FAQ
- http://rf.net/~james/perli18n.html
- UTF-8 and Unicode FAQ (not directly perl-related)
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
- Perl 5.8 Documentation - Unicode Support
- http://perl.active-venture.com/pod/perlguts-unicode.html
- Unicode-processing issues in Perl and how to cope with it
- http://www.ahinea.com/en/tech/perl-unicode-struggle.html
- "UTF-8" vs. "utf8"
- http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8
- Perl Unicode Tutorial
- http://juerd.nl/files/slides/2007yapceu/unicodetutorial.html