Notes about using Unicode in Perl 5.6.1 and Perl 5.8. [] (Perl, Unicode, Perl, UTF-8, ISO-8859-1, latin1, convert, converting, file, DBI, DBD::ADO, database, Perl 5.6.1, Perl 5.8), notes, page 721339
https://www.purl.org/stefan_ram/pub/perl_unicode_en (permalink) is the canonical URI of this page.
Stefan Ram

Unicode with Perl

These are some notes on my experiences with Perl  and Unicode.

Strings in Perl 5.8  are tagged as either strings of bytes or strings of characters. Character strings are stored internally as UTF-8  but appear to the programmer as sequences of UCS  characters.

Script Encoding (Perl 5.6.1  and Perl 5.8 )

The declaration "use utf8;" determines how the script source file containing it is interpret. If this declaration is used, the source file (especially string literals within the source file) will be interpret as being encoded in utf8.

The UTF-8-Declaration
use utf8;

If a file with this declaration is including another perl-file with "do", the utf8-declaration needs to be repeated at the beginning of the included file. 

Converting From ISO-8859-1  (Perl 5.6.1 )

As long as no string literal depending on this occurs, it does not matter if "use utf8;" was used.

To convert from ISO-8859-1  to UTF-8, the Module "Unicode::String" can be used as shown in the following example.

Converting From ISO-8859-1

use Unicode::String;

Unicode::String->stringify_as( 'utf8' ); # utf8 already is the default

my $string_iso_8859_1 = "This is latin text.";

my $string_utf8 = Unicode::String::latin1( $string_iso_8859_1 );

# $string_utf8 is now string_iso_8859_1 encoded as UTF-8

print $string_utf8;

Reading or Writing UTF-8  Text Files (Perl 5.8 )

When reading characters from a UTF-8  text file, the file has to be prepared with "binmode".

Reading a Unicode Text File

$source = IO::File->new( $filename, 'r' );

binmode( $source, ':utf8' );

$ch = getc( $source );

The same approach applies to the writing of UTF-8  text files.

To get the UTF-8  representation of a Perl -string into a string the following approach might be used:

use Encode;

$octets = Encode::encode( "utf8", $string );

A Perl -string already might be represented internally using UTF-8 , but this is hidden from the script and a string appears to be a sequence of UCS-characters. So, to get a visible octet representation of a string in UTF-8, the preceding code might be used.

Sometimes, it might be appropriate to adjust the standard streams to use UTF-8.

Setting the standard streams
binmode( STDIN,  ':utf8' );
binmode( STDOUT, ':utf8' );

Reading a Unicode  Jet -4.0-Database (Perl 5.8 )

When reading characters from a Unicode  Jet -4.0-Database, it will help to use the "dbi:ADO"-driver instead of the ODBC-driver. (I saw one ODBC-driver that issued the spurious message "out of memory", when it encountered Unicode -data.).

Setting the property "LongReadLen" to a large value, such as 65535, will indeed consume more memory but can prevent some problems reading Unicode -data from some text fields (so called „memo-fields“): With UTF-8, some Characters will require more than one octet and the size set with the property "LongReadLen" is measured in octets.

Finally, as Jan Dubois  explains, UTF-8 support for the module "Win32::OLE" explicitly needs to be permitted by putting the line "Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);" somewhere after the line "use DBD::ADO". This is believed to work reliably only with Win32::OLE  version 0.16 (as of 2002-10-18) and later, preferably with Perl 5.8.

Reading a Unicode Database

use strict;

use warnings;

use DBI;

use DBD::ADO;

Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);

my $dbh = DBI->connect( "dbi:ADO:Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\\tmp.mdb;" );

$dbh->{LongReadLen} = 65535;

my $sth = $dbh->prepare( "SELECT tmp FROM tmp" );

$sth->execute();

my $row = $sth->fetchrow_hashref;

print ord( substr( $row->{'tmp'}, 0, 1 )); # prints "602"

$sth->finish(); $dbh->disconnect();

See Also

Perl Documentation (5.6 and 5.8)
perldoc perlunicode
Perl Documentation (5.8)
perldoc perluniintro
perldoc encoding
DBD:ADO, DBI and UTF-8
Jan Dubois
http://www.mail-archive.com/perl-unicode@perl.org/msg01954.html
http://www.xray.mpe.mpg.de/mailing-lists/dbi/2003-12/msg00063.html
adventures with mysql, perl and unicode
Brigitte Jellinek
http://perlwelt.horus.at/Beispiele/Magic/PerlUnicodeMysql/
Unicode in Perl 5.6.1
http://developers.sun.com/dev/gadc/unicode/perl/perl561.html
Perl-XML Frequently Asked Questions  (Section on encodings)
http://perl-xml.sourceforge.net/faq/#encodings
Perl, Unicode and i18N FAQ
http://rf.net/~james/perli18n.html
UTF-8 and Unicode FAQ  (not directly perl-related)
http://www.cl.cam.ac.uk/~mgk25/unicode.html
Perl 5.8 Documentation - Unicode Support
http://perl.active-venture.com/pod/perlguts-unicode.html
Unicode-processing issues in Perl and how to cope with it
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
"UTF-8" vs. "utf8"
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8
Perl Unicode Tutorial
http://juerd.nl/files/slides/2007yapceu/unicodetutorial.html

About this page, Impressum  |   Form for messages to the publisher regarding this page  |   "ram@zedat.fu-berlin.de" (without the quotation marks) is the email-address of Stefan Ram.   |   A link to the start page of Stefan Ram appears at the top of this page behind the text "Stefan Ram".)  |   Copyright 1998-2014 Stefan Ram, Berlin. All rights reserved. This page is a publication by Stefan Ram. relevant keywords describing this page: Stefan Ram Berlin slrprd slrprd stefanramberlin spellched stefanram721339 stefan_ram:721339 Perl, Unicode, Perl, UTF-8, ISO-8859-1, latin1, convert, converting, file, DBI, DBD::ADO, database, Perl 5.6.1, Perl 5.8 reading UTF8 in Perl; how to use unicode file in perl; Perl Script to read Unicode file; reading and writing unicode files in Perl; perl write unicode string; intro, introduction, course, article, talk, lecture, lectures, lecture note, lecture notes, seminar, training, free teaching material, free teaching materials, teaching unit, teaching units, distance education, instruction, schooling, advanced training, continuing education, further education, further training, vocational training, education and training, course of instruction, preparatory training, course handout, hand out, trainer, didactics, class, classes, school, tuition, apprenticeship training, day release, theoretical training for apprentices, primer, howto, how-to, how to, textbook, schoolbook, book, books, specialised book, report, tutorial, tutorials, teacher, consulter, advisor, guidance, instruction, instructions, manual, work, reference, solution, solutions, definition of, laymans explanation, explanations, about the topic, FAQ, FAQs, learn, notion, word explanation, example, school, preparation, paper, presentation, hint, tips and tricks, method, methodology, functionality, composition, design, developement, structure, principle, basis, foundation, foundations, structure, structures, question, questions, answer, answers, first step, first steps, overview, first steps, online learning, learn and understand, , free, online, on-line, on line, download, down load, english, information, service, server, about, keyword, keywords, key word, keywords, internet, web, www, world wide web, experience, application, it, 2002, 2003, 2004, 2005, 2006, 2007 what is, what are, contents, html, xhtml, digital, electronic, general, , Stefan Ram, Berlin, and, or, near, uni, online, slrprd, slrprdqxx, slrprddoc, slrprd721339, slrprddef721339, PbclevtugFgrsnaEnz Erklärung, Beschreibung, Info, Information, Hinweis,

Copyright 1998-2014 Stefan Ram, Berlin. All rights reserved. This page is a publication by Stefan Ram.
https://www.purl.org/stefan_ram/pub/perl_unicode_en