UTF-16 Processing Issue in Perl

| No Comments | No TrackBacks
Here is an interesting piece of information on how to avoid problems while editing UTF-16|32 files using Perl.
Some time ago I wanted to read an UTF-16LE-encoded XML file, modify some strings, and then write out the
updated version. I followed the guidelines to transform the encoding to Perl's internal format using PerlIO:

open my $in, "<:encoding(UTF-16LE)", $in_path;
# some code goes here
open my $out, ">:encoding(UTF-16LE)", $out_path;

but couldn't make it work right this way. The extended characters weren't displayed correctly. After doing some research on the Internet, I found a solution in this thread of the perl-unicode list. Dan Kogai mentions that BOMed UTF files are not suitable for streaming models, and they must be slurped as binary files instead:

# read file
open(my $in, '<:raw', $in_path) || die "Couldn't open file: $!";
my $text = do { local $/; <$in> };
close $in;

# now decode to use character semantics (no need to specify LE or BE when reading)
my $content = decode('UTF-16LE', $text);
my @lines = split /\n/, $content;

# some code that turns @lines into @processed goes here

# write file
my $output_str = join "\n", @processed;

open my $out, '>:raw', $out_path;
print $out encode("UTF-16LE", $output_str);
close $out;

and that did the trick.

No TrackBacks

TrackBack URL: http://www.haboogo.com/cgi-bin/mt/mt-tb.cgi/8

Leave a comment

About this Entry

This page contains a single entry by Enrique Nell published on January 28, 2009 2:07 AM.

Code Syntax Highlighting was the previous entry in this blog.

Hot Books Basket(s) is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Pages

Powered by Movable Type 4.23-en