Here is an interesting piece of information on how to avoid problems while editing UTF-16|32 files using Perl.
Some time ago I wanted to read an UTF-16LE-encoded XML file, modify some strings, and then write out the
updated version. I followed the guidelines to transform the encoding to Perl's internal format using PerlIO:
open my $in, "<:encoding(UTF-16LE)", $in_path;
# some code goes here
open my $out, ">:encoding(UTF-16LE)", $out_path;
but couldn't make it work right this way. The extended characters weren't displayed correctly. After doing some research on the Internet, I found a solution in this thread of the perl-unicode list. Dan Kogai mentions that BOMed UTF files are not suitable for streaming models, and they must be slurped as binary files instead:
# read file
open(my $in, '<:raw', $in_path) || die "Couldn't open file: $!";
my $text = do { local $/; <$in> };
close $in;
# now decode to use character semantics (no need to specify LE or BE when reading)
my $content = decode('UTF-16LE', $text);
my @lines = split /\n/, $content;
# some code that turns @lines into @processed goes here
# write file
my $output_str = join "\n", @processed;
open my $out, '>:raw', $out_path;
print $out encode("UTF-16LE", $output_str);
close $out;
and that did the trick.
Some time ago I wanted to read an UTF-16LE-encoded XML file, modify some strings, and then write out the
updated version. I followed the guidelines to transform the encoding to Perl's internal format using PerlIO:
open my $in, "<:encoding(UTF-16LE)", $in_path;
# some code goes here
open my $out, ">:encoding(UTF-16LE)", $out_path;
but couldn't make it work right this way. The extended characters weren't displayed correctly. After doing some research on the Internet, I found a solution in this thread of the perl-unicode list. Dan Kogai mentions that BOMed UTF files are not suitable for streaming models, and they must be slurped as binary files instead:
# read file
open(my $in, '<:raw', $in_path) || die "Couldn't open file: $!";
my $text = do { local $/; <$in> };
close $in;
# now decode to use character semantics (no need to specify LE or BE when reading)
my $content = decode('UTF-16LE', $text);
my @lines = split /\n/, $content;
# some code that turns @lines into @processed goes here
# write file
my $output_str = join "\n", @processed;
open my $out, '>:raw', $out_path;
print $out encode("UTF-16LE", $output_str);
close $out;
and that did the trick.

Leave a comment