Friday, September 16, 2005

Perl Script to Convert CSV to Tab Delimited

Converting commas to tabs is trivial: print s/,/\t/g;

Now, handling quoted commas makes it interesting, especially when the quotes are only present when necessary. I wrote this today as an exercise. Perl r0XX0r5!!!!


###
# csv2tab.pl
# Converts comma delimited text to tab delimited text
# Input from STDIN, output to STDOUT
# Note: any pre-existing tabs are converted to commas

while (<>) {
s/,/\t/g;
while (m/"[^\t"]+\t[^\t"]*"/g)
{ s/("[^\t"]+)\t/\1,/g; }
print;
}

###

4 comments:

michael said...
This comment has been removed by the author.
michael said...

Hi, thanks for sharing the code -- unfortunately I had issues with fields containing commas inside of quotes. So here's another version -- with a more complex regexp -- adapted from Mastering Regular Expression and the Perl Cookbook Recipe 1.15 -- that should handle all cases. Hope this saves someone an hour or two.

while (<>) {
my @fields = ();
push(@fields, $+) while $_ =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(@fields, undef) if substr($_, -1,1) eq ',';
print join("\t", @fields);
}

ProfStevie said...

The script michael posted is closer, but still seems to have a couple of problems:

1) I don't know if there's a "standard" per se, but afaik know Excel and OpenOffice both expect in delimited text files that "" is used for a quote within quotes, not \". For example,

"My 7"" records", "My 12"" records", "My compact discs"

2) Maybe it's my version of perl or something, but the second line of michael's regex seems to match the newline character(s) at the end of each line, causing an extra tab at the end of each line of the output.

I propose:

#!/usr/bin/perl

while (<>) {
my @fields = ();
push(@fields, $+) while $_ =~ m{
("[^"]*(?:""[^"]*)*"),?
| ([^,\n\r]+),?
| ,
}gx;
push(@fields, undef) if substr($_, -1,1) eq ',';
print join("\t", @fields);
print "\n";
}

of course you can get the reverse effect (tab-delimited to csv) by switch every , with \t and vice versa, which is actually the version I'm using.

Anonymous said...

Thanks for this! Saved me hours of work, great articles