Converting commas to tabs is trivial: print s/,/\t/g;
Now, handling quoted commas makes it interesting, especially when the quotes are only present when necessary. I wrote this today as an exercise. Perl r0XX0r5!!!!
###
# csv2tab.pl
# Converts comma delimited text to tab delimited text
# Input from STDIN, output to STDOUT
# Note: any pre-existing tabs are converted to commas
while (<>) {
s/,/\t/g;
while (m/"[^\t"]+\t[^\t"]*"/g)
{ s/("[^\t"]+)\t/\1,/g; }
print;
}
###
Friday, September 16, 2005
Subscribe to:
Post Comments (Atom)
4 comments:
Hi, thanks for sharing the code -- unfortunately I had issues with fields containing commas inside of quotes. So here's another version -- with a more complex regexp -- adapted from Mastering Regular Expression and the Perl Cookbook Recipe 1.15 -- that should handle all cases. Hope this saves someone an hour or two.
while (<>) {
my @fields = ();
push(@fields, $+) while $_ =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(@fields, undef) if substr($_, -1,1) eq ',';
print join("\t", @fields);
}
The script michael posted is closer, but still seems to have a couple of problems:
1) I don't know if there's a "standard" per se, but afaik know Excel and OpenOffice both expect in delimited text files that "" is used for a quote within quotes, not \". For example,
"My 7"" records", "My 12"" records", "My compact discs"
2) Maybe it's my version of perl or something, but the second line of michael's regex seems to match the newline character(s) at the end of each line, causing an extra tab at the end of each line of the output.
I propose:
#!/usr/bin/perl
while (<>) {
my @fields = ();
push(@fields, $+) while $_ =~ m{
("[^"]*(?:""[^"]*)*"),?
| ([^,\n\r]+),?
| ,
}gx;
push(@fields, undef) if substr($_, -1,1) eq ',';
print join("\t", @fields);
print "\n";
}
of course you can get the reverse effect (tab-delimited to csv) by switch every , with \t and vice versa, which is actually the version I'm using.
Thanks for this! Saved me hours of work, great articles
Post a Comment