我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.
我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:
1,word1,"description for word1","another text",""text contains "double quotes" some more text" 2,word2,"description for word2","text may not contain double quotes,but may contain commas," 3,word3,"description for "word3"","more text and more"
所需的输出是:
1,"text contains double quotes some more text" 2,"description for word3","more text and more"
我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式:
sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
这些来自以下问题,但似乎不适用于sed:
Related question for perl
Related question for SISS
原始文件是* .txt,我正在尝试用sed编辑它们.
解决方法
这是使用GNU awk和
FPAT变量的一种方法:
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"",$i); $i=N $i N } }1' file
结果:
1,"text contains double quotes some more text" 2,"another text","more text and more"
说明:
Using FPAT,a field is defined as either “anything that is not a comma,” or “a double quote,anything that is not a double quote,and a closing double quote”. Then on every line of input,loop through each field and if the field starts and ends with a double quote,remove all quotes from the field. Finally,add double quotes surrounding the field.