regex – sed – 删除大型csv文件中引号内的引号

我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.

我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据：

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","text may not contain double quotes,but may contain commas,"
3,word3,"description for "word3"","more text and more"

所需的输出是：

1,"text contains double quotes some more text"
2,"description for word3","more text and more"

我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式：

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题,但似乎不适用于sed：

解决方法

这是使用GNU awk和
FPAT变量的一种方法：

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"",$i); $i=N $i N } }1' file

结果：

1,"text contains double
quotes some more text" 2,"another
text","more text and more"

说明：

Using FPAT,a field is defined as either “anything that is not a comma,” or “a double quote,anything that is not a double quote,and a closing double quote”. Then on every line of input,loop through each field and if the field starts and ends with a double quote,remove all quotes from the field. Finally,add double quotes surrounding the field.

热点

regex – sed – 删除大型csv文件中引号内的引号

解决方法

由 dawei

您错过了

站长心得：网站设计中的色彩心理学应用

站长揭秘：优质内容的创作与发布技巧

站长教程：如何使用网站分析工具提升运营效率？

站长聚焦：新网站推广策略探讨

regex – sed – 删除大型csv文件中引号内的引号

解决方法

由 dawei

相关文章

linux – 如何从当前模块获取kobject

开源Linux Acrobat Javascript编辑器

用于修改ELF二进制文件的动态部分的工具

您错过了

站长心得：网站设计中的色彩心理学应用

站长揭秘：优质内容的创作与发布技巧

站长教程：如何使用网站分析工具提升运营效率？

站长聚焦：新网站推广策略探讨