我有一个CSV檔案,
file.csv
,其中包含這樣的日期和時間:
id0,2020-12-12T07:18:26,7f
id1,2017-04-28T19:59:00,80
id2,2017-04-28T03:14:35,e4
id3,2020-12-12T23:45:09,ff
id4,2020-12-12T09:12:34,a1
id5,2017-04-28T00:31:54,65
id6,2020-12-12T20:13:47,45
id7,2017-04-28T21:04:30,7f
我想根据第2列中的日期拆分檔案。使用上面的示例,它應该建立2个檔案:
file_1.csv
id1,2017-04-28T19:59:00,80
id2,2017-04-28T03:14:35,e4
id5,2017-04-28T00:31:54,65
id7,2017-04-28T21:04:30,7f
和
file_2.csv
id0,2020-12-12T07:18:26,7f
id3,2020-12-12T23:45:09,ff
id4,2020-12-12T09:12:34,a1
id6,2020-12-12T20:13:47,45
我尝試使用
sort
和
awk
做這項工作,但会根据日期和時間將檔案分成8个檔案。
sort -k2 -t, file.csv | awk -F, '!($2 in col) {col[$2]=++i} {print > ("file_" i ".csv")}'
如何仅根据日期(而不是日期和時間)分割檔案?
最新回復
- 5月前1 #
- 5月前2 #
$ cat tst.sh #!/usr/bin/env bash awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' "${@:--}" | sort -k1,1n -k2,2n | cut -f3- | awk -F'[ -]' ' { curr = $2$3 } curr != prev { close(out) out = "file_" (++cnt) ".csv" prev = curr } { print > out } '
./tst.sh file
$ head file_* ==> file_1.csv <== id1, 2017-04-28T19:59:00, 80 id2, 2017-04-28T03:14:35, e4 id5, 2017-04-28T00:31:54, 65 id7, 2017-04-28T21:04:30, 7f ==> file_2.csv <== id0, 2020-12-12T07:18:26, 7f id3, 2020-12-12T23:45:09, ff id4, 2020-12-12T09:12:34, a1 id6, 2020-12-12T20:13:47, 45
以上內容可与任何POSIX awk,排序和剪切一起稳健,高效且可移植地工作,並將輸入順序保留在輸出檔案中。
這是前三步重新排列輸入檔案內容的方式:
$ cat file id0, 2020-12-12T07:18:26, 7f id1, 2017-04-28T19:59:00, 80 id2, 2017-04-28T03:14:35, e4 id3, 2020-12-12T23:45:09, ff id4, 2020-12-12T09:12:34, a1 id5, 2017-04-28T00:31:54, 65 id6, 2020-12-12T20:13:47, 45 id7, 2017-04-28T21:04:30, 7f
因此,在最终的awk指令碼執行時,它按年+月从$ 2開始按行排序,並保留所有具有相同日期和時間的行的輸入順序:
$ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file 202012 1 id0, 2020-12-12T07:18:26, 7f 201704 2 id1, 2017-04-28T19:59:00, 80 201704 3 id2, 2017-04-28T03:14:35, e4 202012 4 id3, 2020-12-12T23:45:09, ff 202012 5 id4, 2020-12-12T09:12:34, a1 201704 6 id5, 2017-04-28T00:31:54, 65 202012 7 id6, 2020-12-12T20:13:47, 45 201704 8 id7, 2017-04-28T21:04:30, 7f
$ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file | sort -k1,1n -k2,2n 201704 2 id1, 2017-04-28T19:59:00, 80 201704 3 id2, 2017-04-28T03:14:35, e4 201704 6 id5, 2017-04-28T00:31:54, 65 201704 8 id7, 2017-04-28T21:04:30, 7f 202012 1 id0, 2020-12-12T07:18:26, 7f 202012 4 id3, 2020-12-12T23:45:09, ff 202012 5 id4, 2020-12-12T09:12:34, a1 202012 7 id6, 2020-12-12T20:13:47, 45
$ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file | sort -k1,1n -k2,2n | cut -f3- id1, 2017-04-28T19:59:00, 80 id2, 2017-04-28T03:14:35, e4 id5, 2017-04-28T00:31:54, 65 id7, 2017-04-28T21:04:30, 7f id0, 2020-12-12T07:18:26, 7f id3, 2020-12-12T23:45:09, ff id4, 2020-12-12T09:12:34, a1 id6, 2020-12-12T20:13:47, 45
- 5月前3 #
按原樣行事,意味着先做
sort
然後分成不同的檔案,也避免使用awk
陣列:<infile sort -t, -k2 \ |awk -F, '{ substr($2,1,10)!=prev && nxt++; print >>("file_"nxt".csv"); close("file_"nxt".csv"); prev=substr($2,1,10); }'
相似問題
- bash:查詢唯一名稱的频率bashshellscripttextprocessingawksed2021-01-06 13:54
- regex:仅使用sed或awk从html頁面提取URL的最簡單方法htmlregexbashsedawk2021-01-09 16:56
- bash:使用awk查詢列的平均值bashawk2021-01-09 11:23
- bash:Grep有/無ID的多个职位bashawkgreppatternmatchingvcfvariantcallformat2021-01-03 13:23
如何:
如果檔案很大且具有许多唯一的日期,則可能需要使用以下方法来防止"打開的檔案過多"錯誤: