首頁>Linux>source

我有一个CSV檔案, file.csv ,其中包含這樣的日期和時間:

id0,2020-12-12T07:18:26,7f
id1,2017-04-28T19:59:00,80
id2,2017-04-28T03:14:35,e4
id3,2020-12-12T23:45:09,ff
id4,2020-12-12T09:12:34,a1
id5,2017-04-28T00:31:54,65
id6,2020-12-12T20:13:47,45
id7,2017-04-28T21:04:30,7f

我想根据第2列中的日期拆分檔案。使用上面的示例,它應该建立2个檔案:

file_1.csv
id1,2017-04-28T19:59:00,80
id2,2017-04-28T03:14:35,e4
id5,2017-04-28T00:31:54,65
id7,2017-04-28T21:04:30,7f

file_2.csv
id0,2020-12-12T07:18:26,7f
id3,2020-12-12T23:45:09,ff
id4,2020-12-12T09:12:34,a1
id6,2020-12-12T20:13:47,45

我尝試使用 sortawk 做這項工作,但会根据日期和時間將檔案分成8个檔案。

sort -k2 -t, file.csv | awk -F, '!($2 in col) {col[$2]=++i} {print > ("file_" i ".csv")}'

如何仅根据日期(而不是日期和時間)分割檔案?

最新回復
  • 5月前
    1 #

    如何:

    awk -F', ' '
      { date = substr($2,1,10) }
      !(date in outfile) { outfile[date] = "file_" (++numout) ".csv" }
      { print > outfile[date] }
    ' file.csv
    

    如果檔案很大且具有许多唯一的日期,則可能需要使用以下方法来防止"打開的檔案過多"錯誤:

     { print >> outfile[date]; close(outfile[date]) }
    

  • 5月前
    2 #

    $ cat tst.sh
    #!/usr/bin/env bash
    awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' "${@:--}" |
    sort -k1,1n -k2,2n |
    cut -f3- |
    awk -F'[ -]' '
        { curr = $2$3 }
        curr != prev {
            close(out)
            out = "file_" (++cnt) ".csv"
            prev = curr
        }
        { print > out }
    '
    

    ./tst.sh file
    

    $ head file_*
    ==> file_1.csv <==
    id1, 2017-04-28T19:59:00, 80
    id2, 2017-04-28T03:14:35, e4
    id5, 2017-04-28T00:31:54, 65
    id7, 2017-04-28T21:04:30, 7f
    ==> file_2.csv <==
    id0, 2020-12-12T07:18:26, 7f
    id3, 2020-12-12T23:45:09, ff
    id4, 2020-12-12T09:12:34, a1
    id6, 2020-12-12T20:13:47, 45
    

    以上內容可与任何POSIX awk,排序和剪切一起稳健,高效且可移植地工作,並將輸入順序保留在輸出檔案中。

    這是前三步重新排列輸入檔案內容的方式:

    $ cat file
    id0, 2020-12-12T07:18:26, 7f
    id1, 2017-04-28T19:59:00, 80
    id2, 2017-04-28T03:14:35, e4
    id3, 2020-12-12T23:45:09, ff
    id4, 2020-12-12T09:12:34, a1
    id5, 2017-04-28T00:31:54, 65
    id6, 2020-12-12T20:13:47, 45
    id7, 2017-04-28T21:04:30, 7f
    

    因此,在最终的awk指令碼執行時,它按年+月从$ 2開始按行排序,並保留所有具有相同日期和時間的行的輸入順序:

    $ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file
    202012  1       id0, 2020-12-12T07:18:26, 7f
    201704  2       id1, 2017-04-28T19:59:00, 80
    201704  3       id2, 2017-04-28T03:14:35, e4
    202012  4       id3, 2020-12-12T23:45:09, ff
    202012  5       id4, 2020-12-12T09:12:34, a1
    201704  6       id5, 2017-04-28T00:31:54, 65
    202012  7       id6, 2020-12-12T20:13:47, 45
    201704  8       id7, 2017-04-28T21:04:30, 7f
    

    $ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file | sort -k1,1n -k2,2n
    201704  2       id1, 2017-04-28T19:59:00, 80
    201704  3       id2, 2017-04-28T03:14:35, e4
    201704  6       id5, 2017-04-28T00:31:54, 65
    201704  8       id7, 2017-04-28T21:04:30, 7f
    202012  1       id0, 2020-12-12T07:18:26, 7f
    202012  4       id3, 2020-12-12T23:45:09, ff
    202012  5       id4, 2020-12-12T09:12:34, a1
    202012  7       id6, 2020-12-12T20:13:47, 45
    

    $ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file | sort -k1,1n -k2,2n | cut -f3-
    id1, 2017-04-28T19:59:00, 80
    id2, 2017-04-28T03:14:35, e4
    id5, 2017-04-28T00:31:54, 65
    id7, 2017-04-28T21:04:30, 7f
    id0, 2020-12-12T07:18:26, 7f
    id3, 2020-12-12T23:45:09, ff
    id4, 2020-12-12T09:12:34, a1
    id6, 2020-12-12T20:13:47, 45
    

  • 5月前
    3 #

    按原樣行事,意味着先做 sort 然後分成不同的檔案,也避免使用 awk 陣列:

    <infile sort -t, -k2 \
    |awk -F, '{
         substr($2,1,10)!=prev && nxt++;
         print >>("file_"nxt".csv"); close("file_"nxt".csv");
         prev=substr($2,1,10);
    }'
    

  • bash:查詢唯一名稱的频率
  • bash:如何通過使用sed在檔案的開頭附加儲存在shell變數中的文字?