首頁>Program>source

如何生成像這樣的字元串的n-gram:

String Input="This is my car."

我想用此輸入生成n-gram:

Input Ngram size = 3

輸出應為:

This
is
my
car
This is
is my
my car
This is my
is my car

在Java中给出一些想法,如何實現该想法,或者是否有可用的庫。

我正在尝試使用此NGramTokenizer,但它给出n-gram的字元序列,而我想要n-gram的單詞序列。

最新回復
  • 6月前
    1 #

    您正在尋找ShingleFilter。

    更新:鏈接指向版本3.0.2.此類可能在更高版本的Lucene中使用不同的軟體包。

  • 6月前
    2 #

    我相信這会满足您的要求:

    import java.util.*;
    public class Test {
        public static List<String> ngrams(int n, String str) {
            List<String> ngrams = new ArrayList<String>();
            String[] words = str.split(" ");
            for (int i = 0; i < words.length - n + 1; i++)
                ngrams.add(concat(words, i, i+n));
            return ngrams;
        }
        public static String concat(String[] words, int start, int end) {
            StringBuilder sb = new StringBuilder();
            for (int i = start; i < end; i++)
                sb.append((i > start ? " " : "") + words[i]);
            return sb.toString();
        }
        public static void main(String[] args) {
            for (int n = 1; n <= 3; n++) {
                for (String ngram : ngrams(n, "This is my car."))
                    System.out.println(ngram);
                System.out.println();
            }
        }
    }
    

    Output:

    This
    is
    my
    car.
    This is
    is my
    my car.
    This is my
    is my car.
    

    實現為迭代器的"按需"解決方案:

    class NgramIterator implements Iterator<String> {
        String[] words;
        int pos = 0, n;
        public NgramIterator(int n, String str) {
            this.n = n;
            words = str.split(" ");
        }
        public boolean hasNext() {
            return pos < words.length - n + 1;
        }
        public String next() {
            StringBuilder sb = new StringBuilder();
            for (int i = pos; i < pos + n; i++)
                sb.append((i > pos ? " " : "") + words[i]);
            pos++;
            return sb.toString();
        }
        public void remove() {
            throw new UnsupportedOperationException();
        }
    }
    

  • 6月前
    3 #

    此代碼返迴给定长度的所有字元串的陣列:

    public static String[] ngrams(String s, int len) {
        String[] parts = s.split(" ");
        String[] result = new String[parts.length - len + 1];
        for(int i = 0; i < parts.length - len + 1; i++) {
           StringBuilder sb = new StringBuilder();
           for(int k = 0; k < len; k++) {
               if(k > 0) sb.append(' ');
               sb.append(parts[i+k]);
           }
           result[i] = sb.toString();
        }
        return result;
    }
    

    例如

    System.out.println(Arrays.toString(ngrams("This is my car", 2)));
    //--> [This is, is my, my car]
    System.out.println(Arrays.toString(ngrams("This is my car", 3)));
    //--> [This is my, is my car]
    

  • 6月前
    4 #

    /**
     * 
     * @param sentence should has at least one string
     * @param maxGramSize should be 1 at least
     * @return set of continuous word n-grams up to maxGramSize from the sentence
     */
    public static List<String> generateNgramsUpto(String str, int maxGramSize) {
        List<String> sentence = Arrays.asList(str.split("[\\W+]"));
        List<String> ngrams = new ArrayList<String>();
        int ngramSize = 0;
        StringBuilder sb = null;
        //sentence becomes ngrams
        for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
            String word = (String) it.next();
            //1- add the word itself
            sb = new StringBuilder(word);
            ngrams.add(word);
            ngramSize=1;
            it.previous();
            //2- insert prevs of the word and add those too
            while(it.hasPrevious() && ngramSize<maxGramSize){
                sb.insert(0,' ');
                sb.insert(0,it.previous());
                ngrams.add(sb.toString());
                ngramSize++;
            }
            //go back to initial position
            while(ngramSize>0){
                ngramSize--;
                it.next();
            }                   
        }
        return ngrams;
    }
    

    Call:

    long startTime = System.currentTimeMillis();
    ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
    long stopTime = System.currentTimeMillis();
    System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
    System.out.println(ngrams.toString());
    

    Output:

    My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This is my, car, my car, is my car]

    pre
       public static void CreateNgram(ArrayList<String> list, int cutoff) {
        try
        {
            NGramModel ngramModel = new NGramModel();
            POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
            PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
            POSTaggerME tagger = new POSTaggerME(model);
            perfMon.start();
            for(int i = 0; i<list.size(); i++)
            {
                String inputString = list.get(i);
                ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
                String line;
                while ((line = lineStream.read()) != null) 
                {
                    String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
                    String[] tags = tagger.tag(whitespaceTokenizerLine);
                    POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
                    perfMon.incrementCounter();
                    String words[] = sample.getSentence();
                    if(words.length > 0)
                    {
                        for(int k = 2; k< 4; k++)
                        {
                            ngramModel.add(new StringList(words), k, k);
                        }
                    }
                }
            }
            ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
            Iterator<StringList> it = ngramModel.iterator();
            while(it.hasNext())
            {
                StringList strList = it.next();
                System.out.println(strList.toString());
            }
            perfMon.stopAndPrintFinalResult();
        }catch(Exception e)
        {
            System.out.println(e.toString());
        }
    }
    

    這是我建立n-gram的代碼.在這種情况下,n = 2、3。小於截止值的n-gram單詞序列將从結果集中忽略.輸入的是句子列表,然後使用OpenNLP工具进行解析

相似問題

  • JSlint錯誤"不要在迴圈內建立函式"匯致對Javascript本身的質疑
  • find path of current folder:查詢当前檔案夹的路徑-cmd