JAVA开发示例之IK分词器的初步使用(Preliminary use of IK word splitter of java development example)

开发步骤

添加依赖

// IK中文分词相关依赖
implementation 'com.github.magese:ik-analyzer:8.5.0'

可配置需要的扩展词及停止词

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">extwords.dic;</entry>
    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">stopwords.dic;</entry>
</properties>

代码示例

package com.wywtime.toolbox;

import org.junit.jupiter.api.Test;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import org.wltea.analyzer.dic.Dictionary;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class WordAnalyzerTest {

    @Test
    public void testWordAnalyzer() throws IOException {
        String text = "人生如逆旅,我亦是行人";
        StringReader sr=new StringReader(text);
        IKSegmenter ik=new IKSegmenter(sr, true);
        // 如果不加下面这句,解析结果是:人|生如|逆旅|我|亦是|行人
        Dictionary.getSingleton().disableWords(Arrays.asList("生如","亦是"));
        Lexeme lex=null;
        List<String> words = new ArrayList<>();
        while((lex=ik.next())!=null){
            words.add(lex.getLexemeText());
        }
        // 输出结果为:人生|如|逆旅|我亦|是|行人
        System.out.println(words.stream().collect(Collectors.joining("|")));
    }
}

遇到的问题

解析得到的结果一直不是想要的,然后自定义了扩展词,还是不行,最后,也是深入源码,才发现Dictionary类里有个disableWords方法。这说明词与词之间是有优先级的,实际应用时,肯定需要根据不同场景去确定分词的优先级。

————————

Development steps

Add dependency

// IK中文分词相关依赖
implementation 'com.github.magese:ik-analyzer:8.5.0'

The required extension words and stop words can be configured

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">extwords.dic;</entry>
    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">stopwords.dic;</entry>
</properties>

Code example

package com.wywtime.toolbox;

import org.junit.jupiter.api.Test;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import org.wltea.analyzer.dic.Dictionary;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class WordAnalyzerTest {

    @Test
    public void testWordAnalyzer() throws IOException {
        String text = "人生如逆旅,我亦是行人";
        StringReader sr=new StringReader(text);
        IKSegmenter ik=new IKSegmenter(sr, true);
        // 如果不加下面这句,解析结果是:人|生如|逆旅|我|亦是|行人
        Dictionary.getSingleton().disableWords(Arrays.asList("生如","亦是"));
        Lexeme lex=null;
        List<String> words = new ArrayList<>();
        while((lex=ik.next())!=null){
            words.add(lex.getLexemeText());
        }
        // 输出结果为:人生|如|逆旅|我亦|是|行人
        System.out.println(words.stream().collect(Collectors.joining("|")));
    }
}

Problems encountered

The result of parsing is not what you want. Then you customize the extension word, but it still doesn’t work. Finally, you go deep into the source code and find that there is a disablewords method in the dictionary class. This shows that there is a priority between words. In practical application, we must determine the priority of word segmentation according to different scenarios.