ES系列十三、Elasticsearch Suggester API(自动补全)

2020-01-16 21:35栏目:操作系统
TAG:

Win7之家:Google Suggest 开始变“聪明”了

原文载于Elastic中文社区: https://elasticsearch.cn/article/142

bob体育平台,1.概念

Google Suggest越来越实用了,现在又加入了拼写检查。以前我们必须在输入完毕后,Google才会在搜索结果页面里提示“你要找的是不是xx?”,现在Google直接在搜索建议里就会给出纠错后的内容,如下图,当你输入this is a rlly的时候,Google会在搜索建议里问你“你要找的是不是this is a rally”。

现代的搜索引擎,一般会具备"Suggest As You Type"功能,即在用户输入搜索的过程中,进行自动补全或者纠错。 通过协助用户输入更精准的关键词,提高后续全文搜索阶段文档匹配的程度。例如在Google上输入部分关键词,甚至输入拼写错误的关键词时,它依然能够提示出用户想要输入的内容:

1.补全api主要分为四类

  1. Term Suggester(纠错补全,输入错误的情况下补全正确的单词)
  2. Phrase Suggester(自动补全短语,输入一个单词补全整个短语)
  3. Completion Suggester(完成补全单词,输出如前半部分,补全整个单词)
  4. Context Suggester

整体效果类似百度搜索,如图:

bob体育平台 1

bob体育app,  

最不可思议的是这里的纠错还会根据上下文关系给出最适合的内容,当你继续输入this is a rlly beautiful pce of art之后,Google会冰雪聪明的将纠错建议改成“你要找的是不是this is a really beautiful piece of art”。

bob体育平台 2

2.Term Suggester

如果你输入Why its so important too eat hole grains,一般浏览器和编辑工具的纠错都不会认为你写错了──因为这里没有拼写错误。不过Google则再一次冰雪聪明的跳出来为你纠错:“你要找的是不是Why it's so important to eat whole grains”。 这意味着Google现在已经在尝试理解你要表达的意思,Google真的正在变聪明。

bob体育平台 3

2.1.api

1.建立索引

PUT /book4{  "mappings": {    "english": {      "properties": {        "passage": {          "type": "text"        }      }    }  }}

如果自己亲手去试一试,可以看到Google在用户刚开始输入的时候是自动补全的,而当输入到一定长度,如果因为单词拼写错误无法补全,就开始尝试提示相似的词。

2.插入数据

curl -H "Content-Type: application/json" -XPOST 'http:localhost:9200/_bulk' -d'{ "index" : { "_index" : "book4", "_type" : "english" } }{ "passage": "Lucene is cool"}{ "index" : { "_index" : "book4", "_type" : "english" } }{ "passage": "Elasticsearch builds on top of lucene"}{ "index" : { "_index" : "book4", "_type" : "english" } }{ "passage": "Elasticsearch rocks"}{ "index" : { "_index" : "book4", "_type" : "english" } }{ "passage": "Elastic is the company behind ELK stack"}{ "index" : { "_index" : "book4", "_type" : "english" } }{ "passage": "elk rocks"}{ "index" : { "_index" : "book4", "_type" : "english" } }{  "passage": "elasticsearch is rock solid"}'

那么类似的功能在Elasticsearch里如何实现呢? 答案就在Suggesters API。 Suggesters基本的运作原理是将输入的文本分解为token,然后在索引的字典里查找相似的term并返回。 根据使用场景的不同,Elasticsearch里设计了4种类别的Suggester,分别是:

3.看下储存的分词有哪些

post /_analyze{  "text": [    "Lucene is cool",    "Elasticsearch builds on top of lucene",    "Elasticsearch rocks",    "Elastic is the company behind ELK stack",    "elk rocks",    "elasticsearch is rock solid"  ]}

结果:

bob体育平台 4bob体育平台 5

{    "tokens": [        {            "token": "lucene",            "start_offset": 0,            "end_offset": 6,            "type": "<ALPHANUM>",            "position": 0        },        {            "token": "is",            "start_offset": 7,            "end_offset": 9,            "type": "<ALPHANUM>",            "position": 1        },        {            "token": "cool",            "start_offset": 10,            "end_offset": 14,            "type": "<ALPHANUM>",            "position": 2        },        {            "token": "elasticsearch",            "start_offset": 15,            "end_offset": 28,            "type": "<ALPHANUM>",            "position": 103        },        {            "token": "builds",            "start_offset": 29,            "end_offset": 35,            "type": "<ALPHANUM>",            "position": 104        },        {            "token": "on",            "start_offset": 36,            "end_offset": 38,            "type": "<ALPHANUM>",            "position": 105        },        {            "token": "top",            "start_offset": 39,            "end_offset": 42,            "type": "<ALPHANUM>",            "position": 106        },        {            "token": "of",            "start_offset": 43,            "end_offset": 45,            "type": "<ALPHANUM>",            "position": 107        },        {            "token": "lucene",            "start_offset": 46,            "end_offset": 52,            "type": "<ALPHANUM>",            "position": 108        },        {            "token": "elasticsearch",            "start_offset": 53,            "end_offset": 66,            "type": "<ALPHANUM>",            "position": 209        },        {            "token": "rocks",            "start_offset": 67,            "end_offset": 72,            "type": "<ALPHANUM>",            "position": 210        },        {            "token": "elastic",            "start_offset": 73,            "end_offset": 80,            "type": "<ALPHANUM>",            "position": 311        },        {            "token": "is",            "start_offset": 81,            "end_offset": 83,            "type": "<ALPHANUM>",            "position": 312        },        {            "token": "the",            "start_offset": 84,            "end_offset": 87,            "type": "<ALPHANUM>",            "position": 313        },        {            "token": "company",            "start_offset": 88,            "end_offset": 95,            "type": "<ALPHANUM>",            "position": 314        },        {            "token": "behind",            "start_offset": 96,            "end_offset": 102,            "type": "<ALPHANUM>",            "position": 315        },        {            "token": "elk",            "start_offset": 103,            "end_offset": 106,            "type": "<ALPHANUM>",            "position": 316        },        {            "token": "stack",            "start_offset": 107,            "end_offset": 112,            "type": "<ALPHANUM>",            "position": 317        },        {            "token": "elk",            "start_offset": 113,            "end_offset": 116,            "type": "<ALPHANUM>",            "position": 418        },        {            "token": "rocks",            "start_offset": 117,            "end_offset": 122,            "type": "<ALPHANUM>",            "position": 419        },        {            "token": "elasticsearch",            "start_offset": 123,            "end_offset": 136,            "type": "<ALPHANUM>",            "position": 520        },        {            "token": "is",            "start_offset": 137,            "end_offset": 139,            "type": "<ALPHANUM>",            "position": 521        },        {            "token": "rock",            "start_offset": 140,            "end_offset": 144,            "type": "<ALPHANUM>",            "position": 522        },        {            "token": "solid",            "start_offset": 145,            "end_offset": 150,            "type": "<ALPHANUM>",            "position": 523        }    ]}

View Code

  • Term Suggester
  • Phrase Suggester
  • Completion Suggester
  • Context Suggester

4.term suggest api

搜索下试试,给出错误单词Elasticsearaach

POST /book4/_search{    "suggest" : {    "my-suggestion" : {      "text" : "Elasticsearaach",      "term" : {        "field" : "passage",
     "suggest_mode": "popular"      }    }  }}

response:

{    "took": 26,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "skipped": 0,        "failed": 0    },    "hits": {        "total": 0,        "max_score": 0,        "hits": []    },    "suggest": {        "my-suggestion": [            {                "text": "elasticsearaach",                "offset": 0,                "length": 15,                "options": [                    {                        "text": "elasticsearch",                        "score": 0.84615386,                        "freq": 3                    }                ]            }        ]    }}

在官方的参考文档里,对这4种Suggester API都有比较详细的介绍,但苦于只有英文版,部分国内开发者看完文档后仍然难以理解其运作机制。 本文将在Elasticsearch 5.x上通过示例讲解Suggester的基础用法,希望能帮助部分国内开发者快速用于实际项目开发。限于篇幅,更为高级的Context Suggester会被略过。

5.搜索多个字段分别给出提示:

POST _search{  "suggest": {    "my-suggest-1" : {      "text" : "tring out Elasticsearch",      "term" : {        "field" : "message"      }    },    "my-suggest-2" : {      "text" : "kmichy",      "term" : {        "field" : "user"      }    }  }}

term建议者提出基于编辑距离条款。在建议术语之前分析提供的建议文本。建议的术语是根据分析的建议文本标记提供的。该term建议者不走查询到的是是的请求部分。

首先来看一个Term Suggester的示例:
准备一个叫做blogs的索引,配置一个text字段。

常见建议选项:

text

建议文字。建议文本是必需的选项,需要全局或按建议设置。

field

从中获取候选建议的字段。这是一个必需的选项,需要全局设置或根据建议设置。

analyzer

用于分析建议文本的分析器。默认为建议字段的搜索分析器。

size

每个建议文本标记返回的最大更正。

sort

定义如何根据建议文本术语对建议进行排序。两个可能的值:

  • score:先按分数排序,然后按文档频率排序,再按术语本身排序。
  • frequency:首先按文档频率排序,然后按相似性分数排序,然后按术语本身排序。

suggest_mode

建议模式控制包含哪些建议或控制建议的文本术语,建议。可以指定三个可能的值:

  • missing:仅提供不在索引词典中,但是在原文档中的词。这是默认值。
  • popular:仅提供在索引词典中出现的词语。
  • always:索引词典中出没出现的词语都要给出建议。
PUT /blogs/
{
  "mappings": {
    "tech": {
      "properties": {
        "body": {
          "type": "text"
        }
      }
    }
  }
}

其他术语建议选项:

lowercase_terms

在文本分析之后,建议文本术语小写。

max_edits

最大编辑距离候选建议可以具有以便被视为建议。只能是介于1和2之间的值。任何其他值都会导致抛出错误的请求错误。默认为2。

prefix_length

必须匹配的最小前缀字符的数量才是候选建议。默认为1.增加此数字可提高拼写检查性能。通常拼写错误不会出现在术语的开头。(旧名“prefix_len”已弃用)

min_word_length

建议文本术语必须具有的最小长度才能包含在内。默认为4.(旧名称“min_word_len”已弃用)

shard_size

设置从每个单独分片中检索的最大建议数。在减少阶段,仅根据size选项返回前N个建议。默认为该size选项。将此值设置为高于该值的值size可能非常有用,以便以性能为代价获得更准确的拼写更正文档频率。由于术语在分片之间被划分,因此拼写校正频率的分片级文档可能不准确。增加这些将使这些文档频率更精确。

max_inspections

用于乘以的因子,shards_size以便在碎片级别上检查更多候选拼写更正。可以以性能为代价提高准确性。默认为5。

min_doc_freq

建议应出现的文档数量的最小阈值。可以指定为绝对数字或文档数量的相对百分比。这可以仅通过建议高频项来提高质量。默认为0f且未启用。如果指定的值大于1,则该数字不能是小数。分片级文档频率用于此选项。

max_term_freq

建议文本令牌可以存在的文档数量的最大阈值,以便包括在内。可以是表示文档频率的相对百分比数或绝对数。如果指定的值大于1,则不能指定小数。默认为0.01f。这可用于排除高频术语的拼写检查。高频术语通常拼写正确,这也提高了拼写检查的性能。分片级文档频率用于此选项。

string_distance

用于比较类似建议术语的字符串距离实现。可以指定五个可能的值:internal- 默认值基于damerau_levenshtein,但高度优化用于比较索引中术语的字符串距离。damerau_levenshtein- 基于Damerau-Levenshtein算法的字符串距离算法。levenshtein- 基于Levenshtein编辑距离算法的字符串距离算法。jaro_winkler- 基于Jaro-Winkler算法的字符串距离算法。ngram- 基于字符n-gram的字符串距离算法。

官方api

通过bulk api写入几条文档

2.phase sguesster:短语纠错

phrase 短语建议,在term的基础上,会考量多个term之间的关系,比如是否同时出现在索引的原文里,相邻程度,以及词频等

示例1:

POST book4/_search

{  "suggest" : { "myss":{
"text": "Elasticsearch rock",
"phrase": {
"field": "passage"
}
}
}
}

{    "took": 11,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "skipped": 0,        "failed": 0    },    "hits": {        "total": 0,        "max_score": 0,        "hits": []    },    "suggest": {        "myss": [            {                "text": "Elasticsearch rock",                "offset": 0,                "length": 18,                "options": [                    {                        "text": "elasticsearch rocks",                        "score": 0.3467123                    }                ]            }        ]    }}
POST _bulk/?refresh=true
{ "index" : { "_index" : "blogs", "_type" : "tech" } }
{ "body": "Lucene is cool"}
{ "index" : { "_index" : "blogs", "_type" : "tech" } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "blogs", "_type" : "tech" } }
{ "body": "Elasticsearch rocks"}
{ "index" : { "_index" : "blogs", "_type" : "tech" } }
{ "body": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "blogs", "_type" : "tech" } }
{ "body": "elk rocks"}
{ "index" : { "_index" : "blogs", "_type" : "tech" } }
{  "body": "elasticsearch is rock solid"}

3. Completion suggester 自动补全

针对自动补全场景而设计的建议器。此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在

此时blogs索引里已经有一些文档了,可以进行下一步的探索。为帮助理解,我们先看看哪些term会存在于词典里。
将输入的文本分析一下:

1.建立索引

POST /book5{    "mappings": {        "music" : {            "properties" : {                "suggest" : {                     "type" : "completion"                },                "title" : {                    "type": "keyword"                }            }        }    }}

插入数据:

POST /book5/music{    "suggest":"test my book"}

Input 指定输入词 Weight 指定排序值

PUT music/music/5nupmmUBYLvVFwGWH3cu?refresh{    "suggest" : {        "input": [ "test", "book" ],        "weight" : 34    }}

指定不同的排序值:

PUT music/_doc/6Hu2mmUBYLvVFwGWxXef?refresh{    "suggest" : [        {            "input": "test",            "weight" : 10        },        {            "input": "good",            "weight" : 3        }    ]}
POST _analyze
{
  "text": [
    "Lucene is cool",
    "Elasticsearch builds on top of lucene",
    "Elasticsearch rocks",
    "Elastic is the company behind ELK stack",
    "elk rocks",
    "elasticsearch is rock solid"
  ]
}

示例1:查询建议根据前缀查询

POST book5/_search?pretty{    "suggest": {        "song-suggest" : {            "prefix" : "te",             "completion" : {                 "field" : "suggest"             }        }    }}

{    "took": 8,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "skipped": 0,        "failed": 0    },    "hits": {        "total": 0,        "max_score": 0,        "hits": []    },    "suggest": {        "song-suggest": [            {                "text": "te",                "offset": 0,                "length": 2,                "options": [                    {                        "text": "test my book1",                        "_index": "book5",                        "_type": "music",                        "_id": "6Xu6mmUBYLvVFwGWpXeL",                        "_score": 1,                        "_source": {                            "suggest": "test my book1"                        }                    },                    {                        "text": "test my book1",                        "_index": "book5",                        "_type": "music",                        "_id": "6nu8mmUBYLvVFwGWSndC",                        "_score": 1,                        "_source": {                            "suggest": "test my book1"                        }                    },                    {                        "text": "test my book1 english",                        "_index": "book5",                        "_type": "music",                        "_id": "63u8mmUBYLvVFwGWZHdC",                        "_score": 1,                        "_source": {                            "suggest": "test my book1 english"                        }                    }                ]            }        ]    }}

(由于结果太长,此处略去)

示例2:对建议查询结果去重

{    "suggest": {        "song-suggest" : {            "prefix" : "te",             "completion" : {                 "field" : "suggest" ,                 "skip_duplicates": true             }        }    }}

这些分出来的token都会成为词典里一个term,注意有些token会出现多次,因此在倒排索引里记录的词频会比较高,同时记录的还有这些token在原文档里的偏移量和相对位置信息。
执行一次suggester搜索看看效果:

示例3:查询建议文档存储短语

POST /book5/music/63u8mmUBYLvVFwGWZHdC?refresh{    "suggest" : {        "input": [ "book1 english", "test english" ],        "weight" : 20    }}

查询:

POST book5/_search?pretty{    "suggest": {        "song-suggest" : {            "prefix" : "test",             "completion" : {                 "field" : "suggest" ,                "skip_duplicates": true            }        }    }}

结果:

{    "took": 7,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "skipped": 0,        "failed": 0    },    "hits": {        "total": 0,        "max_score": 0,        "hits": []    },    "suggest": {        "song-suggest": [            {                "text": "test",                "offset": 0,                "length": 4,                "options": [                    {                        "text": "test english",                        "_index": "book5",                        "_type": "music",                        "_id": "63u8mmUBYLvVFwGWZHdC",                        "_score": 20,                        "_source": {                            "suggest": {                                "input": [                                    "book1 english",                                    "test english"                                ],                                "weight": 20                            }                        }                    },                    {                        "text": "test my book1",                        "_index": "book5",                        "_type": "music",                        "_id": "6Xu6mmUBYLvVFwGWpXeL",                        "_score": 1,                        "_source": {                            "suggest": "test my book1"                        }                    }                ]            }        ]    }}
POST /blogs/_search
{ 
  "suggest": {
    "my-suggestion": {
      "text": "lucne rock",
      "term": {
        "suggest_mode": "missing",
        "field": "body"
      }
    }
  }
}

4. 总结和建议

因此用好Completion Sugester并不是一件容易的事,实际应用开发过程中,需要根据数据特性和业务需要,灵活搭配analyzer和mapping参数,反复调试才可能获得理想的补全效果。

回到篇首搜索框的补全/纠错功能,如果用ES怎么实现呢?我能想到的一个的实现方式:

  1. 在用户刚开始输入的过程中,使用Completion Suggester进行关键词前缀匹配,刚开始匹配项会比较多,随着用户输入字符增多,匹配项越来越少。如果用户输入比较精准,可能Completion Suggester的结果已经够好,用户已经可以看到理想的备选项了。
  2. 如果Completion Suggester已经到了零匹配,那么可以猜测是否用户有输入错误,这时候可以尝试一下Phrase Suggester。
  3. 如果Phrase Suggester没有找到任何option,开始尝试term Suggester。

精准程度上(Precision)看: Completion > Phrase > term, 而召回率上则反之。从性能上看,Completion Suggester是最快的,如果能满足业务需求,只用Completion Suggester做前缀匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比较而言性能应该要低不少,应尽量控制suggester用到的索引的数据量,最理想的状况是经过一定时间预热后,索引可以全量map到内存。

suggest就是一种特殊类型的搜索,DSL内部的text指的是api调用方提供的文本,也就是通常用户界面上用户输入的内容。这里的lucne是错误的拼写,模拟用户输入错误。 term表示这是一个term suggester。 field指定suggester针对的字段,另外有一个可选的suggest_mode。 范例里的"missing"实际上就是缺省值,它是什么意思?有点挠头... 还是先看看返回结果吧:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits":
  },
  "suggest": {
    "my-suggestion": [
      {
        "text": "lucne",
        "offset": 0,
        "length": 5,
        "options": [
          {
            "text": "lucene",
            "score": 0.8,
            "freq": 2
          }
        ]
      },
      {
        "text": "rock",
        "offset": 6,
        "length": 4,
        "options":
      }
    ]
  }
}

版权声明:本文由bob体育app发布于操作系统,转载请注明出处:ES系列十三、Elasticsearch Suggester API(自动补全)