• [TIP]
    • [source,js]
    • }
    • [source,js]
    • Sue ate the alligator
    • [source,js]
    • }
    • [source,js]
    • { “title”: “Sue never goes anywhere without her alligator skin purse” }
    • [source,js]
    • }
    • [source,js]
    • }
    • [source,js]
    • }
    • [source,js]
    • }

    [[shingles]]
    === Finding Associated Words

    As useful as phrase and proximity queries can be, they still have a downside.
    They are overly strict: all terms must be present for a phrase query to match,
    even when using slop.(((“proximity matching”, “finding associated words”, range=”startofrange”, id=”ix_proxmatchassoc”)))

    The flexibility in word ordering that you gain with slop also comes at a
    price, because you lose the association between word pairs. While you can
    identify documents in which sue, alligator, and ate occur close together,
    you can’t tell whether Sue ate or the alligator ate.

    When words are used in conjunction with each other, they express an idea that
    is bigger or more meaningful than each word in isolation. The two clauses
    I’m not happy I’m working and I’m happy I’m not working contain the sames words, in
    close proximity, but have quite different meanings.

    If, instead of indexing each word independently, we were to index pairs of
    words, then we could retain more of the context in which the words were used.

    For the sentence Sue ate the alligator, we would not only index each word
    (or unigram) as(((“unigrams”))) a term

    1. ["sue", "ate", "the", "alligator"]

    but also each word and its neighbor as single terms:

    1. ["sue ate", "ate the", "the alligator"]

    These word (((“bigrams”)))pairs (or bigrams) are (((“shingles”)))known as shingles.

    [TIP]

    Shingles are not restricted to being pairs of words; you could index word
    triplets (trigrams) as (((“trigrams”)))well:

    1. ["sue ate the", "ate the alligator"]

    Trigrams give you a higher degree of precision, but greatly increase the
    number of unique terms in the index. Bigrams are sufficient for most use
    cases.

    ==================================================

    Of course, shingles are useful only if the user enters the query in the same
    order as in the original document; a query for sue alligator would match
    the individual words but none of our shingles.

    Fortunately, users tend to express themselves using constructs similar to
    those that appear in the data they are searching. But this point is an
    important one: it is not enough to index just bigrams; we still need unigrams,
    but we can use matching bigrams as a signal to increase the relevance score.

    ==== Producing Shingles

    Shingles need to be created at index time as part of the analysis process.(((“shingles”, “producing at index time”))) We
    could index both unigrams and bigrams into a single field, but it is cleaner
    to keep unigrams and bigrams in separate fields that can be queried
    independently. The unigram field would form the basis of our search, with the
    bigram field being used to boost relevance.

    First, we need to create an analyzer that uses the shingle token filter:

    [source,js]

    DELETE /my_index

    PUT /my_index
    {
    “settings”: {
    “number_of_shards”: 1, <1>
    “analysis”: {
    “filter”: {
    “my_shingle_filter”: {
    “type”: “shingle”,
    “min_shingle_size”: 2, <2>
    “max_shingle_size”: 2, <2>
    “output_unigrams”: false <3>
    }
    },
    “analyzer”: {
    “my_shingle_analyzer”: {
    “type”: “custom”,
    “tokenizer”: “standard”,
    “filter”: [
    “lowercase”,
    “my_shingle_filter” <4>
    ]
    }
    }
    }
    }

    }

    // SENSE: 120_Proximity_Matching/35_Shingles.json

    <1> See <>.

    <2> The default min/max shingle size is 2 so we don’t really need to set
    these.

    <3> The shingle token filter outputs unigrams by default, but we want to
    keep unigrams and bigrams separate.

    <4> The my_shingle_analyzer uses our custom my_shingles_filter token
    filter.

    First, let’s test that our analyzer is working as expected with the analyze
    API:

    [source,js]

    GET /my_index/_analyze?analyzer=my_shingle_analyzer

    Sue ate the alligator

    Sure enough, we get back three terms:

    • sue ate
    • ate the
    • the alligator

    Now we can proceed to setting up a field to use the new analyzer.

    ==== Multifields

    We said that it is cleaner to index unigrams and bigrams separately, so we
    will create the title field (((“multifields”)))as a multifield (see <>):

    [source,js]

    PUT /my_index/_mapping/my_type
    {
    “my_type”: {
    “properties”: {
    “title”: {
    “type”: “string”,
    “fields”: {
    “shingles”: {
    “type”: “string”,
    “analyzer”: “my_shingle_analyzer”
    }
    }
    }
    }
    }

    }

    With this mapping, values from our JSON document in the field title will be
    indexed both as unigrams (title) and as bigrams (title.shingles), meaning
    that we can query these fields independently.

    And finally, we can index our example documents:

    [source,js]

    POST /my_index/my_type/_bulk
    { “index”: { “_id”: 1 }}
    { “title”: “Sue ate the alligator” }
    { “index”: { “_id”: 2 }}
    { “title”: “The alligator ate Sue” }
    { “index”: { “_id”: 3 }}

    { “title”: “Sue never goes anywhere without her alligator skin purse” }

    ==== Searching for Shingles

    To understand the benefit (((“shingles”, “searching for”)))that the shingles field adds, let’s first look at
    the results from a simple match query for ``The hungry alligator ate Sue’’:

    [source,js]

    GET /my_index/my_type/_search
    {
    “query”: {
    “match”: {
    “title”: “the hungry alligator ate sue”
    }
    }

    }

    This query returns all three documents, but note that documents 1 and 2
    have the same relevance score because they contain the same words:

    [source,js]

    {
    “hits”: [
    {
    “_id”: “1”,
    “_score”: 0.44273707, <1>
    “_source”: {
    “title”: “Sue ate the alligator”
    }
    },
    {
    “_id”: “2”,
    “_score”: 0.44273707, <1>
    “_source”: {
    “title”: “The alligator ate Sue”
    }
    },
    {
    “_id”: “3”, <2>
    “_score”: 0.046571054,
    “_source”: {
    “title”: “Sue never goes anywhere without her alligator skin purse”
    }
    }
    ]

    }

    <1> Both documents contain the, alligator, and ate and so have the
    same score.

    <2> We could have excluded document 3 by setting the minimum_should_match
    parameter. See <>.

    Now let’s add the shingles field into the query. Remember that we want
    matches on the shingles field to act as a signal—to increase the
    relevance score—so we still need to include the query on the main title
    field:

    [source,js]

    GET /my_index/my_type/_search
    {
    “query”: {
    “bool”: {
    “must”: {
    “match”: {
    “title”: “the hungry alligator ate sue”
    }
    },
    “should”: {
    “match”: {
    “title.shingles”: “the hungry alligator ate sue”
    }
    }
    }
    }

    }

    We still match all three documents, but document 2 has now been bumped into
    first place because it matched the shingled term ate sue.

    [source,js]

    {
    “hits”: [
    {
    “_id”: “2”,
    “_score”: 0.4883322,
    “_source”: {
    “title”: “The alligator ate Sue”
    }
    },
    {
    “_id”: “1”,
    “_score”: 0.13422975,
    “_source”: {
    “title”: “Sue ate the alligator”
    }
    },
    {
    “_id”: “3”,
    “_score”: 0.014119488,
    “_source”: {
    “title”: “Sue never goes anywhere without her alligator skin purse”
    }
    }
    ]

    }

    Even though our query included the word hungry, which doesn’t appear in
    any of our documents, we still managed to use word proximity to return the
    most relevant document first.

    ==== Performance

    Not only are shingles more flexible than phrase queries,(((“shingles”, “better performance than phrase queries”))) but they perform better
    as well. Instead of paying the price of a phrase query every time you search,
    queries for shingles are just as efficient as a simple match query. A small price is paid at index time, because more terms need to
    be indexed, which also means that fields with shingles use more disk space.
    However, most applications write once and read many times, so it makes sense
    to optimize for fast queries.

    This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front
    setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.
    (((“proximity matching”, “finding associated words”, range=”endofrange”, startref =”ix_proxmatchassoc”)))