網站首頁編程語言正文

圖解Elasticsearch 獲取兩個索引數據不同之處的四種方案

作者：銘毅天下更新時間： 2022-07-26 編程語言

1、實戰項目問題

......我有2個index，假設其中index1中數據是 id1，id2，id3，index2 中是 id1，id3。我的目的是能找出缺失的 id2 的數據，并且后續進去的 id4，id5 如果有缺失的也能發現。——問題來源：死磕 Elasticsearch 知識星球

2、問題解讀

假定有兩個索引 index1、index2，這兩個索引中有大量相同數據。

這個問題的本質是實現類似：linux 下的 diff 命令的操作，找出一個索引中存在而在另外一個索引不存在的數據。

3、方案探討

Elasticsearch 沒有直接實現找索引數據差異的類 diff 命令可用。

但，redis 中有 sdiff 命令可以一鍵搞定一個集合中有而另外一個集合中沒有的數據。

這就引申出方案一：借助 redis 實現。

那么問題來了，不用 redis， Elasticsearch 自身能否搞定呢？

其實是可以搞定的。我們通過組合索引檢索，然后對索引中公有相同主鍵字段進行聚合，然后進行去重統計，找出計數 < 2 的就是我們想要的 id 。因為：如果兩個索引都有數據，勢必聚合后計數 >= 2。此為方案二。

還有，我們可以借助 Elasticsearch transform 實現，此為方案三。

類似問題是個業界通用問題，有沒有開源實現方案呢？此為方案四。

4、方案實現

4.1 方案一：借助 redis sdiff 實現

前提：Elasticsearch 索引數據中有類似 MySQL 主鍵的字段，能唯一標定一條記錄。如果沒有可以使用 _id 字段，但不建議使用 _id ，下文會說原因。

實施步驟如下：

步驟1：將 index1 （數據量多的，全量索引）的主鍵字段 uniq_1 導入 redis；
步驟2：將 index2 的主鍵字段 uniq_2 導入 redis；
步驟3：使用 sdiff 命令行返回結果就是期望不同 id 值。

4.2 方案二：借助 Elasticsearch 聚合實現

我們用 kibana 自帶的索引數據仿真一把。

4.2.1 用已有索引實現，好理解，大家都可以復現。

POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "kibana_sample_data_flights_ext"
  }
}

GET kibana_sample_data_flights/_count

共60個，用作不同的值區分用
POST kibana_sample_data_flights_ext/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "OriginCountry.keyword": {
              "value": "US"
            }
          }
        },
        {
          "term": {
            "OriginWeather.keyword": {
              "value": "Rain"
            }
          }
        },
        {
          "term": {
            "DestWeather.keyword": {
              "value": "Rain"
            }
          }
        }
      ]
    }
  }
}

刪除掉了60條記錄  "deleted" : 60,
POST kibana_sample_data_flights_ext/_delete_by_query
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "OriginCountry.keyword": {
              "value": "US"
            }
          }
        },
        {
          "term": {
            "OriginWeather.keyword": {
              "value": "Rain"
            }
          }
        },
        {
          "term": {
            "DestWeather.keyword": {
              "value": "Rain"
            }
          }
        }
      ]
    }
  }
}

這樣操作之后，_data_flights_ext 索引就比 _data_flights 索引少了 60 條數據。

如何實現聚合呢？

先全局設置修復可能的報錯，設置如下：

PUT _cluster/settings
{
  "persistent": {
    "indices.id_field_data.enabled": true
  }
}

4.2.2 聚合去重實現 DSL

POST kibana_sample_data_flights,kibana_sample_data_flights_ext/_search
{
  "size": 0,
  "aggs": {
    "group_by_uid": {
      "terms": {
        "field": "_id",
        "size": 1000000
      },
      "aggs": {
        "count_indices": {
          "cardinality": {
            "field": "_index"
          }
        },
        "values_bucket_filter_by_index_count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "count_indices"
            },
            "script": "params.count < 2"
          }
        }
      }
    }
  }
}

size 值設置的比較大，是因為提高聚合精度的原因，否則結果會不準確。

前面如果不設置的話，會報錯如下：

"reason" : "Fielddata access on the _id field is disallowed, you can re-enable it by updating the dynamic cluster setting: indices.id_field_data.enabled"

也就是說 8.X 版本不推薦使用 id 作為聚合操作的字段，這也解釋了前文讓自己生成 uniq_id 的原因所在。

執行結果如下：

doc_count 為 1 的結果值，就是我們期望的結果。

如果上面聚合不好理解，簡化版圖解如下：

4.3 方案三：借助 Elasticsearch transform 實現

transform 咱們之前文章提及的少，這里簡單說一下。

transform 含義如其英文釋義一致“轉換、改造”的意思。就是把已有索引“轉換、改造”為匯總索引（summarized indices），方便我們做后續的分析操作。

transform 常見的 API 如下所示：

https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-apis.html

步驟1：創建索引

其實這一步非必須，只不過我們后面使用了 _id 字段，不先創建索引、指定 mapping 的話會報錯。

PUT compare
{
  "mappings": {
    "_meta": {
      "_transform": {
        "transform": "index_compare",
        "version": {
          "created": "8.2.2"
        },
        "creation_date_in_millis": 1656279927899
      },
      "created_by": "transform"
    },
    "properties": {
      "unique-id": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "1",
      "auto_expand_replicas": "0-1"
    }
  },
  "aliases": {}
}

compare 就是我們目標生成的：匯總索引。

細心的讀者會發現，這個 compare 像是系統生成的索引。沒錯的，這是借助：POST _transform/_preview ...生成然后人工做部分修改后的索引。

步驟2：創建 transform

PUT _transform/index_compare
{
  "source": {
    "index": [
      "kibana_sample_data_flights",
      "kibana_sample_data_flights_ext"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "compare"
  },
  "pivot": {
    "group_by": {
      "unique-id": {
        "terms": {
          "field": "_id"
        }
      }
    },
    "aggregations": {
      "compare": {
        "scripted_metric": {
          "map_script": "state.doc = new HashMap(params['_source'])",
          "combine_script": "return state",
          "reduce_script": """ 
            if (states.size() != 2) {
              return "count_mismatch"
            }
            if (states.get(0).equals(states.get(1))) {
              return "match"
            } else {
              return "mismatch"
            }
            """
        }
      }
    }
  }
}

source：指定了兩個源索引，便于后續的 compare 操作。
pivot：中樞、樞紐的意思，所有的核心操作都放到這里面。執行的核心：先以_id 做了聚合操作，然后針對聚合后的結果做了處理；聚合結果不為2（必然為1），就是我們期望的結果，返回：count_mismatch。其他，若相等返回：match。

步驟3：執行 transform

POST _transform/index_compare/_start

步驟4：基于 transform 生成的目標索引，執行特定檢索。

POST compare/_search
{
  "track_total_hits": true,
  "size": 1000,
  "query": {
    "term": {
      "compare.keyword": {
        "value": "count_mismatch"
      }
    }
  }
}

返回結果就是我們期望的不同值，截圖如下所示：

4.4 方案四：第三方開源實現

認知前提：只要我們認為是問題的點，極大可能“前人”早已經遇到過，更大可能“前人”早已經給出了解決方案甚至已經開源了解決方案。這是我從業10年+感觸比較深的地方，一句話：“非必要，不重復造輪子”。

開源方案 1：https://github.com/Aconex/scrutineer/

可實現不同數據源，如：Elasticsearch VS Elasticsearch，Elasticsearch VS Solr 之間的索引數據比較。

開源方案 2：https://github.com/olivere/esdiff

可實現比較不同索引之間文檔的差異。

實現參考如下：

$ ./esdiff -u=true -d=false 'http://localhost:19200/index01/tweet' 'http://localhost:29200/index01/_doc'
Unchanged       1
Updated 3       {*diff.Document}.Source["message"]:
        -: "Playing the piano is fun as well"
        +: "Playing the guitar is fun as well"

Created 4       {*diff.Document}:
        -: (*diff.Document)(nil)
        +: &diff.Document{ID: "4", Source: map[string]interface {}{"message": "Climbed that mountain", "user": "sandrae"}}

5、小結

只要思想不滑坡，方案總比問題多。

自己寫程序能否實現呢？當然也是可以的。“index1是完整的可以作為參照物。以插入時間為主線（時間戳，應該每條記錄都會有一條數據）拿 index1 的每個id數據在 index2 中進行檢索，如果存在，ok沒有問題；如果不存在，記錄一下id，id 存入一個集合里面，這個 id 集合就是想要的目標 id 集合?！?/p>

原文鏈接：https://blog.csdn.net/laoyang360/article/details/125494511

上一篇：go通過channel獲取goroutine的處理結果
下一篇：ES對比兩個索引的數據差

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網站首頁編程語言正文

圖解Elasticsearch 獲取兩個索引數據不同之處的四種方案

1、實戰項目問題

2、問題解讀

3、方案探討

4、方案實現

4.1 方案一：借助 redis sdiff 實現

4.2 方案二：借助 Elasticsearch 聚合實現

4.2.1 用已有索引實現，好理解，大家都可以復現。

4.2.2 聚合去重實現 DSL

4.3 方案三：借助 Elasticsearch transform 實現

步驟1：創建索引

步驟2：創建 transform

步驟3：執行 transform

步驟4：基于 transform 生成的目標索引，執行特定檢索。

4.4 方案四：第三方開源實現

5、小結

相關推薦

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網站首頁 編程語言 正文

圖解Elasticsearch 獲取兩個索引數據不同之處的四種方案

1、實戰項目問題

2、問題解讀

3、方案探討

4、方案實現

4.1 方案一：借助 redis sdiff 實現

4.2 方案二：借助 Elasticsearch 聚合實現

4.2.1 用已有索引實現，好理解，大家都可以復現。

4.2.2 聚合去重實現 DSL

4.3 方案三：借助 Elasticsearch transform 實現

步驟1：創建索引

步驟2：創建 transform

步驟3：執行 transform

步驟4：基于 transform 生成的目標索引，執行特定檢索。

4.4 方案四：第三方開源實現

5、小結

相關推薦

網站首頁編程語言正文

2、問題解讀

4、方案實現

4.2.1 用已有索引實現，好理解，大家都可以復現。

步驟4：基于 transform 生成的目標索引，執行特定檢索。

5、小結