Franta – Občasník malého ajťáka

Domény, Hosting, Cestování

III. Hratky s Elastikem – Transforms

Elastic Search

Vcera jsem resil, ze se mi neskutecne hromadi data v Elastiku. Mam denni prijem novych dat okolo 250GB a to proste z dlouhodobeho hlediska je neunosny. Faktem zustava, ze ale nerad mazu stary data, protoze se vzdycky najde neco k cemu se hodilo je mit. Napad byl tedy jasny – vytvorim si agregovany statistiky, ty si ulozim pro snadny prochazeni a puvodni index smazu. Nicmene me to nedalo a zacal hledat jak resi lide obecne agregaci starych logu.

Nalezl jsem vsak uzasnou vec. Elastic podporuje tkzv transforms ktere umoznuji vytvoreni pravidla – dotazu prave pro agregaci dat z jednoho ci vice indexu do automaticky zalozeneho noveho ktery bude tyto data obsahovat.

Tak si rikam, ze to vyzkousim rovnou v praxi.

Mam momentalne jednoduchy index zonefiles_2021-12-03 do ktereho ukladam zpracovane zonove soubory (https://czds.icann.org/home). Index ma nasledujici mapping:

{
  "zonefiles_2021-12-03" : {
    "mappings" : {
      "dynamic_templates" : [ ],
      "properties" : {
        "domain" : {
          "type" : "keyword"
        },
        "host" : {
          "type" : "keyword"
        },
        "ns" : {
          "type" : "keyword"
        }
      }
    }
  }
}

Sice trosku nesikovne pojmenovany, ale v domain je domena, v ns jeji nameserver a v host je nazev domeny nameserveru. Tedy napriklad:

domain: franta.cz
ns: ns.gransy.com
host: gransy.com

Diky datum v tomto indexu muzu nasledne zjistit nejen pocet domen ale i jake domeny vyuzivani nameservery od domeny „gransy.com“ – nemusim tedy zjistovat jednotlive ns.gransy.com/ns2.gransy.com, atd … ale ziskam takto uceleny prehled.

Zasadni problem je ale vykonostni, pokud chci vytahnout nejaky statistiky podle poctu domen. V DB je preci jen par set milionu domen a u vsech spocitat cetnost domeny nameserveru je proste zbytecne narocna operace. Proto by se hodila udelat agregace a jeji vysledek ulozit do samostatnyho indexu:

{
  "aggs": {
    "domains": {
      "cardinality": {
        "field": "domain"
      }
    }, "hosts": {
      "terms": {
        "field": "host",
        "size": 10,
        "order": {
          "total": "desc"
        }
      }, "aggs": {
        "total": {
          "cardinality": {
            "field": "domain"
          }
        }
      }
    }
  }, "size": 10
}

Vysledek:

{
  "aggregations": {
    "hosts": {
      "doc_count_error_upper_bound": -1,
      "sum_other_doc_count": 41058241,
      "buckets": [
        {
          "key": "d***l.com",
          "doc_count": 10485087,
          "total": {
            "value": 5326191
          }
        },
        {
          "key": "c***e.com",
          "doc_count": 5500593,
          "total": {
            "value": 2756430
          }
        },
        {
          "key": "r***s.com",
          "doc_count": 3888264,
          "total": {
            "value": 1894877
          }
        },
        {
          "key": "h***a.com",
          "doc_count": 2984809,
          "total": {
            "value": 1486940
          }
        },
        {
          "key": "g***s.com",
          "doc_count": 3326168,
          "total": {
            "value": 843891
          }
        },
        {
          "key": "a***s.com",
          "doc_count": 1483922,
          "total": {
            "value": 742449
          }
        },
        {
          "key": "d***l.com",
          "doc_count": 1702775,
          "total": {
            "value": 569271
          }
        },
        {
          "key": "m***n.net",
          "doc_count": 1145293,
          "total": {
            "value": 413353
          }
        },
        {
          "key": "d***d.net",
          "doc_count": 779078,
          "total": {
            "value": 390456
          }
        },
        {
          "key": "d***s.com",
          "doc_count": 758784,
          "total": {
            "value": 381989
          }
        }
      ]
    },
    "domains": {
      "value": 30100694
    }
  }
}

Pro nejakou vyraznejsi praci se statistikama, napriklad na webu je to zcela nepouzitelny, protoze pri velkym poctu zaznamu (stamiliony) trva ta agregace strasne dlouho. Takze by ji clovek musel bud nacachovat do nejakyho lokalniho .json, kde nam nastava prozmenu problem kolik zaznamu nacachovat ? 10 ? 100 ? Milion ? Problem je, ze vic jak milion jich ani neziskam, protoze me moje soucasna konfigurace ES vic nedovoli. A co kdyz tech domen je vic jak milion a chceme s tim preci jen pracovat nejak rozumne.

Zkusil jsem tedy nastudovat transforms a vytvoril toto:

PUT _transform/host_domains
{
  "source": {
    "index": "zonefiles_2021-12-03"
  },
  "dest": {
    "index": "zonefiles_2021-12-03_agg"
  },
  "pivot": {
    "group_by": {
      "domain": {
        "terms": {
          "field": "host"
        }
      }
    },
    "aggregations": {
      "domains": {
        "cardinality": {
          "field": "domain"
        }
      }
    }
  }
}

Vyse uvedeny priklad nam vytvori transforms ktera zajisti agregaci dat z indexu zonefiles_2021-12-03 podle „host“ ktery ulozi jako domain, a pro kazdy vypocita cardnality – coz je pocet unikatnich vyskytu) prvku „domain„. Prevedeno do lidske reci, vytvori to agregaci kde bude seznam domen nameserveru s poctem unikatnich domen pro ktery nameserver delaji a to cele ulozi do noveho indexu zonefiles_2021-12-03_agg

Vyhodou transforms je, ze se nemusi pouzivat jednorazove, ale prave si je takto nadefinujem a pak je muzem poustet napriklad periodicky. V nasem testovacim pripade to ale staci jen jednou, a proto transformaci spustime:

Req: POST _transform/host_domains/_start

Resp:
{
  "acknowledged" : true
}

Nasledne si muzeme zobrazit stav:

Req: GET _transform/host_domains/_stats

Resp:
{
  "count" : 1,
  "transforms" : [
    {
      "id" : "host_domains",
      "state" : "started",
      "node" : {
        "id" : "zjhd101QTZStmA0TO4y9xw",
        "name" : "es_data_18",
        "ephemeral_id" : "mvttyZQnTZOorVzItNpB7Q",
        "transport_address" : "10.0.100.30:9300",
        "attributes" : { }
      },
      "stats" : {
        "pages_processed" : 577,
        "documents_processed" : 73113014,
        "documents_indexed" : 287774,
        "documents_deleted" : 0,
        "trigger_count" : 1,
        "index_time_in_ms" : 75412,
        "index_total" : 576,
        "index_failures" : 0,
        "search_time_in_ms" : 340420,
        "search_total" : 577,
        "search_failures" : 0,
        "processing_time_in_ms" : 1015,
        "processing_total" : 577,
        "delete_time_in_ms" : 0,
        "exponential_avg_checkpoint_duration_ms" : 420998.0,
        "exponential_avg_documents_indexed" : 287774.0,
        "exponential_avg_documents_processed" : 7.3113014E7
      },
      "checkpointing" : {
        "last" : {
          "checkpoint" : 1,
          "timestamp_millis" : 1638699836658
        }
      }
    }
  ]
}

Proces lze kdykoliv zastavit prikazem:

Req: POST _transform/host_domains/_stop

Resp:
{
  "acknowledged" : true
}

Az se transformace dokonci, muzem se podivat na vysledek:

GET /zonefiles_2021-12-03_agg/_search?rest_total_hits_as_int
{
  "sort": [
    {
      "domains": {
        "order": "desc"
      }
    }
  ]
}

Elastic reaguje okamzite jak nejlepe umi a dostavame vysledek:

{
  "took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 287774,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "ZEzpbk8gJWupCglprZfdwK0AAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "d***l.com",
          "domains" : 5326191
        },
        "sort" : [
          5326191
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "Y_Wvdivf0zUDS6WV7O0FhDQAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "c***e.com",
          "domains" : 2756430
        },
        "sort" : [
          2756430
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "cg8PSRlmHfKnDIMQ-1fuLmgAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "r***s.com",
          "domains" : 1894877
        },
        "sort" : [
          1894877
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "aPhB5em17MA9s6v_Ax8gCisAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "h***a.com",
          "domains" : 1486940
        },
        "sort" : [
          1486940
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "Z_bBM3-nsPjwSuRAmRc757IAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "g***s.com",
          "domains" : 843891
        },
        "sort" : [
          843891
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "YQR9xR8ClI6nlecIopa03ZkAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "a***s.com",
          "domains" : 742449
        },
        "sort" : [
          742449
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "ZJxtUi9WboU45S_FbZA4SM0AAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "d***l.com",
          "domains" : 569271
        },
        "sort" : [
          569271
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "dWcvAbo0YmsNCUzVFbaqshMAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "u***s.com",
          "domains" : 431739
        },
        "sort" : [
          431739
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "dUbo0VYoMjnWt2JUmENSYNoAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "u***s.org",
          "domains" : 431658
        },
        "sort" : [
          431658
        ]
      },
      {
        "_index" : "zonefiles_2021-12-03_agg",
        "_type" : "_doc",
        "_id" : "dZxn3zUnGqxdXcDqsVrbLWwAAAAAAAAA",
        "_score" : null,
        "_source" : {
          "domain" : "u***s.biz",
          "domains" : 431572
        },
        "sort" : [
          431572
        ]
      }
    ]
  }
}
Tags:  , ,

Napsat komentář

Vaše e-mailová adresa nebude zveřejněna.

Tato stránka používá Akismet k omezení spamu. Podívejte se, jak vaše data z komentářů zpracováváme..