Bash vs. Go – zpracovani 23GB TXT => JSON

V ramci zpracovani zones files z CZDS do Elastic Search potrebuju nejak prevest i zonefile od .com do JSON pro import do ES. Problem je, ze zabaleny ma 4,5GB, rozbaleny pak neco okolo 23 GB.
format souboru je jasne dany – jedna se o gzipovany TXT:
0-10k.com. 172800 in ns ns1.dan.com. 0-10k.com. 172800 in ns ns2.dan.com. 0-10kamonth.com. 172800 in ns ns1.dnsimple.com. 0-10kamonth.com. 172800 in ns ns2.dnsimple.com. 0-10kamonth.com. 172800 in ns ns3.dnsimple.com. 0-10kamonth.com. 172800 in ns ns4.dnsimple.com. 0-10kmonth.com. 172800 in ns ns-cloud-c1.googledomains.com. 0-10kmonth.com. 172800 in ns ns-cloud-c2.googledomains.com. 0-10kmonth.com. 172800 in ns ns-cloud-c3.googledomains.com. 0-10kmonth.com. 172800 in ns ns-cloud-c4.googledomains.com. 0-10kmonth.com. 86400 in ds 57434 8 2 F582629A1BED36EA95175C1E3AA5BF731284E79C5CFE17286962AEF8752B13CE 0-10kmonth.com. 86400 in rrsig ds 8 2 86400 20211208054935 20211201043935 15549 com. B55CcMMqjyMdfs36sBboPstHTfrR53SGuQNaXKNsld/KdTMWfJvc4rGlU7j0l6k5/oaSJGitDbswNdpRJ07T3q6gyllvQ/RYqHCwr4Q40JsNSHFt4P19yCqolqSb6Ek74rtl0wSV6AUOkVtL8QrtY1nkg0PxXM/FRlz04W1L6oANPjcXOhNiRv1rbT+OkRhcn73/Ri84u+FN 9nKleiILVA== 0-10s.com. 172800 in ns ns1.tophostcloud.com. 0-10s.com. 172800 in ns ns2.tophostcloud.com. 0-10v.com. 172800 in ns ns31.domaincontrol.com. 0-10v.com. 172800 in ns ns32.domaincontrol.com. 0-10vdimmer.com. 172800 in ns ns03.domaincontrol.com. 0-10vdimmer.com. 172800 in ns ns04.domaincontrol.com. 0-11-0.com. 172800 in ns ns6381.hostgator.com.
A co musime provest abychom se dostali do pozadovanyho stavu:
- rozbalit soubor
- vybrat radky ktere obsahuji pouze ns zaznamy (soa, rrsig, nsec, a, atd nas nezajimaji)
- vybrat pouze domenu a nameserver (vse mezi nas nezajima)
- oddelat posledni tecku z domeny i z nameserveru
- zjistit z nameserveru hlavni domenu
- cele ulozit jako JSON
Nejprve jsem to zkusil napsat v bashi za pomoci standardnich prikazu + perfektni utilitky jq pro praci s JSON
cat com.txt.gz | gunzip | tr "\t" " " | grep "in ns " | sed "s/\. / /" | sed "s/\.$//" | awk {'cmd="getreg $5"; printf "{\"domain\":\""$1"\",\"ns\":\""$5"\",\"host\":\""; system("getreg " $5 "| tr -d \"\n\""); printf "\"}\n"'} > done/$1.json
Rekneme, ze to neni idealni script, ale je to maximum co jsem vyplodil s ohledem na rychlost zpracovani.
Bohuzel – a primarne to dela simple C aplikace „getreg“ na zjisteni domeny podle publicSuffix – je tato funkce neskutecne pomala 🙁 Za 12 hodin jsem mel zpracovano cca 1GB dat 🙁
Dnes jsem prepsal aplikaci do Go, kde navic muzu prikazy poustet v mnoha vlaknech. Kod opet neni idealni, ale svuj ucel splnil:
package main import ( "bufio" "fmt" "log" "time" "strings" "os" "sync" "golang.org/x/net/publicsuffix" ) var lc = int64(0) var last = int64(0) var wg sync.WaitGroup var sem = make(chan struct{}, 500) func writeRemain(){ div := (last-lc)/2 eta := float64(0) if div > 20000 { eta = float64(lc)/float64(div) } else { eta = float64(lc)/float64(80000) } log.Printf("Zbyva %v | ETA: %v seconds\n",lc,eta) time.Sleep(2* time.Second) last = lc go writeRemain() } func lineCount(f *os.File) (int64, error) { s := bufio.NewScanner(f) for s.Scan() { lc++ } return lc, s.Err() } func process(s string) { radek := strings.Split(strings.ToLower(strings.Replace(s, "\t", " ", -1)), " ") if radek[3] == "ns" { domain := radek[0][:len(radek[0])-1] nameserver := radek[4][:len(radek[4])-1] nsdomain,_ := publicsuffix.EffectiveTLDPlusOne(nameserver) fmt.Printf("{\"domain\":\"%s\",\"ns\":\"%s\",\"host\":\"%s\"}\n",domain,nameserver,nsdomain) } defer wg.Done() <-sem } func generateStats(f *os.File, filename string) { s := bufio.NewScanner(f) for s.Scan() { wg.Add(1) sem <- struct{}{} go process(s.Text()) lc-- } wg.Wait() close(sem) } func main() { filename := os.Args[1] f, err := os.Open(filename) if err != nil { log.Println(err) } defer f.Close() lc, err := lineCount(f) if err != nil { log.Println(err) return } log.Println(filename+" line count:", lc) go writeRemain() f, err = os.Open(filename) if err != nil { log.Println(err) } defer f.Close() generateStats(f,filename) }
Vystup z logu po jeho spusteni:
2021/12/05 13:49:33 com.txt line count: 397682056 2021/12/05 13:49:33 Zbyva 397681921 | ETA: 4971.0240125 seconds 2021/12/05 13:49:35 Zbyva 397387870 | ETA: 4967.348375 seconds 2021/12/05 13:49:37 Zbyva 397100853 | ETA: 4963.7606625 seconds 2021/12/05 13:49:39 Zbyva 396666761 | ETA: 4958.3345125 seconds 2021/12/05 13:49:41 Zbyva 396400104 | ETA: 4955.0013 seconds 2021/12/05 13:49:43 Zbyva 396122213 | ETA: 4951.5276625 seconds 2021/12/05 13:49:45 Zbyva 395847359 | ETA: 4948.0919875 seconds 2021/12/05 13:49:47 Zbyva 395554344 | ETA: 4944.4293 seconds 2021/12/05 13:49:49 Zbyva 395277705 | ETA: 4940.9713125 seconds 2021/12/05 13:49:51 Zbyva 395023235 | ETA: 4937.7904375 seconds 2021/12/05 13:49:53 Zbyva 394705271 | ETA: 4933.8158875 seconds 2021/12/05 13:49:55 Zbyva 394269766 | ETA: 4928.372075 seconds 2021/12/05 13:49:57 Zbyva 393988374 | ETA: 4924.854675 seconds 2021/12/05 13:49:59 Zbyva 393682895 | ETA: 4921.0361875 seconds 2021/12/05 13:50:01 Zbyva 393375297 | ETA: 4917.1912125 seconds
Nove bych to mel mit zpracovane za cca 5000s = cca 1,5 hodiny. Kdyz se ale podivate pozorne, cas mezi radky je presne 2s, ETA se ale snizuje o cca 4 vteriny. Cas by tedy mohl byt polovicni.
Samozrejme to lze udelat lepe, pokud na to ma clovek znalosti. A i cisteji – napriklad nevypisovat JSON natvrdo jako string, ale vygenerovat ho spravne pres encoding/json … a tak dale 🙂
Nicmene, podstatou veci je, ze to presne dela to co chci a venoval jsem tomu par minut 🙂
nekdo ze slacku -
v czds jsou data stripnuty o bind hlavicku/paticku v exportu?
protoze to tve nevezme vystup z `dig @zonedata.iis.se se AXFR >zoneFile.se`
muse se radek 46 opravit na `if (len(line) >= 4) && (line[3] == „ns“) {`
pak to vezme i kompletni export
ps: jo, takhle skladat .json je omfg:D
gransy -
Jj, v CZDS je prvni radek SOA – cely v radku:
abogado. 172800 in soa dns1.nic.abogado. hostmaster.nominet.org.uk. 2100004112 900 300 2419200 3600
Takze veme v pohode. Ja to delal zatim jen pro ty data z CZDS, ccTLD zony jsem jeste neresil 🙂