際際滷

際際滷Share a Scribd company logo
Spracovanie ve直k箪ch d叩t
    Tom叩邸 Majer

    MONOGRAM Tech. Monday 25.7.2011


Monday, July 25, 11
o s炭 ve直k辿 d叩ta



        Tomajov叩 de鍖nicia

              Tak辿 d叩ta, ktor辿: nevojd炭 na jeden stroj, alebo sa s ned叩 s nimi
               pracova泥 v re叩lnom ase na jednom stroji




Monday, July 25, 11
Preo s炭 d担le転it辿?

        D叩t je st叩le viac a viac

        Web 2.0 - soci叩lny aspekt webu zabezpeuje vytv叩ranie obrovsk辿ho
         mno転stva pou転ite直n箪ch d叩t

        Jednoduch箪 pr鱈klad: Facebook

              135 bilionov spr叩v za mesiac

              20 bilionov udalost鱈 za de - 200 000 za sekundu


Monday, July 25, 11
N叩rast Facebooku
                                                                    Nov辿 data za en (GB)

                                                             4,000



                                                            3,000


                                                           2,000


                                                       1,000

                      Marec 2008
                                Apr鱈l 2009             0
                                          Okt坦ber 2009


Monday, July 25, 11
Ako uklada泥 ve直k辿 d叩ta



        SQL datab叩zy trpia z叩kladnym probl辿mom so 邸k叩lovate頂nos泥ou



        NoSQL - 直ahko 邸k叩lovate頂n辿 - vhodn辿 pre ve直k辿 d叩ta




Monday, July 25, 11
Monday, July 25, 11
NoSQL

        Viacer辿 typy

              document oriented, column oriented, graph oriented, key-value

        Vysok箪 v箪kon

        Obmedzen辿 mo転nosti - oproti SQL datab叩zam

        Neexistuje 邸tandard pre pr叩cu s d叩tami

        V praxi sa osvedila kombin叩cia NoSQL s SQL


Monday, July 25, 11
Google
    MapReduce
    2004 Google vydal paper:
    MapReduce: Simpli鍖ed Data
    Processing on Large Clusters




Monday, July 25, 11
Ciele
    MapReduce

        Rozlo転i泥 v箪poet medzi viacero
         strojov - nodov

        Jednoduch箪 framework, ktor箪
         zabezpe鱈 jednoduch辿 p鱈sanie
         tak辿hoto k坦du

        Horizont叩lna 邸k叩lovate直nos泥




Monday, July 25, 11
Ako teda funguje MapReduce?




Monday, July 25, 11
    Existuje viacero nodov, ktor辿 m担転u robi泥 viacero vec鱈

        2 z叩kladne 炭lohy

              Map job

                     vstupn箪 vektor <key1, value1>

                     v箪stupn箪 zoznam vektorov <key2, value2>

              Reduce job

                     vstupn箪 vektor <key2, <zoznam hodnot z maperov s v箪stupom key2>>

                     v箪stupn箪 zoznam vektorov <key3, value3>




Monday, July 25, 11
Jednoduch箪 pr鱈klad - spo鱈tanie
    slov
        void map(String name, String document):

             // name: document name
             // document: document contents
             for each word w in document:
               EmitIntermediate(w, "1");

        void reduce(String word, Iterator partialCounts):
          // word: a word
          // partialCounts: a list of aggregated partial counts
          int sum = 0;
          for each pc in partialCounts:
            sum += ParseInt(pc);
          Emit(word, AsString(sum));



Monday, July 25, 11
Monday, July 25, 11
Jednoduch辿 MapReduce pr鱈klady

        Distribuovan箪 grep

        S鱈tanie n叩v邸tev pre URL

              mapper <URL, 1>

              reducer <URL, celkov箪 poet n叩v邸tev>

        Graf v辰zieb str叩nok

              mapper <cie直, zdroj>

              reducer <cie直, zoznam zdrojov>

Monday, July 25, 11
Moje sk炭senosti s MapReduce?

Monday, July 25, 11
Diplomovka


        Pr叩ca s Twitter Datasetom

              takmer 30 GB textov箪 subor

              al邸ie p叩r sto megov辿 csvcka

              implement叩cia viacer箪ch Mapperov a Reducerov pre v箪poet
               ohodnotenia str叩nok pomocou tweetov z mikroblogu




Monday, July 25, 11
Apache HADOOP

Monday, July 25, 11
    Open source MapReduce framework

        Nap鱈san箪 v Jave

        Podporuje aj in辿 jazyky

        Vyu転鱈vaj炭 ho dnes okrem Google-u takmer v邸etci
         ve直k箪 IT hr叩鱈:

              Facebook, Twitter, LinkedIn, Adobe, Amazon, Apple,
               eBay, Hulu, IBM, Last.fm, Yahoo a stra邸ne ve直a
               al邸鱈ch

Monday, July 25, 11
Hadoop zaha cel箪 ekosyst辿m
    Hadoop zah

    Date


Monday, July 25, 11
HDFS


        Vych叩dza z GFS - Google File
         Syst辿m

        Distribuovan箪 鍖le syst辿m

        Rie邸i o kde bude ulo転en辿 a
         ko直kokr叩t

        Virtu叩lny 鍖le syst辿m



Monday, July 25, 11
HIVE


        SQL nad NoSQL d叩tami

              s炭bory, SQL Datab叩zy...

        Podporuje SELECT, JOIN,
         GROUP BY..

        Vyvinul Facebook




Monday, July 25, 11
HBase


        NoSQL column oriented
         datab叩za

        Vych叩dza z Google BigTable

        pod直a ma asi najviac
         enterprise NoSQL




Monday, July 25, 11
イ恒邸庄艶

        Mahout - kni転nica s MapReduce jobmi pre strojov辿 uenie

        Pig - prasainy ;-) vlastn箪 jazyk pre 直ahk炭 pr叩cu s d叩tami

        Chuckwa - zbera logov

        ZooKeeper - dr転鱈 v邸etko pohromade ;-) rie邸i zamykanie,
         synchorniz叩ciu at...

        Avro - serializ辿r


Monday, July 25, 11
舘叩厩艶姻



        D叩t je ve直a - distribuovan辿 v箪poty s炭 nevyhnutnos泥ou

        Problematick辿 ulo転enie d叩t - NoSQL

        Hadoop - framework s cel箪m ekosyst辿mom pre
         distribuovan辿 v箪poty zalo転en箪 na MapReduce




Monday, July 25, 11

More Related Content

Spracovanie velkych dat MONOGRAM Tech Monday 27.07.2011

  • 1. Spracovanie ve直k箪ch d叩t Tom叩邸 Majer MONOGRAM Tech. Monday 25.7.2011 Monday, July 25, 11
  • 2. o s炭 ve直k辿 d叩ta Tomajov叩 de鍖nicia Tak辿 d叩ta, ktor辿: nevojd炭 na jeden stroj, alebo sa s ned叩 s nimi pracova泥 v re叩lnom ase na jednom stroji Monday, July 25, 11
  • 3. Preo s炭 d担le転it辿? D叩t je st叩le viac a viac Web 2.0 - soci叩lny aspekt webu zabezpeuje vytv叩ranie obrovsk辿ho mno転stva pou転ite直n箪ch d叩t Jednoduch箪 pr鱈klad: Facebook 135 bilionov spr叩v za mesiac 20 bilionov udalost鱈 za de - 200 000 za sekundu Monday, July 25, 11
  • 4. N叩rast Facebooku Nov辿 data za en (GB) 4,000 3,000 2,000 1,000 Marec 2008 Apr鱈l 2009 0 Okt坦ber 2009 Monday, July 25, 11
  • 5. Ako uklada泥 ve直k辿 d叩ta SQL datab叩zy trpia z叩kladnym probl辿mom so 邸k叩lovate頂nos泥ou NoSQL - 直ahko 邸k叩lovate頂n辿 - vhodn辿 pre ve直k辿 d叩ta Monday, July 25, 11
  • 7. NoSQL Viacer辿 typy document oriented, column oriented, graph oriented, key-value Vysok箪 v箪kon Obmedzen辿 mo転nosti - oproti SQL datab叩zam Neexistuje 邸tandard pre pr叩cu s d叩tami V praxi sa osvedila kombin叩cia NoSQL s SQL Monday, July 25, 11
  • 8. Google MapReduce 2004 Google vydal paper: MapReduce: Simpli鍖ed Data Processing on Large Clusters Monday, July 25, 11
  • 9. Ciele MapReduce Rozlo転i泥 v箪poet medzi viacero strojov - nodov Jednoduch箪 framework, ktor箪 zabezpe鱈 jednoduch辿 p鱈sanie tak辿hoto k坦du Horizont叩lna 邸k叩lovate直nos泥 Monday, July 25, 11
  • 10. Ako teda funguje MapReduce? Monday, July 25, 11
  • 11. Existuje viacero nodov, ktor辿 m担転u robi泥 viacero vec鱈 2 z叩kladne 炭lohy Map job vstupn箪 vektor <key1, value1> v箪stupn箪 zoznam vektorov <key2, value2> Reduce job vstupn箪 vektor <key2, <zoznam hodnot z maperov s v箪stupom key2>> v箪stupn箪 zoznam vektorov <key3, value3> Monday, July 25, 11
  • 12. Jednoduch箪 pr鱈klad - spo鱈tanie slov void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int sum = 0; for each pc in partialCounts: sum += ParseInt(pc); Emit(word, AsString(sum)); Monday, July 25, 11
  • 14. Jednoduch辿 MapReduce pr鱈klady Distribuovan箪 grep S鱈tanie n叩v邸tev pre URL mapper <URL, 1> reducer <URL, celkov箪 poet n叩v邸tev> Graf v辰zieb str叩nok mapper <cie直, zdroj> reducer <cie直, zoznam zdrojov> Monday, July 25, 11
  • 15. Moje sk炭senosti s MapReduce? Monday, July 25, 11
  • 16. Diplomovka Pr叩ca s Twitter Datasetom takmer 30 GB textov箪 subor al邸ie p叩r sto megov辿 csvcka implement叩cia viacer箪ch Mapperov a Reducerov pre v箪poet ohodnotenia str叩nok pomocou tweetov z mikroblogu Monday, July 25, 11
  • 18. Open source MapReduce framework Nap鱈san箪 v Jave Podporuje aj in辿 jazyky Vyu転鱈vaj炭 ho dnes okrem Google-u takmer v邸etci ve直k箪 IT hr叩鱈: Facebook, Twitter, LinkedIn, Adobe, Amazon, Apple, eBay, Hulu, IBM, Last.fm, Yahoo a stra邸ne ve直a al邸鱈ch Monday, July 25, 11
  • 19. Hadoop zaha cel箪 ekosyst辿m Hadoop zah Date Monday, July 25, 11
  • 20. HDFS Vych叩dza z GFS - Google File Syst辿m Distribuovan箪 鍖le syst辿m Rie邸i o kde bude ulo転en辿 a ko直kokr叩t Virtu叩lny 鍖le syst辿m Monday, July 25, 11
  • 21. HIVE SQL nad NoSQL d叩tami s炭bory, SQL Datab叩zy... Podporuje SELECT, JOIN, GROUP BY.. Vyvinul Facebook Monday, July 25, 11
  • 22. HBase NoSQL column oriented datab叩za Vych叩dza z Google BigTable pod直a ma asi najviac enterprise NoSQL Monday, July 25, 11
  • 23. イ恒邸庄艶 Mahout - kni転nica s MapReduce jobmi pre strojov辿 uenie Pig - prasainy ;-) vlastn箪 jazyk pre 直ahk炭 pr叩cu s d叩tami Chuckwa - zbera logov ZooKeeper - dr転鱈 v邸etko pohromade ;-) rie邸i zamykanie, synchorniz叩ciu at... Avro - serializ辿r Monday, July 25, 11
  • 24. 舘叩厩艶姻 D叩t je ve直a - distribuovan辿 v箪poty s炭 nevyhnutnos泥ou Problematick辿 ulo転enie d叩t - NoSQL Hadoop - framework s cel箪m ekosyst辿mom pre distribuovan辿 v箪poty zalo転en箪 na MapReduce Monday, July 25, 11