際際滷

際際滷Share a Scribd company logo
???        @yonghosee


'???' ?? ???. kth ??????? ??????? ??? ???
????. '??? ?? ???'??? ???? ???, 'iLock'?? ?? ?
? ?? ??? ???, ??? ??? ????? ??? ????. ???
?, ??????? ??? ????. ??? ??? ??? ?? ????
?. Data Scientist? ?????.




????? ???? ?? ???.
???? ?? ??? ?? ?? ??? ???? ?? ?? ????? ?
?? ?????. ??? ??? ??? ??? ?? ??? ???? ??
??, ????? ???? ? ???? ???? ? ???? ???? ?
????. ????? ????, ?? ?? ???? ??????,
Hadoop? ??? ????? ???? ?? ???? ????? ???
??.
?????
????
?? ???!
????Lab I ???
? ??
 ?   ?

         3
?? : ??? ??? ???
?? : ??? ??? ???(??? ??)
??? ??   ?????
         ????? ???
         ????? ???
         ????? ???.
2011/11/21/???
???? ??? ??




                 ?=_=?
??!
??!
?? ??!
??? ?????!!
2011 H3 ????-????? ???? ?? ???
??? ??? ????
??? ???? ????
??? ??? ????
?? ???
???!
???

?? ?? ????
  ???
??? !
   ??
?? ???
??? ?? ???.
     ????? ???^^;;
??? ?????.
?? ??
        ????
???      &
        ????
MULTI CORE
MANY MACHINES
CLOUD
???
???????? ?????.
to LEARN
Hard   to WRITE
       to RUN
???
????? ??? ???.

???? ?? ??? ?? ??.

????? ???? ???? ?? ??.

??? ?? ??? ?? ???.
??? ??? ?? ?? ??.
???
???
Md5 ???? ??? ????

  ?? ?? ???.
??? ???? ????
                    hashlib.md5(^python is so powerful").hexdigest()



                    import gzip


  ?? ?? ???.
                    f = gzip.open('example.txt.gz', 'wb')
                    f.write('Contents of the example go here.n')
                    f.close()


????? ???? ????
                    import urllib2

  ?? ?? ???.
                    response = urllib2.urlopen('http://google.com/')
                    data = response.read()
                    print data


REST ???? ??? ????from   flask import Flask
                    app = Flask(__name__)

  ?? ?? ???.        @app.route("/^, methods=[`GET¨])
                    def hello():
                        return "Hello World!"

                    if __name__ == "__main__":
                        app.run()
???
               JAVA




               PYTHON
               Perl, Ruby




 ??? ??(???)
???
????? ??? ???.
             ?? ?? ???
???? ?? ??? ?? ??.
   ?? ??? ? ?????? ???
????? ???? ???? ?? ??.
      ??? ?? ?? ?????.
??? ?? ??? ?? ???.
        ?? ? ?? ?? ? ??
??? ??? ?? ?? ??.
     ????? ??? ? ???^^
to LEARN
Easy   to WRITE
       to RUN
????? ??
???????? ????(?)
? ???
????? ???!
?? ? ??? ??
???(Thread) !!!
  ????

         ???

         ???
??? ?? ????
from threading import Thread
                                            start??
def do_work(start, end, result):
    sum = 0                                 end??
    for i in range(start, end):             ???
        sum += i
    result.append(sum)
    return                                  ???
if __name__ == "__main__":
                                            1??
    START, END = 0, 20000000                ????.
    result = list()
    th1 = Thread(target=do_work, args=(START, END, result))
    th1.start()
    th1.join()
print "Result : ", sum(result)
??? 1?
$ time python 1thread.py
Result : 199999990000000




real       0m3.523s
2? ?? ? ??????
from threading import Thread
                                            start??
def do_work(start, end, result):
    sum = 0                                 end??
    for i in range(start, end):             ???
        sum += i
    result.append(sum)
    return                                  ???
if __name__ == "__main__":
                                            2??
    START, END = 0, 20000000                ????.
    result = list()
    th1 = Thread(target=do_work, args=(START, END/2, result))
    th2 = Thread(target=do_work, args=(END/2, END, result))
    th1.start()
    th2.start()
    th1.join()
    th2.join()
print "Result : ", sum(result)
??? 1?                     ??? 2?
$ time python 1thread.py   $ time python 2thread.py
Result : 199999990000000   Result : 199999990000000




real       0m3.523s        real 0m4.225s
                                        ?-_-?
2011 H3 ????-????? ???? ?? ???
2011 H3 ????-????? ???? ?? ???
? ?? ?? ???????
?
?????.
GIL !
(Global Interpreter Lock)
?????? ??? ???? ???.

   ???     ???   ???   ???
    #1      #2    #3    #4




         PYTHON VM
Coarse-
Grained
Lock
Coarse-Grained Lock

???     ???   ???   ???
 #1      #2    #3    #4




      PYTHON VM
Fine-
Grained
Lock
Fine-Grained Lock

???     ???   ???   ???
 #1      #2    #3    #4




      PYTHON VM
One Big Lock =
                   Global Interpreter Lock




                   ???? ??
guido van rossum
Global Interpreter Lock

???     ???   ???   ???
 #1      #2    #3    #4




      PYTHON VM
??? ?? ?
    ???




   ?? ???



   ?? ???
GIL??? ??? ?
???? ???? ? CPU? ?(??)??!

      ??         ???



      ???   ??   ???   ??



       ???       ??    ???
? ??? ??????

    ? ?????´
GIL? ????
?????? ??? ?????.

Garbage Collector ???? ????.

C/C++ ?? ?? ???? ?? ????.

? ??? ???? ?? ?? ? ?????.
GIL? ???? ???
1990??, ??? ????? ?? ?? ???

CPU? ? ? ?, ? ??? ???(Time-Sharing)

??? ?? ? ?? ???

???? ???? ??? ??? ?? ???!
??? ????
?? ??????
CPU                            READ/WRITE



????? I/O? ???? Python ???? ??!

  C
  P     READ/WRITE
  U
                                     ??? ??
                                     ?? ??????
                                     ??? I/O bound!!
  ?         C
  ?         P   READ/WRITE
  ?         U
                                     ??? ???~!
                C
      ???       P   READ/WRITE
                U
?? ??? ?? ?????
   ??? ????
??? ?? ????? ????? ?????




Multiprocessing module
     http://docs.python.org/library/multiprocessing.html
?? ??? ????!
???? ???? ?????!
????
    ???     ???    ???


    ???     ???    ???




multiprocessing ???
??? ?? ????? ?????.
    ????
           ????.

    ????
           ????.
from multiprocessing import Process, Queue
def do_work(start, end, result):
    sum = 0
    for i in range(start, end):
        sum += i
    result.put(sum)
    return
if __name__ == "__main__":
    START, END = 0, 20000000
    result = Queue()
    pr1 = Process(target=do_work, args=(START, END/2, result))
    pr2 = Process(target=do_work, args=(END/2, END, result))
    pr1.start()
    pr2.start()
    pr1.join()
    pr2.join()
    result.put('STOP')
    sum = 0
    while True:
        tmp = result.get()
        if tmp == 'STOP': break
        else: sum += tmp
    print "Result : ", sum
??? 2? ???? 2?
$ time python 2thread.py      $ time python 2process.py
Result : 199999990000000      Result : 199999990000000


real 0m4.225s                 real 0m1.880s



                                          ??!
                       ????(sec)
                     4.2

                                 1.8



                   2 thread   2 process
Multiprocessing ?
     ??? ?? ????? ? ? ????.
th1 = Thread(target=do_work, args=(START, END, result))
th1.start()
th1.join()



pr1 = Process(target=do_work, args=(START, END, result))
pr1.start()
pr1.join()
Multiprocessing ?
         ??? ???? ?? ?????.
threading.Condition   multiprocessing.Condition
threading.Event       multiprocessing.Event
threading.Lock        multiprocessing.Lock
threading.RLock       multiprocessing.RLock
threading.Semaphore   multiprocessing.Semaphore


    ?? ?? ?? ??? ????? ?????.
?????
????!
? ? ? ????.
?? ?? ???!
???? ???? ???? ???

     ?? ???
????!
?? ??? ?? ????? ?
???

              ???




??? ???? ? ??? ?? ????.
??? ??? ?? ?? ? ?? ????? ?

               ????

               ????


               ????



               ????

               ????
???? ???
???? ????.
????? ?? ??
   ?? ???? ????
????? ????? ??? ?? ?????




Parallel Python
     http://www.parallelpython.com/
Parallel Python? ???
           ????? ?? ??


           ??? ?? ?? ??
^192.168.1.2 ̄


import pp
ppservers=(^* ̄)
job_server =
pp.Server(ppservers=ppservers)


                                     ^192.168.1.3 ̄      ^192.168.1.4 ̄
                                     $ ppserver.py -a   $ ppserver.py -a
f1 = job_server.submit(func, args)
f2 = job_server.submit(func, args)



result1 = f1()
result2 = f2()
0~160000000 ???

         ????(sec)
 9.3


            5.1
                      3.5




1 node     2 node    3 node
Parallel Python? ???

?? ??? ???? ???? ? ? ??.

Worker?? ?? ?? ??? ??.

?? callback??? ?? ?? ??.
?????
?? ??!
?? ? ? ????.
????? ???!
????
Hadoop, MapReduce
?? ??? ?????
?? ? ????
??? ???~~
???? ???? ???? ???? ???




  MapReduce
    http://en.wikipedia.org/wiki/MapReduce
OSDI 2004? ??
MapReduce? ??
- ?? ????-
??? ??
^??. ? ?? ? ????
  ? ?? ???? ????? ̄
?´
??? ?? ?? ?? ????.
                                map




KTH 1   ?? 1    KTH 1   ??? 1    ?? 1
??? 1   ?? 1    ?? 1             ?? 1
        ??? 1   ??? 1            ?? 1
? ?? ??? ?? ?????.
                                           reduce

   KTH 1       ?? 1        KTH 1       ??? 1    ?? 1
   ??? 1       ?? 1        ?? 1                 ?? 1
               ??? 1       ??? 1                ?? 1




?? A~Z? ???~           ? ?~?                    ? ?~???



               KTH 2           ?? 3     ??? 1
                               ??? 2    ?? 1 ?? 2
                                        ??? 1
MapReduce
      ? ???
   ?? ???(map)
  ???? ???(reduce)
? ???? ??? ????.
??? MapReduce?
? ??? ?????
???? ???? ???
??? Hadoop!!
Hadoop??? ????

???? ?? ?
  ?? ??? ???? ????
  ????? ??? ??? ?????
  ?? ?? ????? ????
  ??? ??? ??? ??? ???
  ?~ ?~ @.@
Hadoop??? ????

???? ?? ?
  Map? ??? ???
  Reduce? ??? ???
????? ?? ????? ??????.




Hadoop?? ???? ???   Hadoop??
????? ????(1398~1468)         ?? ?? ???
                              ???? ?? ??? ??
                              (?? ??? ??????)


Hadoop?? ??? start-up? ?? ??? ????? ? ? ?? ?
              ?? ????? ?? ??
??? Hadoop? JAVA??
                ?? ??? ?????


  HadoopStreaming
   STDIN/STDOUT? ??? ??? ????
   = Python, Ruby? ?? ???? ????
HadoopStreaming ? ??? ??



             ??? ?? ???

             ?? ??? ??? ??? ??

             ???? ?? ?? ???!!(???!)


             ?? ?? 100GB
? ?? ???? Python ???.

                   Hadoop? ?? ??.

                   Python? ?? ??.

                   ?? ??? ?~

Siri on iphone4s
HadoopStreaming? ?? ?? ?? Python Library




           mrJob
           https://github.com/Yelp/mrjob



                 by
??? ?? ?? ???? : mrjob? ??? ??


 from mrjob.job import MRJob                 ??? ??
                                             ? (??, 1)
 class MRWordCounter(MRJob):
     def mapper(self, key, line):            ? ????
         for word in line.split():
             yield word, 1

     def reducer(self, word, occurrences):
         yield word, sum(occurrences)

 if __name__ == '__main__':                  ??? ??
     MRWordCounter.run()                     ? ???
???? ???, ?? Hadoop? ??? ?? ???

$ python word_count.py data.txt --runner=local --output-dir=result




  Hadoop?? ??? runner? ???? ?
$ export HADOOP_HOME=~~~~~
$ python word_count.py data.txt --runner=hadoop --output-dir=result
??? ???
??? ????
??? ??????
??? 100?? ????
????? ??? ???? ?? ? ?? ????.
???? ??
?.. ????? ????
100?? ???? ???? ?
   ?.?
?? ??? ??
???? ???? Hadoop ??? ???




ElasticMapReduce
    http://aws.amazon.com/elasticmapreduce/
2011 H3 ????-????? ???? ?? ???
??? ??
OS? Hadoop?? ????
??? ????.



?? ??? ???.
   ? ??? ? ?? ?
         Large Instance ??
 7.5G ???, 4 EC2 Compute Units ??
      100?? 1????? ? $ 6



??? ???? ??!
??? mrjob? ??? ??

$ python word_count.py data.txt --runner=local --output-dir=result



$ export AWS_ACCESS_KEY_ID=~~~~
$ export AWS_SECRET_ACCESS_KEY=~~~~
$ python word_count.py data.txt --runner=emr --output-dir=s3://yongho/result


   runner? ??? ????? ???? ?? ?!!!
2011 H3 ????-????? ???? ?? ???
Elastic MapReduce?
?? ?? ??? ???
??? ??? ?? ??????
?: ???? ??? ?? ??.
no configs found; falling back on auto-configuration
                                                                  Job completed.
using existing scratch bucket mrjob-2a2aa23a8d6b1931
                                                                  Running time was 68.0s (not counting time spent waiting for the
using s3://mrjob-2a2aa23a8d6b1931/tmp/ as our scratch dir on S3
                                                                  EC2 instances)
Uploading input to s3://mrjob-
                                                                  Fetching counters...
2a2aa23a8d6b1931/tmp/word_count.yongho.20111104.053927.13564
                                                                  counters: [{'File Systems': {'Local bytes read': 341091,
3/input/
                                                                                   'Local bytes written': 682264,
creating tmp directory


     1MB ??? ????
                                                                                   'S3N bytes read': 170755,
/var/folders/v2/t63rb7x54f53_9mx4hmw_xqc0000gn/T/word_count.yo
                                                                                   'S3N bytes written': 70720},
ngho.20111104.053927.135643
                                                                    'Job Counters ': {'Launched map tasks': 2,
writing master bootstrap script to
                                                                                    'Launched reduce tasks': 1,
/var/folders/v2/t63rb7x54f53_9mx4hmw_xqc0000gn/T/word_count.yo
                                                                                    'Rack-local map tasks': 2},
ngho.20111104.053927.135643/b.py
                                                                    'Map-Reduce Framework': {'Combine input records': 0,
Copying non-input files into s3://mrjob-


     ??? 5?, ??? 1?
                                                                                          'Combine output records': 0,
2a2aa23a8d6b1931/tmp/word_count.yongho.20111104.053927.13564
                                                                                          'Map input bytes': 167508,
3/files/
                                                                                          'Map input records': 3735,
Waiting 5.0s for S3 eventual consistency
                                                                                          'Map output bytes': 279513,
Creating Elastic MapReduce job flow
                                                                                          'Map output records': 29460,
Job flow created with ID: j-3BPUCHHQ1T5PC
                                                                                          'Reduce input groups': 6017,
Job launched 32.2s ago, status STARTING: Starting instances
                                                                                          'Reduce input records': 29460,


     ??? ??? ???? ? ??
Job launched 63.5s ago, status STARTING: Starting instances
                                                                                          'Reduce output records': 6017}}]
Job launched 94.4s ago, status STARTING: Starting instances
                                                                  removing tmp directory
Job launched 125.3s ago, status STARTING: Starting instances
                                                                  /var/folders/v2/t63rb7x54f53_9mx4hmw_xqc0000gn/T/word_count.yo
Job launched 156.4s ago, status STARTING: Starting instances
                                                                  ngho.20111104.053927.135643
Job launched 187.2s ago, status STARTING: Starting instances
                                                                  Removing all files in s3://mrjob-
Job launched 218.1s ago, status BOOTSTRAPPING: Running
                                                                  2a2aa23a8d6b1931/tmp/word_count.yongho.20111104.053927.13564
bootstrap actions
                                                                  3/
Job launched 249.1s ago, status BOOTSTRAPPING: Running
                                                                  Removing all files in s3://mrjob-2a2aa23a8d6b1931/tmp/logs/j-
bootstrap actions


     ??? ?, ????? ?? ??
                                                                  3BPUCHHQ1T5PC/
Job launched 280.0s ago, status RUNNING: Running step
                                                                  Terminating job flow: j-3BPUCHHQ1T5PC
(word_count.yongho.20111104.053927.135643: Step 1 of 1)
Job launched 310.9s ago, status RUNNING: Running step
(word_count.yongho.20111104.053927.135643: Step 1 of 1)
Job launched 342.2s ago, status RUNNING: Running step
(word_count.yongho.20111104.053927.135643: Step 1 of 1)
Waiting 5.0s for S3 eventual consistency
REVIEW
Multiprocessing module
                             MultiCore
Parallel Python
MapReduce                Many Machines

mrJob
ElasticMapReduce               Cloud
2011 H3 ????-????? ???? ?? ???
??? ??? ? ????
??? ?? ??? ???
??? ??? ?????
???
?? ???
???? ?????.
?????.
??????? / ???? Lab / ???
    wizfromnorth@paran.com
          @yonghosee

More Related Content

2011 H3 ????-????? ???? ?? ???

  • 1. ??? @yonghosee '???' ?? ???. kth ??????? ??????? ??? ??? ????. '??? ?? ???'??? ???? ???, 'iLock'?? ?? ? ? ?? ??? ???, ??? ??? ????? ??? ????. ??? ?, ??????? ??? ????. ??? ??? ??? ?? ???? ?. Data Scientist? ?????. ????? ???? ?? ???. ???? ?? ??? ?? ?? ??? ???? ?? ?? ????? ? ?? ?????. ??? ??? ??? ??? ?? ??? ???? ?? ??, ????? ???? ? ???? ???? ? ???? ???? ? ????. ????? ????, ?? ?? ???? ??????, Hadoop? ??? ????? ???? ?? ???? ????? ??? ??.
  • 3. ? ?? ? ? 3
  • 4. ?? : ??? ??? ???
  • 5. ?? : ??? ??? ???(??? ??)
  • 6. ??? ?? ????? ????? ??? ????? ??? ????? ???.
  • 8. ??!
  • 9. ??!
  • 16. ???!
  • 18. ??? ! ??
  • 19. ?? ??? ??? ?? ???. ????? ???^^;;
  • 21. ?? ?? ???? ??? & ????
  • 24. CLOUD
  • 26. to LEARN Hard to WRITE to RUN
  • 27. ??? ????? ??? ???. ???? ?? ??? ?? ??. ????? ???? ???? ?? ??. ??? ?? ??? ?? ???. ??? ??? ?? ?? ??.
  • 28. ???
  • 29. ??? Md5 ???? ??? ???? ?? ?? ???. ??? ???? ???? hashlib.md5(^python is so powerful").hexdigest() import gzip ?? ?? ???. f = gzip.open('example.txt.gz', 'wb') f.write('Contents of the example go here.n') f.close() ????? ???? ???? import urllib2 ?? ?? ???. response = urllib2.urlopen('http://google.com/') data = response.read() print data REST ???? ??? ????from flask import Flask app = Flask(__name__) ?? ?? ???. @app.route("/^, methods=[`GET¨]) def hello(): return "Hello World!" if __name__ == "__main__": app.run()
  • 30. ??? JAVA PYTHON Perl, Ruby ??? ??(???)
  • 31. ??? ????? ??? ???. ?? ?? ??? ???? ?? ??? ?? ??. ?? ??? ? ?????? ??? ????? ???? ???? ?? ??. ??? ?? ?? ?????. ??? ?? ??? ?? ???. ?? ? ?? ?? ? ?? ??? ??? ?? ?? ??. ????? ??? ? ???^^
  • 32. to LEARN Easy to WRITE to RUN
  • 34. ? ???
  • 36. ?? ? ??? ?? ???(Thread) !!! ???? ??? ???
  • 38. from threading import Thread start?? def do_work(start, end, result): sum = 0 end?? for i in range(start, end): ??? sum += i result.append(sum) return ??? if __name__ == "__main__": 1?? START, END = 0, 20000000 ????. result = list() th1 = Thread(target=do_work, args=(START, END, result)) th1.start() th1.join() print "Result : ", sum(result)
  • 39. ??? 1? $ time python 1thread.py Result : 199999990000000 real 0m3.523s
  • 40. 2? ?? ? ??????
  • 41. from threading import Thread start?? def do_work(start, end, result): sum = 0 end?? for i in range(start, end): ??? sum += i result.append(sum) return ??? if __name__ == "__main__": 2?? START, END = 0, 20000000 ????. result = list() th1 = Thread(target=do_work, args=(START, END/2, result)) th2 = Thread(target=do_work, args=(END/2, END, result)) th1.start() th2.start() th1.join() th2.join() print "Result : ", sum(result)
  • 42. ??? 1? ??? 2? $ time python 1thread.py $ time python 2thread.py Result : 199999990000000 Result : 199999990000000 real 0m3.523s real 0m4.225s ?-_-?
  • 45. ? ?? ?? ???????
  • 48. ?????? ??? ???? ???. ??? ??? ??? ??? #1 #2 #3 #4 PYTHON VM
  • 50. Coarse-Grained Lock ??? ??? ??? ??? #1 #2 #3 #4 PYTHON VM
  • 52. Fine-Grained Lock ??? ??? ??? ??? #1 #2 #3 #4 PYTHON VM
  • 53. One Big Lock = Global Interpreter Lock ???? ?? guido van rossum
  • 54. Global Interpreter Lock ??? ??? ??? ??? #1 #2 #3 #4 PYTHON VM
  • 55. ??? ?? ? ??? ?? ??? ?? ???
  • 56. GIL??? ??? ? ???? ???? ? CPU? ?(??)??! ?? ??? ??? ?? ??? ?? ??? ?? ???
  • 57. ? ??? ?????? ? ?????´
  • 58. GIL? ???? ?????? ??? ?????. Garbage Collector ???? ????. C/C++ ?? ?? ???? ?? ????. ? ??? ???? ?? ?? ? ?????.
  • 59. GIL? ???? ??? 1990??, ??? ????? ?? ?? ??? CPU? ? ? ?, ? ??? ???(Time-Sharing) ??? ?? ? ?? ??? ???? ???? ??? ??? ?? ???!
  • 61. CPU READ/WRITE ????? I/O? ???? Python ???? ??! C P READ/WRITE U ??? ?? ?? ?????? ??? I/O bound!! ? C ? P READ/WRITE ? U ??? ???~! C ??? P READ/WRITE U
  • 62. ?? ??? ?? ????? ??? ????
  • 63. ??? ?? ????? ????? ????? Multiprocessing module http://docs.python.org/library/multiprocessing.html
  • 64. ?? ??? ????! ???? ???? ?????!
  • 65. ???? ??? ??? ??? ??? ??? ??? multiprocessing ??? ??? ?? ????? ?????. ???? ????. ???? ????.
  • 66. from multiprocessing import Process, Queue def do_work(start, end, result): sum = 0 for i in range(start, end): sum += i result.put(sum) return if __name__ == "__main__": START, END = 0, 20000000 result = Queue() pr1 = Process(target=do_work, args=(START, END/2, result)) pr2 = Process(target=do_work, args=(END/2, END, result)) pr1.start() pr2.start() pr1.join() pr2.join() result.put('STOP') sum = 0 while True: tmp = result.get() if tmp == 'STOP': break else: sum += tmp print "Result : ", sum
  • 67. ??? 2? ???? 2? $ time python 2thread.py $ time python 2process.py Result : 199999990000000 Result : 199999990000000 real 0m4.225s real 0m1.880s ??! ????(sec) 4.2 1.8 2 thread 2 process
  • 68. Multiprocessing ? ??? ?? ????? ? ? ????. th1 = Thread(target=do_work, args=(START, END, result)) th1.start() th1.join() pr1 = Process(target=do_work, args=(START, END, result)) pr1.start() pr1.join()
  • 69. Multiprocessing ? ??? ???? ?? ?????. threading.Condition multiprocessing.Condition threading.Event multiprocessing.Event threading.Lock multiprocessing.Lock threading.RLock multiprocessing.RLock threading.Semaphore multiprocessing.Semaphore ?? ?? ?? ??? ????? ?????.
  • 72. ???? ???? ???? ??? ?? ???
  • 73. ????! ?? ??? ?? ????? ?
  • 74. ??? ??? ??? ???? ? ??? ?? ????.
  • 75. ??? ??? ?? ?? ? ?? ????? ? ???? ???? ???? ???? ????
  • 77. ????? ?? ?? ?? ???? ????
  • 78. ????? ????? ??? ?? ????? Parallel Python http://www.parallelpython.com/
  • 79. Parallel Python? ??? ????? ?? ?? ??? ?? ?? ??
  • 80. ^192.168.1.2 ̄ import pp ppservers=(^* ̄) job_server = pp.Server(ppservers=ppservers) ^192.168.1.3 ̄ ^192.168.1.4 ̄ $ ppserver.py -a $ ppserver.py -a f1 = job_server.submit(func, args) f2 = job_server.submit(func, args) result1 = f1() result2 = f2()
  • 81. 0~160000000 ??? ????(sec) 9.3 5.1 3.5 1 node 2 node 3 node
  • 82. Parallel Python? ??? ?? ??? ???? ???? ? ? ??. Worker?? ?? ?? ??? ??. ?? callback??? ?? ?? ??.
  • 83. ????? ?? ??! ?? ? ? ????.
  • 85. ???? Hadoop, MapReduce ?? ??? ????? ?? ? ????
  • 87. ???? ???? ???? ???? ??? MapReduce http://en.wikipedia.org/wiki/MapReduce
  • 90. ??? ?? ^??. ? ?? ? ???? ? ?? ???? ????? ̄
  • 91.
  • 92. ??? ?? ?? ?? ????. map KTH 1 ?? 1 KTH 1 ??? 1 ?? 1 ??? 1 ?? 1 ?? 1 ?? 1 ??? 1 ??? 1 ?? 1
  • 93. ? ?? ??? ?? ?????. reduce KTH 1 ?? 1 KTH 1 ??? 1 ?? 1 ??? 1 ?? 1 ?? 1 ?? 1 ??? 1 ??? 1 ?? 1 ?? A~Z? ???~ ? ?~? ? ?~??? KTH 2 ?? 3 ??? 1 ??? 2 ?? 1 ?? 2 ??? 1
  • 94. MapReduce ? ??? ?? ???(map) ???? ???(reduce) ? ???? ??? ????.
  • 98. Hadoop??? ???? ???? ?? ? ?? ??? ???? ???? ????? ??? ??? ????? ?? ?? ????? ???? ??? ??? ??? ??? ??? ?~ ?~ @.@
  • 99. Hadoop??? ???? ???? ?? ? Map? ??? ??? Reduce? ??? ???
  • 100. ????? ?? ????? ??????. Hadoop?? ???? ??? Hadoop??
  • 101. ????? ????(1398~1468) ?? ?? ??? ???? ?? ??? ?? (?? ??? ??????) Hadoop?? ??? start-up? ?? ??? ????? ? ? ?? ? ?? ????? ?? ??
  • 102. ??? Hadoop? JAVA?? ?? ??? ????? HadoopStreaming STDIN/STDOUT? ??? ??? ???? = Python, Ruby? ?? ???? ????
  • 103. HadoopStreaming ? ??? ?? ??? ?? ??? ?? ??? ??? ??? ?? ???? ?? ?? ???!!(???!) ?? ?? 100GB
  • 104. ? ?? ???? Python ???. Hadoop? ?? ??. Python? ?? ??. ?? ??? ?~ Siri on iphone4s
  • 105. HadoopStreaming? ?? ?? ?? Python Library mrJob https://github.com/Yelp/mrjob by
  • 106. ??? ?? ?? ???? : mrjob? ??? ?? from mrjob.job import MRJob ??? ?? ? (??, 1) class MRWordCounter(MRJob): def mapper(self, key, line): ? ???? for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences) if __name__ == '__main__': ??? ?? MRWordCounter.run() ? ???
  • 107. ???? ???, ?? Hadoop? ??? ?? ??? $ python word_count.py data.txt --runner=local --output-dir=result Hadoop?? ??? runner? ???? ? $ export HADOOP_HOME=~~~~~ $ python word_count.py data.txt --runner=hadoop --output-dir=result
  • 110. ????? ??? ???? ?? ? ?? ????.
  • 111. ???? ?? ?.. ????? ???? 100?? ???? ???? ? ?.?
  • 113. ???? ???? Hadoop ??? ??? ElasticMapReduce http://aws.amazon.com/elasticmapreduce/
  • 115. ??? ?? OS? Hadoop?? ???? ??? ????. ?? ??? ???. ? ??? ? ?? ? Large Instance ?? 7.5G ???, 4 EC2 Compute Units ?? 100?? 1????? ? $ 6 ??? ???? ??!
  • 116. ??? mrjob? ??? ?? $ python word_count.py data.txt --runner=local --output-dir=result $ export AWS_ACCESS_KEY_ID=~~~~ $ export AWS_SECRET_ACCESS_KEY=~~~~ $ python word_count.py data.txt --runner=emr --output-dir=s3://yongho/result runner? ??? ????? ???? ?? ?!!!
  • 118. Elastic MapReduce? ?? ?? ??? ??? ??? ??? ?? ?????? ?: ???? ??? ?? ??.
  • 119. no configs found; falling back on auto-configuration Job completed. using existing scratch bucket mrjob-2a2aa23a8d6b1931 Running time was 68.0s (not counting time spent waiting for the using s3://mrjob-2a2aa23a8d6b1931/tmp/ as our scratch dir on S3 EC2 instances) Uploading input to s3://mrjob- Fetching counters... 2a2aa23a8d6b1931/tmp/word_count.yongho.20111104.053927.13564 counters: [{'File Systems': {'Local bytes read': 341091, 3/input/ 'Local bytes written': 682264, creating tmp directory 1MB ??? ???? 'S3N bytes read': 170755, /var/folders/v2/t63rb7x54f53_9mx4hmw_xqc0000gn/T/word_count.yo 'S3N bytes written': 70720}, ngho.20111104.053927.135643 'Job Counters ': {'Launched map tasks': 2, writing master bootstrap script to 'Launched reduce tasks': 1, /var/folders/v2/t63rb7x54f53_9mx4hmw_xqc0000gn/T/word_count.yo 'Rack-local map tasks': 2}, ngho.20111104.053927.135643/b.py 'Map-Reduce Framework': {'Combine input records': 0, Copying non-input files into s3://mrjob- ??? 5?, ??? 1? 'Combine output records': 0, 2a2aa23a8d6b1931/tmp/word_count.yongho.20111104.053927.13564 'Map input bytes': 167508, 3/files/ 'Map input records': 3735, Waiting 5.0s for S3 eventual consistency 'Map output bytes': 279513, Creating Elastic MapReduce job flow 'Map output records': 29460, Job flow created with ID: j-3BPUCHHQ1T5PC 'Reduce input groups': 6017, Job launched 32.2s ago, status STARTING: Starting instances 'Reduce input records': 29460, ??? ??? ???? ? ?? Job launched 63.5s ago, status STARTING: Starting instances 'Reduce output records': 6017}}] Job launched 94.4s ago, status STARTING: Starting instances removing tmp directory Job launched 125.3s ago, status STARTING: Starting instances /var/folders/v2/t63rb7x54f53_9mx4hmw_xqc0000gn/T/word_count.yo Job launched 156.4s ago, status STARTING: Starting instances ngho.20111104.053927.135643 Job launched 187.2s ago, status STARTING: Starting instances Removing all files in s3://mrjob- Job launched 218.1s ago, status BOOTSTRAPPING: Running 2a2aa23a8d6b1931/tmp/word_count.yongho.20111104.053927.13564 bootstrap actions 3/ Job launched 249.1s ago, status BOOTSTRAPPING: Running Removing all files in s3://mrjob-2a2aa23a8d6b1931/tmp/logs/j- bootstrap actions ??? ?, ????? ?? ?? 3BPUCHHQ1T5PC/ Job launched 280.0s ago, status RUNNING: Running step Terminating job flow: j-3BPUCHHQ1T5PC (word_count.yongho.20111104.053927.135643: Step 1 of 1) Job launched 310.9s ago, status RUNNING: Running step (word_count.yongho.20111104.053927.135643: Step 1 of 1) Job launched 342.2s ago, status RUNNING: Running step (word_count.yongho.20111104.053927.135643: Step 1 of 1) Waiting 5.0s for S3 eventual consistency
  • 120. REVIEW Multiprocessing module MultiCore Parallel Python MapReduce Many Machines mrJob ElasticMapReduce Cloud
  • 122. ??? ??? ? ???? ??? ?? ??? ??? ??? ??? ?????
  • 124. ?????. ??????? / ???? Lab / ??? wizfromnorth@paran.com @yonghosee