�ݺ�ߣ

1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group

2. Who's who?

3. Who's who? Who has worked on scale? e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes? >= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?

4. In this talk Why large-scale data processing?

5. An introduction to scale @ SARA

6. An introduction to Hadoop & MapReduce

7. Hadoop @ SARA

8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA

9. (Jimmy Lin, University of Maryland / Twitter, 2011)

10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)

11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)

12. Data-processing as a commodity Cheap Clusters

13. Simple programming models

14. Easy-to-learn scripting

15. Anybody with the know-how can generate insights!

16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge

18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization

19. Large-scale data != new

20. Different types of computing Parallelism Data parallelism

21. Task parallelism Architectures SIMD: Single Instruction Multiple Data

22. MIMD: Multiple Instruction Multiple Data

23. MISD: Multiple Instruction Single Data

24. SISD: Single Instruction Single Data (Von Neumann)

25. Parallelism: Amdahl's law

26. Data parallelism

27. Compute @ SARA

28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)

30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production

32. Core principals Scale out, not up

33. Move processing to the data

34. Process data sequentially, avoid random reads

35. Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)

36. A typical data-parallel problem in abstraction Iterate over a large number of records

37. Extract something of interest

38. Create an ordering in intermediate results

39. Aggregate intermediate results

40. Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)

41. MapReduce Programmer specifies two functions map (k, v) -> <k', v'>*

42. reduce (k', v') -> <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest

43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...

44. An overview of a Hadoop cluster

45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...

47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

48. Architecture

49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?

50. What are scientists doing? Information Retrieval

51. Natural Language Processing

52. Machine Learning

53. Econometry

54. Bioinformatics

55. Computational Ecology / Ecoinformatics

56. Machine learning: Infrawatch, Hollandse Brug

57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

58. And others: NLP & IR e.g. ClueWeb: a ~13.4 TB webcrawl

59. e.g. Twitter gardenhose data

60. e.g. Wikipedia dumps

61. e.g. del.ico.us & flickr tags

62. Finding named entities: [person company place] names

63. Creating inverted indexes

64. Piloting real-time search

65. Personalization

66. Semantic web

67. Interest from industry We're opening shop. Come and pilot.

68. Final thoughts The tide rises, data is not getting less, let's ride that wave!

69. Hadoop is the first to provide commodity computing Hadoop is not the only

70. Hadoop is probably not the best

71. Hadoop has momentum

72. And how many infrastructures do we need? MapReduce fits surprisingly well as a programming model for data-parallelism

73. The data center is your computer

74. Where is the data scientist? Much to learn & teach!

75. Any questions? [email_address] @eevrt @sara_nl

�ݺ�ߣ

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

More Related Content

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop