ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Part 0 ¨C Answer important argument
                          Art of Distributed
                       ¡®Why Distributed Computing¡¯
                          Haytham ElFadeel - Hfadeel@acm.org




Abstract
Day after day distributed systems tend to appear in mainstream systems. In spite of ma-
turity of distributed frameworks and tools, a lot of distributed technologies still ambi-
guous for a lot of people, especially if we talks about complicated or new stuff.
Lately I have been getting a lot of questions about distributed computing, especially dis-
tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma-
pReduce fit in all situations? How can we compares it with other technologies such as
Grid Computing? And what is the best solution to our situation? So I decided to write
about distributed computing article into two parts. The first, about distributed compu-
ting models and the differences between them. The second part, I will discuss reliability
of the distrusted system and distributed storage systems. So let¡¯s start¡­

Introduction:
After publish the part 1 of this article that entitled ¡®Rethinking about the Distributed
Computing¡¯, I received a few message doubting about Distributed Computing. So I de-
cide to go back and write Part 0 ¨C Answer important argument ¡®Why Distributed Com-
puting¡¯.

Why Distributed Computing:
A log of challenges require involving of huge amount of Data, Image if you want to
crawling the whole Web, perform Monte Carlo simulation, or render movie animation,
etc. Such challenges and many others cannot perform in one PC even.
Usually we resort to Distributed Computing when we want deal with huge amount of
data. So the question is what is the big data?




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
The Big Data - Experience with single machine:
Today computers have several Gigabytes or a few Terabytes of storage system capacity.
But what is ¡®big data¡¯ anyway? Gigabytes? Terabytes? Petabytes?
A database on the order of 100 GB would not be considered trivially small even today,
although hard drives capable of storing 10 times as much can be had for less than $100
at any computer store. The U.S. Census database included many different datasets of
varying sizes, but let¡¯s simplify a bit: 100 gigabytes is enough to store at least the basic
demographic information: age, sex, income, ethnicity, language, religion, housing status,
and location, packed in a 128-bit record¡ªfor every living human being on this planet.
This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be
considered ¡®big data¡¯? It depends, on what you¡¯re trying to do with it. More important-
ly, any competent programmer could in a few hours write a simple, unoptimized appli-
cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that
dataset and return answers to simple aggregation queries such as ¡®what is the median
age by sex for each country?¡¯ with perfectly reasonable performance.
To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e.
128-bit) records containing uniformly distributed random data. Since a seven-bit age
field allows a maximum of 128 possible values, one bit for sex allows only two, and eight
bits for country allows up to 256 option, we can calculate the median age by using a
counting strategy: simply create 65,536 buckets-one for each combination of age, sex,
and country-and count how many records fall into each. We find the median age by de-
termining, for each sex and country group, the cumulative count over the 128 age buck-
ets: the median is the bucket where the count reaches half of the total. [See figure 1].




 In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver-
age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU
the whole time.



Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB
RAM. Running off in-memory data, my simple median-age-by-sex-and-country program
completed in less than a minute. By such measures, I would hesitate to call this ¡®big da-
ta¡¯. This application will need in average between 725-second, and 1000-second to per-
form the above quire. But what if the data bigger than this, you want perform more
complex quires, or what if this data stored into Database system which means every
record will take only 128-bit??? Actually if you add more data or make the query more
complex the time will likely to get exponential or at least multiply!
So the question answer is: It¡¯s depends on what you¡¯re trying to do with the Data.

Hard Limit:
Hardware and software failures are important limit, there is no way to predict or protect
your system from the failures. Even if you use a High-end Server or expensive hardware
the failure is likely to occur. So in single machine you likely are going to lose your whole
work! But if you distribute the work across many machines you will lose one part of the
work and you can resume it into another machine. Actually the modern distributed
strategies show its ability and reliable to handle the hardware and software failures,
Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop
all designed to handle the unexpected failures.
Of course, hardware¡ªchiefly memory and CPU limitations¡ªis often a major factor in
software limits on dataset size. Many applications are designed to read entire datasets
into memory and work with them there; a good example of this is the popular statistical
computing environment R.7 Memory-bound applications naturally exhibit higher per-
formance than disk-bound ones (at least insofar as the data-crunching they carry out
advances beyond single-pass, purely sequential processing), but requiring all data to fit
in memory means that if you have a dataset larger than your installed RAM, you¡¯re out
of luck. So in this case do you will wait until the next generation of the hardware come?!
On most hardware platforms, there¡¯s a much harder limit on memory expansion than
disk expansion: the motherboard has only so many slots to fill.
The problem often goes further than this, however. Like most other aspects of comput-
er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare
configuration for a desktop workstation, and servers are frequently configured with far
more than that. There is no guarantee, however, that a memory-bound application will
be able to use all installed RAM. Even under modern 64-bit operating systems, many




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
applications today (e.g., R under Windows) have only 32-bit executables and are limited
to 4-GB address spaces¡ªthis often translates into a 2- or 3-GB working set limitation.
Finally, even where a 64-bit binary is available¡ªremoving the absolute address space
limitation¡ªall too often relics from the age of 32-bit code still pervade software, partic-
ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit
versions of R (available for Linux and Mac) use signed 32-bit integers to represent
lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit
system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such
as the earlier world census example ends up being too big for R to handle.




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org

More Related Content

Art Of Distributed P0

  • 1. Part 0 ¨C Answer important argument Art of Distributed ¡®Why Distributed Computing¡¯ Haytham ElFadeel - Hfadeel@acm.org Abstract Day after day distributed systems tend to appear in mainstream systems. In spite of ma- turity of distributed frameworks and tools, a lot of distributed technologies still ambi- guous for a lot of people, especially if we talks about complicated or new stuff. Lately I have been getting a lot of questions about distributed computing, especially dis- tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma- pReduce fit in all situations? How can we compares it with other technologies such as Grid Computing? And what is the best solution to our situation? So I decided to write about distributed computing article into two parts. The first, about distributed compu- ting models and the differences between them. The second part, I will discuss reliability of the distrusted system and distributed storage systems. So let¡¯s start¡­ Introduction: After publish the part 1 of this article that entitled ¡®Rethinking about the Distributed Computing¡¯, I received a few message doubting about Distributed Computing. So I de- cide to go back and write Part 0 ¨C Answer important argument ¡®Why Distributed Com- puting¡¯. Why Distributed Computing: A log of challenges require involving of huge amount of Data, Image if you want to crawling the whole Web, perform Monte Carlo simulation, or render movie animation, etc. Such challenges and many others cannot perform in one PC even. Usually we resort to Distributed Computing when we want deal with huge amount of data. So the question is what is the big data? Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 2. The Big Data - Experience with single machine: Today computers have several Gigabytes or a few Terabytes of storage system capacity. But what is ¡®big data¡¯ anyway? Gigabytes? Terabytes? Petabytes? A database on the order of 100 GB would not be considered trivially small even today, although hard drives capable of storing 10 times as much can be had for less than $100 at any computer store. The U.S. Census database included many different datasets of varying sizes, but let¡¯s simplify a bit: 100 gigabytes is enough to store at least the basic demographic information: age, sex, income, ethnicity, language, religion, housing status, and location, packed in a 128-bit record¡ªfor every living human being on this planet. This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be considered ¡®big data¡¯? It depends, on what you¡¯re trying to do with it. More important- ly, any competent programmer could in a few hours write a simple, unoptimized appli- cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that dataset and return answers to simple aggregation queries such as ¡®what is the median age by sex for each country?¡¯ with perfectly reasonable performance. To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e. 128-bit) records containing uniformly distributed random data. Since a seven-bit age field allows a maximum of 128 possible values, one bit for sex allows only two, and eight bits for country allows up to 256 option, we can calculate the median age by using a counting strategy: simply create 65,536 buckets-one for each combination of age, sex, and country-and count how many records fall into each. We find the median age by de- termining, for each sex and country group, the cumulative count over the 128 age buck- ets: the median is the bucket where the count reaches half of the total. [See figure 1]. In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver- age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU the whole time. Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 3. In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB RAM. Running off in-memory data, my simple median-age-by-sex-and-country program completed in less than a minute. By such measures, I would hesitate to call this ¡®big da- ta¡¯. This application will need in average between 725-second, and 1000-second to per- form the above quire. But what if the data bigger than this, you want perform more complex quires, or what if this data stored into Database system which means every record will take only 128-bit??? Actually if you add more data or make the query more complex the time will likely to get exponential or at least multiply! So the question answer is: It¡¯s depends on what you¡¯re trying to do with the Data. Hard Limit: Hardware and software failures are important limit, there is no way to predict or protect your system from the failures. Even if you use a High-end Server or expensive hardware the failure is likely to occur. So in single machine you likely are going to lose your whole work! But if you distribute the work across many machines you will lose one part of the work and you can resume it into another machine. Actually the modern distributed strategies show its ability and reliable to handle the hardware and software failures, Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop all designed to handle the unexpected failures. Of course, hardware¡ªchiefly memory and CPU limitations¡ªis often a major factor in software limits on dataset size. Many applications are designed to read entire datasets into memory and work with them there; a good example of this is the popular statistical computing environment R.7 Memory-bound applications naturally exhibit higher per- formance than disk-bound ones (at least insofar as the data-crunching they carry out advances beyond single-pass, purely sequential processing), but requiring all data to fit in memory means that if you have a dataset larger than your installed RAM, you¡¯re out of luck. So in this case do you will wait until the next generation of the hardware come?! On most hardware platforms, there¡¯s a much harder limit on memory expansion than disk expansion: the motherboard has only so many slots to fill. The problem often goes further than this, however. Like most other aspects of comput- er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare configuration for a desktop workstation, and servers are frequently configured with far more than that. There is no guarantee, however, that a memory-bound application will be able to use all installed RAM. Even under modern 64-bit operating systems, many Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 4. applications today (e.g., R under Windows) have only 32-bit executables and are limited to 4-GB address spaces¡ªthis often translates into a 2- or 3-GB working set limitation. Finally, even where a 64-bit binary is available¡ªremoving the absolute address space limitation¡ªall too often relics from the age of 32-bit code still pervade software, partic- ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit versions of R (available for Linux and Mac) use signed 32-bit integers to represent lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such as the earlier world census example ends up being too big for R to handle. Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org