This document discusses why distributed computing is necessary for large datasets and complex queries. It argues that while a single computer may be able to handle datasets of billions of records for simple queries, more complex queries or larger datasets require distributed computing. Distributed systems can leverage multiple machines to parallelize work, handle hardware failures without losing entire datasets, and scale beyond the memory and processing limits of individual machines. The document uses examples of large census and web crawling datasets to illustrate how requirements quickly exceed what is possible on a single computer.
1 of 4
Download to read offline
More Related Content
Art Of Distributed P0
1. Part 0 ¨C Answer important argument
Art of Distributed
¡®Why Distributed Computing¡¯
Haytham ElFadeel - Hfadeel@acm.org
Abstract
Day after day distributed systems tend to appear in mainstream systems. In spite of ma-
turity of distributed frameworks and tools, a lot of distributed technologies still ambi-
guous for a lot of people, especially if we talks about complicated or new stuff.
Lately I have been getting a lot of questions about distributed computing, especially dis-
tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma-
pReduce fit in all situations? How can we compares it with other technologies such as
Grid Computing? And what is the best solution to our situation? So I decided to write
about distributed computing article into two parts. The first, about distributed compu-
ting models and the differences between them. The second part, I will discuss reliability
of the distrusted system and distributed storage systems. So let¡¯s start¡
Introduction:
After publish the part 1 of this article that entitled ¡®Rethinking about the Distributed
Computing¡¯, I received a few message doubting about Distributed Computing. So I de-
cide to go back and write Part 0 ¨C Answer important argument ¡®Why Distributed Com-
puting¡¯.
Why Distributed Computing:
A log of challenges require involving of huge amount of Data, Image if you want to
crawling the whole Web, perform Monte Carlo simulation, or render movie animation,
etc. Such challenges and many others cannot perform in one PC even.
Usually we resort to Distributed Computing when we want deal with huge amount of
data. So the question is what is the big data?
Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
2. The Big Data - Experience with single machine:
Today computers have several Gigabytes or a few Terabytes of storage system capacity.
But what is ¡®big data¡¯ anyway? Gigabytes? Terabytes? Petabytes?
A database on the order of 100 GB would not be considered trivially small even today,
although hard drives capable of storing 10 times as much can be had for less than $100
at any computer store. The U.S. Census database included many different datasets of
varying sizes, but let¡¯s simplify a bit: 100 gigabytes is enough to store at least the basic
demographic information: age, sex, income, ethnicity, language, religion, housing status,
and location, packed in a 128-bit record¡ªfor every living human being on this planet.
This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be
considered ¡®big data¡¯? It depends, on what you¡¯re trying to do with it. More important-
ly, any competent programmer could in a few hours write a simple, unoptimized appli-
cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that
dataset and return answers to simple aggregation queries such as ¡®what is the median
age by sex for each country?¡¯ with perfectly reasonable performance.
To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e.
128-bit) records containing uniformly distributed random data. Since a seven-bit age
field allows a maximum of 128 possible values, one bit for sex allows only two, and eight
bits for country allows up to 256 option, we can calculate the median age by using a
counting strategy: simply create 65,536 buckets-one for each combination of age, sex,
and country-and count how many records fall into each. We find the median age by de-
termining, for each sex and country group, the cumulative count over the 128 age buck-
ets: the median is the bucket where the count reaches half of the total. [See figure 1].
In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver-
age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU
the whole time.
Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
3. In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB
RAM. Running off in-memory data, my simple median-age-by-sex-and-country program
completed in less than a minute. By such measures, I would hesitate to call this ¡®big da-
ta¡¯. This application will need in average between 725-second, and 1000-second to per-
form the above quire. But what if the data bigger than this, you want perform more
complex quires, or what if this data stored into Database system which means every
record will take only 128-bit??? Actually if you add more data or make the query more
complex the time will likely to get exponential or at least multiply!
So the question answer is: It¡¯s depends on what you¡¯re trying to do with the Data.
Hard Limit:
Hardware and software failures are important limit, there is no way to predict or protect
your system from the failures. Even if you use a High-end Server or expensive hardware
the failure is likely to occur. So in single machine you likely are going to lose your whole
work! But if you distribute the work across many machines you will lose one part of the
work and you can resume it into another machine. Actually the modern distributed
strategies show its ability and reliable to handle the hardware and software failures,
Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop
all designed to handle the unexpected failures.
Of course, hardware¡ªchiefly memory and CPU limitations¡ªis often a major factor in
software limits on dataset size. Many applications are designed to read entire datasets
into memory and work with them there; a good example of this is the popular statistical
computing environment R.7 Memory-bound applications naturally exhibit higher per-
formance than disk-bound ones (at least insofar as the data-crunching they carry out
advances beyond single-pass, purely sequential processing), but requiring all data to fit
in memory means that if you have a dataset larger than your installed RAM, you¡¯re out
of luck. So in this case do you will wait until the next generation of the hardware come?!
On most hardware platforms, there¡¯s a much harder limit on memory expansion than
disk expansion: the motherboard has only so many slots to fill.
The problem often goes further than this, however. Like most other aspects of comput-
er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare
configuration for a desktop workstation, and servers are frequently configured with far
more than that. There is no guarantee, however, that a memory-bound application will
be able to use all installed RAM. Even under modern 64-bit operating systems, many
Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
4. applications today (e.g., R under Windows) have only 32-bit executables and are limited
to 4-GB address spaces¡ªthis often translates into a 2- or 3-GB working set limitation.
Finally, even where a 64-bit binary is available¡ªremoving the absolute address space
limitation¡ªall too often relics from the age of 32-bit code still pervade software, partic-
ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit
versions of R (available for Linux and Mac) use signed 32-bit integers to represent
lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit
system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such
as the earlier world census example ends up being too big for R to handle.
Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org