�ݺ�ߣ

New Directions for
Power Law Research

Michael Mitzenmacher
Harvard University

1

Internet Mathematics
Articles Related to This Talk

The Future of Power Law Research

Dynamic Models for File Sizes
and Double Pareto Distributions

A Brief History of Generative
Models for Power Law and
Lognormal Distributions

2

Motivation: General
? Power laws (and/or scale-free networks) are now
everywhere.
�C See the popular texts Linked by Barabasi or Six
Degrees by Watts.
�C In computer science: file sizes, download times,
Internet topology, Web graph, etc.
�C Other sciences: Economics, physics, ecology,
linguistics, etc.
? What has been and what should be the research
agenda?

3

My (Biased) View
? There are 5 stages of power law network research.
1) Observe: Gather data to demonstrate power law behavior
in a system.
2) Interpret: Explain the importance of this observation in
the system context.
3) Model: Propose an underlying model for the observed
behavior of the system.
4) Validate: Find data to validate (and if necessary
specialize or modify) the model.
5) Control: Design ways to control and modify the
underlying behavior of the system based on the model.
4

My (Biased) View
? In networks, we have spent a lot of time observing
and interpreting power laws.
? We are currently in the modeling stage.
�C Many, many possible models.
�C I��ll talk about some of my favorites later on.
? We need to now put much more focus on
validation and control.
�C And these are specific areas where computer science
has much to contribute!

5

Models
? After observation, the natural step is to
explain/model the behavior.
? Outcome: lots of modeling papers.
�C And many models rediscovered.
? Lots of history��

6

History
? In 1990��s, the abundance of observed power laws in networks
surprised the community.
�C Perhaps they shouldn��t have�� power laws appear frequently
throughout the sciences.
? Pareto : income distribution, 1897
? Zipf-Auerbach: city sizes, 1913/1940��s
? Zipf-Estouf: word frequency, 1916/1940��s
? Lotka: bibliometrics, 1926
? Yule: species and genera, 1924.
? Mandelbrot: economics/information theory, 1950��s+
? Observation/interpretation were/are key to initial understanding.
? My claim: but now the mere existence of power laws should not
be surprising, or necessarily even noteworthy.
? My (biased) opinion: The bar should now be very high for
observation/interpretation.
7

Power Law Distribution
? A power law distribution satisfies
Pr[ X �� x] ~ cx ?��
? Pareto distribution
Pr[ X �� x] = k ( )
x ?��
�C Log-complementary cumulative distribution function
(ccdf) is exactly linear.
ln Pr[ X �� x] = ?�� ln x + �� ln k
? Properties
�C Infinite mean/variance possible

8

Lognormal Distribution
? X is lognormally distributed if Y = ln X is
normally distributed.
? Density function: f ( x) = 1 e?(ln x ? ? ) / 2��
2 2

? Properties: 2�� x
�C Finite mean/variance.
�C Skewed: mean > median > mode
�C Multiplicative: X1 lognormal, X2 lognormal
implies X1X2 lognormal.

9

Similarity
? Easily seen by looking at log-densities.
? Pareto has linear log-density.
ln f ( x) = ?(�� ? 1) ln x + �� ln k + ln ��
? For large ��, lognormal has nearly linear log-
density. ( ln x ? ? ) 2
ln f ( x) = ? ln x ? ln 2�� ?
2�� 2
? Similarly, both have near linear log-ccdfs.
�C Log-ccdfs usually used for empirical, visual tests of
power law behavior.
? Question: how to differentiate them empirically?

10

Lognormal vs. Power Law
? Question: Is this distribution lognormal or
a power law?
�C Reasonable follow-up: Does it matter?
? Primarily in economics
�C Income distribution.
�C Stock prices. (Black-Scholes model.)
? But also papers in ecology, biology,
astronomy, etc.

11

Preferential Attachment
? Consider dynamic Web graph.
�C Pages join one at a time.
�C Each page has one outlink.
? Let Xj(t) be the number of pages of degree j
at time t.
? New page links:
�C With probability ��, link to a random page.
�C With probability (1- ��), a link to a page chosen
proportionally to indegree. (Copy a link.)
12

Preferential Attachment History
? This model (without the graphs) was
derived in the 1950��s by Herbert Simon.
�C �� who won a Nobel Prize in economics for
entirely different work.
�C His analysis was not for Web graphs, but for
other preferential attachment problems.

13

Optimization Model: Power Law
? Mandelbrot experiment: design a language over a d-
ary alphabet to optimize information per character.
�C Probability of jth most frequently used word is pj.
�C Length of jth most frequently used word is cj.
? Average information per word:
H = ?�� j p j log 2 p j
? Average characters per word:
C = �� j p jc j

? Optimization leads to power law.

14

Monkeys Typing Randomly
? Miller (psychologist, 1957) suggests following:
monkeys type randomly at a keyboard.
�C Hit each of n characters with probability p.
�C Hit space bar with probability 1 - np > 0.
�C A word is sequence of characters separated by a space.
? Resulting distribution of word frequencies follows
a power law.
? Conclusion: Mandelbrot��s ��optimization�� not
required for languages to have power law

15

Generative Models: Lognormal
? Start with an organism of size X0.
? At each time step, size changes by a random
multiplicative factor.
X t = Ft ?1 X t ?1
? If Ft is taken from a lognormal distribution, each Xt is
lognormal.
? If Ft are independent, identically distributed then (by
CLT) Xt converges to lognormal distribution.

16

BUT!
? If there exists a lower bound:
X t = max(�� , Ft ?1 X t ?1 )
then Xt converges to a power law
distribution. (Champernowne, 1953)
? Lognormal model easily pushed to a power
law model.

17

Double Pareto Distributions

? Consider continuous version of lognormal
generative model.
�C At time t, log Xt is normal with mean ?t and variance
��2 t
? Suppose observation time is distributed
exponentially.
�C E.g., When Web size doubles every year.
? Resulting distribution is Double Pareto.
�C Between lognormal and Pareto.
�C Linear tail on a log-log chart, but a lognormal body.

18

Lognormal vs. Double Pareto

19

And So Many More��
? New variations coming up all of the time.
? Question : What makes a new power law model
sufficiently interesting to merit attention and/or
publication?
�C Strong connection to an observed process.
? Many models claim this, but few demonstrate it convincingly.
�C Theory perspective: new mathematical insight or
sophistication.
? My (biased) opinion: the bar should start being
raised on model papers.
20

Validation: The Current Stage
? We now have so many models.
? It may be important to know the right model, to
extrapolate and control future behavior.
? Given a proposed underlying model, we need tools
to help us validate it.
? We appear to be entering the validation stage of
research��. BUT the first steps have focused on
invalidation rather than validation.

21

Examples : Invalidation
? Lakhina, Byers, Crovella, Xie
�C Show that observed power-law of Internet topology
might be because of biases in traceroute sampling.
? Chen, Chang, Govindan, Jamin, Shenker,
Willinger
�C Show that Internet topology has characteristics that do
not match preferential-attachment graphs.
�C Suggest an alternative mechanism.
? But does this alternative match all characteristics, or are we
still missing some?

22

My (Biased) View
? Invalidation is an important part of the process!
BUT it is inherently different than validating a
model.
? Validating seems much harder.
? Indeed, it is arguable what constitutes a validation.
? Question: what should it mean to say
��This model is consistent with observed data.��

23

Time-Series/Trace Analysis
? Many models posit some sort of actions.
�C New pages linking to pages in the Web.
�C New routers joining the network.
�C New files appearing in a file system.
? A validation approach: gather traces and see if the
traces suitably match the model.
�C Trace gathering can be a challenging systems problem.
�C Check model match requires using appropriate
statistical techniques and tests.
�C May lead to new, improved, better justified models.

24

Sampling and Trace Analysis
? Often, cannot record all actions.
�C Internet is too big!
? Sampling
�C Global: snapshots of entire system at various times.
�C Local: record actions of sample agents in a system.
? Examples:
�C Snapshots of file systems: full systems vs. actions of
individual users.
�C Router topology: Internet maps vs. changes at subset
of routers.
? Question: how much/what kind of sampling is
sufficient to validate a model appropriately?
�C Does this differ among models? 25

To Control
? In many systems, intervention can impact the
outcome.
�C Maybe not for earthquakes, but for computer networks!
�C Typical setting: individual agents acting in their own
best interest, giving a global power law. Agents can be
given incentives to change behavior.
? General problem: given a good model, determine
how to change system behavior to optimize a
global performance function.
�C Distributed algorithmic mechanism design.
�C Mix of economics/game theory and computer science.
26

Possible Control Approaches
? Adding constraints: local or global
�C Example: total space in a file system.
�C Example: preferential attachment but links limited by
an underlying metric.
? Add incentives or costs
�C Example: charges for exceeding soft disk quotas.
�C Example: payments for certain AS level connections.
? Limiting information
�C Impact decisions by not letting everyone have true view
of the system.

27

Conclusion : My (Biased) View
? There are 5 stages of power law research.
1) Observe: Gather data to demonstrate power law
behavior in a system.
2) Interpret: Explain the import of this observation in the
system context.
3) Model: Propose an underlying model for the observed
behavior of the system.
4) Validate: Find data to validate (and if necessary
specialize or modify) the model.
5) Control: Design ways to control and modify the
underlying behavior of the system based on the model.
? We need to focus on validation and control.
�C Lots of open research problems.
28

A Chance for Collaboration
? The observe/interpret stages of research are dominated by
systems; modeling dominated by theory.
�C And need new insights, from statistics, control theory,
economics!!!
? Validation and control require a strong theoretical
foundation.
�C Need universal ideas and methods that span different types of
systems.
�C Need understanding of underlying mathematical models.
? But also a large systems buy-in.
�C Getting/analyzing/understanding data.
�C Find avenues for real impact.
? Good area for future systems/theory/others collaboration
and interaction.
29

�ݺ�ߣ

Radcliffe

Recommended

More Related Content

Similar to Radcliffe (20)

Recently uploaded (20)

Radcliffe