�ݺ�ߣ

Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communicating Agents Seeking Information
David James
April 2019
David James Communicating Agents Seeking Information

1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps

Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps

Personal Motivations
I am fascinated by:
simulation as a tool
building agents that learn
understanding multi-agent systems
understanding how communication aﬀects ↑
My educational background:
Liberal Arts + ECE (undergraduate)
Public Policy + CS (graduate)

Research Question
How do decentralized* communicating agents behave?
What behavior patterns emerge?
How do incentives aﬀect behavior?
Do agents reach global optima?
What are the algorithmic challenges?

Design Process
What kind of simulation can explore these questions?
Let’s walk through some design questions:
How does communication work?
How does learning (optimization) work?

Communication Type
AgentAgent
AgentAgent
shared
Shared Communication Channel
AgentAgent
AgentAgent
Private Communication Channels

Learning Type
Agent
Agent
Agent
Simulator(t)
Simulator(t’)
observation action
observation action
actionobservation
centralized learning
Centralized Learning
Agent
Agent
Agent
Simulator(t)
Simulator(t’)
observation action
observation action
actionobservation
decentralized learning
Decentralized Learning

Simulation Overview
You can think of the CASI simulation as a game:
There are na agents.
Each episode consists of T steps.
If an agent correctly guesses an unsolved goal attribute, it
gains points.
If an agent guesses all of its ng goal attributes, it gains many
points and wins.
Various actions are penalized (more on this later).
For example, asking a question has a small cost.

Simulation Attributes
Each simulation has a conﬁgurable number of attributes.
Here is an example with three agents and six attributes:
100121
001goalgoal
10goalgoal1
goalgoal121
Attribute Vectors
agent 1
agent 2
agent 3
global
unknownknownkey

Agent’s Perspective
Each agent sees an observation and must choose an action.
An observation consists of:
the agent’s internal state
interactions with other agents
(answers received, questions asked)
An action consists of:
intentions on how to respond to questions
choosing what to do next, one of:
(do nothing, guess an attribute, ask a question)

Communication Protocol
agent i says to agent j:
“tell me about attribute a”
agent j chooses an intention
(i.e. to lie, ignore, or tell the truth)
the simulation converts the intention to an answer
and delivers it to agent i

Agent Intentions
Intention Response
if unknown if known
i.e. None i.e. Some(v)
speciﬁc lie Value(r) Value(r) | r = v
general lie Known Unknown
ignore None None
incomplete truth Unknown Known
complete truth Unknown Value(v)
where r is a random value

Incentives
let incentives = Incentives { // Experiment #1
cost_ask_question: 2.0,
cost_ask_self_question: 20.0,
cost_excessive_guessing: 20.0,
cost_incorrect_guess: 20.0,
cost_known_guess: 20.0,
cost_non_goal_guess: 20.0,
cost_solved_guess: 20.0,
cost_unnecessary_reply: 20.0,
max_guesses_per_attr: 1,
reward_correct: 100.0,
reward_win: 200.0,
};

CASI Architecture
Agent
Agent
Agent
Simulator
gRPC
gRPC
gRPC
Simulator (written in Rust)
Learning Agents (written with PyTorch)
Interprocess communication (uses gRPC)

Why Rust?
Rust helps me write (and refactor) fast and correct software,
quickly.
Performance: no runtime, no garbage collector
Reliability: expressive type system gives memory-safety,
thread-safety
Productivity: package manager, libraries, community

Why PyTorch?
As you might expect, PyTorch:
manages a computation graph and auto-diﬀerentiation
includes reusable modules for neural networks
has CPU and GPU support
In comparison to TensorFlow, I ﬁnd PyTorch is:
more natural (has a more intuitive API)
easier to debug

Why gRPC?
Simple service deﬁnition:
easy refactoring
Works across languages and platforms:
conveniently bridges Rust and Python
public and private use cases (RPC, APIs)
Start quickly and scale:
works on one machine
viable for large multi-agent systems

Neural Network gRPC Definition
Here is how I use gRPC to define a neural network model for a
learning agent:
// Neural network for action prefs and value function
service Net {
rpc Init (NetCfg) returns (Empty) {}
rpc InitHidden (Empty) returns (Empty) {}
rpc Predict (Observations) returns (Predictions) {}
rpc ResetOptimizer (OptimCfg) returns (Empty) {}
rpc Train (Sequences) returns (Loss) {}
rpc GetParams (Empty) returns (Params) {}
rpc SetParams (Params) returns (Empty) {}
}
This file will generate executable code in a wide variety of
languages.

Reinforcement Learning
Reinforcement learning (RL) is learning what to do —
how to map situations to actions — so as to maximize a
numerical reward signal. - Sutton & Barto
Chapter 3: Finite Markov Decision Pr
actions and presenting new situations to the agent.1
The environment also
o rewards, special numerical values that the agent seeks to maximize ove
gh its choice of actions.
Agent
Environment
action
At
reward
Rt
state
St
Rt+1
St+1
Figure 3.1: The agent–environment interaction in a Markov decision process.
re speciﬁcally, the agent and environment interact at each of a sequence of dDavid James Communicating Agents Seeking Information

Decentralized RL
You may remember this diagram from before:
Agent
Agent
Agent
Simulator(t)
Simulator(t’)
observation action
observation action
actionobservation
decentralized learning
Can you think about some pros and cons?

RL Dimensions
Environmental considerations:
states: fully-observed vs. partially-observed
actions: discrete vs. continuous
Algorithmic considerations:
policies: deterministic vs. stochastic
on-policy vs. oﬀ-policy
model-based vs. model-free *

‘Models’ in RL
I brieﬂy wanted to mention what ‘model’ means in the context of
RL, since it can be confusing.
Ben Recht says it well in “A Tour of Reinforcement Learning: The
View from Continuous Control” (2018):
The term “model-free” almost always means “no model
of the state transition function” when casually claimed in
reinforcement learning research. However, this does not
mean that modeling is not heavily built into the
assumptions of model-free RL algorithms.

Algorithm Experimentation
I’ve implemented these algorithms from scratch in Rust and
PyTorch:
REINFORCE (’vanilla’)
REINFORCE with baseline
Proximal Policy Optimization (PPO)
Advantage Actor Critic (A2C)
I’m currently using A2C + PPO with exploration bonuses.

REINFORCE
328 Chapter 13: Policy Gradient Methods
REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for ⇡⇤
Input: a di↵erentiable policy parameterization ⇡(a|s, ✓)
Algorithm parameter: step size ↵ > 0
Initialize policy parameter ✓ 2 Rd0
(e.g., to 0)
Loop forever (for each episode):
Generate an episode S0, A0, R1, . . . , ST 1, AT 1, RT , following ⇡(·|·, ✓)
Loop for each step of the episode t = 0, 1, . . . , T 1:
G
PT
k=t+1
k t 1
Rk (Gt)
✓ ✓ + ↵ t
Gr ln ⇡(At|St, ✓)
The second di↵erence between the pseudocode update and the REINFORCE update
equation (13.8) is that the former includes a factor of t
. This is because, as mentioned
earlier, in the text we are treating the non-discounted case ( =1) while in the boxed
algorithms we are giving the algorithms for the general discounted case. All of the ideas
go through in the discounted case with appropriate adjustments (including to the box on
page 199) but involve additional complexity that distracts from the main ideas.
⇤
Exercise 13.2 Generalize the box on page 199, the policy gradient theorem (13.5), the
proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE
“Reinforcement Learning: An Introduction” by Sutton & Barto, 2018

PPO Actor-Critic
Equation (10) when = 1:
Ât = t + ( ) t+1 + · · · + · · · + ( )T t+1
T 1,
where t = rt + V (st+1) V (st)
A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segm
shown below. Each iteration, each of N (parallel) actors collect T timesteps of data. T
construct the surrogate loss on these NT timesteps of data, and optimize it with minibatc
(or usually for better performance, Adam [KB14]), for K epochs.
Algorithm 1 PPO, Actor-Critic Style
for iteration=1, 2, . . . do
for actor=1, 2, . . . , N do
Run policy ✓old
in environment for T timesteps
Compute advantage estimates Â1, . . . , ÂT
end for
Optimize surrogate L wrt , with K epochs and minibatch size M NT
old
end for
6 Experiments
6.1 Comparison of Surrogate Objectives
First, we compare several di erent surrogate objectives under di erent hyperparameters. H
compare the surrogate objective LCLIP to several natural variations and ablated versions.
No clipping or penalty: Lt( ) = rt( ) Ât
“Proximal Policy Optimization Algorithms” by Schulman et al. 2017

Exploration
There are many kinds of exploration in RL:
-greedy exploration *
a result of stochastic policies
entropy-based exploration
seeking novel (state, action) pairs

Count-Based Exploration with Hashing
From #Exploration by Tang et. al, 2017:
a simple generalization of the classic count-based
approach [. . .]. States are mapped to hash codes, which
allows to count their occurrences with a hash table.
These counts are then used to compute a reward bonus
according to the classic count-based exploration theory.

Locality Sensitive Hashing (LSH)
From Wikipedia:
LSH hashes input items so that similar items map to the
same ‘buckets’ with high probability [. . .]. LSH diﬀers
from conventional and cryptographic hash functions
because it aims to maximize the probability of a
‘collision’ for similar items.
Arguably, the key criteria for choosing an LSH algorithm is the
relevant distance metric.

LSH for Hamming Distance
Early LSH papers on Hamming distances suggest using
single-column sampling techniques.
Each output bit is created simply by sampling a single column
from the input vector.
However, sampling only a fraction of the input vector seems
suboptimal. I would expect beneﬁts from drawing information
from all columns.

Hamming Distance LSH with Super-Bits
I blended two approaches from the literature as follows:
Sample from each column but only once. Otherwise,
diﬀerences can be accentuated.
Designate groups of the output bits. This allows trading-oﬀ
between the number of ’anchor’ points for the distance
calculation vs. the amount of compression. (Inspired by
Super-Bit LSH by Jie et al., 2012.)
Combine columns using the Hamming distance. Store the
result in each super-bit group.

Attributes
Here is an example of a Rust data structure to conﬁgure attributes
for an experiment.
/// Attribute configuration
pub struct AttrCfg {
/// Total number of attributes in simulation
pub count: u16,
/// Attributes known from direct experience (per agent)
pub direct: u16,
/// To win, an agent must solve this many attributes
pub goal: u16,
/// Number of possible values for each attribute
pub values: u8,
}

Experiment 0 Description
Two Agents:
One non-learning random agent
One learning agent
a feedforward neural network
Question: Can this agent learn to win?
Answer: Perhaps surprisingly, it can, by
exploiting the lack of random initialization.
You might call it ‘Groundhog Day’ learning.

Experiment 0 Output 1

Same as Experiment 0 but with random experiment initialization:
Two Agents:
One non-learning random agent
One learning agent
Question: How can this agent learn to win?
Answer: It can’t; communication would be required.

Two Agents:
One non-learning completely honest agent
One learning agent, with both:
an LSTM
Question: How can this agent learn to win?
Answer: By asking questions of the honest agent.
Discuss: Is the LSTM necessary in this case?

Problem: Diverse Purposes for Runs
Since the simulation engine is a platform for experimentation, it
has diﬀerent purposes:
Verifying new algorithms
Inspecting policy changes over time
Exploring hyperparameters
Monitoring optimizer performance
Getting the verbosity level is tricky: too little misses out, but too
much creates clutter...

Solution: Granular Logging
One solution is to use ﬁne-grained logging:
let log_cfg = LogCfg { // subset
log_action_probs: false,
log_actions_rewards: false,
log_attrs: false,
log_encodings: false,
log_experiment_start: true,
log_interactions: false,
log_iteration_csv: true,
log_learning: false,
log_obs_summary: true,
log_trajectories: false,
};

Problem: When to Stop Optimization?
Policy gradient (PG) methods aim to improve a policy over many
iterations.
However, the policy performance does not necessarily
monotonically improve.
Empirically, PG methods, even with improvements such as PPO,
can sometimes drift away from good performance and never
recover.
See: “Where Did My Optimum Go?: An Empirical Analysis of
Gradient Descent Optimization in Policy Gradient Methods” by
Henderson et al. 2018.

Why do Policy Gradient Methods Degrade?
The Policy Gradient Theorem gives the correct gradient, so what
could go wrong in practice?
Gradient ascent takes many iterations. How many? What
does research say about the bounds as applied to PG?
The underlying function approximator* may lack suﬃcient
representational power.
The network parameters may not be able to eﬃciently
respond to the latest distributional shift.

Solution: Backtracking Algorithm
I implemented a three-level backtracking algorithm that requires
three hyper-parameters:
/// Backtracking configuration
pub struct BacktrackCfg {
/// Snapshot pops before stopping early
pub pop_limit: u32,
/// Consecutive resets before snapshot pop
pub restart_limit: u32,
/// Consecutive unimproved iterations before reset
pub unimproved_limit: u32,
}
The algorithm is shown on the next slide.

Backtracking Algorithm
Algorithm 1: Three-level backtracking algorithm
snapshots ← []; pops ← 0; restarts ← 0; unimproved ← 0
for iteration ← 0 to max iterations do
if improved performance then
snapshots.push(model); restarts ← 0; unimproved ← 0
else unimproved ← unimproved + 1
if unimproved > max unimproved then
restarts ← restarts + 1; unimproved ← 0
if restarts > max restarts then
pops ← pops + 1
if pops > max pops then stop early()
else snapshots.pop(); restarts ← 0
model ← snapshots.last()
else
model.optimization step()

Problem: Consistency of Responses
An agent reports an intention to the simulation; the simulation
converts it to a response that another agent observes.
source: http://factmyth.com
Claim: inconsistent lies are easier to discover.
How can the simulation achieve consistent lying?

Consistent Lying
/// Return a deterministic lie that depends only on:
/// 1. the function arguments
/// 2. the agent's private salt
/// Note: 'q' / 'a' = agent id of questioner / answerer
fn specific_lie(
&self, q: AgentId, attr_idx: u16, attr_val: Option<u8>
) -> Answer {
let m = self.attr_cfg.values as u64;
let hash = self.calc_hash(q, attr_idx);
let lie = match attr_val {
Some(true_val) =>
(true_val as u64 + (hash % (m - 1))) % m,
None => hash % m
};
Answer::Value(lie as u8)
}

Next Steps
LSTM architecture & parameter tuning
Create experiments with multiple learning agents
Improve visualizations
Try various multi-agent reward structures
Use unsupervised learning to ﬁnd patterns in policies
Improve eﬃciency by using parallelism in Rust

Discussion
I look forward to your questions and comments.

�ݺ�ߣ

Communicating Agents Seeking Information

More Related Content

Communicating Agents Seeking Information