I built a multi-agent reinforcement learning (RL) environment to explore cooperative/competitive behavior. Agents learn decentralized policies in a simulation with partial observability and private communication channels. A Rust simulation engine connects to PyTorch agents via gRPC. Implemented REINFORCE, PPO & A2C with exploration bonuses.
1 of 53
Download to read offline
More Related Content
Communicating Agents Seeking Information
1. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communicating Agents Seeking Information
David James
April 2019
David James Communicating Agents Seeking Information
2. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
3. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
4. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Personal Motivations
I am fascinated by:
simulation as a tool
building agents that learn
understanding multi-agent systems
understanding how communication a鍖ects
My educational background:
Liberal Arts + ECE (undergraduate)
Public Policy + CS (graduate)
David James Communicating Agents Seeking Information
5. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Research Question
How do decentralized* communicating agents behave?
What behavior patterns emerge?
How do incentives a鍖ect behavior?
Do agents reach global optima?
What are the algorithmic challenges?
David James Communicating Agents Seeking Information
6. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Design Process
What kind of simulation can explore these questions?
Lets walk through some design questions:
How does communication work?
How does learning (optimization) work?
David James Communicating Agents Seeking Information
7. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communication Type
AgentAgent
AgentAgent
shared
Shared Communication Channel
AgentAgent
AgentAgent
Private Communication Channels
David James Communicating Agents Seeking Information
8. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Learning Type
Agent
Agent
Agent
Simulator(t)
Simulator(t)
observation action
observation action
actionobservation
centralized learning
Centralized Learning
Agent
Agent
Agent
Simulator(t)
Simulator(t)
observation action
observation action
actionobservation
decentralized learning
Decentralized Learning
David James Communicating Agents Seeking Information
9. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
10. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Simulation Overview
You can think of the CASI simulation as a game:
There are na agents.
Each episode consists of T steps.
If an agent correctly guesses an unsolved goal attribute, it
gains points.
If an agent guesses all of its ng goal attributes, it gains many
points and wins.
Various actions are penalized (more on this later).
For example, asking a question has a small cost.
David James Communicating Agents Seeking Information
11. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Simulation Attributes
Each simulation has a con鍖gurable number of attributes.
Here is an example with three agents and six attributes:
100121
001goalgoal
10goalgoal1
goalgoal121
Attribute Vectors
agent 1
agent 2
agent 3
global
unknownknownkey
David James Communicating Agents Seeking Information
12. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Agents Perspective
Each agent sees an observation and must choose an action.
An observation consists of:
the agents internal state
interactions with other agents
(answers received, questions asked)
An action consists of:
intentions on how to respond to questions
choosing what to do next, one of:
(do nothing, guess an attribute, ask a question)
David James Communicating Agents Seeking Information
13. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communication Protocol
agent i says to agent j:
tell me about attribute a
agent j chooses an intention
(i.e. to lie, ignore, or tell the truth)
the simulation converts the intention to an answer
and delivers it to agent i
David James Communicating Agents Seeking Information
14. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Agent Intentions
Intention Response
if unknown if known
i.e. None i.e. Some(v)
speci鍖c lie Value(r) Value(r) | r = v
general lie Known Unknown
ignore None None
incomplete truth Unknown Known
complete truth Unknown Value(v)
where r is a random value
David James Communicating Agents Seeking Information
15. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Incentives
let incentives = Incentives { // Experiment #1
cost_ask_question: 2.0,
cost_ask_self_question: 20.0,
cost_excessive_guessing: 20.0,
cost_incorrect_guess: 20.0,
cost_known_guess: 20.0,
cost_non_goal_guess: 20.0,
cost_solved_guess: 20.0,
cost_unnecessary_reply: 20.0,
max_guesses_per_attr: 1,
reward_correct: 100.0,
reward_win: 200.0,
};
David James Communicating Agents Seeking Information
16. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
CASI Architecture
Agent
Agent
Agent
Simulator
gRPC
gRPC
gRPC
Simulator (written in Rust)
Learning Agents (written with PyTorch)
Interprocess communication (uses gRPC)
David James Communicating Agents Seeking Information
17. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why Rust?
Rust helps me write (and refactor) fast and correct software,
quickly.
Performance: no runtime, no garbage collector
Reliability: expressive type system gives memory-safety,
thread-safety
Productivity: package manager, libraries, community
David James Communicating Agents Seeking Information
18. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why PyTorch?
As you might expect, PyTorch:
manages a computation graph and auto-di鍖erentiation
includes reusable modules for neural networks
has CPU and GPU support
In comparison to TensorFlow, I 鍖nd PyTorch is:
more natural (has a more intuitive API)
easier to debug
David James Communicating Agents Seeking Information
19. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why gRPC?
Simple service de鍖nition:
easy refactoring
Works across languages and platforms:
conveniently bridges Rust and Python
public and private use cases (RPC, APIs)
Start quickly and scale:
works on one machine
viable for large multi-agent systems
David James Communicating Agents Seeking Information
20. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Neural Network gRPC De鍖nition
Here is how I use gRPC to de鍖ne a neural network model for a
learning agent:
// Neural network for action prefs and value function
service Net {
rpc Init (NetCfg) returns (Empty) {}
rpc InitHidden (Empty) returns (Empty) {}
rpc Predict (Observations) returns (Predictions) {}
rpc ResetOptimizer (OptimCfg) returns (Empty) {}
rpc Train (Sequences) returns (Loss) {}
rpc GetParams (Empty) returns (Params) {}
rpc SetParams (Params) returns (Empty) {}
}
This 鍖le will generate executable code in a wide variety of
languages.
David James Communicating Agents Seeking Information
21. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
22. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Reinforcement Learning
Reinforcement learning (RL) is learning what to do
how to map situations to actions so as to maximize a
numerical reward signal. - Sutton & Barto
Chapter 3: Finite Markov Decision Pr
actions and presenting new situations to the agent.1
The environment also
o rewards, special numerical values that the agent seeks to maximize ove
gh its choice of actions.
Agent
Environment
action
At
reward
Rt
state
St
Rt+1
St+1
Figure 3.1: The agentenvironment interaction in a Markov decision process.
re speci鍖cally, the agent and environment interact at each of a sequence of dDavid James Communicating Agents Seeking Information
23. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Decentralized RL
You may remember this diagram from before:
Agent
Agent
Agent
Simulator(t)
Simulator(t)
observation action
observation action
actionobservation
decentralized learning
Can you think about some pros and cons?
David James Communicating Agents Seeking Information
24. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
RL Dimensions
Environmental considerations:
states: fully-observed vs. partially-observed
actions: discrete vs. continuous
Algorithmic considerations:
policies: deterministic vs. stochastic
on-policy vs. o鍖-policy
model-based vs. model-free *
David James Communicating Agents Seeking Information
25. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Models in RL
I brie鍖y wanted to mention what model means in the context of
RL, since it can be confusing.
Ben Recht says it well in A Tour of Reinforcement Learning: The
View from Continuous Control (2018):
The term model-free almost always means no model
of the state transition function when casually claimed in
reinforcement learning research. However, this does not
mean that modeling is not heavily built into the
assumptions of model-free RL algorithms.
David James Communicating Agents Seeking Information
26. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Algorithm Experimentation
Ive implemented these algorithms from scratch in Rust and
PyTorch:
REINFORCE (vanilla)
REINFORCE with baseline
Proximal Policy Optimization (PPO)
Advantage Actor Critic (A2C)
Im currently using A2C + PPO with exploration bonuses.
David James Communicating Agents Seeking Information
27. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
REINFORCE
328 Chapter 13: Policy Gradient Methods
REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for ≠
Input: a dierentiable policy parameterization (a|s, )
Algorithm parameter: step size > 0
Initialize policy parameter 2 Rd0
(e.g., to 0)
Loop forever (for each episode):
Generate an episode S0, A0, R1, . . . , ST 1, AT 1, RT , following (揃|揃, )
Loop for each step of the episode t = 0, 1, . . . , T 1:
G
PT
k=t+1
k t 1
Rk (Gt)
+ t
Gr ln (At|St, )
The second dierence between the pseudocode update and the REINFORCE update
equation (13.8) is that the former includes a factor of t
. This is because, as mentioned
earlier, in the text we are treating the non-discounted case ( =1) while in the boxed
algorithms we are giving the algorithms for the general discounted case. All of the ideas
go through in the discounted case with appropriate adjustments (including to the box on
page 199) but involve additional complexity that distracts from the main ideas.
Exercise 13.2 Generalize the box on page 199, the policy gradient theorem (13.5), the
proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE
Reinforcement Learning: An Introduction by Sutton & Barto, 2018
David James Communicating Agents Seeking Information
28. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
PPO Actor-Critic
Equation (10) when = 1:
At = t + ( ) t+1 + 揃 揃 揃 + 揃 揃 揃 + ( )T t+1
T 1,
where t = rt + V (st+1) V (st)
A proximal policy optimization (PPO) algorithm that uses 鍖xed-length trajectory segm
shown below. Each iteration, each of N (parallel) actors collect T timesteps of data. T
construct the surrogate loss on these NT timesteps of data, and optimize it with minibatc
(or usually for better performance, Adam [KB14]), for K epochs.
Algorithm 1 PPO, Actor-Critic Style
for iteration=1, 2, . . . do
for actor=1, 2, . . . , N do
Run policy old
in environment for T timesteps
Compute advantage estimates A1, . . . , AT
end for
Optimize surrogate L wrt , with K epochs and minibatch size M NT
old
end for
6 Experiments
6.1 Comparison of Surrogate Objectives
First, we compare several di erent surrogate objectives under di erent hyperparameters. H
compare the surrogate objective LCLIP to several natural variations and ablated versions.
No clipping or penalty: Lt( ) = rt( ) At
Proximal Policy Optimization Algorithms by Schulman et al. 2017
David James Communicating Agents Seeking Information
29. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Exploration
There are many kinds of exploration in RL:
-greedy exploration *
a result of stochastic policies
entropy-based exploration
seeking novel (state, action) pairs
David James Communicating Agents Seeking Information
30. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Count-Based Exploration with Hashing
From #Exploration by Tang et. al, 2017:
a simple generalization of the classic count-based
approach [. . .]. States are mapped to hash codes, which
allows to count their occurrences with a hash table.
These counts are then used to compute a reward bonus
according to the classic count-based exploration theory.
David James Communicating Agents Seeking Information
31. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Locality Sensitive Hashing (LSH)
From Wikipedia:
LSH hashes input items so that similar items map to the
same buckets with high probability [. . .]. LSH di鍖ers
from conventional and cryptographic hash functions
because it aims to maximize the probability of a
collision for similar items.
Arguably, the key criteria for choosing an LSH algorithm is the
relevant distance metric.
David James Communicating Agents Seeking Information
32. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
LSH for Hamming Distance
Early LSH papers on Hamming distances suggest using
single-column sampling techniques.
Each output bit is created simply by sampling a single column
from the input vector.
However, sampling only a fraction of the input vector seems
suboptimal. I would expect bene鍖ts from drawing information
from all columns.
David James Communicating Agents Seeking Information
33. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Hamming Distance LSH with Super-Bits
I blended two approaches from the literature as follows:
Sample from each column but only once. Otherwise,
di鍖erences can be accentuated.
Designate groups of the output bits. This allows trading-o鍖
between the number of anchor points for the distance
calculation vs. the amount of compression. (Inspired by
Super-Bit LSH by Jie et al., 2012.)
Combine columns using the Hamming distance. Store the
result in each super-bit group.
David James Communicating Agents Seeking Information
34. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
35. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Attributes
Here is an example of a Rust data structure to con鍖gure attributes
for an experiment.
/// Attribute configuration
pub struct AttrCfg {
/// Total number of attributes in simulation
pub count: u16,
/// Attributes known from direct experience (per agent)
pub direct: u16,
/// To win, an agent must solve this many attributes
pub goal: u16,
/// Number of possible values for each attribute
pub values: u8,
}
David James Communicating Agents Seeking Information
36. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Description
Two Agents:
One non-learning random agent
One learning agent
a feedforward neural network
Question: Can this agent learn to win?
Answer: Perhaps surprisingly, it can, by
exploiting the lack of random initialization.
You might call it Groundhog Day learning.
David James Communicating Agents Seeking Information
37. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Output 1
David James Communicating Agents Seeking Information
38. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Output 2
David James Communicating Agents Seeking Information
39. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Output 3
David James Communicating Agents Seeking Information
40. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 1 Description
Same as Experiment 0 but with random experiment initialization:
Two Agents:
One non-learning random agent
One learning agent
a feedforward neural network
Question: How can this agent learn to win?
Answer: It cant; communication would be required.
David James Communicating Agents Seeking Information
41. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 2 Description
Two Agents:
One non-learning completely honest agent
One learning agent, with both:
a feedforward neural network
an LSTM
Question: How can this agent learn to win?
Answer: By asking questions of the honest agent.
Discuss: Is the LSTM necessary in this case?
David James Communicating Agents Seeking Information
42. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
43. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Problem: Diverse Purposes for Runs
Since the simulation engine is a platform for experimentation, it
has di鍖erent purposes:
Verifying new algorithms
Inspecting policy changes over time
Exploring hyperparameters
Monitoring optimizer performance
Getting the verbosity level is tricky: too little misses out, but too
much creates clutter...
David James Communicating Agents Seeking Information
44. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Solution: Granular Logging
One solution is to use 鍖ne-grained logging:
let log_cfg = LogCfg { // subset
log_action_probs: false,
log_actions_rewards: false,
log_attrs: false,
log_encodings: false,
log_experiment_start: true,
log_interactions: false,
log_iteration_csv: true,
log_learning: false,
log_obs_summary: true,
log_trajectories: false,
};
David James Communicating Agents Seeking Information
45. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Problem: When to Stop Optimization?
Policy gradient (PG) methods aim to improve a policy over many
iterations.
However, the policy performance does not necessarily
monotonically improve.
Empirically, PG methods, even with improvements such as PPO,
can sometimes drift away from good performance and never
recover.
See: Where Did My Optimum Go?: An Empirical Analysis of
Gradient Descent Optimization in Policy Gradient Methods by
Henderson et al. 2018.
David James Communicating Agents Seeking Information
46. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why do Policy Gradient Methods Degrade?
The Policy Gradient Theorem gives the correct gradient, so what
could go wrong in practice?
Gradient ascent takes many iterations. How many? What
does research say about the bounds as applied to PG?
The underlying function approximator* may lack su鍖cient
representational power.
The network parameters may not be able to e鍖ciently
respond to the latest distributional shift.
David James Communicating Agents Seeking Information
47. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Solution: Backtracking Algorithm
I implemented a three-level backtracking algorithm that requires
three hyper-parameters:
/// Backtracking configuration
pub struct BacktrackCfg {
/// Snapshot pops before stopping early
pub pop_limit: u32,
/// Consecutive resets before snapshot pop
pub restart_limit: u32,
/// Consecutive unimproved iterations before reset
pub unimproved_limit: u32,
}
The algorithm is shown on the next slide.
David James Communicating Agents Seeking Information
48. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Backtracking Algorithm
Algorithm 1: Three-level backtracking algorithm
snapshots []; pops 0; restarts 0; unimproved 0
for iteration 0 to max iterations do
if improved performance then
snapshots.push(model); restarts 0; unimproved 0
else unimproved unimproved + 1
if unimproved > max unimproved then
restarts restarts + 1; unimproved 0
if restarts > max restarts then
pops pops + 1
if pops > max pops then stop early()
else snapshots.pop(); restarts 0
model snapshots.last()
else
model.optimization step()
David James Communicating Agents Seeking Information
49. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Problem: Consistency of Responses
An agent reports an intention to the simulation; the simulation
converts it to a response that another agent observes.
source: http://factmyth.com
Claim: inconsistent lies are easier to discover.
How can the simulation achieve consistent lying?
David James Communicating Agents Seeking Information
50. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Consistent Lying
/// Return a deterministic lie that depends only on:
/// 1. the function arguments
/// 2. the agent's private salt
/// Note: 'q' / 'a' = agent id of questioner / answerer
fn specific_lie(
&self, q: AgentId, attr_idx: u16, attr_val: Option<u8>
) -> Answer {
let m = self.attr_cfg.values as u64;
let hash = self.calc_hash(q, attr_idx);
let lie = match attr_val {
Some(true_val) =>
(true_val as u64 + (hash % (m - 1))) % m,
None => hash % m
};
Answer::Value(lie as u8)
}
David James Communicating Agents Seeking Information
51. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
52. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Next Steps
LSTM architecture & parameter tuning
Create experiments with multiple learning agents
Improve visualizations
Try various multi-agent reward structures
Use unsupervised learning to 鍖nd patterns in policies
Improve e鍖ciency by using parallelism in Rust
David James Communicating Agents Seeking Information
53. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Discussion
I look forward to your questions and comments.
David James Communicating Agents Seeking Information