際際滷

際際滷Share a Scribd company logo
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communicating Agents Seeking Information
David James
April 2019
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Personal Motivations
I am fascinated by:
simulation as a tool
building agents that learn
understanding multi-agent systems
understanding how communication a鍖ects 
My educational background:
Liberal Arts + ECE (undergraduate)
Public Policy + CS (graduate)
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Research Question
How do decentralized* communicating agents behave?
What behavior patterns emerge?
How do incentives a鍖ect behavior?
Do agents reach global optima?
What are the algorithmic challenges?
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Design Process
What kind of simulation can explore these questions?
Lets walk through some design questions:
How does communication work?
How does learning (optimization) work?
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communication Type
AgentAgent
AgentAgent
shared
Shared Communication Channel
AgentAgent
AgentAgent
Private Communication Channels
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Learning Type
Agent
Agent
Agent
Simulator(t)
Simulator(t)
observation action
observation action
actionobservation
centralized learning
Centralized Learning
Agent
Agent
Agent
Simulator(t)
Simulator(t)
observation action
observation action
actionobservation
decentralized learning
Decentralized Learning
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Simulation Overview
You can think of the CASI simulation as a game:
There are na agents.
Each episode consists of T steps.
If an agent correctly guesses an unsolved goal attribute, it
gains points.
If an agent guesses all of its ng goal attributes, it gains many
points and wins.
Various actions are penalized (more on this later).
For example, asking a question has a small cost.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Simulation Attributes
Each simulation has a con鍖gurable number of attributes.
Here is an example with three agents and six attributes:
100121
001goalgoal
10goalgoal1
goalgoal121
Attribute Vectors
agent 1
agent 2
agent 3
global
unknownknownkey
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Agents Perspective
Each agent sees an observation and must choose an action.
An observation consists of:
the agents internal state
interactions with other agents
(answers received, questions asked)
An action consists of:
intentions on how to respond to questions
choosing what to do next, one of:
(do nothing, guess an attribute, ask a question)
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Communication Protocol
agent i says to agent j:
tell me about attribute a
agent j chooses an intention
(i.e. to lie, ignore, or tell the truth)
the simulation converts the intention to an answer
and delivers it to agent i
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Agent Intentions
Intention Response
if unknown if known
i.e. None i.e. Some(v)
speci鍖c lie Value(r) Value(r) | r = v
general lie Known Unknown
ignore None None
incomplete truth Unknown Known
complete truth Unknown Value(v)
where r is a random value
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Incentives
let incentives = Incentives { // Experiment #1
cost_ask_question: 2.0,
cost_ask_self_question: 20.0,
cost_excessive_guessing: 20.0,
cost_incorrect_guess: 20.0,
cost_known_guess: 20.0,
cost_non_goal_guess: 20.0,
cost_solved_guess: 20.0,
cost_unnecessary_reply: 20.0,
max_guesses_per_attr: 1,
reward_correct: 100.0,
reward_win: 200.0,
};
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
CASI Architecture
Agent
Agent
Agent
Simulator
gRPC
gRPC
gRPC
Simulator (written in Rust)
Learning Agents (written with PyTorch)
Interprocess communication (uses gRPC)
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why Rust?
Rust helps me write (and refactor) fast and correct software,
quickly.
Performance: no runtime, no garbage collector
Reliability: expressive type system gives memory-safety,
thread-safety
Productivity: package manager, libraries, community
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why PyTorch?
As you might expect, PyTorch:
manages a computation graph and auto-di鍖erentiation
includes reusable modules for neural networks
has CPU and GPU support
In comparison to TensorFlow, I 鍖nd PyTorch is:
more natural (has a more intuitive API)
easier to debug
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why gRPC?
Simple service de鍖nition:
easy refactoring
Works across languages and platforms:
conveniently bridges Rust and Python
public and private use cases (RPC, APIs)
Start quickly and scale:
works on one machine
viable for large multi-agent systems
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Neural Network gRPC De鍖nition
Here is how I use gRPC to de鍖ne a neural network model for a
learning agent:
// Neural network for action prefs and value function
service Net {
rpc Init (NetCfg) returns (Empty) {}
rpc InitHidden (Empty) returns (Empty) {}
rpc Predict (Observations) returns (Predictions) {}
rpc ResetOptimizer (OptimCfg) returns (Empty) {}
rpc Train (Sequences) returns (Loss) {}
rpc GetParams (Empty) returns (Params) {}
rpc SetParams (Params) returns (Empty) {}
}
This 鍖le will generate executable code in a wide variety of
languages.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Reinforcement Learning
Reinforcement learning (RL) is learning what to do 
how to map situations to actions  so as to maximize a
numerical reward signal. - Sutton & Barto
Chapter 3: Finite Markov Decision Pr
actions and presenting new situations to the agent.1
The environment also
o rewards, special numerical values that the agent seeks to maximize ove
gh its choice of actions.
Agent
Environment
action
At
reward
Rt
state
St
Rt+1
St+1
Figure 3.1: The agentenvironment interaction in a Markov decision process.
re speci鍖cally, the agent and environment interact at each of a sequence of dDavid James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Decentralized RL
You may remember this diagram from before:
Agent
Agent
Agent
Simulator(t)
Simulator(t)
observation action
observation action
actionobservation
decentralized learning
Can you think about some pros and cons?
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
RL Dimensions
Environmental considerations:
states: fully-observed vs. partially-observed
actions: discrete vs. continuous
Algorithmic considerations:
policies: deterministic vs. stochastic
on-policy vs. o鍖-policy
model-based vs. model-free *
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Models in RL
I brie鍖y wanted to mention what model means in the context of
RL, since it can be confusing.
Ben Recht says it well in A Tour of Reinforcement Learning: The
View from Continuous Control (2018):
The term model-free almost always means no model
of the state transition function when casually claimed in
reinforcement learning research. However, this does not
mean that modeling is not heavily built into the
assumptions of model-free RL algorithms.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Algorithm Experimentation
Ive implemented these algorithms from scratch in Rust and
PyTorch:
REINFORCE (vanilla)
REINFORCE with baseline
Proximal Policy Optimization (PPO)
Advantage Actor Critic (A2C)
Im currently using A2C + PPO with exploration bonuses.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
REINFORCE
328 Chapter 13: Policy Gradient Methods
REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for ≠
Input: a dierentiable policy parameterization (a|s, )
Algorithm parameter: step size  > 0
Initialize policy parameter  2 Rd0
(e.g., to 0)
Loop forever (for each episode):
Generate an episode S0, A0, R1, . . . , ST 1, AT 1, RT , following (揃|揃, )
Loop for each step of the episode t = 0, 1, . . . , T 1:
G
PT
k=t+1
k t 1
Rk (Gt)
  +  t
Gr ln (At|St, )
The second dierence between the pseudocode update and the REINFORCE update
equation (13.8) is that the former includes a factor of t
. This is because, as mentioned
earlier, in the text we are treating the non-discounted case ( =1) while in the boxed
algorithms we are giving the algorithms for the general discounted case. All of the ideas
go through in the discounted case with appropriate adjustments (including to the box on
page 199) but involve additional complexity that distracts from the main ideas.

Exercise 13.2 Generalize the box on page 199, the policy gradient theorem (13.5), the
proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE
Reinforcement Learning: An Introduction by Sutton & Barto, 2018
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
PPO Actor-Critic
Equation (10) when = 1:
At = t + ( ) t+1 + 揃 揃 揃 + 揃 揃 揃 + ( )T t+1
T 1,
where t = rt + V (st+1) V (st)
A proximal policy optimization (PPO) algorithm that uses 鍖xed-length trajectory segm
shown below. Each iteration, each of N (parallel) actors collect T timesteps of data. T
construct the surrogate loss on these NT timesteps of data, and optimize it with minibatc
(or usually for better performance, Adam [KB14]), for K epochs.
Algorithm 1 PPO, Actor-Critic Style
for iteration=1, 2, . . . do
for actor=1, 2, . . . , N do
Run policy old
in environment for T timesteps
Compute advantage estimates A1, . . . , AT
end for
Optimize surrogate L wrt , with K epochs and minibatch size M NT
old
end for
6 Experiments
6.1 Comparison of Surrogate Objectives
First, we compare several di erent surrogate objectives under di erent hyperparameters. H
compare the surrogate objective LCLIP to several natural variations and ablated versions.
No clipping or penalty: Lt( ) = rt( ) At
Proximal Policy Optimization Algorithms by Schulman et al. 2017
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Exploration
There are many kinds of exploration in RL:
-greedy exploration *
a result of stochastic policies
entropy-based exploration
seeking novel (state, action) pairs
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Count-Based Exploration with Hashing
From #Exploration by Tang et. al, 2017:
a simple generalization of the classic count-based
approach [. . .]. States are mapped to hash codes, which
allows to count their occurrences with a hash table.
These counts are then used to compute a reward bonus
according to the classic count-based exploration theory.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Locality Sensitive Hashing (LSH)
From Wikipedia:
LSH hashes input items so that similar items map to the
same buckets with high probability [. . .]. LSH di鍖ers
from conventional and cryptographic hash functions
because it aims to maximize the probability of a
collision for similar items.
Arguably, the key criteria for choosing an LSH algorithm is the
relevant distance metric.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
LSH for Hamming Distance
Early LSH papers on Hamming distances suggest using
single-column sampling techniques.
Each output bit is created simply by sampling a single column
from the input vector.
However, sampling only a fraction of the input vector seems
suboptimal. I would expect bene鍖ts from drawing information
from all columns.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Hamming Distance LSH with Super-Bits
I blended two approaches from the literature as follows:
Sample from each column but only once. Otherwise,
di鍖erences can be accentuated.
Designate groups of the output bits. This allows trading-o鍖
between the number of anchor points for the distance
calculation vs. the amount of compression. (Inspired by
Super-Bit LSH by Jie et al., 2012.)
Combine columns using the Hamming distance. Store the
result in each super-bit group.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Attributes
Here is an example of a Rust data structure to con鍖gure attributes
for an experiment.
/// Attribute configuration
pub struct AttrCfg {
/// Total number of attributes in simulation
pub count: u16,
/// Attributes known from direct experience (per agent)
pub direct: u16,
/// To win, an agent must solve this many attributes
pub goal: u16,
/// Number of possible values for each attribute
pub values: u8,
}
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Description
Two Agents:
One non-learning random agent
One learning agent
a feedforward neural network
Question: Can this agent learn to win?
Answer: Perhaps surprisingly, it can, by
exploiting the lack of random initialization.
You might call it Groundhog Day learning.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Output 1
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Output 2
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 0 Output 3
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 1 Description
Same as Experiment 0 but with random experiment initialization:
Two Agents:
One non-learning random agent
One learning agent
a feedforward neural network
Question: How can this agent learn to win?
Answer: It cant; communication would be required.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Experiment 2 Description
Two Agents:
One non-learning completely honest agent
One learning agent, with both:
a feedforward neural network
an LSTM
Question: How can this agent learn to win?
Answer: By asking questions of the honest agent.
Discuss: Is the LSTM necessary in this case?
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Problem: Diverse Purposes for Runs
Since the simulation engine is a platform for experimentation, it
has di鍖erent purposes:
Verifying new algorithms
Inspecting policy changes over time
Exploring hyperparameters
Monitoring optimizer performance
Getting the verbosity level is tricky: too little misses out, but too
much creates clutter...
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Solution: Granular Logging
One solution is to use 鍖ne-grained logging:
let log_cfg = LogCfg { // subset
log_action_probs: false,
log_actions_rewards: false,
log_attrs: false,
log_encodings: false,
log_experiment_start: true,
log_interactions: false,
log_iteration_csv: true,
log_learning: false,
log_obs_summary: true,
log_trajectories: false,
};
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Problem: When to Stop Optimization?
Policy gradient (PG) methods aim to improve a policy over many
iterations.
However, the policy performance does not necessarily
monotonically improve.
Empirically, PG methods, even with improvements such as PPO,
can sometimes drift away from good performance and never
recover.
See: Where Did My Optimum Go?: An Empirical Analysis of
Gradient Descent Optimization in Policy Gradient Methods by
Henderson et al. 2018.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Why do Policy Gradient Methods Degrade?
The Policy Gradient Theorem gives the correct gradient, so what
could go wrong in practice?
Gradient ascent takes many iterations. How many? What
does research say about the bounds as applied to PG?
The underlying function approximator* may lack su鍖cient
representational power.
The network parameters may not be able to e鍖ciently
respond to the latest distributional shift.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Solution: Backtracking Algorithm
I implemented a three-level backtracking algorithm that requires
three hyper-parameters:
/// Backtracking configuration
pub struct BacktrackCfg {
/// Snapshot pops before stopping early
pub pop_limit: u32,
/// Consecutive resets before snapshot pop
pub restart_limit: u32,
/// Consecutive unimproved iterations before reset
pub unimproved_limit: u32,
}
The algorithm is shown on the next slide.
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Backtracking Algorithm
Algorithm 1: Three-level backtracking algorithm
snapshots  []; pops  0; restarts  0; unimproved  0
for iteration  0 to max iterations do
if improved performance then
snapshots.push(model); restarts  0; unimproved  0
else unimproved  unimproved + 1
if unimproved > max unimproved then
restarts  restarts + 1; unimproved  0
if restarts > max restarts then
pops  pops + 1
if pops > max pops then stop early()
else snapshots.pop(); restarts  0
model  snapshots.last()
else
model.optimization step()
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Problem: Consistency of Responses
An agent reports an intention to the simulation; the simulation
converts it to a response that another agent observes.
source: http://factmyth.com
Claim: inconsistent lies are easier to discover.
How can the simulation achieve consistent lying?
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Consistent Lying
/// Return a deterministic lie that depends only on:
/// 1. the function arguments
/// 2. the agent's private salt
/// Note: 'q' / 'a' = agent id of questioner / answerer
fn specific_lie(
&self, q: AgentId, attr_idx: u16, attr_val: Option<u8>
) -> Answer {
let m = self.attr_cfg.values as u64;
let hash = self.calc_hash(q, attr_idx);
let lie = match attr_val {
Some(true_val) =>
(true_val as u64 + (hash % (m - 1))) % m,
None => hash % m
};
Answer::Value(lie as u8)
}
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Table of Contents
1 Motivation
2 Simulation
3 Algorithms
4 Experiments
5 Lessons Learned
6 Next Steps
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Next Steps
LSTM architecture & parameter tuning
Create experiments with multiple learning agents
Improve visualizations
Try various multi-agent reward structures
Use unsupervised learning to 鍖nd patterns in policies
Improve e鍖ciency by using parallelism in Rust
David James Communicating Agents Seeking Information
Motivation Simulation Algorithms Experiments Lessons Learned Next Steps
Discussion
I look forward to your questions and comments.
David James Communicating Agents Seeking Information

More Related Content

Communicating Agents Seeking Information

  • 1. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Communicating Agents Seeking Information David James April 2019 David James Communicating Agents Seeking Information
  • 2. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 3. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Table of Contents 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 4. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Personal Motivations I am fascinated by: simulation as a tool building agents that learn understanding multi-agent systems understanding how communication a鍖ects My educational background: Liberal Arts + ECE (undergraduate) Public Policy + CS (graduate) David James Communicating Agents Seeking Information
  • 5. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Research Question How do decentralized* communicating agents behave? What behavior patterns emerge? How do incentives a鍖ect behavior? Do agents reach global optima? What are the algorithmic challenges? David James Communicating Agents Seeking Information
  • 6. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Design Process What kind of simulation can explore these questions? Lets walk through some design questions: How does communication work? How does learning (optimization) work? David James Communicating Agents Seeking Information
  • 7. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Communication Type AgentAgent AgentAgent shared Shared Communication Channel AgentAgent AgentAgent Private Communication Channels David James Communicating Agents Seeking Information
  • 8. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Learning Type Agent Agent Agent Simulator(t) Simulator(t) observation action observation action actionobservation centralized learning Centralized Learning Agent Agent Agent Simulator(t) Simulator(t) observation action observation action actionobservation decentralized learning Decentralized Learning David James Communicating Agents Seeking Information
  • 9. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Table of Contents 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 10. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Simulation Overview You can think of the CASI simulation as a game: There are na agents. Each episode consists of T steps. If an agent correctly guesses an unsolved goal attribute, it gains points. If an agent guesses all of its ng goal attributes, it gains many points and wins. Various actions are penalized (more on this later). For example, asking a question has a small cost. David James Communicating Agents Seeking Information
  • 11. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Simulation Attributes Each simulation has a con鍖gurable number of attributes. Here is an example with three agents and six attributes: 100121 001goalgoal 10goalgoal1 goalgoal121 Attribute Vectors agent 1 agent 2 agent 3 global unknownknownkey David James Communicating Agents Seeking Information
  • 12. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Agents Perspective Each agent sees an observation and must choose an action. An observation consists of: the agents internal state interactions with other agents (answers received, questions asked) An action consists of: intentions on how to respond to questions choosing what to do next, one of: (do nothing, guess an attribute, ask a question) David James Communicating Agents Seeking Information
  • 13. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Communication Protocol agent i says to agent j: tell me about attribute a agent j chooses an intention (i.e. to lie, ignore, or tell the truth) the simulation converts the intention to an answer and delivers it to agent i David James Communicating Agents Seeking Information
  • 14. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Agent Intentions Intention Response if unknown if known i.e. None i.e. Some(v) speci鍖c lie Value(r) Value(r) | r = v general lie Known Unknown ignore None None incomplete truth Unknown Known complete truth Unknown Value(v) where r is a random value David James Communicating Agents Seeking Information
  • 15. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Incentives let incentives = Incentives { // Experiment #1 cost_ask_question: 2.0, cost_ask_self_question: 20.0, cost_excessive_guessing: 20.0, cost_incorrect_guess: 20.0, cost_known_guess: 20.0, cost_non_goal_guess: 20.0, cost_solved_guess: 20.0, cost_unnecessary_reply: 20.0, max_guesses_per_attr: 1, reward_correct: 100.0, reward_win: 200.0, }; David James Communicating Agents Seeking Information
  • 16. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps CASI Architecture Agent Agent Agent Simulator gRPC gRPC gRPC Simulator (written in Rust) Learning Agents (written with PyTorch) Interprocess communication (uses gRPC) David James Communicating Agents Seeking Information
  • 17. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Why Rust? Rust helps me write (and refactor) fast and correct software, quickly. Performance: no runtime, no garbage collector Reliability: expressive type system gives memory-safety, thread-safety Productivity: package manager, libraries, community David James Communicating Agents Seeking Information
  • 18. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Why PyTorch? As you might expect, PyTorch: manages a computation graph and auto-di鍖erentiation includes reusable modules for neural networks has CPU and GPU support In comparison to TensorFlow, I 鍖nd PyTorch is: more natural (has a more intuitive API) easier to debug David James Communicating Agents Seeking Information
  • 19. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Why gRPC? Simple service de鍖nition: easy refactoring Works across languages and platforms: conveniently bridges Rust and Python public and private use cases (RPC, APIs) Start quickly and scale: works on one machine viable for large multi-agent systems David James Communicating Agents Seeking Information
  • 20. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Neural Network gRPC De鍖nition Here is how I use gRPC to de鍖ne a neural network model for a learning agent: // Neural network for action prefs and value function service Net { rpc Init (NetCfg) returns (Empty) {} rpc InitHidden (Empty) returns (Empty) {} rpc Predict (Observations) returns (Predictions) {} rpc ResetOptimizer (OptimCfg) returns (Empty) {} rpc Train (Sequences) returns (Loss) {} rpc GetParams (Empty) returns (Params) {} rpc SetParams (Params) returns (Empty) {} } This 鍖le will generate executable code in a wide variety of languages. David James Communicating Agents Seeking Information
  • 21. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Table of Contents 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 22. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Reinforcement Learning Reinforcement learning (RL) is learning what to do how to map situations to actions so as to maximize a numerical reward signal. - Sutton & Barto Chapter 3: Finite Markov Decision Pr actions and presenting new situations to the agent.1 The environment also o rewards, special numerical values that the agent seeks to maximize ove gh its choice of actions. Agent Environment action At reward Rt state St Rt+1 St+1 Figure 3.1: The agentenvironment interaction in a Markov decision process. re speci鍖cally, the agent and environment interact at each of a sequence of dDavid James Communicating Agents Seeking Information
  • 23. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Decentralized RL You may remember this diagram from before: Agent Agent Agent Simulator(t) Simulator(t) observation action observation action actionobservation decentralized learning Can you think about some pros and cons? David James Communicating Agents Seeking Information
  • 24. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps RL Dimensions Environmental considerations: states: fully-observed vs. partially-observed actions: discrete vs. continuous Algorithmic considerations: policies: deterministic vs. stochastic on-policy vs. o鍖-policy model-based vs. model-free * David James Communicating Agents Seeking Information
  • 25. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Models in RL I brie鍖y wanted to mention what model means in the context of RL, since it can be confusing. Ben Recht says it well in A Tour of Reinforcement Learning: The View from Continuous Control (2018): The term model-free almost always means no model of the state transition function when casually claimed in reinforcement learning research. However, this does not mean that modeling is not heavily built into the assumptions of model-free RL algorithms. David James Communicating Agents Seeking Information
  • 26. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Algorithm Experimentation Ive implemented these algorithms from scratch in Rust and PyTorch: REINFORCE (vanilla) REINFORCE with baseline Proximal Policy Optimization (PPO) Advantage Actor Critic (A2C) Im currently using A2C + PPO with exploration bonuses. David James Communicating Agents Seeking Information
  • 27. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps REINFORCE 328 Chapter 13: Policy Gradient Methods REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for ≠ Input: a dierentiable policy parameterization (a|s, ) Algorithm parameter: step size > 0 Initialize policy parameter 2 Rd0 (e.g., to 0) Loop forever (for each episode): Generate an episode S0, A0, R1, . . . , ST 1, AT 1, RT , following (揃|揃, ) Loop for each step of the episode t = 0, 1, . . . , T 1: G PT k=t+1 k t 1 Rk (Gt) + t Gr ln (At|St, ) The second dierence between the pseudocode update and the REINFORCE update equation (13.8) is that the former includes a factor of t . This is because, as mentioned earlier, in the text we are treating the non-discounted case ( =1) while in the boxed algorithms we are giving the algorithms for the general discounted case. All of the ideas go through in the discounted case with appropriate adjustments (including to the box on page 199) but involve additional complexity that distracts from the main ideas. Exercise 13.2 Generalize the box on page 199, the policy gradient theorem (13.5), the proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE Reinforcement Learning: An Introduction by Sutton & Barto, 2018 David James Communicating Agents Seeking Information
  • 28. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps PPO Actor-Critic Equation (10) when = 1: At = t + ( ) t+1 + 揃 揃 揃 + 揃 揃 揃 + ( )T t+1 T 1, where t = rt + V (st+1) V (st) A proximal policy optimization (PPO) algorithm that uses 鍖xed-length trajectory segm shown below. Each iteration, each of N (parallel) actors collect T timesteps of data. T construct the surrogate loss on these NT timesteps of data, and optimize it with minibatc (or usually for better performance, Adam [KB14]), for K epochs. Algorithm 1 PPO, Actor-Critic Style for iteration=1, 2, . . . do for actor=1, 2, . . . , N do Run policy old in environment for T timesteps Compute advantage estimates A1, . . . , AT end for Optimize surrogate L wrt , with K epochs and minibatch size M NT old end for 6 Experiments 6.1 Comparison of Surrogate Objectives First, we compare several di erent surrogate objectives under di erent hyperparameters. H compare the surrogate objective LCLIP to several natural variations and ablated versions. No clipping or penalty: Lt( ) = rt( ) At Proximal Policy Optimization Algorithms by Schulman et al. 2017 David James Communicating Agents Seeking Information
  • 29. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Exploration There are many kinds of exploration in RL: -greedy exploration * a result of stochastic policies entropy-based exploration seeking novel (state, action) pairs David James Communicating Agents Seeking Information
  • 30. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Count-Based Exploration with Hashing From #Exploration by Tang et. al, 2017: a simple generalization of the classic count-based approach [. . .]. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. David James Communicating Agents Seeking Information
  • 31. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Locality Sensitive Hashing (LSH) From Wikipedia: LSH hashes input items so that similar items map to the same buckets with high probability [. . .]. LSH di鍖ers from conventional and cryptographic hash functions because it aims to maximize the probability of a collision for similar items. Arguably, the key criteria for choosing an LSH algorithm is the relevant distance metric. David James Communicating Agents Seeking Information
  • 32. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps LSH for Hamming Distance Early LSH papers on Hamming distances suggest using single-column sampling techniques. Each output bit is created simply by sampling a single column from the input vector. However, sampling only a fraction of the input vector seems suboptimal. I would expect bene鍖ts from drawing information from all columns. David James Communicating Agents Seeking Information
  • 33. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Hamming Distance LSH with Super-Bits I blended two approaches from the literature as follows: Sample from each column but only once. Otherwise, di鍖erences can be accentuated. Designate groups of the output bits. This allows trading-o鍖 between the number of anchor points for the distance calculation vs. the amount of compression. (Inspired by Super-Bit LSH by Jie et al., 2012.) Combine columns using the Hamming distance. Store the result in each super-bit group. David James Communicating Agents Seeking Information
  • 34. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Table of Contents 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 35. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Attributes Here is an example of a Rust data structure to con鍖gure attributes for an experiment. /// Attribute configuration pub struct AttrCfg { /// Total number of attributes in simulation pub count: u16, /// Attributes known from direct experience (per agent) pub direct: u16, /// To win, an agent must solve this many attributes pub goal: u16, /// Number of possible values for each attribute pub values: u8, } David James Communicating Agents Seeking Information
  • 36. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Experiment 0 Description Two Agents: One non-learning random agent One learning agent a feedforward neural network Question: Can this agent learn to win? Answer: Perhaps surprisingly, it can, by exploiting the lack of random initialization. You might call it Groundhog Day learning. David James Communicating Agents Seeking Information
  • 37. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Experiment 0 Output 1 David James Communicating Agents Seeking Information
  • 38. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Experiment 0 Output 2 David James Communicating Agents Seeking Information
  • 39. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Experiment 0 Output 3 David James Communicating Agents Seeking Information
  • 40. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Experiment 1 Description Same as Experiment 0 but with random experiment initialization: Two Agents: One non-learning random agent One learning agent a feedforward neural network Question: How can this agent learn to win? Answer: It cant; communication would be required. David James Communicating Agents Seeking Information
  • 41. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Experiment 2 Description Two Agents: One non-learning completely honest agent One learning agent, with both: a feedforward neural network an LSTM Question: How can this agent learn to win? Answer: By asking questions of the honest agent. Discuss: Is the LSTM necessary in this case? David James Communicating Agents Seeking Information
  • 42. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Table of Contents 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 43. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Problem: Diverse Purposes for Runs Since the simulation engine is a platform for experimentation, it has di鍖erent purposes: Verifying new algorithms Inspecting policy changes over time Exploring hyperparameters Monitoring optimizer performance Getting the verbosity level is tricky: too little misses out, but too much creates clutter... David James Communicating Agents Seeking Information
  • 44. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Solution: Granular Logging One solution is to use 鍖ne-grained logging: let log_cfg = LogCfg { // subset log_action_probs: false, log_actions_rewards: false, log_attrs: false, log_encodings: false, log_experiment_start: true, log_interactions: false, log_iteration_csv: true, log_learning: false, log_obs_summary: true, log_trajectories: false, }; David James Communicating Agents Seeking Information
  • 45. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Problem: When to Stop Optimization? Policy gradient (PG) methods aim to improve a policy over many iterations. However, the policy performance does not necessarily monotonically improve. Empirically, PG methods, even with improvements such as PPO, can sometimes drift away from good performance and never recover. See: Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods by Henderson et al. 2018. David James Communicating Agents Seeking Information
  • 46. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Why do Policy Gradient Methods Degrade? The Policy Gradient Theorem gives the correct gradient, so what could go wrong in practice? Gradient ascent takes many iterations. How many? What does research say about the bounds as applied to PG? The underlying function approximator* may lack su鍖cient representational power. The network parameters may not be able to e鍖ciently respond to the latest distributional shift. David James Communicating Agents Seeking Information
  • 47. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Solution: Backtracking Algorithm I implemented a three-level backtracking algorithm that requires three hyper-parameters: /// Backtracking configuration pub struct BacktrackCfg { /// Snapshot pops before stopping early pub pop_limit: u32, /// Consecutive resets before snapshot pop pub restart_limit: u32, /// Consecutive unimproved iterations before reset pub unimproved_limit: u32, } The algorithm is shown on the next slide. David James Communicating Agents Seeking Information
  • 48. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Backtracking Algorithm Algorithm 1: Three-level backtracking algorithm snapshots []; pops 0; restarts 0; unimproved 0 for iteration 0 to max iterations do if improved performance then snapshots.push(model); restarts 0; unimproved 0 else unimproved unimproved + 1 if unimproved > max unimproved then restarts restarts + 1; unimproved 0 if restarts > max restarts then pops pops + 1 if pops > max pops then stop early() else snapshots.pop(); restarts 0 model snapshots.last() else model.optimization step() David James Communicating Agents Seeking Information
  • 49. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Problem: Consistency of Responses An agent reports an intention to the simulation; the simulation converts it to a response that another agent observes. source: http://factmyth.com Claim: inconsistent lies are easier to discover. How can the simulation achieve consistent lying? David James Communicating Agents Seeking Information
  • 50. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Consistent Lying /// Return a deterministic lie that depends only on: /// 1. the function arguments /// 2. the agent's private salt /// Note: 'q' / 'a' = agent id of questioner / answerer fn specific_lie( &self, q: AgentId, attr_idx: u16, attr_val: Option<u8> ) -> Answer { let m = self.attr_cfg.values as u64; let hash = self.calc_hash(q, attr_idx); let lie = match attr_val { Some(true_val) => (true_val as u64 + (hash % (m - 1))) % m, None => hash % m }; Answer::Value(lie as u8) } David James Communicating Agents Seeking Information
  • 51. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Table of Contents 1 Motivation 2 Simulation 3 Algorithms 4 Experiments 5 Lessons Learned 6 Next Steps David James Communicating Agents Seeking Information
  • 52. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Next Steps LSTM architecture & parameter tuning Create experiments with multiple learning agents Improve visualizations Try various multi-agent reward structures Use unsupervised learning to 鍖nd patterns in policies Improve e鍖ciency by using parallelism in Rust David James Communicating Agents Seeking Information
  • 53. Motivation Simulation Algorithms Experiments Lessons Learned Next Steps Discussion I look forward to your questions and comments. David James Communicating Agents Seeking Information