This document describes using reinforcement learning for stock trading with Python. It discusses implementing Q-learning to learn the optimal trading policy, using a deep Q-network to represent the action-value function. The trading environment defines the state, actions, and rewards. States represent the stock price window, actions are buy, sell, or sit, and rewards incentivize profit. The agent trains on experience samples to learn the optimal policy for making trading decisions.
5. Action value function (Q-value)
? Expected discounted cumlative future reward given
the current state and action then following the
optimal policy (mapping from state to action)
Q(st, at) = E
"
X
k=0
k
rt+k | s = st, a = at
#
: discount factor
t: time step
st: state
at: action
rt: (immediate) reward
6. Q-learning
? An algorithm of Reinforcement learning for learning
the optimal Q-value
Qnew
(st, at) = (1 ?)Qold
(st, at) + ?(rt + max
a
Q(st+1, a))
? 2 (0, 1): learning rate
? For applying Q-learning, collect samples in
episodes represented as tuples (st, at, rt, st+1)
7. Deep Q-learning
? Representing action value function with a deep
network and minimizing loss function
L =
X
t2D
(Q(st, at) yt)
2
yt = rt + max
a
Q(st+1, a)
Note
? In contrast to supervised learning, the target value involves the current
network outputs. Thus network parameters should be gradually updated
? The order of samples in minibatches can be randomized to decorrelate
sequential samples (replay buffer)
D: minibatch
8. Trading model
? Based on closing prices
? Three actions: buy (1 unit), sell (1 unit), sit
? Ignoring transaction fee
? Immediate transaction
? Limited number of units we can keep
9. State
? State: n-step time window of the 1-step differences
of past stock prices
? Sigmoid function for input scaling issue
st = (dt ?+1, ¡¤ ¡¤ ¡¤ , dt 1, dt)
dt = sigmoid (pt pt 1)
pt: price at time t (discrete)
?: time window size (steps)
10. Reward
? Depending on action
? Sit: 0
? Buy: a negative constant (con?gurable)
? Sell: pro?t (sell price - bought price)
This needs to be appropriately designed
11. Implementation
? Training (train.py)
Initialization, sample collection, training loop
? Environment (environment.py)
Loading stock data, making state transition with reward
? Agent (agent.py)
Feature extraction, Q-learning, e-greedy exploration, saving model
? Results examination (learning_curve.py,
plot_learning_curve.py, evaluate.py)
Compute learning curve for given data, plotting
12. Con?guration
? For comparison of various experimental settings
stock_name : ^GSPC
window_size : 10
episode_count: 500
result_dir : ./models/config
batch_size: 32
clip_reward : True
reward_for_buy: -10
gamma : 0.995
learning_rate : 0.001
optimizer : Adam
inventory_max : 10
13. from ruamel.yaml import YAML
with open(sys.argv[1]) as f:
yaml = YAML()
config = yaml.load(f)
stock_name = config["stock_name"]
window_size = config["window_size"]
episode_count = config["episode_count"]
result_dir = config["result_dir"]
batch_size = config["batch_size"]
Parsing con?g ?le
? Get parameters as a dict
15. # Loop over episodes
for e in range(episode_count + 1):
# Initialization before starting an episode
state = env.reset()
agent.inventory = []
done = False
# Loop in an episode
while not done:
action = agent.act(state)
next_state, reward, done, _ = env.step(action)
agent.memory.append((state, action, reward, next_state, done))
state = next_state
if len(agent.memory) > batch_size:
agent.expReplay(batch_size)
if e % 10 == 0:
agent.model.save("models/model_ep" + str(e))
? Loop over episodes
? Collect samples and train the model (expReplay)
? Train model every 10 episodes
20. # returns an an n-day state representation ending at time t
def getState(data, t, n, agent):
d = t - n + 1
block = data[d:t + 1] if d >= 0 else -d * [data[0]] + data[0:t + 1] # pad # with t0
res = []
for i in range(n - 1):
res.append(sigmoid(block[i + 1] - block[i]))
return agent.modify_state(np.array([res]))
? Computing 1-step differences of time series of
stock prices
? Adding information if the agent has bought
def modify_state(self, state):
if len(self.inventory) > 0:
state = np.hstack((state, [[1]]))
else:
state = np.hstack((state, [[0]]))
return state
(in agent.py)
22. def _model(self):
model = Sequential()
model.add(Dense(units=64, input_dim=self.state_size, activation="relu"))
model.add(Dense(units=32, activation="relu"))
model.add(Dense(units=8, activation="relu"))
model.add(Dense(self.action_size, activation="linear"))
model.compile(loss="mse", optimizer=Adam(lr=0.001))
return model
def act(self, state):
if not self.is_eval and np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
options = self.model.predict(state)
return np.argmax(options[0])
? Implements action value function by a neural network
? Accept state vector as input, output values for each action
? Specify square loss and ADAM optimizer
? Take argmax action or epsilon-greedy for exploration
23. ? Compute a target value with the current network output
? Update parameters once (epochs=1)
def expReplay(self, batch_size):
subsamples = random.sample(list(self.memory), len(self.memory))
states, targets = [], []
for state, action, reward, next_state, done in subsamples:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
target_f = self.model.predict(state)
target_f[0][action] = target
states.append(state)
targets.append(target_f)
self.model.fit(np.vstack(states), np.vstack(targets), epochs=1, verbose=0,
batch_size=batch_size)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
24. Running scripts
# Training with Q-learning
$ python train.py config/config2.yaml
? Training
# Learning curve on training data
$ python learning_curve.py config/config2.yaml ^GSPC
? Learning curve
25. Plotting learning curve
# Plotting learning curve
$ python plot_learning_curve.py config/config2.yaml ^GSPC
? Specifying the data on which pro?ts will be
computed with the trained model
26. Training data (^GSPC) Test data (^GSPC_2011)
? Not working well: increasing pro?t on test data, but
decreasing on training data
? We usually see over?tting to training data
27. Plotting trading behavior
? Specifying the data and model (?lename)
# Plotting trading on a model
$ python evaluate.py config/config2.yaml ^GSPC_2011 model_ep500