You may want to create your own reinforcement learning algorithm by modifying an existing one. As an example for this, we create a copy of the one-file soft actor-critic (SAC) algorithm implementation cleansac
and modify it.
copy cleansac
- copy the
cleansac
foldersrc/custom_algorithms/cleansac
- paste it into the
src/custom_algorithms
folder and name it something other than “cleansac”, e.g. “cleansac_mod” - rename the python file,
cleansac.py
➡cleansac_mod.py
, class-name (e.g. to CLEANSAC_MOD), docstring, super() calls etc. - adjust the
__init__.py
in the cleansac_mod folder to import your custom version of CLEANSAC - create an algorithm configuration file
cleansac_mod.yaml
inconf/algorithm
by copy-pastingcleansac.yaml
and changename: 'cleansac'
toname: 'cleansac_mod'
- start the algorithm with
python src/main.py algorithm=cleansac_mod
start modifying it
CLEANSAC uses multiple critic-networks that assess the value of the actions from the actor network. Empirically, it is best to take the minimal value of these critics (the most pessimistic one) as the q-value for the actor. Let’s say you have the hypothesis that it would be better to take the maximum of these q-values (the most optimistic value estimate). You can modify the code as follows to test your hypothesis:
# original
min_crit_next_target = torch.min(crit_next_targets, dim=0).values
min_crit_next_target -= ent_coef * next_state_log_pi
next_q_value = replay_data.rewards.flatten() + \
(1 - replay_data.dones.flatten()) * self.gamma * min_crit_next_target.flatten()
...
min_crit_pi = torch.min(self.critic(observations, pi), dim=0).values
actor_loss = ((ent_coef * log_pi) - min_crit_pi).mean()
# modified
max_crit_next_target = torch.max(crit_next_targets, dim=0).values
max_crit_next_target -= ent_coef * next_state_log_pi
next_q_value = replay_data.rewards.flatten() + \
(1 - replay_data.dones.flatten()) * self.gamma * max_crit_next_target.flatten()
...
max_crit_pi = torch.max(self.critic(observations, pi), dim=0).values
actor_loss = ((ent_coef * log_pi) - max_crit_pi).mean()
It would be more convenient to have a command line parameter with which we can choose between min and max.
- add the line
critic_select: 'min' # allows values ['min', 'max']
tocleansac_mod.yaml
. - add the new parameter to the constructor of CLEANSAC_MOD :
critic_select: str = 'min',
. You could also add other constructor parameters like thelearning_rate
to the config or just use hydras overwrite syntax (e.g.++algorithm.learning_rate=0.007
, see also the hydra tutorial). - add the parameter to the instance of the CLEANSAC_MOD object with
self.critic_select = critic_select
. - now you can modify the code accordingly:
if self.critic_select == 'min':
min_crit_next_target = torch.min(crit_next_targets, dim=0).values
min_crit_next_target -= ent_coef * next_state_log_pi
next_q_value = replay_data.rewards.flatten() + \
(1 - replay_data.dones.flatten()) * self.gamma * min_crit_next_target.flatten()
elif self.critic_select == 'max':
max_crit_next_target = torch.max(crit_next_targets, dim=0).values
max_crit_next_target -= ent_coef * next_state_log_pi
next_q_value = replay_data.rewards.flatten() + \
(1 - replay_data.dones.flatten()) * self.gamma * max_crit_next_target.flatten()
else:
assert False, "Error, invalid value for critic_select"
...
if self.critic_select == 'min':
min_crit_pi = torch.min(self.critic(observations, pi), dim=0).values
actor_loss = ((ent_coef * log_pi) - min_crit_pi).mean()
elif self.critic_select == 'max':
max_crit_pi = torch.max(self.critic(observations, pi), dim=0).values
actor_loss = ((ent_coef * log_pi) - max_crit_pi).mean()
else:
assert False, "Error, invalid value for critic_select"
You can now choose min/max via the command line parameter algorithm.critic_select
and run experiments to find out which is more successful. You could either start a number of runs with either configurations like in the Weights and Biases tutorial, or perform hyperparameter optimization (recommended).
integrate into framework
If you want to add your algorithm to the framework, create a merge request. GitLab will automatically run a pipeline with smoke-tests to see if your changes broke anything. Your algorithm will automatically be tested during the smoke tests.
Add new Python packages
If your new algorithm requires new python packages, you must put them into the requirements.txt