# Integration of feedback and feed-forward techniques in reinforcement learning

### The problems

In the last decade machine learning (ML) techniques like reinforcement learning (RL) have demonstrated a fantastic capability to solve hard technical and mathematical problems. A famous example is the game of “GO”, where the human world champion was defeated by an algorithm trained by playing against itself. Other technical fields in which ML excels include the wide variety of reported pattern recognition applications, for example from the medical sciences.

A closer look on the reported applications do however begin to show some cracks. When applied to self-driving cars, the predictions of technical and commercial success have been over-optimistic and manufacturers have been forced to recall fleets of vehicles due to more or less trivial incidents where the algorithms have misinterpreted stop signs and continued across intersections. Simple parts of the ML automotive driver software for normal driver support, like automatic forward light beam control, have also shown erratic behavior and has been outperformed by conventional logic.

When analyzing such effects as listed above from a classical control and estimation point of view, a few important facts may be observed:

- The training of gaming algorithms have access to full and exact state information, enabling exact state feedback in RL training.
- The training of pattern recognition algorithms treats static cases, i.e. function approximation problems, although sometimes with noise in images and labels.
- Self-driving algorithms for cars need to handle noisy inputs, deal with highly nonlinear dynamic systems with multiple targets or set-points, and also deal with a very high amount of unmodelled effects as well as measurement disturbances that may be hard to model as zero mean noises.

It therefore seems to be a need for improvements of ML and RL algorithms, to better deal with feedback control problems. A main theme of this research project is therefore to find ways to integrate basic features from the field of automatic control, into ML and RL algorithms.

The above bullets indicate that it could be beneficial to study improvements of the training of ML and RL algorithms by integration of automatic control technology for

- Set-point control of nonlinear systems, since conventional RL deals with a single target.
- Output feedback control of nonlinear systems, since full state feedback control is static.
- Regulation of nonlinear systems, handling unmeasurable nonzero mean disturbances.
- Regulation of nonlinear systems, accounting for measurable disturbances.

### Exploration aiding stabilizing feedback control

The excitation needed for an adaptive controller to converge and stabilize a feedback control system has been studied extensively. The problem is well understood at least for linear systems where the frequency properties of the exciting signals is a key property. In adaptive set-point control of nonlinear systems, the amplitude excitation is crucial as well, since the controller needs to learn how to handle different reference levels which represent different operating points in the state space.

Set-point control is related so called multi-goal RL algorithms. One problem for multi-goal RL when applied to set-point control of nonlinear systems is that it begins the learning without any prior knowledge. In addition, the RL algorithms typically exploits a zero mean stochastic additional excitation (or exploration) signal. Hence, initially the RL-controller uses more or less white noise as input signal. The data collected in this way will typically not excite amplitudes sufficiently in the whole range of the control objective, therefore it can take a long time for the RL controller to learn even how to move towards the current reference level. There is also a risk that the controller ends up in a saturated state, where the white probing noise has no effect on the output, which again leads to insufficient excitation.

The first of the four ideas presented here is to add a stabilizing aiding controller, designed by conventional feedback control methods, to force the state of the system to change towards the operating point of the varying reference value. The stabilizing controller could e.g. be a basic proportional controller. The aiding control concept introduced in the paper 1 is illustrated by Fig. 1.

*Figure 1. Addition of a stabilizing aiding controller to a multi-goal RL algorithm. *

The effect of the aiding controller was tested using the double tank laboratory process depicted in Fig. 2, where the objective was to follow a varying water level set point for the lower tank.

*Figure 2. The double water tank laboratory system. *

The multi-goal RL controller was implemented using the PPO-algorithm and a stabilizing P-controller was used for aiding. The live data experiments for the PPO algorithm with aiding appears in Fig. 3, while the live data experiment without aiding failed, as shown in Fig. 4. In this case, the effect of aiding on the exploration thus has a dramatic effect on closed loop performance. One possible reason for this is that only output feedback from the lower tank level is available, i.e. full state feedback is not available, which makes the adaptation problem for the un-aided RL algorithm hard. See 1 for a more detailed discussion. The different curves of Fig. 3 also depict the effect of variations of the training times.

*Figure 3. The performance of the aiding controller, and the RL controller with aiding for different training cases. *

*Figure 4. The performance of the PPO-controller without aiding. *

### Integrating control and ensamble training

Unmodelled disturbances with non-zero mean is a fundamental problem in feedback control. It was solved more than 200 years ago (at least in industrial time) by integrating control of steam engines. Yet, this concept has not been used when RL is applied for feedback control of dynamic systems, a fact that makes RL inherently non-robust to such effects. Compensation of a non-zero mean disturbance then require re-training which may take a long time.

The papers and report 3, 4 and 5 address the problem by an obvious second idea, namely augmentation of an integrator of the control error to the neural network (NN) typically applied in RL, together with injection of the integrator state into the NN for learning. The references 3,4, and 5 then first performed off-line training on simulated data from one reference model with a static disturbance. However, there was then only a very small effect of the integrator when the trained model was applied to another static disturbance. The reason was found to be that the NN model structure is so flexible that it does not need the integrator to arrive at a controller for a nonlinear system - it simply adapts to the correct gain for multiple levels of the system and disturbance.

To avoid the problem a third idea is needed. The proposed techniques enhances the off-line training from being based on one model, to simulated training based on multiple models in an ensamble, randomly drawn. By forcing the off-line training of the RL-algorithm to account for all simulated models in the ensamble, it becomes favorable for the RL algorithm to make use of the integration mechanism to maximize the reward. Therefore, when trained on an ensamble of simulated models, a sufficient gain from the integrated feedback control error results.

With the RL controller equipped with sufficient integrating gain, it can be successfully deployed on a real world system with partly unknown dynamics, and unmeasurable non-zero mean disturbances.

The methodology was tested both using the double tank process, and on a highly nonlinear pH control system model from 6. The system consists of linear mixing dynamics with an additional unmeasurable disturbance signal, in cascade with a static nonlinear transformation to the pH-value. The pH-curve of Fig. 5 is highly nonlinear, making it hard or even impossible to achieve acceptable performance with linear methods like PI control. As shown in Fig. 6, the PPO controller with proportional aiding, integration and ensample training (PIME-PPO) performs the best in this example.

Note also the very large training improvement that results from the aiding of the stabilizing P-controller.

*Figure 5. Illustration of the static nonlinear block of the pH-control dynamics. *

*Figure 6. Comparison of linear PI control, and ensamble trained PPO-control with and without aiding. *

### Structured feedback and feed-forward inspired neural networks

In automatic control, structure is used extensively. Optimal control of linear systems is one example, where it is common to use the separation principle for output feedback control. The separation principle breaks the optimal control problem into one dynamic optimal state estimation problem, and one static full state feedback problem, by proving that the optimal solution is to apply full state feedback to the optimally estimated states. The structural implication for RL, is that the optimal solution would be obtained also if the normal unstructured NN would be divided into one first observer NN and one cascaded state feedback control NN. Since the state feedback observer is dynamic, a recurrent neural network (RNN) should be used for the observer part, while a conventional forward NN can be selected for the state feedback.

In addition, automatic control theory advocates the use of feed-forward control whenever the disturbance is measurable. This is advantageous, provided that a model can be learned stating how the measurable disturbance affects the output signal. This follows since then a feed-forward component of the control signal can be computed that reduces the control error. The feedback part of the controller gains in terms of reduced requirements. As discussed in the paper 2, a cascade of an RNN and a forward NN is preferably used to capture the feed-forward dynamics.

The paper 2 therefore proposes a fourth idea, namely a PPO-algorithm for RL, based on the NN structure of Fig. 7.

*Figure 7. Structured NN inspired by automatic control theory. Structure 2 is shown. Structure 1 merges feedback and feed-forward into one RNN and one forward NN. *

But is there really a structure problem for RL based control? The answer is not obvious, however it is clear that a completely unstructured NN would have to learn both the observer – controller structure, together with the separation of feedback and feed-forward. In theory, this should require longer learning time, a larger NN and a corresponding increase of computational complexity.

In order to investigate the potential problem, simulated RL control was performed for the double tank system of Fig. 8, using the PPO algorithm with P-controller aiding but without integrating control. The simulation now included a measurable disturbance, allowing for direct flow from tank 1 into the water basin, see Fig. 8.

*Figure 8. Block diagram of the double tank with a measurable disturbance. *

The structural effects are depicted in Fig. 9 and Fig. 10. Fig. 9 shows the case when combined feedback and feed-forward is applied. It is clear that structure 2 performs the best with far better disturbance rejection than structure 1 and the unstructured alternatives, primarily for a switching disturbance but also for constant ones at the minimum and maximum disturbance level. In fact, structure 2 is the only one that gets the level compensation right. The unstructured alternative performs the worst and when comparing the input signals, the unstructured alternative does not settle when the measurable disturbance is constant as do the structured control systems. The remaining oscillations make it questionable if the unstructured controller has any feed-forward effect at, all, it rather appears to converge to a controller that induces a limit cycle, i.e. has a too high gain. The input signal of the controller based on structure 2 also appears to have a smaller amplitude than that of structure 1, which points to better robustness. Fig. 10 shows the effect when no feed-forward control is applied. That is, even though the disturbance is affecting the systema zero feed-forward signal is fed into the RNNs. The effect of the disturbance is significant, and the conclusion that the unstructured controller is ill-tuned is supported, referring to both the input and output signals. For constant disturbance levels, the feedback controller of structure 2 achieves the best results. Finally, the high frequency measurable disturbance is rejected in Fig. 9 while not in Fig. 10. This illustrates that the overall controller of structure 2 uses low bandwidth feedback together with high bandwidth feed-forward, as it should according to classical control theory.

*Figure 10. Combined feedback and feed-forward control using PPO with different NN structures. *

*Figure 10. Feedback control using PPO with different NN structures. *

### References

1. R. Zhang, P. Mattsson and T. Wigren, "Aiding reinforcement learning for set point control", Proc. IFAC World Congress, Yokohama, Japan, pp. 2748-2754, July 9-14, 2023.

2. R. Zhang, P. Mattsson and T. Wigren, "Observer-feedback-feedforward structures in reinforcement learning", Proc. IFAC World Congress, Yokohama, Japan, pp. 6807-6812, July 9-14, 2023.

3. R. Zhang, P. Mattsson and T. Wigren, "Robust nonlinear set-point control with reinforcement learning", Proc. ACC, San Diego, USA, pp. 84-91, May 31- June 2, 2023.

4. R. Zhang, P. Mattsson and T. Wigren, "A Robust Multi-Goal Exploration Aided Tracking Policy", Technical Reports from the Department of Information Technology, 2022-006, Uppsala University, Uppsala, Sweden, August, 2022.

5. R. Zhang, P. Mattsson and T. Wigren, "State observation and feedback in deep reinforcement learning", Reglermöte 2022, Luleå, Sweden, June 7-9, 2022.

6. T. Wigren, "Recursive identification based on the nonlinear Wiener model", Ph.D. thesis, Acta Universitatis Upsaliensis, Uppsala Dissertations from the Faculty of Science 31, Uppsala University, Uppsala, Sweden, December, 1990. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-118290