Computational reinforcement learning using rewards from human feedback

Publication Type:
Thesis
Issue Date:
2018
Full metadata record
A promising method of learning from human feedback is reward shaping, where a robot is trained via human-delivered instantaneous rewards. The existing approach, which requires numerous reward signals about the quality of agent’s actions from the human trainer, is based on a number of assumptions about human capabilities. For example, it assumes that humans can provide a precisely correct feedback to an agent’s action, or that they would always prefer to train an agent by means of reward signals, or that they can assess an agent’s actions for any length of training. In this thesis, we have relaxed these assumptions and have addressed two important issues which are not handled by the existing approach. First, how to compute a potential function using human feedback which can indicate the correctness of an action in terms of increasing or decreasing potential. Second, how to design training methods which cater to human preferences. Furthermore, we have identified that there are two important preferences of a human trainer in the application of reward shaping: (a) a preference to transfer knowledge by providing demonstrations and (b) a preference for short training durations. To address these issues, we have introduced three new methods of computing rewards from human-feedback. The first method, named rewards from state preference, takes human feedback as preferences of states in terms of distance to the goal state. It removes the assumption of highly accurate evaluative feedback from the user. It computes a high-quality potential function for potential-based reward shaping from only a few human feedbacks. Using feedbacks as state preferences, a ranking model is learned which computes a complete ranking of states. These state rankings define a potential function for potential-based reward shaping. This method learns a policy much faster than a reinforcement learner which is trained without human feedbacks. The second method, named rewards from action labels, replaces the traditional evaluative-style feedback approach with a demonstration-style feedback approach. The method caters to the human preference of providing a demonstration. It takes human-feedback as an action label for the current state, which is similar to providing demonstrations. The agent acts using its own policy. A reward function is computed by comparing agent’s action with the action label. We found that this method can be favorable to a naïve user as compared to the traditional evaluative-style feedback method. Finally, the third method, named rewards from part-time trainers, is designed to reduce the load of a single dedicated trainer by curtailing the length of a training session. A policy is taught by a number of trainers. Each trainer provides reward signals for a small number of steps. Experiments, using online crowd, showed that the random part-time trainers can collectively train a good policy. In a survey, conducted for this method, people overwhelmingly voted in favor of the idea of training for a short duration. Overall, this thesis contributes towards further enhancing the application scope of reward shaping. It develops three new efficient techniques of conducting reward shaping using human feedbacks of different types.
Please use this identifier to cite or link to this item: