Computational reinforcement learning using rewards from human feedback

Raza, Syed Ali

Computational reinforcement learning using rewards from human feedback

Raza, Syed Ali

Permalink

Publication Type:: Thesis
Issue Date:: 2018

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (94.32 kB)

Adobe PDF

Download thesisAdobe PDF (4.47 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Raza, Syed Ali
dc.date.accessioned	2018-10-09T01:20:21Z
dc.date.available	2018-10-09T01:20:21Z
dc.date.issued	2018
dc.identifier.uri	http://hdl.handle.net/10453/128007
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	A promising method of learning from human feedback is reward shaping, where a robot is trained via human-delivered instantaneous rewards. The existing approach, which requires numerous reward signals about the quality of agent’s actions from the human trainer, is based on a number of assumptions about human capabilities. For example, it assumes that humans can provide a precisely correct feedback to an agent’s action, or that they would always prefer to train an agent by means of reward signals, or that they can assess an agent’s actions for any length of training. In this thesis, we have relaxed these assumptions and have addressed two important issues which are not handled by the existing approach. First, how to compute a potential function using human feedback which can indicate the correctness of an action in terms of increasing or decreasing potential. Second, how to design training methods which cater to human preferences. Furthermore, we have identified that there are two important preferences of a human trainer in the application of reward shaping: (a) a preference to transfer knowledge by providing demonstrations and (b) a preference for short training durations. To address these issues, we have introduced three new methods of computing rewards from human-feedback. The first method, named rewards from state preference, takes human feedback as preferences of states in terms of distance to the goal state. It removes the assumption of highly accurate evaluative feedback from the user. It computes a high-quality potential function for potential-based reward shaping from only a few human feedbacks. Using feedbacks as state preferences, a ranking model is learned which computes a complete ranking of states. These state rankings define a potential function for potential-based reward shaping. This method learns a policy much faster than a reinforcement learner which is trained without human feedbacks. The second method, named rewards from action labels, replaces the traditional evaluative-style feedback approach with a demonstration-style feedback approach. The method caters to the human preference of providing a demonstration. It takes human-feedback as an action label for the current state, which is similar to providing demonstrations. The agent acts using its own policy. A reward function is computed by comparing agent’s action with the action label. We found that this method can be favorable to a naïve user as compared to the traditional evaluative-style feedback method. Finally, the third method, named rewards from part-time trainers, is designed to reduce the load of a single dedicated trainer by curtailing the length of a training session. A policy is taught by a number of trainers. Each trainer provides reward signals for a small number of steps. Experiments, using online crowd, showed that the random part-time trainers can collectively train a good policy. In a survey, conducted for this method, people overwhelmingly voted in favor of the idea of training for a short duration. Overall, this thesis contributes towards further enhancing the application scope of reward shaping. It develops three new efficient techniques of conducting reward shaping using human feedbacks of different types.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/128007/2/02whole.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.subject	Reinforcement learning.	en_AU
dc.subject	Human feedback.	en_AU
dc.subject	Reward shaping.	en_AU
dc.subject	Learning from demonstration.	en_AU
dc.title	Computational reinforcement learning using rewards from human feedback	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

A promising method of learning from human feedback is reward shaping, where a robot is trained via human-delivered instantaneous rewards. The existing approach, which requires numerous reward signals about the quality of agent’s actions from the human trainer, is based on a number of assumptions about human capabilities. For example, it assumes that humans can provide a precisely correct feedback to an agent’s action, or that they would always prefer to train an agent by means of reward signals, or that they can assess an agent’s actions for any length of training. In this thesis, we have relaxed these assumptions and have addressed two important issues which are not handled by the existing approach. First, how to compute a potential function using human feedback which can indicate the correctness of an action in terms of increasing or decreasing potential. Second, how to design training methods which cater to human preferences. Furthermore, we have identified that there are two important preferences of a human trainer in the application of reward shaping: (a) a preference to transfer knowledge by providing demonstrations and (b) a preference for short training durations. To address these issues, we have introduced three new methods of computing rewards from human-feedback. The first method, named rewards from state preference, takes human feedback as preferences of states in terms of distance to the goal state. It removes the assumption of highly accurate evaluative feedback from the user. It computes a high-quality potential function for potential-based reward shaping from only a few human feedbacks. Using feedbacks as state preferences, a ranking model is learned which computes a complete ranking of states. These state rankings define a potential function for potential-based reward shaping. This method learns a policy much faster than a reinforcement learner which is trained without human feedbacks. The second method, named rewards from action labels, replaces the traditional evaluative-style feedback approach with a demonstration-style feedback approach. The method caters to the human preference of providing a demonstration. It takes human-feedback as an action label for the current state, which is similar to providing demonstrations. The agent acts using its own policy. A reward function is computed by comparing agent’s action with the action label. We found that this method can be favorable to a naïve user as compared to the traditional evaluative-style feedback method. Finally, the third method, named rewards from part-time trainers, is designed to reduce the load of a single dedicated trainer by curtailing the length of a training session. A policy is taught by a number of trainers. Each trainer provides reward signals for a small number of steps. Experiments, using online crowd, showed that the random part-time trainers can collectively train a good policy. In a survey, conducted for this method, people overwhelmingly voted in favor of the idea of training for a short duration. Overall, this thesis contributes towards further enhancing the application scope of reward shaping. It develops three new efficient techniques of conducting reward shaping using human feedbacks of different types.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/128007