Title Sinteza funkcije nagrade iz opisa u prirodnom jeziku transformerom
Title (english) Synthesis of reward function from description in natural language by transformer
Author Jan Corazza
Mentor Luka Grubišić (mentor)
Committee member Luka Grubišić (predsjednik povjerenstva)
Committee member Zvonimir Bujanović (član povjerenstva)
Committee member Nikola Sandrić (član povjerenstva)
Committee member Marcela Hanzer (član povjerenstva)
Granter University of Zagreb Faculty of Science (Department of Mathematics) Zagreb
Defense date and country 2023-07-18, Croatia
Scientific / art field, discipline and subdiscipline NATURAL SCIENCES Mathematics
Abstract Zadaci koji od agenta zahtijevaju ostvarivanje kompleksnog niza događaja u okruženju često se javljaju u praksi, no funkcije nagrade koje obuhvaćaju takve zadatke ne zadovoljavaju Markovljevo svojstvo, koje je nužno za osiguravanje točnosti i konvergencije klasičnih algoritama za učenje s potporom. U radu je uveden pojam strojeva nagrade te posebnog algoritma za potporno učenje: Q-learning for Reward Machines (QRM), koji je namijenjen radu sa strojevima nagrade. Strojevi nagrade su vrsta konačnih automata kojima je moguće zapisati funkcije nagrade koje ne zadovoljavaju Markovljevo svojstvo, te se mogu shvatiti kao oblik konačne memorije u Markovljevom procesu odlučivanja. Nažalost, pisanje takvih formalizama je mukotrpan posao, te se javila potreba da se taj posao olakša. Neke metode tome pristupaju putem učenja automata paralelno uz učenje ponašanja agenta, no moraju pretpostaviti da je signal nagrade unaprijed definiran i da se podvrgava strukturi konačnog automata. U diplomskom radu istražen je drugi pristup tom problemu: korištenje novih tehnologija iz područja razumijevanja prirodnog jezika, s ciljem da inženjerima i istraživačima olakša sam postupak definiranja signala nagrade. U tu svrhu dotreniran je jezični model temeljen na poznatom modelu GPT-2, te je detaljno opisan proces proširenja skupa podataka za treniranje. Krajnji rezultat je omogućio sljedeće korake. 1. Korisnik opisuje zadatak na Engleskom jeziku, na primjer “patrol the forest and the factory, but avoid traps”. 2. Poziva se jezični model koji prevodi korisnikov opis u formalizam stroja nagrade. 3. Koristi se QRM i stroj nagrade iz koraka 2 da bi agent naučio obavljati zadatak kojeg je korisnik opisao u koraku 1. 4. Rezultati se korisniku demonstriraju vizualno na malom broju epizoda. Rad predstavlja i analizira rezultate treniranja jezičnog modela, kao i rezultate QRM algoritma koji je uparen sa strojevima nagrade pronadenim pomoću modela. Također je razvijeno grafičko korisničko sučelje, pomoću kojega se demonstrira uporaba jezičnog modela i rezultati treniranja agenta.
Abstract (english) Tasks that require an agent to accomplish a complex sequence of events in the environment are common in practice. However, reward functions that capture such tasks do not satisfy the Markov property, which is necessary to ensure the correctness and convergence properties of classical reinforcement learning algorithms. This thesis introduces the concept of reward machines and a related reinforcement learning algorithm: Q-learning for Reward Machines (QRM). Reward machines are a type of finite automaton that can capture reward functions that do not satisfy the Markov property. They can be understood as a form of finite memory in the Markov decision process. Unfortunately, writing such formalisms is a difficult manual task, and there is a need to simplify this process. Some methods approach this by learning automata jointly with the policy, but they assume a predefined reward signal that conforms to a finite-state structure. This thesis explores a different approach: using new technologies from the field of natural language processing, aiming to facilitate the process of defining reward signals. For this purpose, a language model based on the well-known GPT-2 model was fine-tuned, and the process of data augmentation that was employed to create the training dataset is described in detail. The final result enables the following steps. 1. The user describes the task in English, for example, “patrol the forest and the factory, but avoid traps”. 2. The language model is invoked to translate the user’s description into a reward machine formalism. 3. QRM and the reward machine from Step 2 are used to train the agent to perform the task described by the user in Step 1. 4. The results are visually demonstrated to the user in a small number of episodes. The thesis presents and analyzes the results of training the language model, as well as the results of the QRM algorithm paired with the reward machines discovered using the model. Additionally, a graphical user interface has been developed to demonstrate the usage of the language model and the training results of the agent.
Keywords
algoritami za učenje s potporom
strojevi nagrade
Q-learning for Reward Machines (QRM)
vrsta konačnih automata
Keywords (english)
learning algorithms
reward machines
Q-learning for Reward Machines (QRM)
finite automaton
Language croatian
URN:NBN urn:nbn:hr:217:612839
Study programme Title: Computer Science and Mathematics Study programme type: university Study level: graduate Academic / professional title: sveučilišni magistar računarstva i matematike (sveučilišni magistar računarstva i matematike)
Type of resource Text
File origin Born digital
Access conditions Open access
Terms of use
Repository Repository of the Faculty of Science
Created on 2024-02-08 12:51:57