Q-learning and policy iteration algorithms for stochastic shortest path problems

Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are relate...
Ausführliche Beschreibung

Gespeichert in:

Autor*in:	Yu, Huizhen [verfasserIn] Bertsekas, Dimitri P.

Format:	Artikel
Sprache:	Englisch

Erschienen:	2012

Schlagwörter:	Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation

Anmerkung:	© Springer Science+Business Media, LLC 2012

Übergeordnetes Werk:	Enthalten in: Annals of operations research - Springer US, 1984, 208(2012), 1 vom: 18. Apr., Seite 95-132
Übergeordnetes Werk:	volume:208 ; year:2012 ; number:1 ; day:18 ; month:04 ; pages:95-132

Links:	Volltext

DOI / URN:	10.1007/s10479-012-1128-z

Katalog-ID:	OLC2111156647

Internformat


LEADER	01000naa a22002652 4500
001	OLC2111156647
003	DE-627
005	20230502202753.0
007	tu
008	230502s2012 xx \|\|\|\|\| 00\| \|\|eng c
024	7		\|a 10.1007/s10479-012-1128-z \|2 doi
035			\|a (DE-627)OLC2111156647
035			\|a (DE-He213)s10479-012-1128-z-p
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
082	0	4	\|a 004 \|q VZ
084			\|a 3,2 \|2 ssgn
100	1		\|a Yu, Huizhen \|e verfasserin \|4 aut
245	1	0	\|a Q-learning and policy iteration algorithms for stochastic shortest path problems
264		1	\|c 2012
336			\|a Text \|b txt \|2 rdacontent
337			\|a ohne Hilfsmittel zu benutzen \|b n \|2 rdamedia
338			\|a Band \|b nc \|2 rdacarrier
500			\|a © Springer Science+Business Media, LLC 2012
520			\|a Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.
650		4	\|a Markov decision processes
650		4	\|a Q-learning
650		4	\|a Approximate dynamic programming
650		4	\|a Value iteration
650		4	\|a Policy iteration
650		4	\|a Stochastic shortest paths
650		4	\|a Stochastic approximation
700	1		\|a Bertsekas, Dimitri P. \|4 aut
773	0	8	\|i Enthalten in \|t Annals of operations research \|d Springer US, 1984 \|g 208(2012), 1 vom: 18. Apr., Seite 95-132 \|w (DE-627)12964370X \|w (DE-600)252629-3 \|w (DE-576)018141862 \|x 0254-5330
773	1	8	\|g volume:208 \|g year:2012 \|g number:1 \|g day:18 \|g month:04 \|g pages:95-132
856	4	1	\|u https://doi.org/10.1007/s10479-012-1128-z \|z lizenzpflichtig \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_OLC
912			\|a SSG-OLC-WIW
912			\|a SSG-OLC-MAT
912			\|a GBV_ILN_26
912			\|a GBV_ILN_4029
951			\|a AR
952			\|d 208 \|j 2012 \|e 1 \|b 18 \|c 04 \|h 95-132

Indexfelder

author_variant	h y hy d p b dp dpb
matchkey_str	article:02545330:2012----::lannadoiytrtoagrtmfrtcatc
hierarchy_sort_str	2012
publishDate	2012
allfields	10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132
spelling	10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132
allfields_unstemmed	10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132
allfieldsGer	10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132
allfieldsSound	10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132
language	English
source	Enthalten in Annals of operations research 208(2012), 1 vom: 18. Apr., Seite 95-132 volume:208 year:2012 number:1 day:18 month:04 pages:95-132
sourceStr	Enthalten in Annals of operations research 208(2012), 1 vom: 18. Apr., Seite 95-132 volume:208 year:2012 number:1 day:18 month:04 pages:95-132
format_phy_str_mv	Article
institution	findex.gbv.de
topic_facet	Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation
dewey-raw	004
isfreeaccess_bool	false
container_title	Annals of operations research
authorswithroles_txt_mv	Yu, Huizhen @@aut@@ Bertsekas, Dimitri P. @@aut@@
publishDateDaySort_date	2012-04-18T00:00:00Z
hierarchy_top_id	12964370X
dewey-sort	14
id	OLC2111156647
language_de	englisch
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2111156647</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502202753.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">230502s2012 xx \|\|\|\|\| 00\| \|\|eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10479-012-1128-z</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2111156647</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10479-012-1128-z-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">3,2</subfield><subfield code="2">ssgn</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Yu, Huizhen</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Q-learning and policy iteration algorithms for stochastic shortest path problems</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2012</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer Science+Business Media, LLC 2012</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Markov decision processes</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Q-learning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Approximate dynamic programming</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Value iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Policy iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic shortest paths</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic approximation</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Bertsekas, Dimitri P.</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Annals of operations research</subfield><subfield code="d">Springer US, 1984</subfield><subfield code="g">208(2012), 1 vom: 18. Apr., Seite 95-132</subfield><subfield code="w">(DE-627)12964370X</subfield><subfield code="w">(DE-600)252629-3</subfield><subfield code="w">(DE-576)018141862</subfield><subfield code="x">0254-5330</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:208</subfield><subfield code="g">year:2012</subfield><subfield code="g">number:1</subfield><subfield code="g">day:18</subfield><subfield code="g">month:04</subfield><subfield code="g">pages:95-132</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10479-012-1128-z</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-WIW</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_26</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4029</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">208</subfield><subfield code="j">2012</subfield><subfield code="e">1</subfield><subfield code="b">18</subfield><subfield code="c">04</subfield><subfield code="h">95-132</subfield></datafield></record></collection>
author	Yu, Huizhen
spellingShingle	Yu, Huizhen ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation Q-learning and policy iteration algorithms for stochastic shortest path problems
authorStr	Yu, Huizhen
ppnlink_with_tag_str_mv	@@773@@(DE-627)12964370X
format	Article
dewey-ones	004 - Data processing & computer science
delete_txt_mv	keep
author_role	aut aut
collection	OLC
remote_str	false
illustrated	Not Illustrated
issn	0254-5330
topic_title	004 VZ 3,2 ssgn Q-learning and policy iteration algorithms for stochastic shortest path problems Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation
topic	ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation
topic_unstemmed	ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation
topic_browse	ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation
format_facet	Aufsätze Gedruckte Aufsätze
format_main_str_mv	Text Zeitschrift/Artikel
carriertype_str_mv	nc
hierarchy_parent_title	Annals of operations research
hierarchy_parent_id	12964370X
dewey-tens	000 - Computer science, knowledge & systems
hierarchy_top_title	Annals of operations research
isfreeaccess_txt	false
familylinks_str_mv	(DE-627)12964370X (DE-600)252629-3 (DE-576)018141862
title	Q-learning and policy iteration algorithms for stochastic shortest path problems
ctrlnum	(DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p
title_full	Q-learning and policy iteration algorithms for stochastic shortest path problems
author_sort	Yu, Huizhen
journal	Annals of operations research
journalStr	Annals of operations research
lang_code	eng
isOA_bool	false
dewey-hundreds	000 - Computer science, information & general works
recordtype	marc
publishDateSort	2012
contenttype_str_mv	txt
container_start_page	95
author_browse	Yu, Huizhen Bertsekas, Dimitri P.
container_volume	208
class	004 VZ 3,2 ssgn
format_se	Aufsätze
author-letter	Yu, Huizhen
doi_str_mv	10.1007/s10479-012-1128-z
dewey-full	004
title_sort	q-learning and policy iteration algorithms for stochastic shortest path problems
title_auth	Q-learning and policy iteration algorithms for stochastic shortest path problems
abstract	Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. © Springer Science+Business Media, LLC 2012
abstractGer	Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. © Springer Science+Business Media, LLC 2012
abstract_unstemmed	Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. © Springer Science+Business Media, LLC 2012
collection_details	GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029
container_issue	1
title_short	Q-learning and policy iteration algorithms for stochastic shortest path problems
url	https://doi.org/10.1007/s10479-012-1128-z
remote_bool	false
author2	Bertsekas, Dimitri P.
author2Str	Bertsekas, Dimitri P.
ppnlink	12964370X
mediatype_str_mv	n
isOA_txt	false
hochschulschrift_bool	false
doi_str	10.1007/s10479-012-1128-z
up_date	2024-07-04T08:52:34.666Z
_version_	1803637913994395648
fullrecord_marcxml	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2111156647</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502202753.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">230502s2012 xx \|\|\|\|\| 00\| \|\|eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10479-012-1128-z</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2111156647</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10479-012-1128-z-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">3,2</subfield><subfield code="2">ssgn</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Yu, Huizhen</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Q-learning and policy iteration algorithms for stochastic shortest path problems</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2012</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer Science+Business Media, LLC 2012</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Markov decision processes</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Q-learning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Approximate dynamic programming</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Value iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Policy iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic shortest paths</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic approximation</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Bertsekas, Dimitri P.</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Annals of operations research</subfield><subfield code="d">Springer US, 1984</subfield><subfield code="g">208(2012), 1 vom: 18. Apr., Seite 95-132</subfield><subfield code="w">(DE-627)12964370X</subfield><subfield code="w">(DE-600)252629-3</subfield><subfield code="w">(DE-576)018141862</subfield><subfield code="x">0254-5330</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:208</subfield><subfield code="g">year:2012</subfield><subfield code="g">number:1</subfield><subfield code="g">day:18</subfield><subfield code="g">month:04</subfield><subfield code="g">pages:95-132</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10479-012-1128-z</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-WIW</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_26</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4029</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">208</subfield><subfield code="j">2012</subfield><subfield code="e">1</subfield><subfield code="b">18</subfield><subfield code="c">04</subfield><subfield code="h">95-132</subfield></datafield></record></collection>
score	7.402237

Nicht das Richtige dabei?

Schreiben Sie uns!

Q-learning and policy iteration algorithms for stochastic shortest path problems

Nicht das Richtige dabei?

Zugang & Verfügbarkeit

Vorhandene Bände

Nicht das Richtige dabei?