Q-learning and policy iteration algorithms for stochastic shortest path problems
Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are relate...
Ausführliche Beschreibung
Autor*in: |
Yu, Huizhen [verfasserIn] |
---|
Format: |
Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2012 |
---|
Schlagwörter: |
---|
Anmerkung: |
© Springer Science+Business Media, LLC 2012 |
---|
Übergeordnetes Werk: |
Enthalten in: Annals of operations research - Springer US, 1984, 208(2012), 1 vom: 18. Apr., Seite 95-132 |
---|---|
Übergeordnetes Werk: |
volume:208 ; year:2012 ; number:1 ; day:18 ; month:04 ; pages:95-132 |
Links: |
---|
DOI / URN: |
10.1007/s10479-012-1128-z |
---|
Katalog-ID: |
OLC2111156647 |
---|
LEADER | 01000naa a22002652 4500 | ||
---|---|---|---|
001 | OLC2111156647 | ||
003 | DE-627 | ||
005 | 20230502202753.0 | ||
007 | tu | ||
008 | 230502s2012 xx ||||| 00| ||eng c | ||
024 | 7 | |a 10.1007/s10479-012-1128-z |2 doi | |
035 | |a (DE-627)OLC2111156647 | ||
035 | |a (DE-He213)s10479-012-1128-z-p | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 004 |q VZ |
084 | |a 3,2 |2 ssgn | ||
100 | 1 | |a Yu, Huizhen |e verfasserin |4 aut | |
245 | 1 | 0 | |a Q-learning and policy iteration algorithms for stochastic shortest path problems |
264 | 1 | |c 2012 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
500 | |a © Springer Science+Business Media, LLC 2012 | ||
520 | |a Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. | ||
650 | 4 | |a Markov decision processes | |
650 | 4 | |a Q-learning | |
650 | 4 | |a Approximate dynamic programming | |
650 | 4 | |a Value iteration | |
650 | 4 | |a Policy iteration | |
650 | 4 | |a Stochastic shortest paths | |
650 | 4 | |a Stochastic approximation | |
700 | 1 | |a Bertsekas, Dimitri P. |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Annals of operations research |d Springer US, 1984 |g 208(2012), 1 vom: 18. Apr., Seite 95-132 |w (DE-627)12964370X |w (DE-600)252629-3 |w (DE-576)018141862 |x 0254-5330 |
773 | 1 | 8 | |g volume:208 |g year:2012 |g number:1 |g day:18 |g month:04 |g pages:95-132 |
856 | 4 | 1 | |u https://doi.org/10.1007/s10479-012-1128-z |z lizenzpflichtig |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a SSG-OLC-WIW | ||
912 | |a SSG-OLC-MAT | ||
912 | |a GBV_ILN_26 | ||
912 | |a GBV_ILN_4029 | ||
951 | |a AR | ||
952 | |d 208 |j 2012 |e 1 |b 18 |c 04 |h 95-132 |
author_variant |
h y hy d p b dp dpb |
---|---|
matchkey_str |
article:02545330:2012----::lannadoiytrtoagrtmfrtcatc |
hierarchy_sort_str |
2012 |
publishDate |
2012 |
allfields |
10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132 |
spelling |
10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132 |
allfields_unstemmed |
10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132 |
allfieldsGer |
10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132 |
allfieldsSound |
10.1007/s10479-012-1128-z doi (DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p DE-627 ger DE-627 rakwb eng 004 VZ 3,2 ssgn Yu, Huizhen verfasserin aut Q-learning and policy iteration algorithms for stochastic shortest path problems 2012 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC 2012 Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation Bertsekas, Dimitri P. aut Enthalten in Annals of operations research Springer US, 1984 208(2012), 1 vom: 18. Apr., Seite 95-132 (DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 0254-5330 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 https://doi.org/10.1007/s10479-012-1128-z lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 AR 208 2012 1 18 04 95-132 |
language |
English |
source |
Enthalten in Annals of operations research 208(2012), 1 vom: 18. Apr., Seite 95-132 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 |
sourceStr |
Enthalten in Annals of operations research 208(2012), 1 vom: 18. Apr., Seite 95-132 volume:208 year:2012 number:1 day:18 month:04 pages:95-132 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation |
dewey-raw |
004 |
isfreeaccess_bool |
false |
container_title |
Annals of operations research |
authorswithroles_txt_mv |
Yu, Huizhen @@aut@@ Bertsekas, Dimitri P. @@aut@@ |
publishDateDaySort_date |
2012-04-18T00:00:00Z |
hierarchy_top_id |
12964370X |
dewey-sort |
14 |
id |
OLC2111156647 |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2111156647</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502202753.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">230502s2012 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10479-012-1128-z</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2111156647</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10479-012-1128-z-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">3,2</subfield><subfield code="2">ssgn</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Yu, Huizhen</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Q-learning and policy iteration algorithms for stochastic shortest path problems</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2012</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer Science+Business Media, LLC 2012</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Markov decision processes</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Q-learning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Approximate dynamic programming</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Value iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Policy iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic shortest paths</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic approximation</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Bertsekas, Dimitri P.</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Annals of operations research</subfield><subfield code="d">Springer US, 1984</subfield><subfield code="g">208(2012), 1 vom: 18. Apr., Seite 95-132</subfield><subfield code="w">(DE-627)12964370X</subfield><subfield code="w">(DE-600)252629-3</subfield><subfield code="w">(DE-576)018141862</subfield><subfield code="x">0254-5330</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:208</subfield><subfield code="g">year:2012</subfield><subfield code="g">number:1</subfield><subfield code="g">day:18</subfield><subfield code="g">month:04</subfield><subfield code="g">pages:95-132</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10479-012-1128-z</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-WIW</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_26</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4029</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">208</subfield><subfield code="j">2012</subfield><subfield code="e">1</subfield><subfield code="b">18</subfield><subfield code="c">04</subfield><subfield code="h">95-132</subfield></datafield></record></collection>
|
author |
Yu, Huizhen |
spellingShingle |
Yu, Huizhen ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation Q-learning and policy iteration algorithms for stochastic shortest path problems |
authorStr |
Yu, Huizhen |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)12964370X |
format |
Article |
dewey-ones |
004 - Data processing & computer science |
delete_txt_mv |
keep |
author_role |
aut aut |
collection |
OLC |
remote_str |
false |
illustrated |
Not Illustrated |
issn |
0254-5330 |
topic_title |
004 VZ 3,2 ssgn Q-learning and policy iteration algorithms for stochastic shortest path problems Markov decision processes Q-learning Approximate dynamic programming Value iteration Policy iteration Stochastic shortest paths Stochastic approximation |
topic |
ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation |
topic_unstemmed |
ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation |
topic_browse |
ddc 004 ssgn 3,2 misc Markov decision processes misc Q-learning misc Approximate dynamic programming misc Value iteration misc Policy iteration misc Stochastic shortest paths misc Stochastic approximation |
format_facet |
Aufsätze Gedruckte Aufsätze |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
nc |
hierarchy_parent_title |
Annals of operations research |
hierarchy_parent_id |
12964370X |
dewey-tens |
000 - Computer science, knowledge & systems |
hierarchy_top_title |
Annals of operations research |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)12964370X (DE-600)252629-3 (DE-576)018141862 |
title |
Q-learning and policy iteration algorithms for stochastic shortest path problems |
ctrlnum |
(DE-627)OLC2111156647 (DE-He213)s10479-012-1128-z-p |
title_full |
Q-learning and policy iteration algorithms for stochastic shortest path problems |
author_sort |
Yu, Huizhen |
journal |
Annals of operations research |
journalStr |
Annals of operations research |
lang_code |
eng |
isOA_bool |
false |
dewey-hundreds |
000 - Computer science, information & general works |
recordtype |
marc |
publishDateSort |
2012 |
contenttype_str_mv |
txt |
container_start_page |
95 |
author_browse |
Yu, Huizhen Bertsekas, Dimitri P. |
container_volume |
208 |
class |
004 VZ 3,2 ssgn |
format_se |
Aufsätze |
author-letter |
Yu, Huizhen |
doi_str_mv |
10.1007/s10479-012-1128-z |
dewey-full |
004 |
title_sort |
q-learning and policy iteration algorithms for stochastic shortest path problems |
title_auth |
Q-learning and policy iteration algorithms for stochastic shortest path problems |
abstract |
Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. © Springer Science+Business Media, LLC 2012 |
abstractGer |
Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. © Springer Science+Business Media, LLC 2012 |
abstract_unstemmed |
Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. © Springer Science+Business Media, LLC 2012 |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-WIW SSG-OLC-MAT GBV_ILN_26 GBV_ILN_4029 |
container_issue |
1 |
title_short |
Q-learning and policy iteration algorithms for stochastic shortest path problems |
url |
https://doi.org/10.1007/s10479-012-1128-z |
remote_bool |
false |
author2 |
Bertsekas, Dimitri P. |
author2Str |
Bertsekas, Dimitri P. |
ppnlink |
12964370X |
mediatype_str_mv |
n |
isOA_txt |
false |
hochschulschrift_bool |
false |
doi_str |
10.1007/s10479-012-1128-z |
up_date |
2024-07-04T08:52:34.666Z |
_version_ |
1803637913994395648 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2111156647</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502202753.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">230502s2012 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10479-012-1128-z</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2111156647</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10479-012-1128-z-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">3,2</subfield><subfield code="2">ssgn</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Yu, Huizhen</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Q-learning and policy iteration algorithms for stochastic shortest path problems</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2012</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer Science+Business Media, LLC 2012</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Markov decision processes</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Q-learning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Approximate dynamic programming</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Value iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Policy iteration</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic shortest paths</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stochastic approximation</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Bertsekas, Dimitri P.</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Annals of operations research</subfield><subfield code="d">Springer US, 1984</subfield><subfield code="g">208(2012), 1 vom: 18. Apr., Seite 95-132</subfield><subfield code="w">(DE-627)12964370X</subfield><subfield code="w">(DE-600)252629-3</subfield><subfield code="w">(DE-576)018141862</subfield><subfield code="x">0254-5330</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:208</subfield><subfield code="g">year:2012</subfield><subfield code="g">number:1</subfield><subfield code="g">day:18</subfield><subfield code="g">month:04</subfield><subfield code="g">pages:95-132</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10479-012-1128-z</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-WIW</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_26</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4029</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">208</subfield><subfield code="j">2012</subfield><subfield code="e">1</subfield><subfield code="b">18</subfield><subfield code="c">04</subfield><subfield code="h">95-132</subfield></datafield></record></collection>
|
score |
7.402237 |