Evaluation of different approaches for missing data imputation on features associated to genomic data
Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data po...
Ausführliche Beschreibung
Autor*in: |
Ben Omega Petrazzini [verfasserIn] Hugo Naya [verfasserIn] Fernando Lopez-Bello [verfasserIn] Gustavo Vazquez [verfasserIn] Lucía Spangenberg [verfasserIn] |
---|
Format: |
E-Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2021 |
---|
Schlagwörter: |
---|
Übergeordnetes Werk: |
In: BioData Mining - BMC, 2010, 14(2021), 1, Seite 13 |
---|---|
Übergeordnetes Werk: |
volume:14 ; year:2021 ; number:1 ; pages:13 |
Links: |
---|
DOI / URN: |
10.1186/s13040-021-00274-7 |
---|
Katalog-ID: |
DOAJ002593157 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | DOAJ002593157 | ||
003 | DE-627 | ||
005 | 20230309171209.0 | ||
007 | cr uuu---uuuuu | ||
008 | 230225s2021 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1186/s13040-021-00274-7 |2 doi | |
035 | |a (DE-627)DOAJ002593157 | ||
035 | |a (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
050 | 0 | |a R858-859.7 | |
050 | 0 | |a QA299.6-433 | |
100 | 0 | |a Ben Omega Petrazzini |e verfasserin |4 aut | |
245 | 1 | 0 | |a Evaluation of different approaches for missing data imputation on features associated to genomic data |
264 | 1 | |c 2021 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a Computermedien |b c |2 rdamedia | ||
338 | |a Online-Ressource |b cr |2 rdacarrier | ||
520 | |a Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. | ||
650 | 4 | |a Machine learning | |
650 | 4 | |a imputation | |
650 | 4 | |a missing data | |
650 | 4 | |a genomics | |
650 | 4 | |a pathogenic variants | |
653 | 0 | |a Computer applications to medicine. Medical informatics | |
653 | 0 | |a Analysis | |
700 | 0 | |a Hugo Naya |e verfasserin |4 aut | |
700 | 0 | |a Fernando Lopez-Bello |e verfasserin |4 aut | |
700 | 0 | |a Gustavo Vazquez |e verfasserin |4 aut | |
700 | 0 | |a Lucía Spangenberg |e verfasserin |4 aut | |
773 | 0 | 8 | |i In |t BioData Mining |d BMC, 2010 |g 14(2021), 1, Seite 13 |w (DE-627)572421893 |w (DE-600)2438773-3 |x 17560381 |7 nnns |
773 | 1 | 8 | |g volume:14 |g year:2021 |g number:1 |g pages:13 |
856 | 4 | 0 | |u https://doi.org/10.1186/s13040-021-00274-7 |z kostenfrei |
856 | 4 | 0 | |u https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 |z kostenfrei |
856 | 4 | 0 | |u https://doi.org/10.1186/s13040-021-00274-7 |z kostenfrei |
856 | 4 | 2 | |u https://doaj.org/toc/1756-0381 |y Journal toc |z kostenfrei |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_DOAJ | ||
912 | |a GBV_ILN_11 | ||
912 | |a GBV_ILN_20 | ||
912 | |a GBV_ILN_22 | ||
912 | |a GBV_ILN_23 | ||
912 | |a GBV_ILN_24 | ||
912 | |a GBV_ILN_31 | ||
912 | |a GBV_ILN_39 | ||
912 | |a GBV_ILN_40 | ||
912 | |a GBV_ILN_60 | ||
912 | |a GBV_ILN_62 | ||
912 | |a GBV_ILN_63 | ||
912 | |a GBV_ILN_65 | ||
912 | |a GBV_ILN_69 | ||
912 | |a GBV_ILN_70 | ||
912 | |a GBV_ILN_73 | ||
912 | |a GBV_ILN_74 | ||
912 | |a GBV_ILN_95 | ||
912 | |a GBV_ILN_105 | ||
912 | |a GBV_ILN_110 | ||
912 | |a GBV_ILN_151 | ||
912 | |a GBV_ILN_161 | ||
912 | |a GBV_ILN_170 | ||
912 | |a GBV_ILN_206 | ||
912 | |a GBV_ILN_213 | ||
912 | |a GBV_ILN_230 | ||
912 | |a GBV_ILN_285 | ||
912 | |a GBV_ILN_293 | ||
912 | |a GBV_ILN_602 | ||
912 | |a GBV_ILN_2003 | ||
912 | |a GBV_ILN_2005 | ||
912 | |a GBV_ILN_2009 | ||
912 | |a GBV_ILN_2011 | ||
912 | |a GBV_ILN_2014 | ||
912 | |a GBV_ILN_2055 | ||
912 | |a GBV_ILN_2111 | ||
912 | |a GBV_ILN_4012 | ||
912 | |a GBV_ILN_4037 | ||
912 | |a GBV_ILN_4112 | ||
912 | |a GBV_ILN_4125 | ||
912 | |a GBV_ILN_4126 | ||
912 | |a GBV_ILN_4249 | ||
912 | |a GBV_ILN_4305 | ||
912 | |a GBV_ILN_4306 | ||
912 | |a GBV_ILN_4307 | ||
912 | |a GBV_ILN_4313 | ||
912 | |a GBV_ILN_4322 | ||
912 | |a GBV_ILN_4323 | ||
912 | |a GBV_ILN_4324 | ||
912 | |a GBV_ILN_4325 | ||
912 | |a GBV_ILN_4326 | ||
912 | |a GBV_ILN_4335 | ||
912 | |a GBV_ILN_4338 | ||
912 | |a GBV_ILN_4367 | ||
912 | |a GBV_ILN_4700 | ||
951 | |a AR | ||
952 | |d 14 |j 2021 |e 1 |h 13 |
author_variant |
b o p bop h n hn f l b flb g v gv l s ls |
---|---|
matchkey_str |
article:17560381:2021----::vlainfifrnapocefrisndtipttoofauea |
hierarchy_sort_str |
2021 |
callnumber-subject-code |
R |
publishDate |
2021 |
allfields |
10.1186/s13040-021-00274-7 doi (DE-627)DOAJ002593157 (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 DE-627 ger DE-627 rakwb eng R858-859.7 QA299.6-433 Ben Omega Petrazzini verfasserin aut Evaluation of different approaches for missing data imputation on features associated to genomic data 2021 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. Machine learning imputation missing data genomics pathogenic variants Computer applications to medicine. Medical informatics Analysis Hugo Naya verfasserin aut Fernando Lopez-Bello verfasserin aut Gustavo Vazquez verfasserin aut Lucía Spangenberg verfasserin aut In BioData Mining BMC, 2010 14(2021), 1, Seite 13 (DE-627)572421893 (DE-600)2438773-3 17560381 nnns volume:14 year:2021 number:1 pages:13 https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 kostenfrei https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/toc/1756-0381 Journal toc kostenfrei GBV_USEFLAG_A SYSFLAG_A GBV_DOAJ GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_65 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_95 GBV_ILN_105 GBV_ILN_110 GBV_ILN_151 GBV_ILN_161 GBV_ILN_170 GBV_ILN_206 GBV_ILN_213 GBV_ILN_230 GBV_ILN_285 GBV_ILN_293 GBV_ILN_602 GBV_ILN_2003 GBV_ILN_2005 GBV_ILN_2009 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2055 GBV_ILN_2111 GBV_ILN_4012 GBV_ILN_4037 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4249 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4335 GBV_ILN_4338 GBV_ILN_4367 GBV_ILN_4700 AR 14 2021 1 13 |
spelling |
10.1186/s13040-021-00274-7 doi (DE-627)DOAJ002593157 (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 DE-627 ger DE-627 rakwb eng R858-859.7 QA299.6-433 Ben Omega Petrazzini verfasserin aut Evaluation of different approaches for missing data imputation on features associated to genomic data 2021 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. Machine learning imputation missing data genomics pathogenic variants Computer applications to medicine. Medical informatics Analysis Hugo Naya verfasserin aut Fernando Lopez-Bello verfasserin aut Gustavo Vazquez verfasserin aut Lucía Spangenberg verfasserin aut In BioData Mining BMC, 2010 14(2021), 1, Seite 13 (DE-627)572421893 (DE-600)2438773-3 17560381 nnns volume:14 year:2021 number:1 pages:13 https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 kostenfrei https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/toc/1756-0381 Journal toc kostenfrei GBV_USEFLAG_A SYSFLAG_A GBV_DOAJ GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_65 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_95 GBV_ILN_105 GBV_ILN_110 GBV_ILN_151 GBV_ILN_161 GBV_ILN_170 GBV_ILN_206 GBV_ILN_213 GBV_ILN_230 GBV_ILN_285 GBV_ILN_293 GBV_ILN_602 GBV_ILN_2003 GBV_ILN_2005 GBV_ILN_2009 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2055 GBV_ILN_2111 GBV_ILN_4012 GBV_ILN_4037 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4249 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4335 GBV_ILN_4338 GBV_ILN_4367 GBV_ILN_4700 AR 14 2021 1 13 |
allfields_unstemmed |
10.1186/s13040-021-00274-7 doi (DE-627)DOAJ002593157 (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 DE-627 ger DE-627 rakwb eng R858-859.7 QA299.6-433 Ben Omega Petrazzini verfasserin aut Evaluation of different approaches for missing data imputation on features associated to genomic data 2021 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. Machine learning imputation missing data genomics pathogenic variants Computer applications to medicine. Medical informatics Analysis Hugo Naya verfasserin aut Fernando Lopez-Bello verfasserin aut Gustavo Vazquez verfasserin aut Lucía Spangenberg verfasserin aut In BioData Mining BMC, 2010 14(2021), 1, Seite 13 (DE-627)572421893 (DE-600)2438773-3 17560381 nnns volume:14 year:2021 number:1 pages:13 https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 kostenfrei https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/toc/1756-0381 Journal toc kostenfrei GBV_USEFLAG_A SYSFLAG_A GBV_DOAJ GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_65 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_95 GBV_ILN_105 GBV_ILN_110 GBV_ILN_151 GBV_ILN_161 GBV_ILN_170 GBV_ILN_206 GBV_ILN_213 GBV_ILN_230 GBV_ILN_285 GBV_ILN_293 GBV_ILN_602 GBV_ILN_2003 GBV_ILN_2005 GBV_ILN_2009 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2055 GBV_ILN_2111 GBV_ILN_4012 GBV_ILN_4037 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4249 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4335 GBV_ILN_4338 GBV_ILN_4367 GBV_ILN_4700 AR 14 2021 1 13 |
allfieldsGer |
10.1186/s13040-021-00274-7 doi (DE-627)DOAJ002593157 (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 DE-627 ger DE-627 rakwb eng R858-859.7 QA299.6-433 Ben Omega Petrazzini verfasserin aut Evaluation of different approaches for missing data imputation on features associated to genomic data 2021 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. Machine learning imputation missing data genomics pathogenic variants Computer applications to medicine. Medical informatics Analysis Hugo Naya verfasserin aut Fernando Lopez-Bello verfasserin aut Gustavo Vazquez verfasserin aut Lucía Spangenberg verfasserin aut In BioData Mining BMC, 2010 14(2021), 1, Seite 13 (DE-627)572421893 (DE-600)2438773-3 17560381 nnns volume:14 year:2021 number:1 pages:13 https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 kostenfrei https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/toc/1756-0381 Journal toc kostenfrei GBV_USEFLAG_A SYSFLAG_A GBV_DOAJ GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_65 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_95 GBV_ILN_105 GBV_ILN_110 GBV_ILN_151 GBV_ILN_161 GBV_ILN_170 GBV_ILN_206 GBV_ILN_213 GBV_ILN_230 GBV_ILN_285 GBV_ILN_293 GBV_ILN_602 GBV_ILN_2003 GBV_ILN_2005 GBV_ILN_2009 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2055 GBV_ILN_2111 GBV_ILN_4012 GBV_ILN_4037 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4249 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4335 GBV_ILN_4338 GBV_ILN_4367 GBV_ILN_4700 AR 14 2021 1 13 |
allfieldsSound |
10.1186/s13040-021-00274-7 doi (DE-627)DOAJ002593157 (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 DE-627 ger DE-627 rakwb eng R858-859.7 QA299.6-433 Ben Omega Petrazzini verfasserin aut Evaluation of different approaches for missing data imputation on features associated to genomic data 2021 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. Machine learning imputation missing data genomics pathogenic variants Computer applications to medicine. Medical informatics Analysis Hugo Naya verfasserin aut Fernando Lopez-Bello verfasserin aut Gustavo Vazquez verfasserin aut Lucía Spangenberg verfasserin aut In BioData Mining BMC, 2010 14(2021), 1, Seite 13 (DE-627)572421893 (DE-600)2438773-3 17560381 nnns volume:14 year:2021 number:1 pages:13 https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 kostenfrei https://doi.org/10.1186/s13040-021-00274-7 kostenfrei https://doaj.org/toc/1756-0381 Journal toc kostenfrei GBV_USEFLAG_A SYSFLAG_A GBV_DOAJ GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_65 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_95 GBV_ILN_105 GBV_ILN_110 GBV_ILN_151 GBV_ILN_161 GBV_ILN_170 GBV_ILN_206 GBV_ILN_213 GBV_ILN_230 GBV_ILN_285 GBV_ILN_293 GBV_ILN_602 GBV_ILN_2003 GBV_ILN_2005 GBV_ILN_2009 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2055 GBV_ILN_2111 GBV_ILN_4012 GBV_ILN_4037 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4249 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4335 GBV_ILN_4338 GBV_ILN_4367 GBV_ILN_4700 AR 14 2021 1 13 |
language |
English |
source |
In BioData Mining 14(2021), 1, Seite 13 volume:14 year:2021 number:1 pages:13 |
sourceStr |
In BioData Mining 14(2021), 1, Seite 13 volume:14 year:2021 number:1 pages:13 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
Machine learning imputation missing data genomics pathogenic variants Computer applications to medicine. Medical informatics Analysis |
isfreeaccess_bool |
true |
container_title |
BioData Mining |
authorswithroles_txt_mv |
Ben Omega Petrazzini @@aut@@ Hugo Naya @@aut@@ Fernando Lopez-Bello @@aut@@ Gustavo Vazquez @@aut@@ Lucía Spangenberg @@aut@@ |
publishDateDaySort_date |
2021-01-01T00:00:00Z |
hierarchy_top_id |
572421893 |
id |
DOAJ002593157 |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">DOAJ002593157</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230309171209.0</controlfield><controlfield tag="007">cr uuu---uuuuu</controlfield><controlfield tag="008">230225s2021 xx |||||o 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1186/s13040-021-00274-7</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)DOAJ002593157</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">R858-859.7</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA299.6-433</subfield></datafield><datafield tag="100" ind1="0" ind2=" "><subfield code="a">Ben Omega Petrazzini</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Evaluation of different approaches for missing data imputation on features associated to genomic data</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2021</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">Computermedien</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Online-Ressource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Machine learning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">imputation</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">missing data</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">genomics</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">pathogenic variants</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Computer applications to medicine. Medical informatics</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Analysis</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Hugo Naya</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Fernando Lopez-Bello</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Gustavo Vazquez</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Lucía Spangenberg</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">In</subfield><subfield code="t">BioData Mining</subfield><subfield code="d">BMC, 2010</subfield><subfield code="g">14(2021), 1, Seite 13</subfield><subfield code="w">(DE-627)572421893</subfield><subfield code="w">(DE-600)2438773-3</subfield><subfield code="x">17560381</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:14</subfield><subfield code="g">year:2021</subfield><subfield code="g">number:1</subfield><subfield code="g">pages:13</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doi.org/10.1186/s13040-021-00274-7</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doi.org/10.1186/s13040-021-00274-7</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="u">https://doaj.org/toc/1756-0381</subfield><subfield code="y">Journal toc</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_DOAJ</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_11</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_20</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_22</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_23</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_24</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_31</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_39</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_40</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_60</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_62</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_63</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_65</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_69</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_73</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_74</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_95</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_105</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_110</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_151</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_161</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_170</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_206</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_213</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_230</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_285</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_293</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_602</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2003</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2005</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2009</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2011</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2014</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2055</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2111</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4012</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4037</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4112</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4125</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4126</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4249</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4305</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4306</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4307</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4313</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4322</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4323</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4324</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4325</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4326</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4335</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4338</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4367</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4700</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">14</subfield><subfield code="j">2021</subfield><subfield code="e">1</subfield><subfield code="h">13</subfield></datafield></record></collection>
|
callnumber-first |
R - Medicine |
author |
Ben Omega Petrazzini |
spellingShingle |
Ben Omega Petrazzini misc R858-859.7 misc QA299.6-433 misc Machine learning misc imputation misc missing data misc genomics misc pathogenic variants misc Computer applications to medicine. Medical informatics misc Analysis Evaluation of different approaches for missing data imputation on features associated to genomic data |
authorStr |
Ben Omega Petrazzini |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)572421893 |
format |
electronic Article |
delete_txt_mv |
keep |
author_role |
aut aut aut aut aut |
collection |
DOAJ |
remote_str |
true |
callnumber-label |
R858-859 |
illustrated |
Not Illustrated |
issn |
17560381 |
topic_title |
R858-859.7 QA299.6-433 Evaluation of different approaches for missing data imputation on features associated to genomic data Machine learning imputation missing data genomics pathogenic variants |
topic |
misc R858-859.7 misc QA299.6-433 misc Machine learning misc imputation misc missing data misc genomics misc pathogenic variants misc Computer applications to medicine. Medical informatics misc Analysis |
topic_unstemmed |
misc R858-859.7 misc QA299.6-433 misc Machine learning misc imputation misc missing data misc genomics misc pathogenic variants misc Computer applications to medicine. Medical informatics misc Analysis |
topic_browse |
misc R858-859.7 misc QA299.6-433 misc Machine learning misc imputation misc missing data misc genomics misc pathogenic variants misc Computer applications to medicine. Medical informatics misc Analysis |
format_facet |
Elektronische Aufsätze Aufsätze Elektronische Ressource |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
cr |
hierarchy_parent_title |
BioData Mining |
hierarchy_parent_id |
572421893 |
hierarchy_top_title |
BioData Mining |
isfreeaccess_txt |
true |
familylinks_str_mv |
(DE-627)572421893 (DE-600)2438773-3 |
title |
Evaluation of different approaches for missing data imputation on features associated to genomic data |
ctrlnum |
(DE-627)DOAJ002593157 (DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05 |
title_full |
Evaluation of different approaches for missing data imputation on features associated to genomic data |
author_sort |
Ben Omega Petrazzini |
journal |
BioData Mining |
journalStr |
BioData Mining |
callnumber-first-code |
R |
lang_code |
eng |
isOA_bool |
true |
recordtype |
marc |
publishDateSort |
2021 |
contenttype_str_mv |
txt |
container_start_page |
13 |
author_browse |
Ben Omega Petrazzini Hugo Naya Fernando Lopez-Bello Gustavo Vazquez Lucía Spangenberg |
container_volume |
14 |
class |
R858-859.7 QA299.6-433 |
format_se |
Elektronische Aufsätze |
author-letter |
Ben Omega Petrazzini |
doi_str_mv |
10.1186/s13040-021-00274-7 |
author2-role |
verfasserin |
title_sort |
evaluation of different approaches for missing data imputation on features associated to genomic data |
callnumber |
R858-859.7 |
title_auth |
Evaluation of different approaches for missing data imputation on features associated to genomic data |
abstract |
Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. |
abstractGer |
Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. |
abstract_unstemmed |
Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_DOAJ GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_65 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_95 GBV_ILN_105 GBV_ILN_110 GBV_ILN_151 GBV_ILN_161 GBV_ILN_170 GBV_ILN_206 GBV_ILN_213 GBV_ILN_230 GBV_ILN_285 GBV_ILN_293 GBV_ILN_602 GBV_ILN_2003 GBV_ILN_2005 GBV_ILN_2009 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2055 GBV_ILN_2111 GBV_ILN_4012 GBV_ILN_4037 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4249 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4335 GBV_ILN_4338 GBV_ILN_4367 GBV_ILN_4700 |
container_issue |
1 |
title_short |
Evaluation of different approaches for missing data imputation on features associated to genomic data |
url |
https://doi.org/10.1186/s13040-021-00274-7 https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05 https://doaj.org/toc/1756-0381 |
remote_bool |
true |
author2 |
Hugo Naya Fernando Lopez-Bello Gustavo Vazquez Lucía Spangenberg |
author2Str |
Hugo Naya Fernando Lopez-Bello Gustavo Vazquez Lucía Spangenberg |
ppnlink |
572421893 |
callnumber-subject |
R - General Medicine |
mediatype_str_mv |
c |
isOA_txt |
true |
hochschulschrift_bool |
false |
doi_str |
10.1186/s13040-021-00274-7 |
callnumber-a |
R858-859.7 |
up_date |
2024-07-04T01:47:28.052Z |
_version_ |
1803611168365871104 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">DOAJ002593157</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230309171209.0</controlfield><controlfield tag="007">cr uuu---uuuuu</controlfield><controlfield tag="008">230225s2021 xx |||||o 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1186/s13040-021-00274-7</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)DOAJ002593157</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)DOAJ386db40a09e84f86bf7e17a7ddf0eb05</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">R858-859.7</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA299.6-433</subfield></datafield><datafield tag="100" ind1="0" ind2=" "><subfield code="a">Ben Omega Petrazzini</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Evaluation of different approaches for missing data imputation on features associated to genomic data</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2021</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">Computermedien</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Online-Ressource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Machine learning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">imputation</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">missing data</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">genomics</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">pathogenic variants</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Computer applications to medicine. Medical informatics</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Analysis</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Hugo Naya</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Fernando Lopez-Bello</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Gustavo Vazquez</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="0" ind2=" "><subfield code="a">Lucía Spangenberg</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">In</subfield><subfield code="t">BioData Mining</subfield><subfield code="d">BMC, 2010</subfield><subfield code="g">14(2021), 1, Seite 13</subfield><subfield code="w">(DE-627)572421893</subfield><subfield code="w">(DE-600)2438773-3</subfield><subfield code="x">17560381</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:14</subfield><subfield code="g">year:2021</subfield><subfield code="g">number:1</subfield><subfield code="g">pages:13</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doi.org/10.1186/s13040-021-00274-7</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doaj.org/article/386db40a09e84f86bf7e17a7ddf0eb05</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doi.org/10.1186/s13040-021-00274-7</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="u">https://doaj.org/toc/1756-0381</subfield><subfield code="y">Journal toc</subfield><subfield code="z">kostenfrei</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_DOAJ</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_11</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_20</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_22</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_23</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_24</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_31</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_39</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_40</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_60</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_62</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_63</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_65</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_69</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_73</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_74</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_95</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_105</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_110</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_151</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_161</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_170</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_206</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_213</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_230</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_285</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_293</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_602</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2003</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2005</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2009</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2011</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2014</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2055</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2111</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4012</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4037</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4112</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4125</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4126</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4249</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4305</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4306</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4307</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4313</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4322</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4323</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4324</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4325</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4326</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4335</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4338</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4367</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4700</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">14</subfield><subfield code="j">2021</subfield><subfield code="e">1</subfield><subfield code="h">13</subfield></datafield></record></collection>
|
score |
7.4000416 |