Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data
Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed fo...
Ausführliche Beschreibung
Autor*in: |
Koufakou, Anna [verfasserIn] |
---|
Format: |
Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2010 |
---|
Schlagwörter: |
---|
Anmerkung: |
© Springer-Verlag London Limited 2010 |
---|
Übergeordnetes Werk: |
Enthalten in: Knowledge and information systems - Springer-Verlag, 2000, 29(2010), 3 vom: 08. Dez., Seite 697-725 |
---|---|
Übergeordnetes Werk: |
volume:29 ; year:2010 ; number:3 ; day:08 ; month:12 ; pages:697-725 |
Links: |
---|
DOI / URN: |
10.1007/s10115-010-0343-7 |
---|
Katalog-ID: |
OLC2063375380 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | OLC2063375380 | ||
003 | DE-627 | ||
005 | 20230502172711.0 | ||
007 | tu | ||
008 | 200820s2010 xx ||||| 00| ||eng c | ||
024 | 7 | |a 10.1007/s10115-010-0343-7 |2 doi | |
035 | |a (DE-627)OLC2063375380 | ||
035 | |a (DE-He213)s10115-010-0343-7-p | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 004 |q VZ |
082 | 0 | 4 | |a 004 |q VZ |
084 | |a 06.74$jInformationssysteme |2 bkl | ||
084 | |a 54.64$jDatenbanken |2 bkl | ||
100 | 1 | |a Koufakou, Anna |e verfasserin |4 aut | |
245 | 1 | 0 | |a Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
264 | 1 | |c 2010 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
500 | |a © Springer-Verlag London Limited 2010 | ||
520 | |a Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. | ||
650 | 4 | |a Outlier detection | |
650 | 4 | |a Anomaly detection | |
650 | 4 | |a Frequent itemset mining | |
650 | 4 | |a Non-Derivable itemsets | |
650 | 4 | |a Categorical datasets | |
700 | 1 | |a Secretan, Jimmy |4 aut | |
700 | 1 | |a Georgiopoulos, Michael |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Knowledge and information systems |d Springer-Verlag, 2000 |g 29(2010), 3 vom: 08. Dez., Seite 697-725 |w (DE-627)323971725 |w (DE-600)2036569-X |w (DE-576)9323971723 |x 0219-1377 |7 nnns |
773 | 1 | 8 | |g volume:29 |g year:2010 |g number:3 |g day:08 |g month:12 |g pages:697-725 |
856 | 4 | 1 | |u https://doi.org/10.1007/s10115-010-0343-7 |z lizenzpflichtig |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a SSG-OLC-MAT | ||
912 | |a SSG-OLC-BUB | ||
912 | |a GBV_ILN_26 | ||
912 | |a GBV_ILN_70 | ||
912 | |a GBV_ILN_100 | ||
912 | |a GBV_ILN_267 | ||
912 | |a GBV_ILN_4277 | ||
936 | b | k | |a 06.74$jInformationssysteme |q VZ |0 106415212 |0 (DE-625)106415212 |
936 | b | k | |a 54.64$jDatenbanken |q VZ |0 106410865 |0 (DE-625)106410865 |
951 | |a AR | ||
952 | |d 29 |j 2010 |e 3 |b 08 |c 12 |h 697-725 |
author_variant |
a k ak j s js m g mg |
---|---|
matchkey_str |
article:02191377:2010----::odrvbetmesofsotireetoilreihies |
hierarchy_sort_str |
2010 |
bklnumber |
06.74$jInformationssysteme 54.64$jDatenbanken |
publishDate |
2010 |
allfields |
10.1007/s10115-010-0343-7 doi (DE-627)OLC2063375380 (DE-He213)s10115-010-0343-7-p DE-627 ger DE-627 rakwb eng 004 VZ 004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl Koufakou, Anna verfasserin aut Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data 2010 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag London Limited 2010 Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets Secretan, Jimmy aut Georgiopoulos, Michael aut Enthalten in Knowledge and information systems Springer-Verlag, 2000 29(2010), 3 vom: 08. Dez., Seite 697-725 (DE-627)323971725 (DE-600)2036569-X (DE-576)9323971723 0219-1377 nnns volume:29 year:2010 number:3 day:08 month:12 pages:697-725 https://doi.org/10.1007/s10115-010-0343-7 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT SSG-OLC-BUB GBV_ILN_26 GBV_ILN_70 GBV_ILN_100 GBV_ILN_267 GBV_ILN_4277 06.74$jInformationssysteme VZ 106415212 (DE-625)106415212 54.64$jDatenbanken VZ 106410865 (DE-625)106410865 AR 29 2010 3 08 12 697-725 |
spelling |
10.1007/s10115-010-0343-7 doi (DE-627)OLC2063375380 (DE-He213)s10115-010-0343-7-p DE-627 ger DE-627 rakwb eng 004 VZ 004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl Koufakou, Anna verfasserin aut Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data 2010 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag London Limited 2010 Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets Secretan, Jimmy aut Georgiopoulos, Michael aut Enthalten in Knowledge and information systems Springer-Verlag, 2000 29(2010), 3 vom: 08. Dez., Seite 697-725 (DE-627)323971725 (DE-600)2036569-X (DE-576)9323971723 0219-1377 nnns volume:29 year:2010 number:3 day:08 month:12 pages:697-725 https://doi.org/10.1007/s10115-010-0343-7 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT SSG-OLC-BUB GBV_ILN_26 GBV_ILN_70 GBV_ILN_100 GBV_ILN_267 GBV_ILN_4277 06.74$jInformationssysteme VZ 106415212 (DE-625)106415212 54.64$jDatenbanken VZ 106410865 (DE-625)106410865 AR 29 2010 3 08 12 697-725 |
allfields_unstemmed |
10.1007/s10115-010-0343-7 doi (DE-627)OLC2063375380 (DE-He213)s10115-010-0343-7-p DE-627 ger DE-627 rakwb eng 004 VZ 004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl Koufakou, Anna verfasserin aut Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data 2010 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag London Limited 2010 Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets Secretan, Jimmy aut Georgiopoulos, Michael aut Enthalten in Knowledge and information systems Springer-Verlag, 2000 29(2010), 3 vom: 08. Dez., Seite 697-725 (DE-627)323971725 (DE-600)2036569-X (DE-576)9323971723 0219-1377 nnns volume:29 year:2010 number:3 day:08 month:12 pages:697-725 https://doi.org/10.1007/s10115-010-0343-7 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT SSG-OLC-BUB GBV_ILN_26 GBV_ILN_70 GBV_ILN_100 GBV_ILN_267 GBV_ILN_4277 06.74$jInformationssysteme VZ 106415212 (DE-625)106415212 54.64$jDatenbanken VZ 106410865 (DE-625)106410865 AR 29 2010 3 08 12 697-725 |
allfieldsGer |
10.1007/s10115-010-0343-7 doi (DE-627)OLC2063375380 (DE-He213)s10115-010-0343-7-p DE-627 ger DE-627 rakwb eng 004 VZ 004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl Koufakou, Anna verfasserin aut Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data 2010 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag London Limited 2010 Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets Secretan, Jimmy aut Georgiopoulos, Michael aut Enthalten in Knowledge and information systems Springer-Verlag, 2000 29(2010), 3 vom: 08. Dez., Seite 697-725 (DE-627)323971725 (DE-600)2036569-X (DE-576)9323971723 0219-1377 nnns volume:29 year:2010 number:3 day:08 month:12 pages:697-725 https://doi.org/10.1007/s10115-010-0343-7 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT SSG-OLC-BUB GBV_ILN_26 GBV_ILN_70 GBV_ILN_100 GBV_ILN_267 GBV_ILN_4277 06.74$jInformationssysteme VZ 106415212 (DE-625)106415212 54.64$jDatenbanken VZ 106410865 (DE-625)106410865 AR 29 2010 3 08 12 697-725 |
allfieldsSound |
10.1007/s10115-010-0343-7 doi (DE-627)OLC2063375380 (DE-He213)s10115-010-0343-7-p DE-627 ger DE-627 rakwb eng 004 VZ 004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl Koufakou, Anna verfasserin aut Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data 2010 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag London Limited 2010 Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets Secretan, Jimmy aut Georgiopoulos, Michael aut Enthalten in Knowledge and information systems Springer-Verlag, 2000 29(2010), 3 vom: 08. Dez., Seite 697-725 (DE-627)323971725 (DE-600)2036569-X (DE-576)9323971723 0219-1377 nnns volume:29 year:2010 number:3 day:08 month:12 pages:697-725 https://doi.org/10.1007/s10115-010-0343-7 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT SSG-OLC-BUB GBV_ILN_26 GBV_ILN_70 GBV_ILN_100 GBV_ILN_267 GBV_ILN_4277 06.74$jInformationssysteme VZ 106415212 (DE-625)106415212 54.64$jDatenbanken VZ 106410865 (DE-625)106410865 AR 29 2010 3 08 12 697-725 |
language |
English |
source |
Enthalten in Knowledge and information systems 29(2010), 3 vom: 08. Dez., Seite 697-725 volume:29 year:2010 number:3 day:08 month:12 pages:697-725 |
sourceStr |
Enthalten in Knowledge and information systems 29(2010), 3 vom: 08. Dez., Seite 697-725 volume:29 year:2010 number:3 day:08 month:12 pages:697-725 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets |
dewey-raw |
004 |
isfreeaccess_bool |
false |
container_title |
Knowledge and information systems |
authorswithroles_txt_mv |
Koufakou, Anna @@aut@@ Secretan, Jimmy @@aut@@ Georgiopoulos, Michael @@aut@@ |
publishDateDaySort_date |
2010-12-08T00:00:00Z |
hierarchy_top_id |
323971725 |
dewey-sort |
14 |
id |
OLC2063375380 |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">OLC2063375380</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502172711.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">200820s2010 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10115-010-0343-7</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2063375380</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10115-010-0343-7-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">06.74$jInformationssysteme</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">54.64$jDatenbanken</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Koufakou, Anna</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2010</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer-Verlag London Limited 2010</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Outlier detection</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Anomaly detection</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Frequent itemset mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Non-Derivable itemsets</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Categorical datasets</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Secretan, Jimmy</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Georgiopoulos, Michael</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Knowledge and information systems</subfield><subfield code="d">Springer-Verlag, 2000</subfield><subfield code="g">29(2010), 3 vom: 08. Dez., Seite 697-725</subfield><subfield code="w">(DE-627)323971725</subfield><subfield code="w">(DE-600)2036569-X</subfield><subfield code="w">(DE-576)9323971723</subfield><subfield code="x">0219-1377</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:29</subfield><subfield code="g">year:2010</subfield><subfield code="g">number:3</subfield><subfield code="g">day:08</subfield><subfield code="g">month:12</subfield><subfield code="g">pages:697-725</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10115-010-0343-7</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-BUB</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_26</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_100</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_267</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4277</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">06.74$jInformationssysteme</subfield><subfield code="q">VZ</subfield><subfield code="0">106415212</subfield><subfield code="0">(DE-625)106415212</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">54.64$jDatenbanken</subfield><subfield code="q">VZ</subfield><subfield code="0">106410865</subfield><subfield code="0">(DE-625)106410865</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">29</subfield><subfield code="j">2010</subfield><subfield code="e">3</subfield><subfield code="b">08</subfield><subfield code="c">12</subfield><subfield code="h">697-725</subfield></datafield></record></collection>
|
author |
Koufakou, Anna |
spellingShingle |
Koufakou, Anna ddc 004 bkl 06.74$jInformationssysteme bkl 54.64$jDatenbanken misc Outlier detection misc Anomaly detection misc Frequent itemset mining misc Non-Derivable itemsets misc Categorical datasets Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
authorStr |
Koufakou, Anna |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)323971725 |
format |
Article |
dewey-ones |
004 - Data processing & computer science |
delete_txt_mv |
keep |
author_role |
aut aut aut |
collection |
OLC |
remote_str |
false |
illustrated |
Not Illustrated |
issn |
0219-1377 |
topic_title |
004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data Outlier detection Anomaly detection Frequent itemset mining Non-Derivable itemsets Categorical datasets |
topic |
ddc 004 bkl 06.74$jInformationssysteme bkl 54.64$jDatenbanken misc Outlier detection misc Anomaly detection misc Frequent itemset mining misc Non-Derivable itemsets misc Categorical datasets |
topic_unstemmed |
ddc 004 bkl 06.74$jInformationssysteme bkl 54.64$jDatenbanken misc Outlier detection misc Anomaly detection misc Frequent itemset mining misc Non-Derivable itemsets misc Categorical datasets |
topic_browse |
ddc 004 bkl 06.74$jInformationssysteme bkl 54.64$jDatenbanken misc Outlier detection misc Anomaly detection misc Frequent itemset mining misc Non-Derivable itemsets misc Categorical datasets |
format_facet |
Aufsätze Gedruckte Aufsätze |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
nc |
hierarchy_parent_title |
Knowledge and information systems |
hierarchy_parent_id |
323971725 |
dewey-tens |
000 - Computer science, knowledge & systems |
hierarchy_top_title |
Knowledge and information systems |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)323971725 (DE-600)2036569-X (DE-576)9323971723 |
title |
Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
ctrlnum |
(DE-627)OLC2063375380 (DE-He213)s10115-010-0343-7-p |
title_full |
Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
author_sort |
Koufakou, Anna |
journal |
Knowledge and information systems |
journalStr |
Knowledge and information systems |
lang_code |
eng |
isOA_bool |
false |
dewey-hundreds |
000 - Computer science, information & general works |
recordtype |
marc |
publishDateSort |
2010 |
contenttype_str_mv |
txt |
container_start_page |
697 |
author_browse |
Koufakou, Anna Secretan, Jimmy Georgiopoulos, Michael |
container_volume |
29 |
class |
004 VZ 06.74$jInformationssysteme bkl 54.64$jDatenbanken bkl |
format_se |
Aufsätze |
author-letter |
Koufakou, Anna |
doi_str_mv |
10.1007/s10115-010-0343-7 |
normlink |
106415212 106410865 |
normlink_prefix_str_mv |
106415212 (DE-625)106415212 106410865 (DE-625)106410865 |
dewey-full |
004 |
title_sort |
non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
title_auth |
Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
abstract |
Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. © Springer-Verlag London Limited 2010 |
abstractGer |
Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. © Springer-Verlag London Limited 2010 |
abstract_unstemmed |
Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values. © Springer-Verlag London Limited 2010 |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT SSG-OLC-BUB GBV_ILN_26 GBV_ILN_70 GBV_ILN_100 GBV_ILN_267 GBV_ILN_4277 |
container_issue |
3 |
title_short |
Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data |
url |
https://doi.org/10.1007/s10115-010-0343-7 |
remote_bool |
false |
author2 |
Secretan, Jimmy Georgiopoulos, Michael |
author2Str |
Secretan, Jimmy Georgiopoulos, Michael |
ppnlink |
323971725 |
mediatype_str_mv |
n |
isOA_txt |
false |
hochschulschrift_bool |
false |
doi_str |
10.1007/s10115-010-0343-7 |
up_date |
2024-07-03T18:48:15.229Z |
_version_ |
1803584793719341057 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">OLC2063375380</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502172711.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">200820s2010 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10115-010-0343-7</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2063375380</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10115-010-0343-7-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">06.74$jInformationssysteme</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">54.64$jDatenbanken</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Koufakou, Anna</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2010</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer-Verlag London Limited 2010</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Outlier detection</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Anomaly detection</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Frequent itemset mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Non-Derivable itemsets</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Categorical datasets</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Secretan, Jimmy</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Georgiopoulos, Michael</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Knowledge and information systems</subfield><subfield code="d">Springer-Verlag, 2000</subfield><subfield code="g">29(2010), 3 vom: 08. Dez., Seite 697-725</subfield><subfield code="w">(DE-627)323971725</subfield><subfield code="w">(DE-600)2036569-X</subfield><subfield code="w">(DE-576)9323971723</subfield><subfield code="x">0219-1377</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:29</subfield><subfield code="g">year:2010</subfield><subfield code="g">number:3</subfield><subfield code="g">day:08</subfield><subfield code="g">month:12</subfield><subfield code="g">pages:697-725</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10115-010-0343-7</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-BUB</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_26</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_100</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_267</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4277</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">06.74$jInformationssysteme</subfield><subfield code="q">VZ</subfield><subfield code="0">106415212</subfield><subfield code="0">(DE-625)106415212</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">54.64$jDatenbanken</subfield><subfield code="q">VZ</subfield><subfield code="0">106410865</subfield><subfield code="0">(DE-625)106410865</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">29</subfield><subfield code="j">2010</subfield><subfield code="e">3</subfield><subfield code="b">08</subfield><subfield code="c">12</subfield><subfield code="h">697-725</subfield></datafield></record></collection>
|
score |
7.399473 |