A Dirichlet process biterm-based mixture model for short text stream clustering
Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little...
Ausführliche Beschreibung
Autor*in: |
Chen, Junyang [verfasserIn] |
---|
Format: |
Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2020 |
---|
Schlagwörter: |
---|
Anmerkung: |
© Springer Science+Business Media, LLC, part of Springer Nature 2020 |
---|
Übergeordnetes Werk: |
Enthalten in: Applied intelligence - Springer US, 1991, 50(2020), 5 vom: 01. Feb., Seite 1609-1619 |
---|---|
Übergeordnetes Werk: |
volume:50 ; year:2020 ; number:5 ; day:01 ; month:02 ; pages:1609-1619 |
Links: |
---|
DOI / URN: |
10.1007/s10489-019-01606-1 |
---|
Katalog-ID: |
OLC2066109754 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | OLC2066109754 | ||
003 | DE-627 | ||
005 | 20230504131332.0 | ||
007 | tu | ||
008 | 200820s2020 xx ||||| 00| ||eng c | ||
024 | 7 | |a 10.1007/s10489-019-01606-1 |2 doi | |
035 | |a (DE-627)OLC2066109754 | ||
035 | |a (DE-He213)s10489-019-01606-1-p | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 004 |q VZ |
100 | 1 | |a Chen, Junyang |e verfasserin |0 (orcid)0000-0002-1139-8654 |4 aut | |
245 | 1 | 0 | |a A Dirichlet process biterm-based mixture model for short text stream clustering |
264 | 1 | |c 2020 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
500 | |a © Springer Science+Business Media, LLC, part of Springer Nature 2020 | ||
520 | |a Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. | ||
650 | 4 | |a Data mining | |
650 | 4 | |a Stream clustering | |
650 | 4 | |a Topic modeling | |
700 | 1 | |a Gong, Zhiguo |4 aut | |
700 | 1 | |a Liu, Weiwen |4 aut | |
773 | 0 | 8 | |i Enthalten in |t Applied intelligence |d Springer US, 1991 |g 50(2020), 5 vom: 01. Feb., Seite 1609-1619 |w (DE-627)130990515 |w (DE-600)1080229-0 |w (DE-576)029154286 |x 0924-669X |7 nnns |
773 | 1 | 8 | |g volume:50 |g year:2020 |g number:5 |g day:01 |g month:02 |g pages:1609-1619 |
856 | 4 | 1 | |u https://doi.org/10.1007/s10489-019-01606-1 |z lizenzpflichtig |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a SSG-OLC-MAT | ||
951 | |a AR | ||
952 | |d 50 |j 2020 |e 5 |b 01 |c 02 |h 1609-1619 |
author_variant |
j c jc z g zg w l wl |
---|---|
matchkey_str |
article:0924669X:2020----::drcltrcsbtrbsditrmdlosote |
hierarchy_sort_str |
2020 |
publishDate |
2020 |
allfields |
10.1007/s10489-019-01606-1 doi (DE-627)OLC2066109754 (DE-He213)s10489-019-01606-1-p DE-627 ger DE-627 rakwb eng 004 VZ Chen, Junyang verfasserin (orcid)0000-0002-1139-8654 aut A Dirichlet process biterm-based mixture model for short text stream clustering 2020 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. Data mining Stream clustering Topic modeling Gong, Zhiguo aut Liu, Weiwen aut Enthalten in Applied intelligence Springer US, 1991 50(2020), 5 vom: 01. Feb., Seite 1609-1619 (DE-627)130990515 (DE-600)1080229-0 (DE-576)029154286 0924-669X nnns volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 https://doi.org/10.1007/s10489-019-01606-1 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT AR 50 2020 5 01 02 1609-1619 |
spelling |
10.1007/s10489-019-01606-1 doi (DE-627)OLC2066109754 (DE-He213)s10489-019-01606-1-p DE-627 ger DE-627 rakwb eng 004 VZ Chen, Junyang verfasserin (orcid)0000-0002-1139-8654 aut A Dirichlet process biterm-based mixture model for short text stream clustering 2020 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. Data mining Stream clustering Topic modeling Gong, Zhiguo aut Liu, Weiwen aut Enthalten in Applied intelligence Springer US, 1991 50(2020), 5 vom: 01. Feb., Seite 1609-1619 (DE-627)130990515 (DE-600)1080229-0 (DE-576)029154286 0924-669X nnns volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 https://doi.org/10.1007/s10489-019-01606-1 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT AR 50 2020 5 01 02 1609-1619 |
allfields_unstemmed |
10.1007/s10489-019-01606-1 doi (DE-627)OLC2066109754 (DE-He213)s10489-019-01606-1-p DE-627 ger DE-627 rakwb eng 004 VZ Chen, Junyang verfasserin (orcid)0000-0002-1139-8654 aut A Dirichlet process biterm-based mixture model for short text stream clustering 2020 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. Data mining Stream clustering Topic modeling Gong, Zhiguo aut Liu, Weiwen aut Enthalten in Applied intelligence Springer US, 1991 50(2020), 5 vom: 01. Feb., Seite 1609-1619 (DE-627)130990515 (DE-600)1080229-0 (DE-576)029154286 0924-669X nnns volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 https://doi.org/10.1007/s10489-019-01606-1 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT AR 50 2020 5 01 02 1609-1619 |
allfieldsGer |
10.1007/s10489-019-01606-1 doi (DE-627)OLC2066109754 (DE-He213)s10489-019-01606-1-p DE-627 ger DE-627 rakwb eng 004 VZ Chen, Junyang verfasserin (orcid)0000-0002-1139-8654 aut A Dirichlet process biterm-based mixture model for short text stream clustering 2020 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. Data mining Stream clustering Topic modeling Gong, Zhiguo aut Liu, Weiwen aut Enthalten in Applied intelligence Springer US, 1991 50(2020), 5 vom: 01. Feb., Seite 1609-1619 (DE-627)130990515 (DE-600)1080229-0 (DE-576)029154286 0924-669X nnns volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 https://doi.org/10.1007/s10489-019-01606-1 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT AR 50 2020 5 01 02 1609-1619 |
allfieldsSound |
10.1007/s10489-019-01606-1 doi (DE-627)OLC2066109754 (DE-He213)s10489-019-01606-1-p DE-627 ger DE-627 rakwb eng 004 VZ Chen, Junyang verfasserin (orcid)0000-0002-1139-8654 aut A Dirichlet process biterm-based mixture model for short text stream clustering 2020 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. Data mining Stream clustering Topic modeling Gong, Zhiguo aut Liu, Weiwen aut Enthalten in Applied intelligence Springer US, 1991 50(2020), 5 vom: 01. Feb., Seite 1609-1619 (DE-627)130990515 (DE-600)1080229-0 (DE-576)029154286 0924-669X nnns volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 https://doi.org/10.1007/s10489-019-01606-1 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT AR 50 2020 5 01 02 1609-1619 |
language |
English |
source |
Enthalten in Applied intelligence 50(2020), 5 vom: 01. Feb., Seite 1609-1619 volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 |
sourceStr |
Enthalten in Applied intelligence 50(2020), 5 vom: 01. Feb., Seite 1609-1619 volume:50 year:2020 number:5 day:01 month:02 pages:1609-1619 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
Data mining Stream clustering Topic modeling |
dewey-raw |
004 |
isfreeaccess_bool |
false |
container_title |
Applied intelligence |
authorswithroles_txt_mv |
Chen, Junyang @@aut@@ Gong, Zhiguo @@aut@@ Liu, Weiwen @@aut@@ |
publishDateDaySort_date |
2020-02-01T00:00:00Z |
hierarchy_top_id |
130990515 |
dewey-sort |
14 |
id |
OLC2066109754 |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">OLC2066109754</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230504131332.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">200820s2020 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10489-019-01606-1</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2066109754</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10489-019-01606-1-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Chen, Junyang</subfield><subfield code="e">verfasserin</subfield><subfield code="0">(orcid)0000-0002-1139-8654</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">A Dirichlet process biterm-based mixture model for short text stream clustering</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2020</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer Science+Business Media, LLC, part of Springer Nature 2020</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stream clustering</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Topic modeling</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Gong, Zhiguo</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Liu, Weiwen</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Applied intelligence</subfield><subfield code="d">Springer US, 1991</subfield><subfield code="g">50(2020), 5 vom: 01. Feb., Seite 1609-1619</subfield><subfield code="w">(DE-627)130990515</subfield><subfield code="w">(DE-600)1080229-0</subfield><subfield code="w">(DE-576)029154286</subfield><subfield code="x">0924-669X</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:50</subfield><subfield code="g">year:2020</subfield><subfield code="g">number:5</subfield><subfield code="g">day:01</subfield><subfield code="g">month:02</subfield><subfield code="g">pages:1609-1619</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10489-019-01606-1</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">50</subfield><subfield code="j">2020</subfield><subfield code="e">5</subfield><subfield code="b">01</subfield><subfield code="c">02</subfield><subfield code="h">1609-1619</subfield></datafield></record></collection>
|
author |
Chen, Junyang |
spellingShingle |
Chen, Junyang ddc 004 misc Data mining misc Stream clustering misc Topic modeling A Dirichlet process biterm-based mixture model for short text stream clustering |
authorStr |
Chen, Junyang |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)130990515 |
format |
Article |
dewey-ones |
004 - Data processing & computer science |
delete_txt_mv |
keep |
author_role |
aut aut aut |
collection |
OLC |
remote_str |
false |
illustrated |
Not Illustrated |
issn |
0924-669X |
topic_title |
004 VZ A Dirichlet process biterm-based mixture model for short text stream clustering Data mining Stream clustering Topic modeling |
topic |
ddc 004 misc Data mining misc Stream clustering misc Topic modeling |
topic_unstemmed |
ddc 004 misc Data mining misc Stream clustering misc Topic modeling |
topic_browse |
ddc 004 misc Data mining misc Stream clustering misc Topic modeling |
format_facet |
Aufsätze Gedruckte Aufsätze |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
nc |
hierarchy_parent_title |
Applied intelligence |
hierarchy_parent_id |
130990515 |
dewey-tens |
000 - Computer science, knowledge & systems |
hierarchy_top_title |
Applied intelligence |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)130990515 (DE-600)1080229-0 (DE-576)029154286 |
title |
A Dirichlet process biterm-based mixture model for short text stream clustering |
ctrlnum |
(DE-627)OLC2066109754 (DE-He213)s10489-019-01606-1-p |
title_full |
A Dirichlet process biterm-based mixture model for short text stream clustering |
author_sort |
Chen, Junyang |
journal |
Applied intelligence |
journalStr |
Applied intelligence |
lang_code |
eng |
isOA_bool |
false |
dewey-hundreds |
000 - Computer science, information & general works |
recordtype |
marc |
publishDateSort |
2020 |
contenttype_str_mv |
txt |
container_start_page |
1609 |
author_browse |
Chen, Junyang Gong, Zhiguo Liu, Weiwen |
container_volume |
50 |
class |
004 VZ |
format_se |
Aufsätze |
author-letter |
Chen, Junyang |
doi_str_mv |
10.1007/s10489-019-01606-1 |
normlink |
(ORCID)0000-0002-1139-8654 |
normlink_prefix_str_mv |
(orcid)0000-0002-1139-8654 |
dewey-full |
004 |
title_sort |
a dirichlet process biterm-based mixture model for short text stream clustering |
title_auth |
A Dirichlet process biterm-based mixture model for short text stream clustering |
abstract |
Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. © Springer Science+Business Media, LLC, part of Springer Nature 2020 |
abstractGer |
Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. © Springer Science+Business Media, LLC, part of Springer Nature 2020 |
abstract_unstemmed |
Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics. © Springer Science+Business Media, LLC, part of Springer Nature 2020 |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT |
container_issue |
5 |
title_short |
A Dirichlet process biterm-based mixture model for short text stream clustering |
url |
https://doi.org/10.1007/s10489-019-01606-1 |
remote_bool |
false |
author2 |
Gong, Zhiguo Liu, Weiwen |
author2Str |
Gong, Zhiguo Liu, Weiwen |
ppnlink |
130990515 |
mediatype_str_mv |
n |
isOA_txt |
false |
hochschulschrift_bool |
false |
doi_str |
10.1007/s10489-019-01606-1 |
up_date |
2024-07-04T03:47:19.992Z |
_version_ |
1803618709668888576 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">OLC2066109754</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230504131332.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">200820s2020 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s10489-019-01606-1</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2066109754</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s10489-019-01606-1-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Chen, Junyang</subfield><subfield code="e">verfasserin</subfield><subfield code="0">(orcid)0000-0002-1139-8654</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">A Dirichlet process biterm-based mixture model for short text stream clustering</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2020</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer Science+Business Media, LLC, part of Springer Nature 2020</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Short text stream clustering has become an important problem for mining textual data in diverse social media platforms (e.g., Twitter). However, most of the existing clustering methods (e.g., LDA and PLSA) are developed based on the assumption of a static corpus of long texts, while little attention has been given to short text streams. Different from the long texts, the clustering of short texts is more challenging since their word co-occurrence pattern easily suffers from a sparsity problem. In this paper, we propose a Dirichlet process biterm-based mixture model (DP-BMM), which can deal with the topic drift problem and the sparsity problem in short text stream clustering. The major advantages of DP-BMM include (1) DP-BMM explicitly exploits the word-pairs constructed from each document to enhance the word co-occurrence pattern in short texts; (2) DP-BMM can deal with the topic drift problem of short text streams naturally. Moreover, we further propose an improved algorithm of DP-BMM with forgetting property called DP-BMM-FP, which can efficiently delete biterms of outdated documents by deleting clusters of outdated batches. To perform inference, we adopt an online Gibbs sampling method for parameter estimation. Our extensive experimental results on real-world datasets show that DP-BMM and DP-BMM-FP can achieve a better performance than the state-of-the-art methods in terms of NMI metrics.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Stream clustering</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Topic modeling</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Gong, Zhiguo</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Liu, Weiwen</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">Applied intelligence</subfield><subfield code="d">Springer US, 1991</subfield><subfield code="g">50(2020), 5 vom: 01. Feb., Seite 1609-1619</subfield><subfield code="w">(DE-627)130990515</subfield><subfield code="w">(DE-600)1080229-0</subfield><subfield code="w">(DE-576)029154286</subfield><subfield code="x">0924-669X</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:50</subfield><subfield code="g">year:2020</subfield><subfield code="g">number:5</subfield><subfield code="g">day:01</subfield><subfield code="g">month:02</subfield><subfield code="g">pages:1609-1619</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s10489-019-01606-1</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">50</subfield><subfield code="j">2020</subfield><subfield code="e">5</subfield><subfield code="b">01</subfield><subfield code="c">02</subfield><subfield code="h">1609-1619</subfield></datafield></record></collection>
|
score |
7.400337 |