Trie-join: a trie-based method for efficient string similarity joins
Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constrain...
Ausführliche Beschreibung
Autor*in: |
Feng, Jianhua [verfasserIn] |
---|
Format: |
Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2011 |
---|
Schlagwörter: |
---|
Anmerkung: |
© Springer-Verlag 2011 |
---|
Übergeordnetes Werk: |
Enthalten in: The VLDB journal - Springer-Verlag, 1992, 21(2011), 4 vom: 04. Okt., Seite 437-461 |
---|---|
Übergeordnetes Werk: |
volume:21 ; year:2011 ; number:4 ; day:04 ; month:10 ; pages:437-461 |
Links: |
---|
DOI / URN: |
10.1007/s00778-011-0252-8 |
---|
Katalog-ID: |
OLC205135927X |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | OLC205135927X | ||
003 | DE-627 | ||
005 | 20230502151958.0 | ||
007 | tu | ||
008 | 200819s2011 xx ||||| 00| ||eng c | ||
024 | 7 | |a 10.1007/s00778-011-0252-8 |2 doi | |
035 | |a (DE-627)OLC205135927X | ||
035 | |a (DE-He213)s00778-011-0252-8-p | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 004 |q VZ |
100 | 1 | |a Feng, Jianhua |e verfasserin |4 aut | |
245 | 1 | 0 | |a Trie-join: a trie-based method for efficient string similarity joins |
264 | 1 | |c 2011 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
500 | |a © Springer-Verlag 2011 | ||
520 | |a Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. | ||
650 | 4 | |a String similarity joins | |
650 | 4 | |a Data integration and cleaning | |
650 | 4 | |a Edit distance | |
650 | 4 | |a Trie index | |
650 | 4 | |a Subtrie pruning | |
700 | 1 | |a Wang, Jiannan |4 aut | |
700 | 1 | |a Li, Guoliang |4 aut | |
773 | 0 | 8 | |i Enthalten in |t The VLDB journal |d Springer-Verlag, 1992 |g 21(2011), 4 vom: 04. Okt., Seite 437-461 |w (DE-627)170933059 |w (DE-600)1129061-4 |w (DE-576)032856466 |x 1066-8888 |7 nnns |
773 | 1 | 8 | |g volume:21 |g year:2011 |g number:4 |g day:04 |g month:10 |g pages:437-461 |
856 | 4 | 1 | |u https://doi.org/10.1007/s00778-011-0252-8 |z lizenzpflichtig |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a SSG-OLC-MAT | ||
912 | |a GBV_ILN_22 | ||
912 | |a GBV_ILN_24 | ||
912 | |a GBV_ILN_30 | ||
912 | |a GBV_ILN_32 | ||
912 | |a GBV_ILN_65 | ||
912 | |a GBV_ILN_70 | ||
912 | |a GBV_ILN_2005 | ||
912 | |a GBV_ILN_2006 | ||
912 | |a GBV_ILN_2018 | ||
912 | |a GBV_ILN_4116 | ||
912 | |a GBV_ILN_4266 | ||
912 | |a GBV_ILN_4277 | ||
912 | |a GBV_ILN_4305 | ||
912 | |a GBV_ILN_4311 | ||
951 | |a AR | ||
952 | |d 21 |j 2011 |e 4 |b 04 |c 10 |h 437-461 |
author_variant |
j f jf j w jw g l gl |
---|---|
matchkey_str |
article:10668888:2011----::reontibsdehdoefcettig |
hierarchy_sort_str |
2011 |
publishDate |
2011 |
allfields |
10.1007/s00778-011-0252-8 doi (DE-627)OLC205135927X (DE-He213)s00778-011-0252-8-p DE-627 ger DE-627 rakwb eng 004 VZ Feng, Jianhua verfasserin aut Trie-join: a trie-based method for efficient string similarity joins 2011 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag 2011 Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning Wang, Jiannan aut Li, Guoliang aut Enthalten in The VLDB journal Springer-Verlag, 1992 21(2011), 4 vom: 04. Okt., Seite 437-461 (DE-627)170933059 (DE-600)1129061-4 (DE-576)032856466 1066-8888 nnns volume:21 year:2011 number:4 day:04 month:10 pages:437-461 https://doi.org/10.1007/s00778-011-0252-8 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT GBV_ILN_22 GBV_ILN_24 GBV_ILN_30 GBV_ILN_32 GBV_ILN_65 GBV_ILN_70 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2018 GBV_ILN_4116 GBV_ILN_4266 GBV_ILN_4277 GBV_ILN_4305 GBV_ILN_4311 AR 21 2011 4 04 10 437-461 |
spelling |
10.1007/s00778-011-0252-8 doi (DE-627)OLC205135927X (DE-He213)s00778-011-0252-8-p DE-627 ger DE-627 rakwb eng 004 VZ Feng, Jianhua verfasserin aut Trie-join: a trie-based method for efficient string similarity joins 2011 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag 2011 Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning Wang, Jiannan aut Li, Guoliang aut Enthalten in The VLDB journal Springer-Verlag, 1992 21(2011), 4 vom: 04. Okt., Seite 437-461 (DE-627)170933059 (DE-600)1129061-4 (DE-576)032856466 1066-8888 nnns volume:21 year:2011 number:4 day:04 month:10 pages:437-461 https://doi.org/10.1007/s00778-011-0252-8 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT GBV_ILN_22 GBV_ILN_24 GBV_ILN_30 GBV_ILN_32 GBV_ILN_65 GBV_ILN_70 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2018 GBV_ILN_4116 GBV_ILN_4266 GBV_ILN_4277 GBV_ILN_4305 GBV_ILN_4311 AR 21 2011 4 04 10 437-461 |
allfields_unstemmed |
10.1007/s00778-011-0252-8 doi (DE-627)OLC205135927X (DE-He213)s00778-011-0252-8-p DE-627 ger DE-627 rakwb eng 004 VZ Feng, Jianhua verfasserin aut Trie-join: a trie-based method for efficient string similarity joins 2011 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag 2011 Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning Wang, Jiannan aut Li, Guoliang aut Enthalten in The VLDB journal Springer-Verlag, 1992 21(2011), 4 vom: 04. Okt., Seite 437-461 (DE-627)170933059 (DE-600)1129061-4 (DE-576)032856466 1066-8888 nnns volume:21 year:2011 number:4 day:04 month:10 pages:437-461 https://doi.org/10.1007/s00778-011-0252-8 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT GBV_ILN_22 GBV_ILN_24 GBV_ILN_30 GBV_ILN_32 GBV_ILN_65 GBV_ILN_70 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2018 GBV_ILN_4116 GBV_ILN_4266 GBV_ILN_4277 GBV_ILN_4305 GBV_ILN_4311 AR 21 2011 4 04 10 437-461 |
allfieldsGer |
10.1007/s00778-011-0252-8 doi (DE-627)OLC205135927X (DE-He213)s00778-011-0252-8-p DE-627 ger DE-627 rakwb eng 004 VZ Feng, Jianhua verfasserin aut Trie-join: a trie-based method for efficient string similarity joins 2011 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag 2011 Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning Wang, Jiannan aut Li, Guoliang aut Enthalten in The VLDB journal Springer-Verlag, 1992 21(2011), 4 vom: 04. Okt., Seite 437-461 (DE-627)170933059 (DE-600)1129061-4 (DE-576)032856466 1066-8888 nnns volume:21 year:2011 number:4 day:04 month:10 pages:437-461 https://doi.org/10.1007/s00778-011-0252-8 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT GBV_ILN_22 GBV_ILN_24 GBV_ILN_30 GBV_ILN_32 GBV_ILN_65 GBV_ILN_70 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2018 GBV_ILN_4116 GBV_ILN_4266 GBV_ILN_4277 GBV_ILN_4305 GBV_ILN_4311 AR 21 2011 4 04 10 437-461 |
allfieldsSound |
10.1007/s00778-011-0252-8 doi (DE-627)OLC205135927X (DE-He213)s00778-011-0252-8-p DE-627 ger DE-627 rakwb eng 004 VZ Feng, Jianhua verfasserin aut Trie-join: a trie-based method for efficient string similarity joins 2011 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © Springer-Verlag 2011 Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning Wang, Jiannan aut Li, Guoliang aut Enthalten in The VLDB journal Springer-Verlag, 1992 21(2011), 4 vom: 04. Okt., Seite 437-461 (DE-627)170933059 (DE-600)1129061-4 (DE-576)032856466 1066-8888 nnns volume:21 year:2011 number:4 day:04 month:10 pages:437-461 https://doi.org/10.1007/s00778-011-0252-8 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT GBV_ILN_22 GBV_ILN_24 GBV_ILN_30 GBV_ILN_32 GBV_ILN_65 GBV_ILN_70 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2018 GBV_ILN_4116 GBV_ILN_4266 GBV_ILN_4277 GBV_ILN_4305 GBV_ILN_4311 AR 21 2011 4 04 10 437-461 |
language |
English |
source |
Enthalten in The VLDB journal 21(2011), 4 vom: 04. Okt., Seite 437-461 volume:21 year:2011 number:4 day:04 month:10 pages:437-461 |
sourceStr |
Enthalten in The VLDB journal 21(2011), 4 vom: 04. Okt., Seite 437-461 volume:21 year:2011 number:4 day:04 month:10 pages:437-461 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning |
dewey-raw |
004 |
isfreeaccess_bool |
false |
container_title |
The VLDB journal |
authorswithroles_txt_mv |
Feng, Jianhua @@aut@@ Wang, Jiannan @@aut@@ Li, Guoliang @@aut@@ |
publishDateDaySort_date |
2011-10-04T00:00:00Z |
hierarchy_top_id |
170933059 |
dewey-sort |
14 |
id |
OLC205135927X |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">OLC205135927X</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502151958.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">200819s2011 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s00778-011-0252-8</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC205135927X</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s00778-011-0252-8-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Feng, Jianhua</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Trie-join: a trie-based method for efficient string similarity joins</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2011</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer-Verlag 2011</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">String similarity joins</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data integration and cleaning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Edit distance</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Trie index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Subtrie pruning</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Wang, Jiannan</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Guoliang</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The VLDB journal</subfield><subfield code="d">Springer-Verlag, 1992</subfield><subfield code="g">21(2011), 4 vom: 04. Okt., Seite 437-461</subfield><subfield code="w">(DE-627)170933059</subfield><subfield code="w">(DE-600)1129061-4</subfield><subfield code="w">(DE-576)032856466</subfield><subfield code="x">1066-8888</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:21</subfield><subfield code="g">year:2011</subfield><subfield code="g">number:4</subfield><subfield code="g">day:04</subfield><subfield code="g">month:10</subfield><subfield code="g">pages:437-461</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s00778-011-0252-8</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_22</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_24</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_30</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_32</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_65</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2005</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2006</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2018</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4116</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4266</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4277</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4305</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4311</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">21</subfield><subfield code="j">2011</subfield><subfield code="e">4</subfield><subfield code="b">04</subfield><subfield code="c">10</subfield><subfield code="h">437-461</subfield></datafield></record></collection>
|
author |
Feng, Jianhua |
spellingShingle |
Feng, Jianhua ddc 004 misc String similarity joins misc Data integration and cleaning misc Edit distance misc Trie index misc Subtrie pruning Trie-join: a trie-based method for efficient string similarity joins |
authorStr |
Feng, Jianhua |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)170933059 |
format |
Article |
dewey-ones |
004 - Data processing & computer science |
delete_txt_mv |
keep |
author_role |
aut aut aut |
collection |
OLC |
remote_str |
false |
illustrated |
Not Illustrated |
issn |
1066-8888 |
topic_title |
004 VZ Trie-join: a trie-based method for efficient string similarity joins String similarity joins Data integration and cleaning Edit distance Trie index Subtrie pruning |
topic |
ddc 004 misc String similarity joins misc Data integration and cleaning misc Edit distance misc Trie index misc Subtrie pruning |
topic_unstemmed |
ddc 004 misc String similarity joins misc Data integration and cleaning misc Edit distance misc Trie index misc Subtrie pruning |
topic_browse |
ddc 004 misc String similarity joins misc Data integration and cleaning misc Edit distance misc Trie index misc Subtrie pruning |
format_facet |
Aufsätze Gedruckte Aufsätze |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
nc |
hierarchy_parent_title |
The VLDB journal |
hierarchy_parent_id |
170933059 |
dewey-tens |
000 - Computer science, knowledge & systems |
hierarchy_top_title |
The VLDB journal |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)170933059 (DE-600)1129061-4 (DE-576)032856466 |
title |
Trie-join: a trie-based method for efficient string similarity joins |
ctrlnum |
(DE-627)OLC205135927X (DE-He213)s00778-011-0252-8-p |
title_full |
Trie-join: a trie-based method for efficient string similarity joins |
author_sort |
Feng, Jianhua |
journal |
The VLDB journal |
journalStr |
The VLDB journal |
lang_code |
eng |
isOA_bool |
false |
dewey-hundreds |
000 - Computer science, information & general works |
recordtype |
marc |
publishDateSort |
2011 |
contenttype_str_mv |
txt |
container_start_page |
437 |
author_browse |
Feng, Jianhua Wang, Jiannan Li, Guoliang |
container_volume |
21 |
class |
004 VZ |
format_se |
Aufsätze |
author-letter |
Feng, Jianhua |
doi_str_mv |
10.1007/s00778-011-0252-8 |
dewey-full |
004 |
title_sort |
trie-join: a trie-based method for efficient string similarity joins |
title_auth |
Trie-join: a trie-based method for efficient string similarity joins |
abstract |
Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. © Springer-Verlag 2011 |
abstractGer |
Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. © Springer-Verlag 2011 |
abstract_unstemmed |
Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. © Springer-Verlag 2011 |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-MAT GBV_ILN_22 GBV_ILN_24 GBV_ILN_30 GBV_ILN_32 GBV_ILN_65 GBV_ILN_70 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2018 GBV_ILN_4116 GBV_ILN_4266 GBV_ILN_4277 GBV_ILN_4305 GBV_ILN_4311 |
container_issue |
4 |
title_short |
Trie-join: a trie-based method for efficient string similarity joins |
url |
https://doi.org/10.1007/s00778-011-0252-8 |
remote_bool |
false |
author2 |
Wang, Jiannan Li, Guoliang |
author2Str |
Wang, Jiannan Li, Guoliang |
ppnlink |
170933059 |
mediatype_str_mv |
n |
isOA_txt |
false |
hochschulschrift_bool |
false |
doi_str |
10.1007/s00778-011-0252-8 |
up_date |
2024-07-04T04:12:40.839Z |
_version_ |
1803620304393601024 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">OLC205135927X</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230502151958.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">200819s2011 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s00778-011-0252-8</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC205135927X</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s00778-011-0252-8-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Feng, Jianhua</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Trie-join: a trie-based method for efficient string similarity joins</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2011</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© Springer-Verlag 2011</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">String similarity joins</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data integration and cleaning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Edit distance</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Trie index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Subtrie pruning</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Wang, Jiannan</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Guoliang</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The VLDB journal</subfield><subfield code="d">Springer-Verlag, 1992</subfield><subfield code="g">21(2011), 4 vom: 04. Okt., Seite 437-461</subfield><subfield code="w">(DE-627)170933059</subfield><subfield code="w">(DE-600)1129061-4</subfield><subfield code="w">(DE-576)032856466</subfield><subfield code="x">1066-8888</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:21</subfield><subfield code="g">year:2011</subfield><subfield code="g">number:4</subfield><subfield code="g">day:04</subfield><subfield code="g">month:10</subfield><subfield code="g">pages:437-461</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s00778-011-0252-8</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_22</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_24</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_30</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_32</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_65</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2005</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2006</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2018</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4116</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4266</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4277</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4305</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4311</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">21</subfield><subfield code="j">2011</subfield><subfield code="e">4</subfield><subfield code="b">04</subfield><subfield code="c">10</subfield><subfield code="h">437-461</subfield></datafield></record></collection>
|
score |
7.40096 |