UM  > 科技學院
CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data
Wang, Longyue; Derek F. Wong; Lidia S. Chao; Junwen Xing
2012
Conference NameProceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing
Source Publicationthe Second CIPS-SIGHAN Joint Conference on Chinese Language Processing,
Pages51–57
Conference Date20-21 DEC. 2012
Conference PlaceTianjin, China
PublisherAssociation for Computational Linguistics
Abstract

In this paper, we proposed a Chinese word segmentation model for micro-blog text. Although Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chinese micro-blog. Different from the genres of common articles, micro-blog has gradually become a new literary with the development of Internet. However, the unavailable of microblog training data has been the obstacle to develop a good segmenter based on trainable models. Considering the linguistic characteristics of the text, we proposed some methods to make the CRFs models suitable for segmentation in the domain of micro-blog. Several experiments have been conducted with different settings and then an optimal tagging method and feature templates have been designed. The proposed model has been implemented for the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing Bakeoff (Bakeoff-2012) and achieves a very high Fmeasure of 93.38% within the test set of 5,000 micro-blog sentences. One of our main contributions is the online version of toolkit1 , which provides segmentation service for Chinese micro-blog text.

Language英语
Fulltext Access
Document TypeConference paper
CollectionFaculty of Science and Technology
DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
AffiliationNatural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau S.A.R., China.
First Author AffilicationUniversity of Macau
Recommended Citation
GB/T 7714
Wang, Longyue,Derek F. Wong,Lidia S. Chao,et al. CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data[C]:Association for Computational Linguistics,2012:51–57.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Wang, Longyue]'s Articles
[Derek F. Wong]'s Articles
[Lidia S. Chao]'s Articles
Baidu academic
Similar articles in Baidu academic
[Wang, Longyue]'s Articles
[Derek F. Wong]'s Articles
[Lidia S. Chao]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Wang, Longyue]'s Articles
[Derek F. Wong]'s Articles
[Lidia S. Chao]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.