- Title
- Identifying high-value social entities from Twitter with machine learning and multilingual analysis
- Creator
- Lo, Siaw Ling
- Relation
- University of Newcastle Research Higher Degree Thesis
- Resource Type
- thesis
- Date
- 2017
- Description
- Research Doctorate - Doctor of Philosophy (PhD)
- Description
- With the vast amount, multilingual and real-time nature of social media data, it is challenging to extract relevant and useful information for individuals, companies and organisations. It is of interest to assess if the content shared and its multilingual expressions can be used to help a company in differentiating prospective customers from a general audience, or for individuals and organisations to detect and identify important topics that may otherwise go unnoticed within the mass of social media data. In this research, various methods and approaches have been investigated to identify high-value social entities in the form of social audiences and topics with minimal manual annotation effort. These include supervised machine learning methods such as the Support Vector Machine (SVM) ensemble, unsupervised clustering methods such as Latent Dirichlet Allocation (LDA), and text mining methods including latent semantic analysis and association rules. In addition, a hybrid framework has been developed for multilingual analysis by leveraging the strengths of both knowledge-based learning and machine learning. Twitter data, which is openly available, was used for validation and testing purposes. Even though the aim of identifying high-value social audiences may seem to be different from that of identifying high-value topics, the underlying framework for the identification of these social entities remains the same. The first step is to earmark definitive contents that can provide information for constructing training or evaluation data with minimal annotation efforts. This step is crucial in order to avoid the alternative: the labour-intensive process of manually annotating data forming large online datasets. The second step is then to employ methods that are suitable to extract contents of interest. Both supervised and unsupervised methods such as the SVM ensemble and Twitter LDA have been used in this research to extract relevant social audiences. The SVM ensemble works well in this regard, as the contents of Twitter account owners are typically well-defined and can be used as training datasets for high-value target audience classification. On the other hand, since the number of classes or topics is not known, the unsupervised Dirichlet Process Mixture Model is instead preferred for topic detection. The third and last step is to assess the strengths and weaknesses of each method used in order to develop a hybrid approach. It is found that the combination or joint approach of various methods can often improve the recall and precision values and enable the identification of high-value social entities across datasets of different nature. This is supported by evidence from the promising results of a unique index devised for ranking high-value social audiences, which is called the high-value social audience (HVSA) index, on three different datasets, as well as the consistently higher precision and recall values from a ‘Joint’ ranking method for identifying high-value topics with their sentiments in a huge set of multilingual tweets. Methods and findings generated from this research have the potential to be adopted for addressing real-world problems. The HVSA index, for example, can be used to identify online customers who are highly likely to be interested in the content shared on social media by a business account owner. This can be useful in identifying prospective customers, or improving engagement with current customers. The ability to identify social media followers in a ‘ranked’ manner no doubt will help in better decision making, so that a (small) marketing budget can be spent more effectively. On the other hand, being able to detect high-value topics with their associated sentiments enables policy makers or organisations to understand issues of concerns on the ground and uncover possible actionable insights for a better community or customer reach.
- Subject
- text mining; machine learning; multilingual analysis; target audience profiling; topic identification
- Identifier
- http://hdl.handle.net/1959.13/1335881
- Identifier
- uon:27505
- Rights
- Copyright 2017 Siaw Ling Lo
- Language
- eng
- Full Text
- Hits: 621
- Visitors: 1167
- Downloads: 458
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | ATTACHMENT01 | Thesis | 3 MB | Adobe Acrobat PDF | View Details Download | ||
View Details Download | ATTACHMENT02 | Abstract | 253 KB | Adobe Acrobat PDF | View Details Download |