ENQUIRE PROJECT DETAILS BY GENERAL PUBLIC

Project Details

Funding Scheme :

General Research Fund

Project Number :

11505119

Project Title(English) :

Assessing conceptual and empirical contributions of social media research based on knowledge graph

Project Title(Chinese) :

基於知識圖譜評估社交媒體研究的理論與實證貢獻

Principal Investigator(English) :

Prof Zhu, Jonathan Jian-hua

Principal Investigator(Chinese) :

Department :

Dept of Media and Communication & Dept of Data Science

Institution :

City University of Hong Kong

E-mail Address :

j.zhu@cityu.edu.hk

Tel :

3442 7186

Co - Investigator(s) :

Dr Peng, Tai-Quan

Dr Zhao, Wayne Xin

Panel :

Business Studies

Subject Area :

Business Studies

Exercise Year :

2019 / 20

Fund Approved :

434,764

Project Status :

Completed

Completion Date :

31-1-2022

Project Objectives :

To develop a methodological framework plus relevant methods, models, and algorithms for scientific knowledge graphs by modifying the existing knowledge graph technology for commercial knowledge graphs.

To use the above methodology to construct a Social Media Knowledge Graph (SMKG) with social media research publications on marketing communication.

To assess the state of the arts of social media research on marketing communication based on SMKG, focusing on the assessment of the nature (conceptual, operational, or empirical contributions) and the degree (brand new, mixed, or mere replication) of social media publications against SMKG,

Abstract as per original application
(English/Chinese):

研究論文的質量為學術界及其相關持份者最爲關心的問題之一。然而，如何測量研究質量卻一向缺乏共識。最常見的方法是基於著作引用率的“影響因子”。影響因子雖然容易獲得和分析，但只反映原文的流行度而非其質量。另外有些學者則曾嘗試通過專家意見來繞過質量真實内核。因此，研究質量至今仍爲一個學術黑箱。我們在本研究中另闢蹊徑，主張用新近湧現的知識譜圖方法來測量研究貢獻（而非研究質量本身）。我們的總體思路是將每篇論文提供的“新知識“與已有“舊知識”相比較、從而確定該論文的貢獻程度。爲此，我們計劃通過證据、方法和理論等三個維度來測量知識。該三維框架使得我們有可能比較文獻中的新舊知識。如，某論文報告了一個已知的現象，那麽其生成的知識僅爲複製的“舊酒“。相反，如果該論文為一個已有的假設提供了新的證據、或為一個已知的現象提供了新的解釋，那麽其生成的知識則為新舊兼有的”半新酒“。如果該論文發現了一個新現象並提供了新解釋，那麽其生成對知識就是”全新酒“。簡言之，這一基於知識的測量方法有望解決在研究質量（無法直接測量的隱含概念）與影響因子（容易但不准）之間長而未決的兩難問題。方法上，我們將設計和構建一個”社會媒體知識圖譜“（簡稱SMKG）以測量和評估社交媒體的知識貢獻。我們首先會從WOS數據庫中采集有關社交媒體的相關實證案例，然後從中用有監督和無監督相結合的機器學習方法提取“知識要素”（如理論概念、方法或實證特徵、假設及驗證的關係等等），並通過共引求證、概念去疑等手段净化上述知識要素，以整合成一個多層及動態的知識庫，其中理論概念將按各抽象層面和原著時序而連接起來。該知識庫最終生成前述SMKG的本體論。我們可以用SMKG來評估文獻中社交媒體論文所產生的知識有多少是複製性的陳酒、多少是為舊假設配新證據或為舊現象配新解釋的半新酒、多少是新現象加新解釋的全新酒。本項目完成之後，SMK可以被不斷更新，使其成爲一個追蹤社交媒體研究進展的實時評估系統。我們還計劃將本研究所構建的基本方法擴展到對其它商業或社會科學研究領域的評估之中。

Realisation of objectives:

All the objectives have been completed through three key components: I. data model, II. data source, and III. extraction framework. We outline the components below, with detailed information available in item #15 of Part C Research Output (“#15” hereafter). I. Data Model Reviewing the core concepts and relations in social media publications requires modeling complex features such as the meaning of the concepts and the test results of the relations, which requires a data model that considers the semantic meaning of the core concepts and relations. Knowledge graph language provides computable syntax and expressive semantics for the needed data model. In particular, our data model rests on the following principles: 1) Triple design for organizing the data. 2) Entity class for concept label. 3) Ontology for synthesizing the concept and relation information. We aim to build an ontology for the core concepts and relations in social media publications. The knowledge graph language, including the triple design, entity class, and ontology, has provided computable syntax and expressive semantics for a data model that organizes the core concepts and relations in publications. Leveraging the knowledge graph language, we model the six kinds of information on core concepts and relations as an ontology consisting of different triples and build a knowledge graph—concept and relation graph—based on the ontology. A variable term is represented as vti. That two variable terms vti and vtj have a relation is represented as a triple (vti, has a relation with, vtj). That a relation r’s property is a value is also represented as a triple. That a relation r’s test result in the study is value v1 is denoted as a triple (v1, is the test result of, r). That a relation r’s hypothesized polarity is value v2 and empirically supported polarity is v3 are denoted as triples (v2, is the hypothesized polarity of, r) and (v3, is the empirically supported polarity of, r), respectively. Figure 1 illustrates the relation information in the ontology (see #15). We constructed the concept tree based on the classic 5W model of communication (Lasswell, 1948), i.e., “Who?”, “Says what?”, “In which channel?”, “To whom?”, and “With what effect?” This model is employed as the first (most generic) level of the concept tree. The classification at the second level aims to distinguish user characteristics, perceptions, behaviors, and benefits. We added more concepts to the initial concept tree when the concept labels are assigned to variable terms in a sample of publications. The extended concept structure is shown in Figures A1–A6 of #15. Therefore, the concept tree is generated through both a top-down approach based on the 5W model and a bottom-up approach based on a publication sample. The appendix of #15 provides more detailed explanations and descriptions of the concept tree. II. Data Source We focused on research hypothesis statements (sentences in a publication that present the hypothetical relations between concepts) as the main data source. First, many publications of quantitative empirical studies on social media that investigate relations among concepts present all focal relations as hypothesis statements. Therefore, if we extract relations mentioned in all hypothesis statements, we can guarantee the recall of the relation extraction. Second, hypothesis statements are specially written to describe the investigated relations of a study, and these sentences contain little redundant information. The precision of the relation information is high, thus enabling automatic extraction. Identifications (IDs) often appear in hypothesis statements. In terms of data collection, this ID is a vital sign that the sentence is a hypothesis statement. Using only regular expressions based on the ID, we can identify most hypothesis statements. Importantly, the IDs also enable the detection of the test results of the hypothesis statements. Therefore, the ID is an important identifier that connects a hypothesis statement and its test result. In a hypothesis statement, a variable term is described using a span of words, which is usually a phrase with variable lengths. Information sharing motive has a positive impact on employees’ usage of enterprise social media” are “information sharing motive” and “employees’ usage of enterprise social media.” This phrase can communicate the meaning of the variable in the study and the conceptual meaning under the 5W model, which we aim to obtain. Formally, three steps are necessary to obtain the 5W conceptual meaning from a variable term in a hypothesis statement. First, we identify which span of words is a variable term. Second, we determine the meaning of the variable term in the study. Third, we ascertain the conceptual meaning of the variable term using the 5W model. In a hypothesis statement, the main relation between two variables is described using typical terms, such as “affect,” “have an effect on,” and “influence.” In addition to the basic relation terms, terms indicating the properties of the relation exist. For instance, “increase” and “positive” indicate whether the relation is positive or negative. The test results of relations cannot be detected from hypothesis statements. This information is rather provided in result statements in the results section of a publication. Most result statements contain an ID. By matching the ID, we can link the result statement with its corresponding hypothesis statement. A result statement such as “H2a is hence supported” does not typically contain variable and relation information. Therefore, it is more effective to link hypothesis and result statements through the ID than through variable terms. The most important information in the test result is whether the hypothesis statement is supported. Typically, this information is indicated by terms such as “is rejected,” “is supported,” and “not significant” in the result sentence. The relation can also be partially supported, that is, more than one group of data is tested for the relation, and mixed results are obtained. The result sentence also provides the confirmed polarity of the relation, such as whether the relation is positive or negative, using relevant terms. III. Extraction Framework How to extract the information included in the data model from unstructured publications and statements is a challenging problem. To the best of our knowledge, no method of domain knowledge graph construction from scientific publications can be directly applied to our problem of synthesizing the core concepts and relations in social media publications. Therefore, we devised a new extraction framework consisting of seven extraction steps to extract the six kinds of information with NLP techniques. First, we extracted variable terms, relations, and concepts from the hypothesis statements (steps 1-4): 1. Extract hypothesis statements from each research publication. 2. Extract two phrases that represent the variable terms and the relation between the two variable terms from a hypothesis statement. 3. Extract the ID and polarity of the relation from a hypothesis statement. 4. Extract the concept label of each variable term. Build a hierarchy of the concepts. Second, we extracted result information for the hypothesis statements (steps 5-7): 5. Extract result sentences from each research publication. 6. Extract the IDs to match corresponding hypothesis statements and results—whether the hypothesis statement is supported, partially supported, or not supported—from result statements. 7. Extract the polarity of the relation from each supported hypothesis statement. We finally matched the hypothesis statements and their results according to the IDs (#15).

Summary of objectives addressed:

	Objectives	Addressed	Percentage achieved
1.	To develop a methodological framework plus relevant methods, models, and algorithms for scientific knowledge graphs by modifying the existing knowledge graph technology for commercial knowledge graphs.	Yes	100%
2.	To use the above methodology to construct a Social Media Knowledge Graph (SMKG) with social media research publications on marketing communication.	Yes	100%
3.	To assess the state of the arts of social media research on marketing communication based on SMKG, focusing on the assessment of the nature (conceptual, operational, or empirical contributions) and the degree (brand new, mixed, or mere replication) of social media publications against SMKG,	Yes	100%

Research Outcome

Major findings and research outcome:

Synthesizing concepts and relations is a fundamental task to enhance the understanding of the concept and relation knowledge in a research area. Accomplishing this task can help us detect hidden relations between concepts and evaluate publications’ contributions to the overall knowledge in an area. In this study, we employed a knowledge graph approach to synthesize and evaluate the concepts and relation knowledge on social media in social science publications. In particular, we extracted concepts and relations studied in more than 2,000 social science publications on the topic of social media (#15). Many new insights have been detected from the extracted results. (1) The extracted results indicate “whom/user” is the one of 5Ws has been studied the most times in publications. Besides, the most hypothesized relations are between “whom/user” and “whom/user” instead of other pairs of Ws in the 5W model. (2) The extracted results indicate that the number of hypotheses on the relations between concepts proposed each year increases. However, the number of new concepts proposed each year decreases most times. The number of new relations (both hypothesized and confirmed) firstly increases, then remains unchanged, and finally decreases. (3) The extracted results indicate that the average number of new concepts proposed by each publication decreases from 0.65 to 0.04 concepts and the average number of new relations decreases from 2.06 to 0.61 relations. This indicates that the contribution of each publication decreases in terms of the proposed new concepts and relations. We further applied the methods and algorithms developed and tested in the project to all aspects of the 5W model for social media, including Who-communicator (e.g., #7, #9), Whom-user (#1, #8, #13, #14), What-content (#3, #4), Which-channel (#2, #6, #11), and What-effects (#5, #10, #12).

Potential for further development of the research
and the proposed course of action:

The current knowledge-graph based approach to literature review can be generalized to other domains of marketing communication research, such as advertising research, health communication research, and political communication. We integrate data model (knowledge graph), data source (information on hypothesis statements in publications), and extraction models (NLP). The information our approach aims to extract is the concepts and their relations, which is also core information in other domains of communication research. The data sources of our approach, the hypothesis statements and corresponding test results, also exist in publications in other marketing/communication domains. The information extraction steps are also generalizable and our extraction models are applicable. An adaptation should be made to build a new concept tree for a new domain in Step 4. Besides, a publication typically provides explications of concepts and theoretical explanations of proposed relations among concepts. However, these explanations and arguments are hidden in the unstructured text. Extracting the explanation and argument structure to enrich the theoretical background of the extracted relation is an important and interesting problem for future work. We plan to identify the argument structure in the theoretical explanations in publications and employ NLP reasoning models to learn and extract the arguments.

Layman's Summary of
Completion Report:

How to synthesize existing research, identify new insights, and evaluate research quality are core challenges for marketing communication. Manual review methods allow detailed analysis but are confined to small sets of documents. Computational methods such as topic modeling can review large numbers of publications, but the results are superficial. Complex but essential information, such as the concept meaning and the relations among concepts, cannot be synthesized. Hence, we develop a novel literature review method, which leverages recent advances in knowledge graph construction and information extraction to synthesize the concept and their relations studied in a large number of publications in a research area. The approach consists of a knowledge graph data model to represent the concepts and relations, a data source of hypothesis statements in publications to provide them, and an extraction framework of seven deep learning models to extract them. We demonstrate the effectiveness of this approach by extracting concepts and relations studied in more than 2,000 social science publications on the topic of social media. New insights, such as which relations between concepts are hypothesized or confirmed the most times in publications and how the relations accumulate with time, can be detected from the extracted results.

Research Output

Peer-reviewed journal publication(s)
arising directly from this research project :
(* denotes the corresponding author)

Year of Publication	Author(s)	Title and Journal/Book	Accessible from Institution Repository
2020	Peng, T. Q.*, Zhou, Y., & Zhu, J. J. H.	From filled to empty time intervals: Quantifying online behaviors with digital traces.	Yes
2020	Zhang, Y., Wang, L., Zhu, J. J. H., & Wang, X.	Viral vs. broadcast: Characterizing the virality and growth of cascades.	Yes
2020	Zhao, W. X., Hou, Y., Chen, J., Zhu, J. J. H., Yin, E. J., Su, H., & Wen, J. R.*	Learning semantic representations from directed social links to tag microblog users at scale.	Yes
2021	Zhang, Y., Wang, L., Zhu, J. J. H., & Wang, X.	Conspiracy vs science: A large-scale analysis of online discussion cascades.	Yes
2021	Wang, C. J.*, & Zhu, J. J. H.	Jumping over the network threshold of information diffusion: testing the threshold hypothesis of social influence.	Yes
2021	Yafei Zhang, Lin Wang,1 Jonathan J. H. Zhu, Xiaofan Wang, and Alex ‘Sandy’ Pentland	The Strength of Structural Diversity in Online Social Networks	Yes
2021	Wang, Y., Peng, T. Q., Lu, H., Wang, H., Xie, X., Qu, H., & Wu, Y.*	Seek for success: a visualization approach for understanding the dynamics of academic careers.	No
2022	Guan, L., Zhang, Y., & Zhu, J. J. H. (2022).	Predicting information exposure and continuous consumption: self-level interest similarity, peer-level interest similarity and global popularity.	Yes
2021	Lei Hou∗, Yueling Pan, Jonathan J.H. Zhu	Impact of scientific, economic, geopolitical, and cultural factors on international research collaboration	Yes
2022	Yixin Zhou and Jonathan J. H. Zhu*	How online health groups help you lose weight: The role of group composition and social contact	Yes
2022	Peng, T. Q.*, & Zhu, J. J. H.	Competition, Cooperation, and Coexistence: An Ecological Approach to Public Agenda Dynamics in the United States (1958–2020)	No
2022	Zhou Y. X. & Zhu J. J. H.*	The Impact of Digital Media on Daily Rhythms: Intrapersonal Diversification and Interpersonal Differentiation.	Yes
	Zhou, Y. X., Peng, T. Q., & Zhu, J. J. H.*	Will time matter with cognitive load and retention in online news consumption?	No
	Hou, L., Guan, L., Zhou, Y. X., Shen, A. Q., Wang, W., Luo, A., Lu, H., & Zhu, J. J. H.*	Staying, switching, and multiplatforming of user-generated content activities: a 12-year panel study	Yes
	Lan, J. & Zhu, J. J. H.*	Synthesizing the Concepts and Relations in Social Media Research: A Knowledge Graph Approach based on Hypothesis Statements	No

Recognized international conference(s)
in which paper(s) related to this research
project was/were delivered :

Other impact
(e.g. award of patents or prizes,
collaboration with other research institutions,
technology transfer, etc.):

SCREEN ID: SCRRM00542