Project Details
Funding Scheme : General Research Fund
Project Number : 16210722
Project Title(English) : An Integrated Framework for Extracting and Utilizing Information from Data Visualizations in Digital Documents 
Project Title(Chinese) : 從數字文檔中的數據可視化中提取和利用信息的集成框架 
Principal Investigator(English) : Dr Qu, Huamin 
Principal Investigator(Chinese) :  
Department : Dept of Computer Science & Engineering
Institution : The Hong Kong University of Science and Technology
Co - Investigator(s) :
Panel : Engineering
Subject Area : Computing Science & Information Technology
Exercise Year : 2022 / 23
Fund Approved : 1,039,978
Project Status : On-going
Completion Date : 31-12-2025
Abstract as per original application
Data visualizations, such as charts and graphs, are increasingly created and shared on the Web, appearing in articles, documents, scientific literature, and social media. Google Trends data show that the search interest in “chart” has been increasing rapidly in recent years, surpassing that of “image” in February 2020. Data visualizations convey vast amounts of quantitative information, from which people can easily interpret data trends and differences. Therefore, visualizations often serve as the main entry point for the general public to access data, especially in high-impact domains, such as politics, public health, and climate. However, unlike humans, machines have no direct access to the data inside visualizations. Massive numerical information and knowledge remain locked in visualizations and are inaccessible via search engines. Addressing this problem can make it possible to access and utilize numerical information in visualizations in similar ways to textual information in documents. The broad goal of the proposed project will be to make visualizations a “first-class citizen” on the Web and to develop techniques for automatically interpreting, retrieving, and analyzing visualizations at the Internet scale. If successful, the resulting tools arising from the proposed project will affect a wide range of visualization users, such as data scientists, journalists, and designers. We will first develop techniques for extracting data and visual encodings in visualizations. We will focus on SVG-based visualizations, which are becoming increasingly popular on the Web, and propose an end-to-end framework for translating SVG-based visualizations into visualization specifications with the underlying data. Next, we will propose methods for measuring the relationships, such as similarities and relevance, between visualizations. Such measurements will serve as building blocks for processing and analyzing visualization collections, such as grouping visualizations by their content. Finally, we will integrate the proposed techniques into an interactive interface for pilot users to analyze visualization collections using real-world datasets. We will conduct user studies to test the viability and effectiveness of the approach and propose at least two usage scenarios, namely search engines for visualizations and mining design knowledge for recommending visualizations. The core content of our proposed project will be disseminated in academic publications, and the dataset and codes will be made open source to benefit future research.
數據可視化(例如圖表和圖形)越來越多地在 網上創建和共享,廣泛出現在文章、文檔、科學文獻和社交媒體中。谷歌趨勢數據顯示,近年來“圖表”的搜索熱度快速增長,在 2020 年 2 月超過了“圖像”的搜索。數據可視化傳達了大量的量化信息,人們可以從中輕鬆解讀數據趨勢和差異。因此,可視化通常作為公眾訪問數據的主要入口,尤其是在政治、公共衛生和氣候等高影響領域。然而,與人類不同,機器無法直接訪問可視化中的數據。海量數字信息和知識仍被鎖定在可視化中,無法通過搜索引擎訪問。如果這個問題可以解決,可視化中的數字信息就可以像文檔中的文本信息一樣通過搜索引擎訪問。本研究項目的廣泛目標是使可視化成為 Web 上的“一等公民”,並開發用於在 互聯網上可以自動解釋、檢索和分析可視化的技術。如果成功,該項目產生的工具將影響廣泛的可視化用戶,例如數據科學家、記者和設計師。我們將首先開發用於在可視化中提取數據和視覺編碼的技術。我們將專注於在互聯網上變得越來越流行的基於 SVG 的可視化,並提出一個端到端的框架,用於將基於 SVG 的可視化轉換為帶有底層數據的可視化框架。接下來,我們將研究測量可視化之間關係的方法,例如相似性和相關性。此類測量關係的方法將用作處理和分析可視化集合的構建塊,例如按內容對可視化進行分組。最後,我們將把提出的技術集成到一個交互式界面中,供試點用戶分析真實世界的數據可視化集合。我們將進行用戶研究以測試該方法的可行性和有效性,並提出至少兩種使用場景,即用於可可視化搜索的引擎設計和用於可視化推薦的設計知識挖掘。該項目的核心內容將在學術出版物中傳播,數據集和代碼將開源,以利於未來的研究。
Research Outcome
Layman's Summary of
Completion Report:
Not yet submitted