ENQUIRE PROJECT DETAILS BY GENERAL PUBLIC

Project Details

Funding Scheme :

General Research Fund

Project Number :

611811

Project Title(English) :

Effective Regression Fault Detection for the Reuse of Spreadsheets

Project Title(Chinese) :

電子表格程序復用中錯誤的有效檢測

Principal Investigator(English) :

Prof Cheung, Shing-chi

Principal Investigator(Chinese) :

Department :

Dept of Computer Science & Engineering

Institution :

The Hong Kong University of Science and Technology

E-mail Address :

scc@cse.ust.hk

Tel :

2358 7016

Co - Investigator(s) :

Panel :

Engineering

Subject Area :

Computing Science & Information Technology

Exercise Year :

2011 / 12

Fund Approved :

1,230,500

Project Status :

Completed

Completion Date :

30-6-2015

Project Objectives :

To investigate the current practice of spreadsheet reuse and identify the common patterns of regression faults that occur in this process.

To derive a framework that supports the testing and debugging of regression faults in spreadsheet reuse.

To implement the proposed framework and evaluate it based on simulated and real-life regression faults in spreadsheets from various application domains.

Abstract as per original application
(English/Chinese):

Spreadsheets are widely used by domain experts in organizations to perform critical business tasks such as accounting, resource management and financial modeling. As many of these tasks are similar to each other, domain experts often perform new tasks by reusing a group of cells from some earlier tasks’ spreadsheets and adapt them as needed. While such reuse has an obvious benefit in improving productivity, it is highly vulnerable to regression faults. This is because spreadsheets lack high-level abstraction commonly found in conventional programs, so that spreadsheet reuse typically involves low-level adaptations, which are tedious and error-prone. Furthermore, such adaptations can trigger silent changes to the cell formulas as side-effects. Such side-effects might deviate from the intention of the users and create regression faults. There is empirical evidence that regression faults in spreadsheet reuse are common in practice. How to detect these faults effectively poses an important research challenge. In the spreadsheet paradigm, the input/output interface of a group of cells is implicitly defined by cell coordinates. Such an implicit interface is prone to various differences between spreadsheets. This makes it difficult to apply conventional regression testing techniques, which assume that the two program versions have mostly the same input/output interface such that their executions are comparable. In this project, we propose to study the following research questions arising from the reuse of spreadsheets: What are the common changes arising from spreadsheet reuse? What the common patterns of regression faults induced by these changes? Can we expose these regression faults by comparing the execution behaviors of two spreadsheet versions? How to summarize behavioral difference in debugging aids? How to generate behavior-differentiating test case to supplement existing test suite? The study of these questions will help develop an effective framework that enables domain experts to reuse spreadsheets safely for a new application. Prototype tools will be developed to support the proposed framework. To evaluate the applicability of our research results, we will conduct experiments using real world spreadsheet applications from various domains. This project will benefit Hong Kong as a competitive financial and service center.

電子表格在各種行業中有著極為廣泛的應用。為了完成會計，資源管理和財務建模等任務，業務人員經常需要開發大量複雜的電子報表。這些任務往往彼此重復，所以一種常見的做法是復用已有的表格中的一組或多組表格單元。這種復用可以極大的提高開發效率。然而，它極易引入回歸錯誤。這是因為電子報表作為一種程序開發語言缺乏抽象，信息封裝等高級語言具備的特性，這就導致表格單元復用時往往需要做大量的修改，例如逐個調整表格單元引用。這些修改非常繁瑣而且極易出錯。另外，當表格單元被拷貝粘貼的時候，常用的開發環境(如Microsoft Excel)會對其中的公式進行自動修改。這種修改作為拷貝粘貼的副作用很有可能出乎業務人員的意料，引入回歸錯誤。有經驗証據表明，復用錯誤是電子報表中出現錯誤的一個常見原因。在一些已知案例中，復用錯誤直接導致企業蒙受經濟損失。因此，如何有效地自動檢測出這種錯誤是一個重要的研究問題。這個問題的難點在於在電子表格程序的輸入/輸出出接口被隱式的定義為公式中引用的表格單元地址。這種方式定義的接口極易被兩個電子表格之間的排版，輸入數據位置等差異而改變。這就導致了傳統的回歸測試技術很難在這裡被直接應用，因為這些傳統技術能應用一個前提條件就是程序的接口基本保持不變，從而使得原始程序和復用后的程序之間的執行結果可以被比較。在電子表格中，這個前提條件是很難成立的。在這個項目中，我們建議研究以下幾個電子報表復用中產生的問題：在復用電子報表的時候，有那些對復用單元的修改是常見的？這些修改又會帶來何種可能的回歸錯誤模式？我們可否改進傳統的回歸測試技術以檢測到這些錯誤？如何可視化檢測到的錯誤以使得業務人員更加容易調試？如何自動生成測試數據以使得潛在的復用錯誤更加容易被檢測出來？這些問題的研究結果將有助於建立一個有效的框架，幫助業務人員更加可靠的復用電子表格。我們也將開發這個框架的原型系統。為了評估我們研究結果的實用性，我們將使用實際的電子表格進行試驗。這個項目將在各個方面幫助香港成為一個有競爭力的金融和服務中心。

Realisation of objectives:

Spreadsheets are a popular computation paradigm used for data storage and analysis, decision support, financial reporting, and quality control. Despite their popularity, spreadsheets are subject to errors. However, conventional quality assurance measures for software such as unit tests and assert statements are generally inapplicable to spreadsheets. To understand the quality issues arising from spreadsheet, we approached the first objective by collecting spreadsheets over the web as well as acquiring the EUSES corpus used by a few research works. The EUSES corpus contains 4,037 spreadsheets over 11 categories. A Java tool based on Apache POI was developed to parse and process the collected spreadsheets. A research assistant was recruited to carry out these tasks. We encountered several major challenges. First, the spreadsheets collected are mostly unrelated to one another. Second, there are no documentation or specification of the collected spreadsheets. As such, we need to inspect each spreadsheet manually and infer its likely objectives. Third, these spreadsheets come with no test cases, bug reports, patches or comments. We need to identify potential spreadsheet issues by checking the cell formulas of a spreadsheet against its inferred objectives. We find that the values of some contiguous cells are not necessarily consistently evaluated by formulas. For example, some values are directly typed while some are evaluated by formulas. Even for those evaluated by formulas resulting in the same results, discrepancy can occur in the formula expressions. Some expressions include the references of blank cells but some do not. We term these issues as ambiguous computation smells instead of errors because these issues can induce errors when referenced cells in formulas are updated as the spreadsheet evolves. We made a systematic study of these computation smells based on the EUSES corpus and captured their patterns. Our experiments revealed that 27.3% of spreadsheets with formulas suffer from the ambiguous computation smells. We also identified the conditions under which these smells can manifest as errors. The work was conducted together with a PhD student from the Chinese Academy of Sciences when he visited HKUST under the PI's supervision in 2013. To achieve the second objective, we developed an automated framework AmCheck to detect and repair ambiguous computation smells in spreadsheets. First, AmCheck clusters contiguous cells with the same intended computation based on some heuristics. The heuristics manages to identify clusters in 44.7% of the experimented spreadsheets. Second, AmCheck deploys an algorithm to detect ambiguous computation smells among the identified cell clusters. Third, AmCheck tries to synthesize the correct formulas by mapping the problem into a constraint-solving problem, which can be handled by existing constraint solvers. AmCheck then repairs the smelly cells using the synthesized formulas. We also studied the constraint checking methodology. To achieve the third objective, we implemented AmCheck using Java on top of the Apache POI. Experiments using the EUSES corpus confirmed that AmCheck can repair 99.1% of the smelly clusters. AmCheck considers detected smells are errors if they cannot be repaired using the synthesized formulas. To evaluate the usefulness of AmCheck, we conducted a user study by applying it to 10 actively used spreadsheets provided by the finance office in an institute and invited the users of these spreadsheets for an interview to validate the detected smells as well as their root causes. As a result, AmCheck found 23 confirmed errors in 4 spreadsheets. Most users committed these errors due to inconsistent updates to cells created by copy-and-paste during spreadsheet evolution. CACheck extends AmCheck to handle in-homogeneous cell arrays where the cells referenced by a formula need not fall into the same row or column. An MPhil student was recruited to enhance the work of AmCheck. We found that the clustering heuristics used by AmCheck can be further improved. In particular, cells that are subject to the same computation are not necessarily contiguous. Further, tabulation styles vary significantly across users. Hence, a universal heuristics does not achieve a good balance of precision and recall across spreadsheets. In view of that, we clustered cells with the same computation based on their strong and weak features. Strong features can be used to cluster those cells that carry out the same computation. We then identify the set of weak features that are applicable to each cell cluster. Another cycle of clustering can be conducted based on the identified weak features. Note that these weak features can vary across cell clusters due to diversified tabulation styles. We implemented the enhanced technique as CUSTODES in Java and applied it to the EUSES corpus. Promising results were obtained. We release the tool and our experimental data set. In 2015, the Enron corpus was released based on the emails in the Enron court case. The corpus contains 15,929 spreadsheets used earlier by Enron employees. However, these spreadsheets were not maintained by a version management system. The corpus is not amenable to the study of regression faults in spreadsheets. We developed a methodology and successfully applied to the Enron corpus to recover the versioning information of the spreadsheets. We call it Versioned Enron corpus (VEnron), and release it to the public. At the end of our project, we conducted a systematic study to compare the effectiveness of existing spreadsheet checking methodologies and tools. We also explored how to carry out table clone detection to improve the effectiveness of revealing spreadsheet errors.

Summary of objectives addressed:

	Objectives	Addressed	Percentage achieved
1.	To investigate the current practice of spreadsheet reuse and identify the common patterns of regression faults that occur in this process.	Yes	100%
2.	To derive a framework that supports the testing and debugging of regression faults in spreadsheet reuse.	Yes	100%
3.	To implement the proposed framework and evaluate it based on simulated and real-life regression faults in spreadsheets from various application domains.	Yes	100%

Research Outcome

Major findings and research outcome:

The findings of this project were/will be disseminated by three journal (TSE, JSS and IST) and four conference publications at ICSE and FSE. Two research students were trained. We made several major findings. 1. Ambiguous computation smells are a common type of regression spreadsheet faults. Experiments found that such smells occur in 27.3% of spreadsheets. 2. Ambiguous computation smells mostly arise from improper updates to cells created using autofill or copy-and-paste of cells from other spreadsheets. 3. Ambiguous computation smells can be effectively detected by an automated technique. Most of these smells can be automatically repaired using constraint solving. 4. Smells that cannot be repaired are likely errors. 5. Effective constraint handling requires knowledge of context consistency. 6. Users tabulate spreadsheet in diversified styles. It is difficult to formulate a universal heuristics to cluster cells that are subject to the same intended computation. 7. Users tabulate the cells subject to the same intended computation consistently using the same style. This enables cells to be effectively clustered based on strong and weak features. 8. Smells usually occur in minority. It is possible to detect smells in each cluster as outliers under a feature space. 9. Versioning information can be effectively recovered from an industrial spreadsheet corpus based on features and email contents. This allows recovery of a versioned spreadsheet corpus. We release this corpus as VEnron, which is the first versioned industry-based spreadsheet corpus. 10. The performances of existing spreadsheet detection tools vary a lot. Tools developed by this project are the most effective ones based on the evaluation using three public spreadsheet corpora: EUSES, Enron and Hawaii Kooker. Findings 1-4 were disseminated by an ICSE conference paper C1, an FSE conference paper C4 and a TSE journal paper [J3]. Finding 5 was disseminated by an IST journal paper J1. Findings 6-8 will be disseminated by an ICSE conference paper C2. Finding 9 will be disseminated by another ICSE conference paper C3. Finding 10 will be disseminated by a JSS journal paper J2.

Potential for further development of the research
and the proposed course of action:

Our project findings can lead to the following future work. 1. Our tool implements intra-worksheet smell detection. It can be further enhanced to support inter-worksheet and inter-spreadsheet smell detection. 2. The current methodology focuses on computational smell. Extension of the methodology can focuses on data smells of those cells referenced by formulas. 3. The current version of VEnron recovers only the versioning information from user emails. We may re-examine these emails and recover fixes and update dependency from the emails. 4. The current project evaluates only the effectiveness of the methodology. It is possible to evaluate the methodology’s usability through more user studies.

Layman's Summary of
Completion Report:

Spreadsheets are a popular computation paradigm used for data storage and analysis, decision support, financial reporting, and quality control. Despite their popularity, spreadsheets are found to be error-prone. Spreadsheet users rarely apply rigorous software engineering practice in maintaining the evolution of spreadsheets and keeping track of spreadsheet bugs. Conventional quality assurance measures for software such as unit tests and assert statements are generally inapplicable to spreadsheets. We made an empirical study of the spreadsheet issues. The study identified a type of smell commonly occurring when users prepare/maintain spreadsheets using copy-and-paste and autofill. The smell happens when a set of related cells is evaluated using different computation. We developed automated techniques to address the quality issue of spreadsheet in the absence of specification, test cases and bug management. The techniques were implemented as Java tools and applied successfully to detect spreadsheet smells and errors. Evaluations on spreadsheet corpus confirm that our techniques outperform the existing ones. We released our experimental results and data set publicly. In addition, we prepared and released the first spreadsheet corpus with versioning information based on the 15,929 spreadsheets used earlier by the Enron employees. The versioned spreadsheets greatly facilitate future studies on spreadsheet regression faults.

Research Output

Peer-reviewed journal publication(s)
arising directly from this research project :
(* denotes the corresponding author)

Year of Publication	Author(s)	Title and Journal/Book	Accessible from Institution Repository
2016	[J1] Jun Sui, Chang Xu*, S.C. Cheung, Wang Xi, Yanyan Jiang, Chun Cao, Xiaoxing Ma, Jian Lu	Hybrid CPU–GPU constraint checking: Towards efficient context consistency, Information and Software Technology, volume 74, June 2016, pp. 230-242.	No
2016	[J2] Ruiqing Zhang, Chang Xu*, S.C. Cheung, Ping Yu, Xiaoxing Ma and Jian Lu	How Effective can Spreadsheet Anomalies be Detected: An Empirical Study, The Journal of Systems and Software, 2016, accepted to appear.	No
	[J3] Wensheng Dou*, Chang Xu, S.C. Cheung, Jun Wei	CACheck: Detecting and Repairing Cell Arrays in Spreadsheets, IEEE Transactions on Software Engineering, accepted to appear.	No

Recognized international conference(s)
in which paper(s) related to this research
project was/were delivered :

Month/Year/City	Title	Conference Name
[C1] May/2014/Hyderabad	Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells due to Ambiguous Computation	The 36th International Conference on Software Engineering (ICSE 2014), pp. 848-858.
[C2] May/2016/Austin	Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features	The 38th International Conference on Software Engineering (ICSE 2016), 12 pages
[C3] May/2016/Austin	VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis	The 38th International Conference on Software Engineering (ICSE 2016 - SEIP), 10 pages
[C4] November/2016/Seattle	Detecting Table Clones and Smells in Spreadsheets	The 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2016).

Other impact
(e.g. award of patents or prizes,
collaboration with other research institutions,
technology transfer, etc.):

- Research collaboration with the Institute of Software, Chinese Academy of Sciences. - Research collaboration with Nanjing University. - Research collaboration with Sun Yat-Sen University. - Release of an industry corpus VEnron (http://sccpu2.cse.ust.hk/venron/) - Release of a research corpus and tools (http://sccpu2.cse.ust.hk/custodes/)

SCREEN ID: SCRRM00542