Machine Learning-based Analysis Heuristic for Vulnerability Detection on Configurable Systems
Software Product Lines, Vulnerability Detection, Secure Coding, Machine Learning
Configurable software systems offer a variety of benefits, such as supporting the easy configuration of custom behaviors for distinctive needs. However, it is known that the presence of configuration options in source code complicates maintenance tasks and requires additional effort from developers when adding or editing code statements. They need to consider multiple configurations when executing tests or performing static analysis to detect vulnerabilities. Therefore, vulnerabilities have been widely reported in configurable software systems. Unfortunately, the effectiveness of vulnerability detection depends on how the multiple configurations (i.e., samples sets) are selected. In this work, we tackle the challenge of generating more adequate system configuration samples by considering the intrinsic characteristics of security vulnerabilities. We propose a new sampling heuristic based on Machine Learning for recommending the subset of configurations that should be analyzed individually. We collected 53 metrics of 11 projects written in C referring to software complexity, probability of vulnerability incidence, evolution history, and developer's contribution. These data were subjected to execution in different scenarios, such as Cross-validation and Cross-project-validation, attempting to reduce the number of variants recommended by the LSA (Linear Sampling Algorithm) heuristic. Our results show that we can achieve high vulnerability-detection effectiveness with a smaller sample size.