Challenges of Exploit Prediction
The methodological issues of exploit prediction, along with finding a balance between conducting an evaluation under real-world conditions and with an adequate sample size, pose three challenges:
As mentioned earlier, evidence of real-world exploits is found for only around 2.4% of the reported vulnerabilities. This skews the distribution towards one class in the prediction problem (i.e., not exploited). In such cases, standard machine learning approaches favor the majority class, leading to poor performance on the minority class. Some of the prior work in predicting the likelihood of exploitation considers the existence of PoCs as an indicator of real-world exploitations, which substantially increases the number of exploited vulnerabilities in the studies adopting this assumption. However, out of the PoC exploits that are identified, only a small fraction are ever used in real world attacks —a result confirmed in this chapter (e.g., only about 4.5% of the vulnerabilities having PoCs were subsequently exploited in the wild). Other prior work uses class balancing techniques on both training and testing datasets and reports performance achieved using metrics like TPR, FPR, and accuracy. Resampling the data to balance both classes in the dataset leads to training the classifier on a data distribution that is highly different than the underlying distribution. The impact of this manipulation, whether positive or negative, cannot be observed when testing the same classifier on a manipulated dataset, e.g., a testing set with the same rebalancing ratio. Hence, the prediction results of the proposed models in deployed, real-world settings are debatable. To confirm the impact of the highly imbalanced dataset used on the machine learning models, oversampling techniques (in particular SMOTE ) are examined. Note that the testing dataset is not manipulated because of the aim to observe a performance that can be reproduced in the settings of a model running on real-world deployment (e.g., streaming predictions). Doing so, only a marginal improvement is observed for some classifiers, while other classifiers have shown a slightly negative impact when they are trained on an oversampled dataset.
Evaluating Models on Temporal Data.
Machine learning models are evaluated by training the model on one set of data, and then testing the model on another set that is assumed to be drawn from the same distribution. The data split can be done randomly or in a stratified manner, where the class ratio is maintained in both training and testing. Exploit prediction is a time-dependent prediction problem. Hence, splitting the data randomly violates the temporal aspect of the data—as events that happen in the future will now be used to predict events that happened in the past. Prior research efforts ignored this aspect while designing experiments . In this work, this temporal mixing is avoided in most experiments. However, experiments with a very small sample size, in which this is not controlled, are included (this is because one of the used ground truth sources does not have date/time information). It is explicitly noted when this is the case.
Limitations of Ground Truth.
As mentioned, attack signatures reported by Symantec are used as the ground truth of the exploited vulnerabilities, similar to previous works [4, 8]. This ground truth is not comprehensive because the distribution of the exploited vulnerabilities over software vendors is found to differ from that for overall vulnerabilities (i.e., vulnerabilities that affect Microsoft products have a good coverage compared to products of other OS vendors). Although this source is limited in terms of coverage , it is still the most reliable source of exploited vulnerabilities because it reports attack signatures of exploits detected in the wild for known vulnerabilities. Other sources either report whether a software is malicious without proper mapping to the exploited CVE-ID (e.g., VirusTotal),24 or rely on online blogs and social media sites to identify exploited vulnerabilities (e.g., SecurityFocus).25 In this chapter, Symantec data is used while taking into account the false negatives. To avoid overfitting the machine learning model on this not-so-representative ground truth, the software vendor is omitted from the set of examined features.