LSTM-Based Information Extraction for Cybersecurity Vulnerability Management

PhD student: 
Director(s): 
Co-supervisor(s): 
Starting date: 
February 2017
Host institution: 

We studied the suitability of Long Short Term Memory (LSTM) deep neural networks in extracting information from cybersecurity vulnerability descriptions. Information extraction is a sub-field of Natural Language Processing (NLP) that involves the recognition of semantic content in natural language text. The two common tasks of information extraction are Named Entity Recognition (NER) and Relation Extraction (REx). Previous works have shown that off-the-shelf NLP tools are not capable of extracting security-related entities and their relations and the mainstream tools used for NER that give the best results rely on feature engineering for information extraction. Feature engineering suffers from several limitations. LSTM-based neural networks-based methods, which became able to handle real-world problems in recent years, provide a promising alternative for the traditional information extraction methods. Their main promise is the elimination of manual feature engineering as neural networks can automatically learn non-linear combinations of features, which relieves us from the laborious feature engineering.

The results showed a remarkable improvement in the NER task over the traditional statistical-based Conditional Random Fields (CRF) model, which we used for benchmarking. The LSTM models used for relation extraction showed that there is a variance in their performance in this domain. As a result, the Shortest Dependency Path (SDP) model achieved the highest accuracy. One of the strengths of the studied LSTM models is being domain agnostic and can be applied to other domains. The traditional methods required extensive feature engineering, which made them time-consuming and labour-intensive. With this approach, the need for domain-specific tools is alleviated. The training corpus consequently is much simpler and requires much simple preprocessing. Finally, the LSTM models were integrated into a unified framework that can be used to convert textual descriptions of software vulnerabilities into information that is used to populate a vulnerability management ontology. This ontology opens the door for systems that provides timely intelligence and awareness of these vulnerabilities and threats.

Keywords : Cybersecurity, Named Entity Recognition, Relation Extraction, Vulnerability Management, LSTM