DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (2024)

[orcid=0000-0002-3263-1275]

Yanjing Yang Xin Zhouzhouxin@nju.edu.cn Runfeng Mao Jinwei Xu Lanxin Yang Yu Zhang Haifeng Shen He ZhangSofeware Institute, Nanjing University, ChinaFaculty of Science and Engineering, Southern Cross University, Australia

Abstract

Software vulnerability detection is generally supported by automated static analysis tools, which have recently been reinforced by deep learning (DL) models. However, despite the superior performance of DL-based approaches over rule-based ones in research, applying DL approaches to software vulnerability detection in practice remains a challenge due to the complex structure of source code, the black-box nature of DL, and the domain knowledge required to understand and validate the black-box results for addressing tasks after detection. Conventional DL models are trained by specific projects and, hence, excel in identifying vulnerabilities in these projects but not in others. These models with poor performance in vulnerability detection would impact the downstream tasks such as location and repair. More importantly, these models do not provide explanations for developers to comprehend detection results. In contrast, Large Language Models (LLMs) have made lots of progress in addressing these issues by leveraging prompting techniques. Unfortunately, their performance in identifying vulnerabilities is unsatisfactory. This paper contributes DLAP, a Deep Learning Augmented LLMs Prompting framework that combines the best of both DL models and LLMs to achieve exceptional vulnerability detection performance. Experimental evaluation results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts, as well as fine-turning on multiple metrics.

keywords:

Vulnerability Detection \sepLarge Language Model \sepPrompting Engineering \sepFramework

1 Introduction

Software vulnerability detection is paramount for safeguarding system security and individual privacy. As the cyber environment grows increasingly complex and attack techniques grow quickly, these various threats to software systems have long puzzled software organizations[26, 36]. In particular, vulnerability is one of the critical threats that may result in information leakage, data tampering, and system breaks[37]. Vulnerability detection aims to identify vulnerabilities, mitigate their impact, and prevent malicious attacks[14]. Moreover, vulnerability detection helps to enhance software quality, usability, and trustworthiness. Nowadays, vulnerability detection has become a must-have in modern software development.

Many automated static analysis tools (ASATs) have been applied for vulnerability detection[24, 26]. However, on the one hand, the outputs of ASATs are difficult to validate as they require developers to master more expertise and experience in vulnerability detection[31]; on the other hand, the performance of ASATs is poor (e.g., high false positive rates) due to they are based on string pattern matching[9, 19]. In recent years, the advancement of deep learning (DL) in natural language processing has inspired researchers to integrate DL models¹¹1In this paper, ‘DL model’ refers to the conventional deep learning models other than large language models such as GPT, Copilot, and Llama. into ASATs. These modern ASATs generally outperform their conventional counterparts in vulnerability detection[24, 36]. However, DL models that perform well on experimental datasets may suffer from severe performance degradation in real-world projects. This is mainly because of the complexity of source code structure and the concealment of vulnerability characteristics[4]. Using ASATs with DL models to detect vulnerabilities makes an impact on a collection of downstream tasks, including but not limited to vulnerability validation, localization, and repair. Moreover, it is challenging for developers who are responsible for checking vulnerabilities indicated by DL models[39].

In recent years, Large Language Models (LLMs) such as ChatGPT[3] and Copilot[5] have shown prominent performance in various tasks[46, 22, 18]. However, LLMs have not achieved satisfactory results in vulnerability detection[34]. Dai et al.[11] indicated that one of the main reasons is the inappropriate use of LLMs. LLMs are pre-trained by a vast amount of data, but not all of them have positive effects on the downstream tasks such as vulnerability detection[16, 29]. There are two techniques to address this problem: fine-tuning and prompt engineering. Fine-tuning is a commonly used technique but requires significant computational resources and time. LLMs with prompts allow users to interact with LLMs iteratively to produce bespoke results[42, 11]. However, as detection performance is highly susceptible to prompts, a generic prompting framework would not be able to achieve satisfactory performance[1]. Prompt engineering makes LLMs adapt to a specific downstream task and generate customized outputs[42, 11]. Moreover, prompt engineering can jointly work with fine-tuning, making it a cost-effective and promising technique for vulnerability detection[35].

Previous work has utilized LLMs for vulnerability detection using various prompting frameworks[45, 11, 34, 30]. However, existing prompts input limited information to LLMs, making them provide little help in improving the performance of LLMs in real-world projects. To address this problem, we propose a bespoke prompting framework DLAP ²²2Data and materials:https://github.com/Yang-Yanjing/DLAP.git.Although DL models can not achieve unsatisfactory performances in multiple projects, they have superior performance in a single project. The core idea of DLAP is using pre-trained DL models for the target project to stimulate adaptive implicit fine-tuning of LLMs. We select the most suitable DL model as plugins among three categories to augment DLAP. This process is implemented through two state-of-the-art prompts: In-context Learning (ICL) prompt and Chain-of-Thought (COT) prompt. On the one hand, the ICL prompt utilizes locality-sensitive hash (LSH) to sample candidate code fragments that are similar to the input code. Then, a pre-trained DL model is employed to obtain the prediction probabilities of candidate fragments. The combination of the candidate code fragments and their corresponding probabilities forms the ICL prompt for the input code. On the other hand, the COT prompt synthesizes the results from static scanning tools and pre-trained DL models as queries. Following these queries, DLAP locates the corresponding COT templates within the detection step template library we constructed based on Common Weakness Enumeration (CWE³³3https://cwe.mitre.org). Then DLAP uses these COT templates to generate the customized completed detection COT prompts for input codes. This stimulates LLMs to conduct implicit fine-tuning to achieve better performance in vulnerability detection and provide supplementary information to facilitate the inspection and comprehension of detection results.

We conduct experiments using four large-scale projects with more than 40,000 examples to evaluate DLAP. We first conduct experiments to select the most suitable DL model to form DLAP. We assess performance by integrating various DL models into DLAP and comparing their results to determine the optimal deep learning model. The results show that combining Linevul with an LLM outperformed other DL models by 15% across all evaluation metrics. Then we select the Linevul to generate prompts within DLAP and compare DLAP against the state-of-the-art prompting frameworks including role-based prompts, auxiliary information prompts, chain-of-thoughts prompts, and in-context learning prompts. The results show that DLAP surpasses baselines across all metrics for each project, achieving a 10% higher F1 score and a 20% higher Matthews Correlation Coefficient (MCC). This indicates that DLAP is more effective for vulnerability detection. Finally, we compare our approach with the most prevalent fine-tuning techniques to explore the effectiveness of DLAP versus fine-tuning. The results reveal that DLAP can achieve 90% of the extensive fine-tuning process at a lower cost and even outperform fine-tuning on some metrics. Moreover, the DLAP-driven LLM model generates more explanatory text than fine-tuning, which is important to aid developers in using ASATs for vulnerability detection tasks.

The main contributions of this paper are as follows.

•
We propose DLAP, a bespoke LLM prompting framework for vulnerability detection. DLAP combines the advantages of DL models and LLMs while overcoming their respective shortcomings. Additionally, DLAP has the potential to be adapted for other ASAT tasks.
•
We conduct rigorous experiments to demonstrate the effectiveness of selecting appropriate DL models for DLAP and showcase the exceptional vulnerability detection performance over state-of-the-art prompting frameworks.
•
We empirically demonstrate the advantages of prompting over fine-tuning for vulnerability detection in terms of detection accuracy, cost-effectiveness, and explanations.

The rest of the paper is organized as follows. Section2 reviews the background and related work. Section3 delineates the design of DLAP. Section4 presents the experimental design and parameter setting of DLAP, followed by results and analysis in Section5. Section6 discusses DLAP’s generation capability and DL model selection. Finally, we present threats to validity in Section7 and conclude this paper in Section8.

2 Background and Related Work

This section describes the related work on vulnerability detection and the background of prompt engineering for LLMs.

2.1 Vulnerability Detection

While there is a plethora of work on this topic, we focus on vulnerability detection enhanced by DL and LLMs.

2.1.1 Deep Learning for Vulnerability Detection

Vulnerability detection has received a lot of concerns in recent years. Lin et al.[27] proposed a framework that incorporates one project and 12 DL models for slice-level vulnerability detection. Zhou et al.[49] proposed Devign which uses graph representation for input performed better than approaches using tokens of code directly. Li et al. developed a series of DL-based approaches including VulDeePecker, $\mu$ VulDeePecker, and SySeVR[24, 25, 50] to complete the construction of a DL framework applied to vulnerability detection. Despite achieving advanced results in experimental setups, there still exists generalization issues in practical applications. With the advent of networks based on the transformer architecture and language models, researchers have started applying these advanced NLP techniques to vulnerability detection. Fu and Tantithamthavorn[13] applied RoBERTa as a pre-training model, fine-tuned on subsequent vulnerability detection tasks, achieving the best experimental performance in both function-level and line-level vulnerability prediction tasks.

Chakraborty et al.[4] found that the performance of several DL-based approaches dropped by an average of 73% on datasets built from multiple real-world projects, highlighting the need for further research into cross-project vulnerability detection. Steenhoek et al.[36] conducted an empirical study to demonstrate the variability between runs of a model and the low agreement among DL models’ outputs and studied interpretation models to guide future studies.

2.1.2 Large Language Model for Vulnerability Detection

The outstanding performance in dialogue, code generation, and machine translation of LLMs has sparked the interest of researchers and practitioners in applying LLMs to software security. Katsadouros et al.[20] highlighted the potential of LLMs to predict software vulnerabilities, emphasizing their advantage over traditional static methods. Thapa et al.[38] discovered that transformer-based LLMs outperformed conventional DL-based models. Zhang et al.[45] enhanced the effectiveness of ChatGPT in software vulnerability detection through innovative prompt designs and leveraging the model’s ability to memorize multi-round dialogue.

However, according to Cheshkov et al.[7], ChatGPT and GPT-3 in Java code vulnerability detection do not outperform the current tools. In the meantime, Liu et al.[29] emphasized that ChatGPT cannot replace professional security engineers in vulnerability analysis, indicating that closed-source LLMs are not the end of the story. These findings suggest that the performance of LLMs in the realm of vulnerability detection leaves much to be desired. The potential false positives and illusions generated by LLMs in specific applications [47] are attributable to the extensive unconstrained training data and the multitude of training parameters. Consequently, it is essential to fine-tune an LLM before deploying it for specific tasks. Lu et al.[30] proposed a method called GRACE that processes code structure using CodeT5, combines semantic with syntactic features to conduct similarity searches, and utilizes in-context learning prompts to drive the LLM beyond all baseline DL models on complex real-world datasets.

As indicated above, the performance of LLMs remains unsatisfactory, accompanied by a high false positive rate. In this study, GRACE[30] along with other evaluated prompts[45] will serve as the baselines, facilitating the comparison and evaluation. To achieve better performance in vulnerability detection, we select Sysver[24], Devign[49], and Linevul[13] as the augmented components for our framework.

2.2 Prompt Engineering for Large Language Models

The increase in the number of parameters in LLMs leads to a rise in the cost of fine-tuning LLMs. Low-cost approaches such as Low-Rank Adaptation of Large Language Models (LoRA)[17] and P-tuning[28] have significantly reduced the cost of fine-tuning. However, the cost for some applications is still considerable. For example, fine-tuning an LLM with 33B parameters requires two high-precision 40G GPUs[17]. The parameters of the LLMs that have been reported to achieve excellent performance all exceed one hundred billion, which may cost a lot of computing resources. Researches[3, 8, 2] reveal that LLMs are transformer-based models. Different inputs can cause changes in the attention layers within their architecture, and as such, the construction of high-quality prompts can assist the LLMs in providing satisfactory answers for the specific target tasks. In contrast, inappropriate prompts will impinge on its attention, which may mislead LLMs to produce hallucinations[47]. The COT[42] and ICL[11] prompting are currently the most effective approaches to prompt engineering. COT prompting is an approach of decomposing a target problem into steps to prompt LLMs to provide the answers. ICL prompts LLMs to deliver correct answers by referring to similar questions. In this paper, DLAP integrates both approaches to provide appropriate prompts to drive LLMs for vulnerability detection.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (1)

3 The Proposed Framework: DLAP

This section describes the design of DLAP, which consists of two prompt techniques.

3.1 Motivation

Vulnerability detection can be formulated as a binary classification problem. Given a vulnerability dataset $\mathcal{D}$ that is represented with $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N},y=\text{0 or 1}$ , where $x_{i}$ is the source code of a function, and $y_{i}$ is the ground-truth (0–NO, 1–YES). Detection models are expected to establish their mappings automatically. One of the main processes of detection models is input (source code) representation. Specifically, the source code can be represented as semantic tokens, abstract parsing trees (ASTs), data/control flow graphs (DFGs/CFGs), or other formats. Conventional DL models use only a single format as input, which may miss useful information. LLMs have the capability of combining multiple representations, making it a more promising detection technique for vulnerability detection. When levering LLMs for vulnerability detection, it is fundamental to make LLMs understand domain and task knowledge as LLMs are trained by general corpora. It can be expected that using LLMs for vulnerability detection directly would achieve unsatisfactory results. Therefore, LLMs use fine-tuning or prompt engineering to address this task.

Fine-tuning is an intuitive technique to make the parameters $\mathcal{W}$ of LLMs adapt to downstream tasks (i.e., vulnerability detection) ion $\mathcal{D}_{o}^{E}$ to achieve better results, which can be described as Equation1.

\footnotesize\hskip 42.67912pt\begin{aligned} \mathcal{W}=\text{argmin}%\mathcal{L}\left(Y-\mathcal{M}_{\mathcal{W}}(\mathcal{P}(X))\right)\end{aligned}

(1)

where $Y$ is the ground truth (label) for function level code and $\mathcal{L}$ is the loss function. By fine-tuning the weight parameters $\mathcal{W}$ of LLMs, the predictive probabilities are expected to be closer to the ground truth. However, fine-tuning is cost-intensive[17, 44]. For example, LoRA which is one of the most efficient LLMs requires approximately 80G graphics card memory and a lot of time for fine-tuning an LLM with only 13B parameters.

Prompt engineering is a new technique to augment LLMs. LLMs can incorporate various inputs and generate their answers; therefore, they can be prompted [3, 8, 2]. Technically, we use $\mathcal{D}_{s}^{T}$ and $\mathcal{D}_{o}^{E}$ to describe pretraining sets and testing sets, respectively. When using prompt engineering ( $\mathcal{P}(\dot{)}$ ) for vulnerability detection, LLMs accept a set of $\mathcal{P}(X)$ as inputs and output the estimated probabilities of them, where $X$ means a collection of examples from $\mathcal{D}_{o}^{E}$ . One cost-effective prompting template $\mathcal{P}$ for vulnerability detection can be described as Equation2.

\footnotesize\hskip 42.67912pt\begin{aligned} \mathcal{P}=\text{argmin}%\mathcal{L}\left(Y-\mathcal{M}_{\mathcal{W}}(\mathcal{P}(X))\right)\end{aligned}

(2)

According to Liu et al.[28], Equation2 can achieve the same effects as Equation1. That is, both prompt engineering and fine-tuning can make predictive probabilities close to ground truths. In the following subsections, we elaborate on DLAP which leverages prompt engineering for vulnerability detection.

3.2 Framework Overview

DLAP leverages attention mechanisms within LLMs, incorporating selectively trained DL models as enhancements. This approach, known as In-Context Learning (ICL), acts to subtly refine LLMs, making them more adept at specific projects. Moreover, DLAP’s use of chain-of-thoughts (COT) enables LLMs to discard incorrect generative paths effectively. Consequently, DLAP enhances LLMs’ capabilities in detection tasks, ensuring robust performance without incurring significant costs. ICL can stimulate the attention layer of an LLM to adapt to the downstream detection task, which is defined by[11] as implicit fine-tuning. As with general fine-tuning, implicit fine-tuning can also drive LLMs to adapt to downstream tasks and achieve better performance. Well-designed prompts stimulate LLMs to perform better in downstream detection tasks. The idea behind the proposed DLAP framework is that it uses DL models to augment LLMs by constructing appropriate prompts to stimulate implicit fine-tuning of the LLMs for them to adapt to vulnerability detection tasks. In this way, it can reduce performance degradation caused by hallucination and data distribution differences.

As shown inFigure1, DLAP is composed of two main parts, including (1) Part I: construction of in-context learning prompts augmented by DL models and (2) Part II: generation of the bespoke COT prompts to augment LLMs. In Part I, we employ DL models to generate detection probabilities for input codes and select candidate codes based on similarity. The combination of candidate codes and their corresponding similarities forms the ICL prompt for detection. In Part II, we combine the results of DL models and static tools to query pre-defined templates in a preset COT template library as key-value pairs. Based on the characteristics of each input sample, we complete the chain of thought, generating COT prompts for detection. These two parts will be introduced in Section3.3 and Section3.4 respectively. In Section3.5, we show an example of synergizing the two prompts to generate the prompts of the final DLAP.

3.3 In-Context Learning Prompts Construction

According to the earlier point, LLMs encapsulate vast knowledge through their expansive weight structures.Firstly, we select the pre-trained DL model through training sets. For new projects, we can also build DL models from newly collected samples based on existing research[13, 24, 49]. Then to create the appropriate in-context, the most similar code candidates are found in the training set using Locality-sensitive hashing (LSH), an efficient similarity search algorithm in the Retrieval Augmented Generation (RAG) technique. Although the LSH similarity calculation algorithm can only concern the similarity of code segments, considering that as a prompting framework, we cannot spend too much time generating prompts formation, it is necessary to sample multiple codes as a similar code candidate set efficiently.

Following Dai et al.[11], we reversely use the dual form of the attention of transformer derived by them. Therefore, the adaptive implicit fine-tuning on attention layer $\widetilde{\mathcal{A}}$ of LLMs stimulated by DLAP for specific projects can be written as Equation3. Please refer to Appendix in the appendix for details.

\hskip 34.14322pt\footnotesize\begin{aligned} \widetilde{\mathcal{A}}(\mathbf{%q})=(W_{\text{init}}+\Delta W_{ICL}(x))\mathbf{q}\end{aligned}

(3)

3.4 Chain-Of-Thought Prompts Generation

The second part of DLAP is to generate specific prompts for each tested sample. It is divided into the following stages. Firstly, because the characteristics and detection steps of vulnerabilities vary, we need to pre-set different detection templates into a COT library. According to the existing peer-reviewed vulnerability taxonomies (i.e., [43, 23]) and reliable grey literature(i.e., [41]), we construct a hierarchical detection COT library that has six major categories as follows.

•
SFE (Security Features Errors): Errors induced by imperfect security features
•
LOG (Logistics Errors): Errors induced by program execution
•
MEM (Memory Errors): Errors related to memory resources
•
NUM (Numeric Errors): Errors induced by numerical computations
•
IDN (Improper Data Neutralization): Errors induced by non-standardization (verification, restriction) of exchanged data
•
UNT (Unknown Taxonomy Errors): Unknown errors

Subsequently, according to the parent-child relationship described in the CWE research concept⁴⁴4https://cwe.mitre.org/data/definitions/1000.html, some categories above are refined (a total of 45 subcategories).In addition, by referring to the relevant research[29, 45, 32] on step-by-step solutions to vulnerability detection through LLMs driven by the COT, we establish a general paradigm for the generation of the COT as follows.

•
Semantics:Comprehending the function of the code.
•
Logic:Analyzing the structure of the code.
•
Internal risks:Identifying components that may introduce vulnerabilities.
•
External risks:Inspecting for unsafe functions that could potentially lead to vulnerabilities.
•
Generating the COT:Integrating the information acquired above and generating a COT to inquire about whether there are potential vulnerabilities step by step.

The specific COT refines the generation paradigms for corresponding COT guidance for different categories. Each subcategory is associated with a specific detection COT template guidance. DLAP selects two open-source functional-level vulnerability detection static tools, i.e., Flawfinder⁵⁵5https://dwheeler.com/flawfinder and Cppcheck⁶⁶6http://cppcheck.net, to generate static scanning results. It parses the result text of the static tools and maps them to the corresponding categories in the taxonomy tree. Then the results are scored and recorded for each tool. The highest-scoring K categories are taken out and added to the query key.DLAP selects the same DL model in Section3.3 because of their better performance according to the studies[49, 13, 24].DLAP combines the detection results of the DL model and the scanning results of the static tool to become the key of a query. Using this key, DLAP obtains customized COT generation guidance templates from the COT library for the test codes. The key is formed as a dictionary that contains the static tool output class and the result of the DL model judgment, such as the following Figure3

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (3)

For instance, if the key is the null pointer dependency that falls under the IDN category, then DLAP will get the refined COT guidance from the taxonomy tree as shown in Figure4. The library of COT guidance templates is publicly available on GitHub⁷⁷7https://github.com/Yang-Yanjing/DLAP.git(COTTree).

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (4)

Through the key generation process described earlier, the COT guidance and the results of the DL model are combined to generate final COT prompts for the target-specific detection samples.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (5)

3.5 Prompts Synergy

Figure5 shows an example of our algorithm where the final prompts are made in such a format.The process of DLAP generating prompts is described in Algorithm1, which takes the selected DL models, the training/history data of target detection project $\mathcal{X}=\{(x_{i},y_{i})\}_{i=1}^{N}$ , the selected static tools $\mathcal{S}$ , and the preset COT library $\mathcal{S}$ as the input.

0: $\mathcal{M}$ ; $\mathcal{X}=\{(x_{i},y_{i})\}_{i=1}^{N}$ ; $\mathcal{S}$ ; $\mathcal{K}$ .

1: $\mathcal{X}_{o}^{T}\leftarrow\operatorname{Sample}(\mathcal{X})$ # the training set for constructing the DL-based model $Y\leftarrow\operatorname{Label}(\mathcal{X}_{o}^{T})$ $\mathcal{X}_{o}^{E}=\mathcal{X}-\mathcal{X}_{o}^{T}$

2: $\mathcal{M}=\operatorname{argmin}\mathcal{L}(\mathcal{X}_{o}^{T},Y)$

3:for $x_{i},i=1$ to $N$ in $\mathcal{X}_{o}^{E}$ do

4: $[Candidates]=LSH(x_{i})$

5: $[Probabilities]=\mathcal{M}([Candidates])$

6:ICL prompt $(x_{i})=\{[Probabilities],[Candidates]\}$

7: $Result_{\mathcal{S}}(x_{i})\leftarrow\operatorname{Ranking}(\mathcal{S}(%\mathcal{X}_{o}^{E}))$

8: $\mathcal{P}(x_{i})=Prediction_{\mathcal{M}}(x_{i})+Result_{\mathcal{S}}(x_{i})$

4 Experimental Design

This section details research questions, datasets, DL models, baseline LLM prompts, and evaluation metrics.

4.1 Research Questions

The experimental evaluation of DLAP is structured with three research questions (RQs).

RQ1:
Which category of DL models is the most effective to DLAP?

Motivation&Setup:An important driver of DLAP is the DL model, which adds the information stored by the DL model training to the prompting process of LLMs through the ICL approach. As little is known about which category of DL models is suitable for augmenting LLMs in the context of vulnerability detection, we propose RQ1 to compare three representative DL models: Sysevr, Devign, and Linevul (cf. Section4.3 for rationales).

RQ2:
How effective is DLAP compared to existing prompting frameworks?

Motivation&Setup:Previous research has shown that the performance of LLMs is susceptible to prompts and inappropriate prompts lead to unsatisfactory performance. In this paper, we design DLAP as a prompt augmented framework for vulnerability detection. In RQ2, we compare DLAP against four existing prompting frameworks: P ${}_{\text{Rol}}$ , P ${}_{\text{Aux}}$ , P ${}_{\text{Cot}}$ , GRACE (cf. Section4.4 for rationales) to evaluate its effectiveness.

RQ3:
How effective is DLAP compared to LoRA fine-tuning?

Motivation&Setup:Previous research has shown that fine-tuning is helpful for augmenting LLMs. In this paper, we use prompt engineering rather than fine-tuning to develop DLAP because of its lower cost. In RQ3, we compare DLAP against a fine-tuning LLM (Llama-13B)[40] to see if it has the same performance as fine-tuning. Specifically, we select LoRA[17], a state-of-the-art LLM fine-tuning technique for comparison.

4.2 Datasets

According to Croft et al.[10], the common vulnerability detection datasets have labeling bias. To develop an appropriate experimental dataset, we customize three criteria for selecting projects: (1) it has been researched by related work[4, 49, 12] (to ensure external validity); (2) it has accumulated more than 3,000 functions (to exclude no-active projects); and (3) it is traceable (to exclude projects whose vulnerability information is incorrect or even unknown). As a result, our experimental dataset consists of four open-source projects, including Chrome, Linux, Android, and Qemu. These projects we selected are of good open-source quality and have high-quality vulnerability fix records for traceability.

The basic information of the selected projects is shown in Table1, from which we can observe the number of trues (vulnerabilities) and falses in each project is imbalanced. To mitigate the impact of data imbalance on training the DL augment model, we first performed random undersampling on the non-vulnerable samples of the four projects. Then we divided the dataset into training and testing sets with the 8:2 proportion. The training set was used to build DL models, while the testing set was used to evaluate the performance of DLAP.

Project	#Functions	#Vulnerbilities	Used by
Chrome	77,173	3,939	Chakrabortyet al.[4]
Linux	46,855	1,961	Fan et al.[12]
Android	8,691	1,277	Fan et al.[12]
Qemu	3,096	125	Zhou et al.[49]

4.3 DLAP Refinement

To address RQ1, we select three DL models for vulnerability detection to refine DLAP. Each of the three represents one type of DL model. Their rationales and hyperparameter settings are as follows.

•
Sysevr[24] represents the category that uses code features including syntactic, semantic, and vector representation. It filters the code into slice input by static analysis of semantics and syntax.
•
Devign[49] represents the category that introduces more structured graph structures and graph neural networks into the vulnerability detection model.
•
Linevul[13] represents the category that utilizes pre-trained deep learning model. This novel system detection model is based on the Transformer architecture.

DL Model	Hyperparameter	Selection
Sysevr	Java version	Java 8
	Static tools	Joern 0.3.1
	Graph database	Neo4j
	Data preprocessing	Slice
	Embedding algorithm	Word2vec
	-sampling algorithm	CBOW
	-sampling window	5
	-min_Count	5
	Network architecture	BiLSTM
	-epoch	100
	-batch_size	32
	-optimizer	sgd
	-loss function	binary cross-entropy
Devign	Java version	Java 8
	Static tools	Joern 2.0.157
	Data preprocessing	Graph
	Embedding algorithm	Word2vec
	-verctor_size	100
	-epoch	10
	-min_count	1
	Network architecture	CNN
	-epoch	200
	-batch_size	128
	-input_channels	115
	-hidden_channels	200
	-num_of_layers	6
	-optimizer	adam
	-loss function	binary cross-entropy
Linevul	Data preprocessing	Slice
	Embedding algorithm	BPE+Transformer
	Pretrained model	codeBERT
	-batch_size	256
	-num_attention_head	12
	-optimizer	Adam
	-loss function	binary cross-entropy

We selected these three DL models to evaluate a range of similar models for the categories they each represent. As part of our model selection process, we referenced the parameters reported in the respective research papers of these DL models that achieved the best performance. These parameters were selected in Table2 as the pre-set hyperparameters in our framework. By doing so, we aim to replicate the optimal performance achieved by these models and ensure consistency in our evaluation and comparison.

4.4 Baselines

We compare DLAP against four prompting frameworks[45, 34, 30, 44] that leverage LLMs to detect vulnerabilities.

$\bullet$ P ${}_{\text{Rol}}$
(Role-based prompts): According to White et al.[44], providing GPT with a clear role would greatly alleviate its illusion problem. Our first baseline is proposed by Zhang et al.[45], making GPT a vulnerability detection system.
$\bullet$ P ${}_{\text{Aux}}$
(Auxiliary information prompts): Based on the view of Zhang et al.[45], providing the LLMs more semantic information about the code for vulnerability detection improves its performance. Therefore, in baseline 2, we provide data flow as auxiliary information to prompts.
$\bullet$ P ${}_{\text{Cot}}$
(Chain-of-thought prompts): According to Wei et al.[42], due to the potential capabilities of LLMs for multi-turn dialogue, constructing a COT better assists LLMs in reasoning[42]. Therefore, in baseline 3, we constructed a two-step thinking chain to drive the LLM in the process of vulnerability detection.Step1: To make LLMs correctly determine whether a code is vulnerable. This step drives the LLMs to understand the purpose of the code exactly. Therefore, we designed first-step prompts for detecting the intent of the code.Step2: Based on the first step, we continue to prompt LLMs to detect vulnerabilities for inputs.
$\bullet$ GRACE:
GRACE is a vulnerability detection prompting framework that enhances the capabilities of LLM for software vulnerability detection. It achieves this by incorporating graph structural information from the code. GRACE employs codeT5 and ICL techniques to use graph information.

4.5 Evaluation Metrics

As vulnerability detection is formulated as a binary classification problem in this paper, we use precision ( $\mathcal{P}_{\text{vul}}$ ), recall ( $\mathcal{R}_{\text{vul}}$ ), and F1-score ( $F_{1}$ ) to measure the performance of each framework. Considering vulnerability is a minor class but is of great severity, we also use FPR as a metric. FPR pays attention to false positives since making mistakes on them would cause more serious outcomes than making mistakes on false negatives. In this paper, the minor class (positive) is vulnerability, and it occupies a very small portion. The definition of FPR is shown in Equation6. Moreover, Matthews correlation coefficient (MCC) is also used as an evaluation metric. MCC, a.k.a., phi coefficient, is a metric to measure the performance of binary classifiers on imbalanced datasets. MCC is a more comprehensive metric than FPR. The definition of MCC is shown in Equation8.

\footnotesize\hskip 54.06006pt\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}

(6)

\footnotesize\text{MCC}=\frac{{\text{TP}\times\text{TN}-\text{FP}\times\text{%FN}}}{{\sqrt{{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(%\text{TN}+\text{FN})}}}}

(7)

where TP represents correctly detected vulnerabilities, TN represents correctly detected non-vulnerabilities, FP represents incorrectly detected vulnerabilities, and FN represents incorrectly detected non-vulnerabilities.

The Coefficient of Variation (CV) is a statistical measure used to determine the dispersion of data points in a dataset relative to its mean. It is particularly valuable when comparing the variability of datasets with different means. The CV is calculated using the equation:

\footnotesize\hskip 71.13188pt\text{CV}=\frac{\sigma}{\mu}

(8)

where $\sigma$ represents the standard deviation and $\mu$ denotes the mean of the dataset.

A higher CV indicates a greater level of dispersion within the data distribution, reflecting more variability relative to the mean. $\mathcal{P}_{\text{vul}}$ , $\mathcal{R}_{\text{vul}}$ , $F_{1}$ , and FPR range from 0 to 1, with higher values indicating better performance of a classifier. MCC ranges from -1 to +1, with higher values indicating better performance of a classifier. We use percentage values ( $\%$ ) to highlight the differences between results.

5 Results and Analysis

This section analyzes the experimental results to address the research questions.

Project	Linevul					Devign					Sysevr
Project	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC
Chrome	40.4	73.3	52.1	28.4	37.6	29.3	85.5	43.7	54.0	26.1	27.7	56.8	37.2	39.0	14.6
Android	34.6	86.2	49.3	41.4	36.1	31.7	85.5	46.2	46.7	31.3	29.4	80.3	43.1	48.7	25.5
Linux	57.1	76.4	65.4	13.9	56.4	48.8	66.3	56.3	16.9	44.4	27.7	22.6	24.9	14.4	08.8
Qemu	84.2	55.1	66.7	01.9	63.9	52.8	65.5	58.5	10.7	50.3	28.6	10.0	14.8	04.4	09.0

5.1 RQ1: Selection of DL Models

We conducted experiments on four large-scale projects to investigate which category of DL model is suitable for DLAP. The results are provided in Table3, which reveal that using Linevul outperforms using others in most datasets and metrics. For instance, in the Chrome dataset, DLAP with Linevul achieves the highest MCC of 37.6%, surpassing Devign’s 26.1% and Sysevr’s 14.6%. This finding is consistent in the Linux dataset, where it secures an MCC of 56.4%, compared to 44.4% and 8.8% for Devign and Sysevr, respectively. Furthermore, the precision and F1 scores of Linevul are notably higher across the datasets, underscoring its robustness in identifying vulnerabilities with greater accuracy and fewer false positives, as evidenced by its lower FPR in datasets. Overall, using Linevul surpasses using Devign by an average of 7.2% and 10.5% on the comprehensive evaluation metrics F1 and MCC, respectively. It also outperforms integrating Sysevr by an average of 28.4% and 34.0% on the same metrics. This demonstrates that Linevul has superior adaptability and generalizability when integrated into LLMs compared to other DL models. These results indicate the effectiveness of integrating Linevul into DLAP to detect vulnerabilities, especially its superior $F_{1}$ , which implies a higher likelihood of detecting actual vulnerabilities. Its MCC, a critical indicator of the quality of binary classifications, shows the ability of DLAP with Linevul to solve extremely imbalanced datasets.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (6)

DL model	Chrome	Android	Linux	Qemu	average
Sysevr	0.1	1.2	0.4	0.02	0.43
Devign	0.5	1.2	2.0	2.0	1.4
Linevul	2.4	2.6	2.5	3.3	2.7

To further distinguish which DL model is more suitable as a plug-in for DLAP, we also analyze the intermediate output (detection probability) of the DL model. Table4 presents the variability in the performance of different DL models across several software projects. The Linevul model, highlighted in gray and bold, displays the highest $CV$ . By comparing and analyzing probability density distribution plots Figure6 and $CV$ Table4 in the largest project dataset Google, we noticed that Linevul displays a more discrete data distribution when compared to the other models. This unique discrete detection distribution property facilitates LLM generation with implicit fine-tuning for downstream detection tasks more effectively.

5.2 RQ2: Comparision with Other Prompting Frameworks

Framework	Chrome					Android					Linux					Qemu
Framework	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC
P ${}_{\text{Rol}}$	24.4	07.2	11.1	05.8	02.3	22.5	06.4	10.0	05.6	01.3	22.4	06.6	10.2	05.6	01.7	22.2	06.9	10.5	04.4	04.2
P ${}_{\text{Aux}}$	22.7	54.6	32.1	48.6	04.8	21.8	63.4	32.5	58.3	04.2	24.6	70.2	36.5	52.6	14.1	19.3	55.2	28.6	42.1	09.5
P ${}_{\text{Cot}}$	16.8	05.4	08.1	07.0	02.6	31.6	03.1	05.7	01.7	04.0	30.7	08.0	12.7	04.4	06.5	64.7	38.0	47.8	03.8	43.0
GRACE	32.6	37.5	32.6	80.2	11.2	25.0	82.6	38.4	74.0	08.5	25.0	76.0	37.6	76.0	02.0	17.1	93.1	28.9	82.4	10.6
DLAP	40.4	73.3	52.1	28.4	37.6	34.6	86.2	49.3	41.4	36.1	57.1	76.4	65.4	13.9	56.4	84.2	55.1	66.7	01.9	63.9

Due to cost constraints associated with OpenAI API calls, we employed the GPT-3.5-turbo-0125 model for vulnerability detection. Table 5 illustrates the performance comparison between the GPT model using the baseline prompting framework and the DLAP approach. The performance of each framework is evaluated based on five metrics: Precision ( $\mathcal{P}_{\text{vul}}$ ), Recall ( $\mathcal{R}_{\text{vul}}$ ), F1 Score ( $F_{1}$ ), False Positive Rate (FPR), and Matthews Correlation Coefficient (MCC). DLAP consistently outperforms the other frameworks across nearly all metrics and datasets. Specifically, DLAP achieves the highest Precision, Recall, F1 Score, and MCC values, showcasing its superior ability to accurately identify vulnerabilities with minimal false positives. For instance, in the Chrome dataset, DLAP’s Precision of 40.4% and Recall of 73.3% significantly surpass those of the next best framework, GRACE. Furthermore, DLAP’s exceptional performance is highlighted by its F1 Score, reaching up to 52.1% in Chrome, 49.3% in Android, 65.4% in Linux, and an impressive 66.7% in Qemu, which are higher compared to the baseline frameworks. In terms of FPR, DLAP demonstrates a moderate FPR across the datasets. Despite that FPR is not the lowest on the Chrome, Android, and Linux datasets when compared to the baselines of P ${}_{\text{Rol}}$ and P ${}_{\text{Cot}}$ , DLAP is far superior to them on the $F_{1}$ and MCC. Therefore, DLAP’s overall effect exceeds the baseline frameworks.

In particular, DLAP’s MCC values, which indicate the quality of binary classifications, significantly exceed those of the other methods, such as 37.6% in Chrome and 63.9% in Qemu, further establishing its superior performance in the task of vulnerability detection using LLMs. Our framework consistently surpasses the top baseline in terms of the MCC indicator, which, based on the nature of the MCC correlation coefficient, suggests that our predictions more accurately reflect the actual distribution and indicates DLAP is superior to the baselines in the generalization performance of large data sets.

Overall, the analysis reveals that DLAP not only excels in identifying vulnerabilities with high precision and recall but also maintains a low false positive rate and achieves outstanding overall performance as evidenced by its F1 Scores and MCC values. This demonstrates DLAP’s exceptional effectiveness in harnessing the power of LLM for the critical task of vulnerability detection, which outperforms the capabilities of other prompting frameworks.

5.3 RQ3: Prompting vs. Fine-tuning

Table6 shows that fine-tuning an LLM on a large project has a higher $F_{1}$ than DLAP. However, on a small project with imbalanced data, DLAP performs better. In particular, LLMs can not fine-tuned on Qemu because the project has a small amount of data. In contrast, DLAP gets the distribution characteristics of small samples and hence can achieve better performance.In addition, fine-tuning an LLM requires stopping the model and retraining it before using it, whereas DLAP does not need to be removed for retraining during its use. It is used as a plug-in to access an LLM in real time to augment the vulnerability detection capability of LLMs. Besides, the comparison of computational cost between DLAP and LoRA fine-tuning is shown on Table7. It is clear that fine-tuning a 13B LLM requires close to 40GB of graphics memory and a lot of time. In contrast, DLAP can select a small DL model and train it to fit the target data in less than one hour.

Dataset	Fine-Tuning Vicuna-13B					DLAP
Dataset	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC	$\mathcal{P}_{\text{vul}}$	$\mathcal{R}_{\text{vul}}$	$F_{1}$	FPR	MCC
Chrome	91.4	74.4	82.0	01.8	78.6	40.4	73.3	52.1	28.4	37.6
Android	67.0	35.8	46.7	04.5	40.4	34.6	86.2	49.3	41.4	36.0
Linux	96.4	55.4	70.3	00.5	68.9	57.1	76.4	65.4	14.0	56.4
Qemu	99.9	06.7	12.1	00.1	23.4	84.2	55.2	66.7	01.9	63.9
Total	88.7	43.0	52.8	01.2	52.8	54.1	72.8	58.4	21.4	48.5

Dataset	Fine-Tuning			DLAP
Dataset	M(MB)	T(h)	GPU(GB)	M(MB)	T(h)	GPU(GB)
Chrome	5.1	11.1	31.2	3.6	0.8	6.3
Android	4.9	4.2	30.3	4.3	0.5	5.5
Linux	4.9	5.5	30.3	3.8	0.4	5.5
Qemu	4.8	1.3	28.7	0.9	0.3	2.8

Equation5 (Cf. Section3.3) indicates that DL model training information changes the relaxed attention of the LLM. This results in an implicit fine-tuning for the LLM to adapt to a specific detection task[11]. Whether through fine-tuning or In-Context Learning (ICL), the extent to which a model is fine-tuned reflects its ability to adapt to the target task, serving as a crucial factor in stimulating LLMs to perform well.According to the detection result shown in Table6, DLAP performs well in approximating fine-tuning on performance evaluation metrics.

To further explain what mechanism induces LLM to produce implicit fine-tuning and achieve good performance on the target task, we extract the attention layer from the fine-tuned local LLM to calculate the probability for each detection category.Subsequently, we gather the ICL outputs by the LLM with DLAP to calculate the probability for each detection category.The probability distribution of the different classes indicates the degree of fine-tuning of the model.Figure7 shows that probability distributions between fine-tuning and DLAP are similar. The same distribution explains that DLAP enables implicit fine-tuning at a reduced cost.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (7)

In comparison with fine-tuning, Figure8 shows a real example of using DLAP to detect vulnerabilities in Linux. The outcomes, which are easily understandable to developers, closely match the records from the actual issue fix commit. In contrast, the output from a fine-tuned LLM is limited to simple ‘yes’ or ‘no’ responses. DLAP’s results are more comprehensible to developers than those from fine-tuning alone.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (8)

6 Discussion

In this section, we discuss the DL model selection for DLAP and DLAP’s potential generalization capability.

6.1 DL Model Selection for DLAP

Based on the insights gained from RQ1, DL models with discrete predictive probability density distributions for the data are more suitable as an integrated plug-in for DLAP. Additionally, we have observed the effectiveness of a DL model as an LLM prompt model. We have discovered that its utility significantly improves when it exhibits discrete data with the highest value of $CV$ . Moreover, our experiments have highlighted the exceptional performance of Transformer-based models in driving the LLM. This advantage could be attributed to the architectural resemblance between Transformer models and the design architecture of the LLM. The similarity in their structures allows for seamless integration, enabling the attention-layer parameters derived from Transformer models to play a pivotal role in facilitating implicit fine-tuning within the LLM.

By leveraging these attention-layer parameters, the LLM dynamically adjusts and refines its internal mechanisms, implicitly adapting itself to the nuances and intricacies of different downstream tasks. This implicit fine-tuning process empowers the LLM to generate more accurate and contextually relevant responses, thereby enhancing its overall performance in various application scenarios.

In summary, our experiments have revealed the crucial roles played by both the varied conformity of the DL model and the resemblance between Transformer models and the architecture of LLMs. These factors, combined with the implicit fine-tuning facilitated by attention-layer parameters, enable the LLM to excel in adapting to and fulfilling the requirements of diverse downstream tasks.

6.2 Generalization Capability of DLAP

The DLAP framework effectively stimulates LLMs to implicitly fine-tune themselves for other software development tasks. By integrating existing static tools and deep learning models, DLAP is applied to a variety of ASAT tasks. This adaptation simplifies the process of adopting DLAP to deal with new challenges. We introduce two scenarios that may extend the applicability of DLAP.

Automated identification of affected libraries from vulnerability data is an ASAT task that requires figuring out which libraries in software are related to each of the reported vulnerabilities in open vulnerability report sets (e.g., NVD, CVE). The task is formulated as an extreme multi-label learning[15, 6]. First, DLAP constructs a sufficient vulnerability description database and combines them with libraries known to be affected by the reported vulnerabilities as a COT template library for affected libraries identification. The known affected libraries are collected to train a DL model. Then, existing static tools (fastXML⁸⁸8https://github.com/fastXML/fastXML) and the DL model are used to generate paramilitary results for the XML library list in the project. Finally, by combining the results as a key to query the COT template library, COT prompts can be built to augment the LLMs, and the identification of libraries from vulnerability data may be more accurate.

Code smell detection is an ASAT task that prevents software from technical debts. Code smell detection based on DL models is a multi-classification detection task comprised of several binary classification models, each designed to detect a specific category of code smell[21, 33]. Utilizing DLAP requires the creation of a comprehensive reference library of code smells and a high-quality coding standards library. Then DLAP uses the static tools (checkstyle⁹⁹9https://checkstyle.org) and DL models for specific detection projects. Like the progress mentioned in this paper, employing the DL model augments the LLMs with prompts for the specific project code smell detection.DLAP is utilized in other ASAT tasks which need to combine DL models with LLMs to improve the performance of LLMs in the target tasks.

7 Threats to Validity

This section analyzes possible threats to validity[48] and our efforts to mitigate their impacts.

Internal Validity.The efficacy of the DLAP relies on its core component, the DL models. While it is permissible for these DL models to result in judgment biases for input code, a completely erroneous DL model that is either irrelevant or detrimental to the task can severely undermine the performance of the DLAP. Therefore, when employing the DLAP, it is crucial to select DL models that are capable of addressing the specific objectives of the task to effectively augment the performance of LLMs. Besides, due to the close source nature of certain LLMs (GPT-3.5-turbo), their internal structures and the specific fine-tuning methods they employ remain unknown. Therefore, for our experiments, we use an open-source LLM (Llama-13b), for comparative fine-tuning studies.

Construct Validity.The relaxed attention of the LLMs changes under the stimulation of DLAP according toEquation5. We define this stimulation as the implicit fine-tuning of LLMs caused by DLAP to adapt to the feature of the target project. Because of the limitations of observing the internal output of LLMs, we can not strictly demonstrate that the stimulation produces gradient descent optimization loss on the target classification task. Instead of performing a demonstration mathematically, we show some of our results with advantages and some of the intermediate outputs of some contrasts showing fine-tuning, validating the existence of the implicit fine-tuning mechanism in the way of experimental data. These visualized experimental results eliminate the construct validity of this paper to some extent.

External Validity.For the verification of DLAP, the final effect of our template needs to drive LLM to complete the vulnerability detection, and the performance of this task is used as an evaluation to measure the effectiveness of our method. So when the LLM is different from the LLM selected in this experiment, the results of using DLAP will be different. We identify the choice of LLM as an external validity threat to this work. Considering both cost and model performance, we chose the least expensive model of the current state-of-the-art LLMs, GPT-3.5-turbo-0125. By using the best model, we make the best use of DLAP. we provide specific model selection, which ensures that other work reproduces the same level of improvement when using DLAP.

8 Conclusion

In this paper, we propose DLAP, a bespoke prompting framework for ASAT tasks that has superior and stable performance in software vulnerability detection tasks with results easily understandable to developers. Experiments show the effectiveness of augmenting LLMs by DL models to stimulate adaptive implicit fine-tuning. This progress prompts LLMs to exceed both state-of-the-art DL solutions and LLMs with alternative prompting frameworks in vulnerability detection. Through experiments, we also find that the pre-trained knowledge of LLMs combines the outputs of all parts of DLAP to achieve good performance. In the future, we will utilize DLAP in more ASAT tasks to explore how DLAP is generalized to the other tasks.

References

Arakelyan etal. [2023]Arakelyan, S., Das, R., Mao, Y., Ren, X., 2023.Exploring distributional shifts in large language models for code analysis, in: Proceedings of the 22nd Conference on Empirical Methods in Natural Language Processing (EMNLP), ACL. pp. 16298–16314.
Bai etal. [2022]Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., etal., 2022.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073 .
Brown etal. [2020]Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., etal., 2020.Language models are few-shot learners.Advances in Neural Information Processing Systems 33, 1877–1901.
Chakraborty etal. [2022]Chakraborty, S., Krishna, R., Ding, Y., Ray, B., 2022.Deep learning based vulnerability detection: Are we there yet.IEEE Transactions on Software Engineering 48, 3280–3296.
Chen etal. [2021]Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., etal., 2021.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374 .
Chen etal. [2020]Chen, Y., Santosa, A.E., Sharma, A., Lo, D., 2020.Automated identification of libraries from vulnerability data, in: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (SEIP), ACM. pp. 90–99.
Cheshkov etal. [2023]Cheshkov, A., Zadorozhny, P., Levichev, R., 2023.Evaluation of chatgpt model for vulnerability detection.arXiv preprint arXiv:2304.07232 .
Chowdhery etal. [2023]Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., etal., 2023.Palm: Scaling language modeling with pathways.Journal of Machine Learning Research 24, 1–113.
Christakis and Bird [2016]Christakis, M., Bird, C., 2016.What developers want and need from program analysis: An empirical study, in: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), ACM. pp. 332–343.
Croft etal. [2023]Croft, R., Babar, M.A., Kholoosi, M.M., 2023.Data quality for software vulnerability datasets, in: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), IEEE. pp. 121–133.
Dai etal. [2023]Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., Wei, F., 2023.Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers, in: Proceedings of the 2023 ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo), Association for Computational Linguistics. pp. 4005–4019.
Fan etal. [2020]Fan, J., Li, Y., Wang, S., Nguyen, T.N., 2020.A c/c++ code vulnerability dataset with code changes and cve summaries, in: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories (MSR), ACM. pp. 508–512.
Fu and Tantithamthavorn [2022]Fu, M., Tantithamthavorn, C., 2022.Linevul: A transformer-based line-level vulnerability prediction, in: Proceedings of the 19th IEEE/ACM International Conference on on Mining Software Repositories (MSR), ACM. pp. 608–620.
Gonzalez etal. [2021]Gonzalez, D., Zimmermann, T., Godefroid, P., Schäfer, M., 2021.Anomalicious: Automated detection of anomalous and potentially malicious commits on github, in: Proceedings of 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), IEEE. pp. 258–267.
Haryono etal. [2022]Haryono, S.A., Kang, H.J., Sharma, A., Sharma, A., Santosa, A., Yi, A.M., Lo, D., 2022.Automated identification of libraries from vulnerability data: Can we do better?, in: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC), ACM. pp. 178–189.
Hsieh etal. [2019]Hsieh, Y.G., Niu, G., Sugiyama, M., 2019.Classification from positive, unlabeled and biased negative data, in: Proceedings of the 36th ACM International Conference on Machine Learning (ICML), PMLR. pp. 2820–2829.
Hu etal. [2022]Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., etal., 2022.Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations (ICLR).
Jin etal. [2023]Jin, M., Shahriar, S., Tufano, M., Shi, X., Lu, S., Sundaresan, N., Svyatkovskiy, A., 2023.Inferfix: End-to-end program repair with llms.arXiv preprint arXiv:2303.07263 .
Kang etal. [2022]Kang, H.J., Aw, K.L., Lo, D., 2022.Detecting false alarms from automatic static analysis tools: How far are we?, in: Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE), ACM. pp. 698–709.
Katsadouros etal. [2023]Katsadouros, E., Patrikakis, C.Z., Hurlburt, G., 2023.Can large language models better predict software vulnerability?IT Professional 25, 4–8.
Lewowski and Madeyski [2022]Lewowski, T., Madeyski, L., 2022.How far are we from reproducible research on code smell detection? a systematic literature review.Information and Software Technology 144, 106783.
Li etal. [2023]Li, J., Li, G., Li, Y., Jin, Z., 2023.Enabling programming thinking in large language models toward code generation.arXiv preprint arXiv:2305.06599 .
Li etal. [2017]Li, X., Chang, X., Board, J.A., Trivedi, K.S., 2017.A novel approach for software vulnerability classification, in: Proceedings of the 64th IEEE/ACM International Conference on Reliability and Maintainability Symposium (RAMS), IEEE. pp. 1–7.
Li etal. [2021]Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z., 2021.Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Transactions on Dependable and Secure Computing 19, 2244–2258.
Li etal. [2018]Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng, Z., Zhong, Y., 2018.Vuldeepecker: A deep learning-based system for vulnerability detection, in: Proceedings of the 25th ACM Annual Network and Distributed System Security Symposium (NDSS), The Internet Society.
Lin etal. [2020a]Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y., 2020a.Software vulnerability detection using deep neural networks: a survey.Proceedings of the IEEE 108, 1825–1848.
Lin etal. [2020b]Lin, G., Xiao, W., Zhang, J., Xiang, Y., 2020b.Deep learning-based vulnerable function detection: A benchmark, in: Proceedings of the 21st ACM Information and Communications Security: International Conference (ICICS), Springer. pp. 219–232.
Liu etal. [2022]Liu, X., Ji, K., Fu, Y., Tam, W., Du, Z., Yang, Z., Tang, J., 2022.P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Association for computational Linguistics. pp. 61–68.
Liu etal. [2023]Liu, X., Tan, Y., Xiao, Z., Zhuge, J., Zhou, R., 2023.Not the end of story: An evaluation of chatgpt-driven vulnerability description mappings, in: Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics. pp. 3724–3731.
Lu etal. [2024]Lu, G., Ju, X., Chen, X., Pei, W., Cai, Z., 2024.Grace: Empowering llm-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software 212, 112–236.
Nachtigall etal. [2022]Nachtigall, M., Schlichtig, M., Bodden, E., 2022.A large-scale study of usability criteria addressed by static analysis tools, in: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), ACM. pp. 532–543.
Ozturk etal. [2023]Ozturk, O.S., Ekmekcioglu, E., Cetin, O., Arief, B., Hernandez-Castro, J., 2023.New tricks to old codes: Can ai chatbots replace static code analysis tools?, in: Proceedings of the 7th ACM European Interdisciplinary Cybersecurity Conference (EICC), ACM. pp. 13–18.
Pecorelli etal. [2019]Pecorelli, F., DiNucci, D., DeRoover, C., DeLucia, A., 2019.On the role of data balancing for machine learning-based code smell detection, in: Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), ACM. pp. 19–24.
Purba etal. [2023]Purba, M.D., Ghosh, A., Radford, B.J., Chu, B., 2023.Software vulnerability detection using large language models, in: Proceedings of the 34th IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), IEEE. pp. 112–119.
Shi etal. [2023]Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E.H., Schärli, N., Zhou, D., 2023.Large language models can be easily distracted by irrelevant context, in: Proceedings of the 40th ACM International Conference on Machine Learning (ICML), PMLR. pp. 31210–31227.
Steenhoek etal. [2023]Steenhoek, B., Rahman, M.M., Jiles, R., Le, W., 2023.An empirical study of deep learning models for vulnerability detection, in: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pp. 2237–2248.
Telang and Wattal [2007]Telang, R., Wattal, S., 2007.An empirical analysis of the impact of software vulnerability announcements on firm stock price.IEEE Transactions on Software engineering 33, 544–557.
Thapa etal. [2022]Thapa, C., Jang, S.I., Ahmed, M.E., Camtepe, S., Pieprzyk, J., Nepal, S., 2022.Transformer-based language models for software vulnerability detection, in: Proceedings of the 38th ACM Annual Computer Security Applications Conference (ACSAC), ACM. pp. 481–496.
Tomas etal. [2019]Tomas, N., Li, J., Huang, H., 2019.An empirical study on culture, automation, measurement, and sharing of devsecops, in: 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), IEEE. pp. 1–8.
Touvron etal. [2023]Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., etal., 2023.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971 .
Tsipenyuk etal. [2005]Tsipenyuk, K., Chess, B., McGraw, G., 2005.Seven pernicious kingdoms: A taxonomy of software security errors.IEEE Security & Privacy 3, 81–84.
Wei etal. [2022]Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., etal., 2022.Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems 35, 24824–24837.
Wei etal. [2021]Wei, Y., Sun, X., Bo, L., Cao, S., Xia, X., Li, B., 2021.A comprehensive study on security bug characteristics.Journal of Software: Evolution and Process 33, e2376.
White etal. [2023]White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C., 2023.A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint arXiv:2302.11382 .
Zhang etal. [2023a]Zhang, C., Liu, H., Zeng, J., Yang, K., Li, Y., Li, H., 2023a.Prompt-enhanced software vulnerability detection using chatgpt.arXiv preprint arXiv:2308.12697 .
Zhang etal. [2023b]Zhang, K., Li, Z., Li, J., Li, G., Jin, Z., 2023b.Self-edit: Fault-aware code editor for code generation.arXiv preprint arXiv:2305.04087 .
Zhang etal. [2023c]Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., etal., 2023c.Siren’s song in the ai ocean: A survey on hallucination in large language models.arXiv preprint arXiv:2309.01219 .
Zhou etal. [2016]Zhou, X., Jin, Y., Zhang, H., Li, S., Huang, X., 2016.A map of threats to validity of systematic literature reviews in software engineering, in: Proceedings of the 23rd IEEE International Conference on Asia-Pacific Software Engineering Conference (APSEC), IEEE. pp. 153–160.
Zhou etal. [2019]Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y., 2019.Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems 32, 10197–10207.
Zou etal. [2019]Zou, D., Wang, S., Xu, S., Li, Z., Jin, H., 2019. $\mu$ vuldeepecker: A deep learning-based system for multiclass vulnerability detection.IEEE Transactions on Dependable and Secure Computing 18, 2224–2236.

Appendix

The appendix analyzes the composition and causes of implicit fine-tuning.The implicit fine-tuning process is described as follows.we sets $\mathcal{C}(x)=COT(x)$ as the input representation for the context. It sets the target task data input as $x$ , and $\mathcal{P}(x)=ICL(x)$ as the input representation for DLAP prompts query. $W_{Q},W_{K},W_{V}$ are the projection matrices for computing the attention queries. $\mathbf{q}=W_{Q}$ is the attention query vector. In the fine-tuning process of DLAP, the attention of LLMs is represented as the Equation9

	$\displaystyle\mathcal{A}(\mathbf{q})$	$\displaystyle=\operatorname{Attention}(V,K,\mathbf{q})$		(9)
	$\displaystyle=$	$\displaystyle W_{V}[\mathcal{P}(x);C(x)]\operatorname{softmax}\left(\frac{%\left(W_{K}[\mathcal{B}(x);C(x)]\right)^{T}\mathbf{q}}{\sqrt{d}}\right)$		(9)

where $W_{Q},W_{K},W_{V}$ are the projection matrices for computing the attention queries. To simplify this process, we eliminate the nonlinear function $\operatorname{softmax}$ and related scaling factor $\sqrt{d}$ to facilitate the analysis of the attention changes in this process itself. We obtain an approximate relaxed linear attention, Equation10.

$\displaystyle\mathcal{A}(\mathbf{q})$	$\displaystyle\approx W_{V}[\mathcal{P}(x);C(x)]\left(W_{K}[\mathcal{P}(x);C(x)%]\right)^{T}\mathbf{q}$	(10)
	$\displaystyle=W_{V}C(x)\left(W_{K}C(x)\right)^{T}\mathbf{q}+W_{V}\mathcal{P}(x%)\left(W_{K}\mathcal{P}(x)\right)^{T}\mathbf{q}$
	$\displaystyle=\widetilde{\mathcal{A}}(\mathbf{q})$

We define the context (information from COT library) prompts as the initial parameters $W_{\text{init}}$ that need to be updated by the attention layer in Equation11.

\displaystyle W_{\text{init}}=W_{v}[\mathcal{C}(x)]\cdot W_{K}[\mathcal{C}(x)]%^{T}\cdot\mathbf{q})

(11)

According to research[11], we reverse use the dual form of the attention of transformer derived by them. Therefore, the adaptive implicit fine-tuning of LLMs stimulated by DLAP for specific projects are written as the Equation10

$\displaystyle\widetilde{\mathcal{A}}(\mathbf{q})$	$\displaystyle=W_{\text{init}}\mathbf{q}+W_{V}[\mathcal{P}(x)]\left(W_{K}[%\mathcal{P}(x)]\right)^{T}\mathbf{q}$	(12)
$\displaystyle=$	$\displaystyle W_{\text{init}}\mathbf{q}+\operatorname{LinearAttn}\left(W_{V}[%\mathcal{P}(x)],W_{K}[\mathcal{P}(x)],\mathbf{q}\right)$
$\displaystyle=$	$\displaystyle W_{\text{init}}\mathbf{q}+\sum_{i}\left((W_{V}\textbf{[}\mathcal%{P}(x)]_{i})\otimes\left(W_{K}\textbf{[}\mathcal{P}(x)]_{i}\right)\right)%\mathbf{q}$
$\displaystyle=$	$\displaystyle W_{\text{init}}\mathbf{q}+\Delta W_{\mathcal{P}(x)}\mathbf{q}$
$\displaystyle=$	$\displaystyle\left(W_{\text{init}}+\Delta W_{\mathcal{P}(x)}\right)\mathbf{q}$

Through , we conclude that the relaxed attention mechanism is influenced by the prompt $\mathcal{P}$ .