DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (2024)

[orcid=0000-0002-3263-1275]

Yanjing Yang  Xin Zhouzhouxin@nju.edu.cn  Runfeng Mao  Jinwei Xu  Lanxin Yang  Yu Zhang  Haifeng Shen  He ZhangSofeware Institute, Nanjing University, ChinaFaculty of Science and Engineering, Southern Cross University, Australia

Abstract

Software vulnerability detection is generally supported by automated static analysis tools, which have recently been reinforced by deep learning (DL) models. However, despite the superior performance of DL-based approaches over rule-based ones in research, applying DL approaches to software vulnerability detection in practice remains a challenge due to the complex structure of source code, the black-box nature of DL, and the domain knowledge required to understand and validate the black-box results for addressing tasks after detection. Conventional DL models are trained by specific projects and, hence, excel in identifying vulnerabilities in these projects but not in others. These models with poor performance in vulnerability detection would impact the downstream tasks such as location and repair. More importantly, these models do not provide explanations for developers to comprehend detection results. In contrast, Large Language Models (LLMs) have made lots of progress in addressing these issues by leveraging prompting techniques. Unfortunately, their performance in identifying vulnerabilities is unsatisfactory. This paper contributes DLAP, a Deep Learning Augmented LLMs Prompting framework that combines the best of both DL models and LLMs to achieve exceptional vulnerability detection performance. Experimental evaluation results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts, as well as fine-turning on multiple metrics.

keywords:

Vulnerability Detection \sepLarge Language Model \sepPrompting Engineering \sepFramework

1 Introduction

Software vulnerability detection is paramount for safeguarding system security and individual privacy. As the cyber environment grows increasingly complex and attack techniques grow quickly, these various threats to software systems have long puzzled software organizations[26, 36]. In particular, vulnerability is one of the critical threats that may result in information leakage, data tampering, and system breaks[37]. Vulnerability detection aims to identify vulnerabilities, mitigate their impact, and prevent malicious attacks[14]. Moreover, vulnerability detection helps to enhance software quality, usability, and trustworthiness. Nowadays, vulnerability detection has become a must-have in modern software development.

Many automated static analysis tools (ASATs) have been applied for vulnerability detection[24, 26]. However, on the one hand, the outputs of ASATs are difficult to validate as they require developers to master more expertise and experience in vulnerability detection[31]; on the other hand, the performance of ASATs is poor (e.g., high false positive rates) due to they are based on string pattern matching[9, 19]. In recent years, the advancement of deep learning (DL) in natural language processing has inspired researchers to integrate DL models111In this paper, ‘DL model’ refers to the conventional deep learning models other than large language models such as GPT, Copilot, and Llama. into ASATs. These modern ASATs generally outperform their conventional counterparts in vulnerability detection[24, 36]. However, DL models that perform well on experimental datasets may suffer from severe performance degradation in real-world projects. This is mainly because of the complexity of source code structure and the concealment of vulnerability characteristics[4]. Using ASATs with DL models to detect vulnerabilities makes an impact on a collection of downstream tasks, including but not limited to vulnerability validation, localization, and repair. Moreover, it is challenging for developers who are responsible for checking vulnerabilities indicated by DL models[39].

In recent years, Large Language Models (LLMs) such as ChatGPT[3] and Copilot[5] have shown prominent performance in various tasks[46, 22, 18]. However, LLMs have not achieved satisfactory results in vulnerability detection[34]. Dai et al.[11] indicated that one of the main reasons is the inappropriate use of LLMs. LLMs are pre-trained by a vast amount of data, but not all of them have positive effects on the downstream tasks such as vulnerability detection[16, 29]. There are two techniques to address this problem: fine-tuning and prompt engineering. Fine-tuning is a commonly used technique but requires significant computational resources and time. LLMs with prompts allow users to interact with LLMs iteratively to produce bespoke results[42, 11]. However, as detection performance is highly susceptible to prompts, a generic prompting framework would not be able to achieve satisfactory performance[1]. Prompt engineering makes LLMs adapt to a specific downstream task and generate customized outputs[42, 11]. Moreover, prompt engineering can jointly work with fine-tuning, making it a cost-effective and promising technique for vulnerability detection[35].

Previous work has utilized LLMs for vulnerability detection using various prompting frameworks[45, 11, 34, 30]. However, existing prompts input limited information to LLMs, making them provide little help in improving the performance of LLMs in real-world projects. To address this problem, we propose a bespoke prompting framework DLAP 222Data and materials:https://github.com/Yang-Yanjing/DLAP.git.Although DL models can not achieve unsatisfactory performances in multiple projects, they have superior performance in a single project. The core idea of DLAP is using pre-trained DL models for the target project to stimulate adaptive implicit fine-tuning of LLMs. We select the most suitable DL model as plugins among three categories to augment DLAP. This process is implemented through two state-of-the-art prompts: In-context Learning (ICL) prompt and Chain-of-Thought (COT) prompt. On the one hand, the ICL prompt utilizes locality-sensitive hash (LSH) to sample candidate code fragments that are similar to the input code. Then, a pre-trained DL model is employed to obtain the prediction probabilities of candidate fragments. The combination of the candidate code fragments and their corresponding probabilities forms the ICL prompt for the input code. On the other hand, the COT prompt synthesizes the results from static scanning tools and pre-trained DL models as queries. Following these queries, DLAP locates the corresponding COT templates within the detection step template library we constructed based on Common Weakness Enumeration (CWE333https://cwe.mitre.org). Then DLAP uses these COT templates to generate the customized completed detection COT prompts for input codes. This stimulates LLMs to conduct implicit fine-tuning to achieve better performance in vulnerability detection and provide supplementary information to facilitate the inspection and comprehension of detection results.

We conduct experiments using four large-scale projects with more than 40,000 examples to evaluate DLAP. We first conduct experiments to select the most suitable DL model to form DLAP. We assess performance by integrating various DL models into DLAP and comparing their results to determine the optimal deep learning model. The results show that combining Linevul with an LLM outperformed other DL models by 15% across all evaluation metrics. Then we select the Linevul to generate prompts within DLAP and compare DLAP against the state-of-the-art prompting frameworks including role-based prompts, auxiliary information prompts, chain-of-thoughts prompts, and in-context learning prompts. The results show that DLAP surpasses baselines across all metrics for each project, achieving a 10% higher F1 score and a 20% higher Matthews Correlation Coefficient (MCC). This indicates that DLAP is more effective for vulnerability detection. Finally, we compare our approach with the most prevalent fine-tuning techniques to explore the effectiveness of DLAP versus fine-tuning. The results reveal that DLAP can achieve 90% of the extensive fine-tuning process at a lower cost and even outperform fine-tuning on some metrics. Moreover, the DLAP-driven LLM model generates more explanatory text than fine-tuning, which is important to aid developers in using ASATs for vulnerability detection tasks.

The main contributions of this paper are as follows.

  • We propose DLAP, a bespoke LLM prompting framework for vulnerability detection. DLAP combines the advantages of DL models and LLMs while overcoming their respective shortcomings. Additionally, DLAP has the potential to be adapted for other ASAT tasks.

  • We conduct rigorous experiments to demonstrate the effectiveness of selecting appropriate DL models for DLAP and showcase the exceptional vulnerability detection performance over state-of-the-art prompting frameworks.

  • We empirically demonstrate the advantages of prompting over fine-tuning for vulnerability detection in terms of detection accuracy, cost-effectiveness, and explanations.

The rest of the paper is organized as follows. Section2 reviews the background and related work. Section3 delineates the design of DLAP. Section4 presents the experimental design and parameter setting of DLAP, followed by results and analysis in Section5. Section6 discusses DLAP’s generation capability and DL model selection. Finally, we present threats to validity in Section7 and conclude this paper in Section8.

2 Background and Related Work

This section describes the related work on vulnerability detection and the background of prompt engineering for LLMs.

2.1 Vulnerability Detection

While there is a plethora of work on this topic, we focus on vulnerability detection enhanced by DL and LLMs.

2.1.1 Deep Learning for Vulnerability Detection

Vulnerability detection has received a lot of concerns in recent years. Lin et al.[27] proposed a framework that incorporates one project and 12 DL models for slice-level vulnerability detection. Zhou et al.[49] proposed Devign which uses graph representation for input performed better than approaches using tokens of code directly. Li et al. developed a series of DL-based approaches including VulDeePecker, μ𝜇\muitalic_μVulDeePecker, and SySeVR[24, 25, 50] to complete the construction of a DL framework applied to vulnerability detection. Despite achieving advanced results in experimental setups, there still exists generalization issues in practical applications. With the advent of networks based on the transformer architecture and language models, researchers have started applying these advanced NLP techniques to vulnerability detection. Fu and Tantithamthavorn[13] applied RoBERTa as a pre-training model, fine-tuned on subsequent vulnerability detection tasks, achieving the best experimental performance in both function-level and line-level vulnerability prediction tasks.

Chakraborty et al.[4] found that the performance of several DL-based approaches dropped by an average of 73% on datasets built from multiple real-world projects, highlighting the need for further research into cross-project vulnerability detection. Steenhoek et al.[36] conducted an empirical study to demonstrate the variability between runs of a model and the low agreement among DL models’ outputs and studied interpretation models to guide future studies.

2.1.2 Large Language Model for Vulnerability Detection

The outstanding performance in dialogue, code generation, and machine translation of LLMs has sparked the interest of researchers and practitioners in applying LLMs to software security. Katsadouros et al.[20] highlighted the potential of LLMs to predict software vulnerabilities, emphasizing their advantage over traditional static methods. Thapa et al.[38] discovered that transformer-based LLMs outperformed conventional DL-based models. Zhang et al.[45] enhanced the effectiveness of ChatGPT in software vulnerability detection through innovative prompt designs and leveraging the model’s ability to memorize multi-round dialogue.

However, according to Cheshkov et al.[7], ChatGPT and GPT-3 in Java code vulnerability detection do not outperform the current tools. In the meantime, Liu et al.[29] emphasized that ChatGPT cannot replace professional security engineers in vulnerability analysis, indicating that closed-source LLMs are not the end of the story. These findings suggest that the performance of LLMs in the realm of vulnerability detection leaves much to be desired. The potential false positives and illusions generated by LLMs in specific applications [47] are attributable to the extensive unconstrained training data and the multitude of training parameters. Consequently, it is essential to fine-tune an LLM before deploying it for specific tasks. Lu et al.[30] proposed a method called GRACE that processes code structure using CodeT5, combines semantic with syntactic features to conduct similarity searches, and utilizes in-context learning prompts to drive the LLM beyond all baseline DL models on complex real-world datasets.

As indicated above, the performance of LLMs remains unsatisfactory, accompanied by a high false positive rate. In this study, GRACE[30] along with other evaluated prompts[45] will serve as the baselines, facilitating the comparison and evaluation. To achieve better performance in vulnerability detection, we select Sysver[24], Devign[49], and Linevul[13] as the augmented components for our framework.

2.2 Prompt Engineering for Large Language Models

The increase in the number of parameters in LLMs leads to a rise in the cost of fine-tuning LLMs. Low-cost approaches such as Low-Rank Adaptation of Large Language Models (LoRA)[17] and P-tuning[28] have significantly reduced the cost of fine-tuning. However, the cost for some applications is still considerable. For example, fine-tuning an LLM with 33B parameters requires two high-precision 40G GPUs[17]. The parameters of the LLMs that have been reported to achieve excellent performance all exceed one hundred billion, which may cost a lot of computing resources. Researches[3, 8, 2] reveal that LLMs are transformer-based models. Different inputs can cause changes in the attention layers within their architecture, and as such, the construction of high-quality prompts can assist the LLMs in providing satisfactory answers for the specific target tasks. In contrast, inappropriate prompts will impinge on its attention, which may mislead LLMs to produce hallucinations[47]. The COT[42] and ICL[11] prompting are currently the most effective approaches to prompt engineering. COT prompting is an approach of decomposing a target problem into steps to prompt LLMs to provide the answers. ICL prompts LLMs to deliver correct answers by referring to similar questions. In this paper, DLAP integrates both approaches to provide appropriate prompts to drive LLMs for vulnerability detection.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (1)

3 The Proposed Framework: DLAP

This section describes the design of DLAP, which consists of two prompt techniques.

3.1 Motivation

Vulnerability detection can be formulated as a binary classification problem. Given a vulnerability dataset 𝒟𝒟\mathcal{D}caligraphic_D that is represented with 𝒟={(xi,yi)}i=1N,y=0 or 1formulae-sequence𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁𝑦0 or 1\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N},y=\text{0 or 1}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_y = 0 or 1, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the source code of a function, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground-truth (0–NO, 1–YES). Detection models are expected to establish their mappings automatically. One of the main processes of detection models is input (source code) representation. Specifically, the source code can be represented as semantic tokens, abstract parsing trees (ASTs), data/control flow graphs (DFGs/CFGs), or other formats. Conventional DL models use only a single format as input, which may miss useful information. LLMs have the capability of combining multiple representations, making it a more promising detection technique for vulnerability detection. When levering LLMs for vulnerability detection, it is fundamental to make LLMs understand domain and task knowledge as LLMs are trained by general corpora. It can be expected that using LLMs for vulnerability detection directly would achieve unsatisfactory results. Therefore, LLMs use fine-tuning or prompt engineering to address this task.

Fine-tuning is an intuitive technique to make the parameters 𝒲𝒲\mathcal{W}caligraphic_W of LLMs adapt to downstream tasks (i.e., vulnerability detection) ion 𝒟oEsuperscriptsubscript𝒟𝑜𝐸\mathcal{D}_{o}^{E}caligraphic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT to achieve better results, which can be described as Equation1.

𝒲=argmin(Y𝒲(𝒫(X)))𝒲argmin𝑌subscript𝒲𝒫𝑋\footnotesize\hskip 42.67912pt\begin{aligned} \mathcal{W}=\text{argmin}%\mathcal{L}\left(Y-\mathcal{M}_{\mathcal{W}}(\mathcal{P}(X))\right)\end{aligned}start_ROW start_CELL caligraphic_W = argmin caligraphic_L ( italic_Y - caligraphic_M start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( caligraphic_P ( italic_X ) ) ) end_CELL end_ROW(1)

where Y𝑌Yitalic_Y is the ground truth (label) for function level code and \mathcal{L}caligraphic_L is the loss function. By fine-tuning the weight parameters 𝒲𝒲\mathcal{W}caligraphic_W of LLMs, the predictive probabilities are expected to be closer to the ground truth. However, fine-tuning is cost-intensive[17, 44]. For example, LoRA which is one of the most efficient LLMs requires approximately 80G graphics card memory and a lot of time for fine-tuning an LLM with only 13B parameters.

Prompt engineering is a new technique to augment LLMs. LLMs can incorporate various inputs and generate their answers; therefore, they can be prompted [3, 8, 2]. Technically, we use 𝒟sTsuperscriptsubscript𝒟𝑠𝑇\mathcal{D}_{s}^{T}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝒟oEsuperscriptsubscript𝒟𝑜𝐸\mathcal{D}_{o}^{E}caligraphic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT to describe pretraining sets and testing sets, respectively. When using prompt engineering (𝒫()˙\mathcal{P}(\dot{)}caligraphic_P ( over˙ start_ARG ) end_ARG ) for vulnerability detection, LLMs accept a set of 𝒫(X)𝒫𝑋\mathcal{P}(X)caligraphic_P ( italic_X ) as inputs and output the estimated probabilities of them, where X𝑋Xitalic_X means a collection of examples from 𝒟oEsuperscriptsubscript𝒟𝑜𝐸\mathcal{D}_{o}^{E}caligraphic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT. One cost-effective prompting template 𝒫𝒫\mathcal{P}caligraphic_P for vulnerability detection can be described as Equation2.

𝒫=argmin(Y𝒲(𝒫(X)))𝒫argmin𝑌subscript𝒲𝒫𝑋\footnotesize\hskip 42.67912pt\begin{aligned} \mathcal{P}=\text{argmin}%\mathcal{L}\left(Y-\mathcal{M}_{\mathcal{W}}(\mathcal{P}(X))\right)\end{aligned}start_ROW start_CELL caligraphic_P = argmin caligraphic_L ( italic_Y - caligraphic_M start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( caligraphic_P ( italic_X ) ) ) end_CELL end_ROW(2)

According to Liu et al.[28], Equation2 can achieve the same effects as Equation1. That is, both prompt engineering and fine-tuning can make predictive probabilities close to ground truths. In the following subsections, we elaborate on DLAP which leverages prompt engineering for vulnerability detection.

3.2 Framework Overview

DLAP leverages attention mechanisms within LLMs, incorporating selectively trained DL models as enhancements. This approach, known as In-Context Learning (ICL), acts to subtly refine LLMs, making them more adept at specific projects. Moreover, DLAP’s use of chain-of-thoughts (COT) enables LLMs to discard incorrect generative paths effectively. Consequently, DLAP enhances LLMs’ capabilities in detection tasks, ensuring robust performance without incurring significant costs. ICL can stimulate the attention layer of an LLM to adapt to the downstream detection task, which is defined by[11] as implicit fine-tuning. As with general fine-tuning, implicit fine-tuning can also drive LLMs to adapt to downstream tasks and achieve better performance. Well-designed prompts stimulate LLMs to perform better in downstream detection tasks. The idea behind the proposed DLAP framework is that it uses DL models to augment LLMs by constructing appropriate prompts to stimulate implicit fine-tuning of the LLMs for them to adapt to vulnerability detection tasks. In this way, it can reduce performance degradation caused by hallucination and data distribution differences.

As shown inFigure1, DLAP is composed of two main parts, including (1) Part I: construction of in-context learning prompts augmented by DL models and (2) Part II: generation of the bespoke COT prompts to augment LLMs. In Part I, we employ DL models to generate detection probabilities for input codes and select candidate codes based on similarity. The combination of candidate codes and their corresponding similarities forms the ICL prompt for detection. In Part II, we combine the results of DL models and static tools to query pre-defined templates in a preset COT template library as key-value pairs. Based on the characteristics of each input sample, we complete the chain of thought, generating COT prompts for detection. These two parts will be introduced in Section3.3 and Section3.4 respectively. In Section3.5, we show an example of synergizing the two prompts to generate the prompts of the final DLAP.

3.3 In-Context Learning Prompts Construction

According to the earlier point, LLMs encapsulate vast knowledge through their expansive weight structures.Firstly, we select the pre-trained DL model through training sets. For new projects, we can also build DL models from newly collected samples based on existing research[13, 24, 49]. Then to create the appropriate in-context, the most similar code candidates are found in the training set using Locality-sensitive hashing (LSH), an efficient similarity search algorithm in the Retrieval Augmented Generation (RAG) technique. Although the LSH similarity calculation algorithm can only concern the similarity of code segments, considering that as a prompting framework, we cannot spend too much time generating prompts formation, it is necessary to sample multiple codes as a similar code candidate set efficiently.

Following Dai et al.[11], we reversely use the dual form of the attention of transformer derived by them. Therefore, the adaptive implicit fine-tuning on attention layer 𝒜~~𝒜\widetilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG of LLMs stimulated by DLAP for specific projects can be written as Equation3. Please refer to Appendix in the appendix for details.

𝒜~(𝐪)=(Winit+ΔWICL(x))𝐪~𝒜𝐪subscript𝑊initΔsubscript𝑊𝐼𝐶𝐿𝑥𝐪\hskip 34.14322pt\footnotesize\begin{aligned} \widetilde{\mathcal{A}}(\mathbf{%q})=(W_{\text{init}}+\Delta W_{ICL}(x))\mathbf{q}\end{aligned}start_ROW start_CELL over~ start_ARG caligraphic_A end_ARG ( bold_q ) = ( italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT + roman_Δ italic_W start_POSTSUBSCRIPT italic_I italic_C italic_L end_POSTSUBSCRIPT ( italic_x ) ) bold_q end_CELL end_ROW(3)

We train the DL model \mathcal{M}caligraphic_M with training data information info which contains the relationship between the project data and the label. Then the DL model generates a detection probability Probs𝒫subscriptProbs𝒫\operatorname{Probs_{\mathcal{P}}}roman_Probs start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT for a detection object x𝑥xitalic_x as shown in Equation4.

ProbsICL(Objinfo)=(Objinfo)(x)subscriptProbsICL𝑂𝑏subscript𝑗info𝑂𝑏subscript𝑗info𝑥\hskip 34.14322pt\footnotesize\begin{aligned} \operatorname{{Probs_{ICL}}}(Obj%_{\text{info}})=\mathcal{M}(Obj_{\text{info}})(x)\end{aligned}start_ROW start_CELL start_OPFUNCTION roman_Probs start_POSTSUBSCRIPT roman_ICL end_POSTSUBSCRIPT end_OPFUNCTION ( italic_O italic_b italic_j start_POSTSUBSCRIPT info end_POSTSUBSCRIPT ) = caligraphic_M ( italic_O italic_b italic_j start_POSTSUBSCRIPT info end_POSTSUBSCRIPT ) ( italic_x ) end_CELL end_ROW(4)

The probabilities output by the DL model represent characteristics of input codes.DLAP uses the probabilities to construct ICL prompts to augment LLMs.Then, we obtain the relaxed attention representation 𝒜~~𝒜\widetilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG of the LLM in Equation5.

𝒜~(𝐪)=(Winit+ΔW(func(Probs𝒫(Objinfo))))𝐪~𝒜𝐪subscript𝑊initΔ𝑊funcsubscriptProbs𝒫𝑂𝑏subscript𝑗info𝐪\footnotesize\hskip 22.76228pt\begin{aligned} \widetilde{\mathcal{A}}(\mathbf{%q})=\left(W_{\text{init}}+\Delta W(\operatorname{func}(\operatorname{Probs_{%\mathcal{P}}}(Obj_{\text{info}})))\right)\mathbf{q}\end{aligned}start_ROW start_CELL over~ start_ARG caligraphic_A end_ARG ( bold_q ) = ( italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT + roman_Δ italic_W ( roman_func ( start_OPFUNCTION roman_Probs start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_OPFUNCTION ( italic_O italic_b italic_j start_POSTSUBSCRIPT info end_POSTSUBSCRIPT ) ) ) ) bold_q end_CELL end_ROW(5)

Equation5 indicates that the relaxed attention of the LLM is related to probabilities output by the selection of a DL model and the probabilities are related to the training project data. This results in an implicit fine-tuning for the LLM to adapt to specific project information[11]. We further explain this adaptive process with more details in the result analysis section through a comparative analysis experiment.

Compared to conventional fine-tuning methods, DLAP does not require excessive resource consumption to update the parameters of LLMs. The ICL prompts update the output of the attention layer in the LLMs. As the example shown in Figure2, the ICL prompts of DLAP stimulate implicit fine-tuning of the LLMs toward adapting to the characteristics of the projects to be detected.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (2)

By adding the results generated by the DL model to the prompts of LLM, DLAP can include more in-context information. The DL model trained to form weights of networks contains the complex relationship between the predicted probability and the input code text.Therefore the DL model’s output contains the characteristics of training sets.ICL built in this way contains more prompt information than normal ICL, which stimulates LLMs to perform better in downstream tasks.

As shown in the upper part of Figure1, after conducting the detection probability of the candidate codes through the DL model, we set the code candidate set and the corresponding probability as the question and answer combination. These combinations (an example shown on Figure2) are constructed to be the in-context learning (ICL) prompt together with part II in Section3.4 to form the final DLAP augmented prompts.

3.4 Chain-Of-Thought Prompts Generation

The second part of DLAP is to generate specific prompts for each tested sample. It is divided into the following stages. Firstly, because the characteristics and detection steps of vulnerabilities vary, we need to pre-set different detection templates into a COT library. According to the existing peer-reviewed vulnerability taxonomies (i.e., [43, 23]) and reliable grey literature(i.e., [41]), we construct a hierarchical detection COT library that has six major categories as follows.

  • SFE (Security Features Errors): Errors induced by imperfect security features

  • LOG (Logistics Errors): Errors induced by program execution

  • MEM (Memory Errors): Errors related to memory resources

  • NUM (Numeric Errors): Errors induced by numerical computations

  • IDN (Improper Data Neutralization): Errors induced by non-standardization (verification, restriction) of exchanged data

  • UNT (Unknown Taxonomy Errors): Unknown errors

Subsequently, according to the parent-child relationship described in the CWE research concept444https://cwe.mitre.org/data/definitions/1000.html, some categories above are refined (a total of 45 subcategories).In addition, by referring to the relevant research[29, 45, 32] on step-by-step solutions to vulnerability detection through LLMs driven by the COT, we establish a general paradigm for the generation of the COT as follows.

  • Semantics:Comprehending the function of the code.

  • Logic:Analyzing the structure of the code.

  • Internal risks:Identifying components that may introduce vulnerabilities.

  • External risks:Inspecting for unsafe functions that could potentially lead to vulnerabilities.

  • Generating the COT:Integrating the information acquired above and generating a COT to inquire about whether there are potential vulnerabilities step by step.

The specific COT refines the generation paradigms for corresponding COT guidance for different categories. Each subcategory is associated with a specific detection COT template guidance. DLAP selects two open-source functional-level vulnerability detection static tools, i.e., Flawfinder555https://dwheeler.com/flawfinder and Cppcheck666http://cppcheck.net, to generate static scanning results. It parses the result text of the static tools and maps them to the corresponding categories in the taxonomy tree. Then the results are scored and recorded for each tool. The highest-scoring K categories are taken out and added to the query key.DLAP selects the same DL model in Section3.3 because of their better performance according to the studies[49, 13, 24].DLAP combines the detection results of the DL model and the scanning results of the static tool to become the key of a query. Using this key, DLAP obtains customized COT generation guidance templates from the COT library for the test codes. The key is formed as a dictionary that contains the static tool output class and the result of the DL model judgment, such as the following Figure3

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (3)

For instance, if the key is the null pointer dependency that falls under the IDN category, then DLAP will get the refined COT guidance from the taxonomy tree as shown in Figure4. The library of COT guidance templates is publicly available on GitHub777https://github.com/Yang-Yanjing/DLAP.git(COTTree).

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (4)

Through the key generation process described earlier, the COT guidance and the results of the DL model are combined to generate final COT prompts for the target-specific detection samples.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (5)

3.5 Prompts Synergy

Figure5 shows an example of our algorithm where the final prompts are made in such a format.The process of DLAP generating prompts is described in Algorithm1, which takes the selected DL models, the training/history data of target detection project 𝒳={(xi,yi)}i=1N𝒳superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{X}=\{(x_{i},y_{i})\}_{i=1}^{N}caligraphic_X = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the selected static tools 𝒮𝒮\mathcal{S}caligraphic_S, and the preset COT library 𝒮𝒮\mathcal{S}caligraphic_S as the input.

0:\mathcal{M}caligraphic_M; 𝒳={(xi,yi)}i=1N𝒳superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{X}=\{(x_{i},y_{i})\}_{i=1}^{N}caligraphic_X = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT; 𝒮𝒮\mathcal{S}caligraphic_S; 𝒦𝒦\mathcal{K}caligraphic_K.

1:𝒳oTSample(𝒳)superscriptsubscript𝒳𝑜𝑇Sample𝒳\mathcal{X}_{o}^{T}\leftarrow\operatorname{Sample}(\mathcal{X})caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ← roman_Sample ( caligraphic_X ) # the training set for constructing the DL-based modelYLabel(𝒳oT)𝑌Labelsuperscriptsubscript𝒳𝑜𝑇Y\leftarrow\operatorname{Label}(\mathcal{X}_{o}^{T})italic_Y ← roman_Label ( caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )𝒳oE=𝒳𝒳oTsuperscriptsubscript𝒳𝑜𝐸𝒳superscriptsubscript𝒳𝑜𝑇\mathcal{X}_{o}^{E}=\mathcal{X}-\mathcal{X}_{o}^{T}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = caligraphic_X - caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

2:=argmin(𝒳oT,Y)argminsuperscriptsubscript𝒳𝑜𝑇𝑌\mathcal{M}=\operatorname{argmin}\mathcal{L}(\mathcal{X}_{o}^{T},Y)caligraphic_M = roman_argmin caligraphic_L ( caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_Y )

3:forxi,i=1subscript𝑥𝑖𝑖1x_{i},i=1italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 to N𝑁Nitalic_N in 𝒳oEsuperscriptsubscript𝒳𝑜𝐸\mathcal{X}_{o}^{E}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPTdo

4:[Candidates]=LSH(xi)delimited-[]𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠𝐿𝑆𝐻subscript𝑥𝑖[Candidates]=LSH(x_{i})[ italic_C italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e italic_s ] = italic_L italic_S italic_H ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

5:[Probabilities]=([Candidates])delimited-[]𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠delimited-[]𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠[Probabilities]=\mathcal{M}([Candidates])[ italic_P italic_r italic_o italic_b italic_a italic_b italic_i italic_l italic_i italic_t italic_i italic_e italic_s ] = caligraphic_M ( [ italic_C italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e italic_s ] )

6:ICL prompt(xi)={[Probabilities],[Candidates]}subscript𝑥𝑖delimited-[]𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠delimited-[]𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠(x_{i})=\{[Probabilities],[Candidates]\}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { [ italic_P italic_r italic_o italic_b italic_a italic_b italic_i italic_l italic_i italic_t italic_i italic_e italic_s ] , [ italic_C italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e italic_s ] }

7:Result𝒮(xi)Ranking(𝒮(𝒳oE))𝑅𝑒𝑠𝑢𝑙subscript𝑡𝒮subscript𝑥𝑖Ranking𝒮superscriptsubscript𝒳𝑜𝐸Result_{\mathcal{S}}(x_{i})\leftarrow\operatorname{Ranking}(\mathcal{S}(%\mathcal{X}_{o}^{E}))italic_R italic_e italic_s italic_u italic_l italic_t start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← roman_Ranking ( caligraphic_S ( caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) )

8:𝒫(xi)=Prediction(xi)+Result𝒮(xi)𝒫subscript𝑥𝑖𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜subscript𝑛subscript𝑥𝑖𝑅𝑒𝑠𝑢𝑙subscript𝑡𝒮subscript𝑥𝑖\mathcal{P}(x_{i})=Prediction_{\mathcal{M}}(x_{i})+Result_{\mathcal{S}}(x_{i})caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_R italic_e italic_s italic_u italic_l italic_t start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

9:Utilizing 𝒫(xi)𝒫subscript𝑥𝑖\mathcal{P}(x_{i})caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the key to retrieve COT prompts generation guidance. G(xi)=𝒦(𝒫(xi))𝐺subscript𝑥𝑖𝒦𝒫subscript𝑥𝑖G(x_{i})=\mathcal{K}(\mathcal{P}(x_{i}))italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = caligraphic_K ( caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

10:Using GPT to complete G(xi)𝐺subscript𝑥𝑖G(x_{i})italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ): COT(xi)=GPT(G(xi))𝐶𝑂𝑇subscript𝑥𝑖𝐺𝑃𝑇𝐺subscript𝑥𝑖COT(x_{i})=GPT(G(x_{i}))italic_C italic_O italic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_G italic_P italic_T ( italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

11:DLAP prompts(xi)subscript𝑥𝑖(x_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ICL prompt(xi)subscript𝑥𝑖(x_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + COT prompt(xi)subscript𝑥𝑖(x_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

12:endfor

12:Specific COT prompts of all detection samples.

First, the target detection project X𝑋Xitalic_X is sampled to construct training sets 𝒳oTsuperscriptsubscript𝒳𝑜𝑇\mathcal{X}_{o}^{T}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. 𝒳oTsuperscriptsubscript𝒳𝑜𝑇\mathcal{X}_{o}^{T}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (if open-source information is directly collected) is labeled as Y𝑌Yitalic_Y. The remaining part of detecting project X𝑋Xitalic_X is test sets 𝒳oEsuperscriptsubscript𝒳𝑜𝐸\mathcal{X}_{o}^{E}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT. 𝒳oTsuperscriptsubscript𝒳𝑜𝑇\mathcal{X}_{o}^{T}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and label Y𝑌Yitalic_Y are utilized to minimize the loss function \mathcal{L}caligraphic_L for training \mathcal{M}caligraphic_M. Next, for each input code in the test set 𝒳oEsuperscriptsubscript𝒳𝑜𝐸\mathcal{X}_{o}^{E}caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, the LSH algorithm is used to find the most similar code [Candidates]delimited-[]𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠[Candidates][ italic_C italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e italic_s ] in this set. Then, the DL model \mathcal{M}caligraphic_M is utilized to get detection probabilities [Probabilities]delimited-[]𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠[Probabilities][ italic_P italic_r italic_o italic_b italic_a italic_b italic_i italic_l italic_i italic_t italic_i italic_e italic_s ]. Using [Candidates]delimited-[]𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠[Candidates][ italic_C italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e italic_s ] and [Probabilities]delimited-[]𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠[Probabilities][ italic_P italic_r italic_o italic_b italic_a italic_b italic_i italic_l italic_i italic_t italic_i italic_e italic_s ], DLAP constructs question-answering combinations to form the DL-augmented ICL prompts.

DLAP carries out a bespoke process described in Figure5 to get the specific COT prompt. Statics tools are utilized to get analysis results 𝒮(𝒳oE)={𝒮1:scores,𝒮2:scores,𝒮3:scores,}𝒮superscriptsubscript𝒳𝑜𝐸conditional-setsubscript𝒮1:scoressubscript𝒮2scoressubscript𝒮3:scores\mathcal{S}(\mathcal{X}_{o}^{E})=\{\mathcal{S}_{1}:\text{scores},\mathcal{S}_{%2}:\text{scores},\mathcal{S}_{3}:\text{scores},...\}caligraphic_S ( caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) = { caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : scores , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : scores , caligraphic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : scores , … } and Result𝒮(xi)𝑅𝑒𝑠𝑢𝑙subscript𝑡𝒮subscript𝑥𝑖Result_{\mathcal{S}}(x_{i})italic_R italic_e italic_s italic_u italic_l italic_t start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is obtained by ranking 𝒮(𝒳oE)𝒮superscriptsubscript𝒳𝑜𝐸\mathcal{S}(\mathcal{X}_{o}^{E})caligraphic_S ( caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ). Then DLAP generates the DL model prediction result. After that, the results Predictions𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛subscript𝑠Predictions_{\mathcal{M}}italic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n italic_s start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT are combined to form a query, which is utilized as the key to retrieve COT prompts generation guidance G(xi)𝐺subscript𝑥𝑖G(x_{i})italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from the COT library. GPT is used to complete G(xi)𝐺subscript𝑥𝑖G(x_{i})italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for generating cot(xi)𝑐𝑜𝑡subscript𝑥𝑖cot(x_{i})italic_c italic_o italic_t ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for each detection sample. Finally, the final prompts of DLAP are composed of the COT prompts and ICL prompts. Each specific prompt is utilized to drive LLMs for detection. The final step is to use the generated COT prompts for LLMs to produce understandable vulnerability detection results by following a particular answer format.

4 Experimental Design

This section details research questions, datasets, DL models, baseline LLM prompts, and evaluation metrics.

4.1 Research Questions

The experimental evaluation of DLAP is structured with three research questions (RQs).

  • RQ1:

    Which category of DL models is the most effective to DLAP?

Motivation&Setup:An important driver of DLAP is the DL model, which adds the information stored by the DL model training to the prompting process of LLMs through the ICL approach. As little is known about which category of DL models is suitable for augmenting LLMs in the context of vulnerability detection, we propose RQ1 to compare three representative DL models: Sysevr, Devign, and Linevul (cf. Section4.3 for rationales).

  • RQ2:

    How effective is DLAP compared to existing prompting frameworks?

Motivation&Setup:Previous research has shown that the performance of LLMs is susceptible to prompts and inappropriate prompts lead to unsatisfactory performance. In this paper, we design DLAP as a prompt augmented framework for vulnerability detection. In RQ2, we compare DLAP against four existing prompting frameworks: PRolRol{}_{\text{Rol}}start_FLOATSUBSCRIPT Rol end_FLOATSUBSCRIPT, PAuxAux{}_{\text{Aux}}start_FLOATSUBSCRIPT Aux end_FLOATSUBSCRIPT, PCotCot{}_{\text{Cot}}start_FLOATSUBSCRIPT Cot end_FLOATSUBSCRIPT, GRACE (cf. Section4.4 for rationales) to evaluate its effectiveness.

  • RQ3:

    How effective is DLAP compared to LoRA fine-tuning?

Motivation&Setup:Previous research has shown that fine-tuning is helpful for augmenting LLMs. In this paper, we use prompt engineering rather than fine-tuning to develop DLAP because of its lower cost. In RQ3, we compare DLAP against a fine-tuning LLM (Llama-13B)[40] to see if it has the same performance as fine-tuning. Specifically, we select LoRA[17], a state-of-the-art LLM fine-tuning technique for comparison.

4.2 Datasets

According to Croft et al.[10], the common vulnerability detection datasets have labeling bias. To develop an appropriate experimental dataset, we customize three criteria for selecting projects: (1) it has been researched by related work[4, 49, 12] (to ensure external validity); (2) it has accumulated more than 3,000 functions (to exclude no-active projects); and (3) it is traceable (to exclude projects whose vulnerability information is incorrect or even unknown). As a result, our experimental dataset consists of four open-source projects, including Chrome, Linux, Android, and Qemu. These projects we selected are of good open-source quality and have high-quality vulnerability fix records for traceability.

The basic information of the selected projects is shown in Table1, from which we can observe the number of trues (vulnerabilities) and falses in each project is imbalanced. To mitigate the impact of data imbalance on training the DL augment model, we first performed random undersampling on the non-vulnerable samples of the four projects. Then we divided the dataset into training and testing sets with the 8:2 proportion. The training set was used to build DL models, while the testing set was used to evaluate the performance of DLAP.

Project#Functions#VulnerbilitiesUsed by
Chrome77,1733,939Chakrabortyet al.[4]
Linux46,8551,961Fan et al.[12]
Android8,6911,277Fan et al.[12]
Qemu3,096125Zhou et al.[49]

4.3 DLAP Refinement

To address RQ1, we select three DL models for vulnerability detection to refine DLAP. Each of the three represents one type of DL model. Their rationales and hyperparameter settings are as follows.

  • Sysevr[24] represents the category that uses code features including syntactic, semantic, and vector representation. It filters the code into slice input by static analysis of semantics and syntax.

  • Devign[49] represents the category that introduces more structured graph structures and graph neural networks into the vulnerability detection model.

  • Linevul[13] represents the category that utilizes pre-trained deep learning model. This novel system detection model is based on the Transformer architecture.

DL
Model
HyperparameterSelection
SysevrJava versionJava 8
Static toolsJoern 0.3.1
Graph databaseNeo4j
Data preprocessingSlice
Embedding algorithmWord2vec
-sampling algorithmCBOW
-sampling window5
-min_Count5
Network architectureBiLSTM
-epoch100
-batch_size32
-optimizersgd
-loss functionbinary cross-entropy
DevignJava versionJava 8
Static toolsJoern 2.0.157
Data preprocessingGraph
Embedding algorithmWord2vec
-verctor_size100
-epoch10
-min_count1
Network architectureCNN
-epoch200
-batch_size128
-input_channels115
-hidden_channels200
-num_of_layers6
-optimizeradam
-loss functionbinary cross-entropy
LinevulData preprocessingSlice
Embedding algorithmBPE+Transformer
Pretrained modelcodeBERT
-batch_size256
-num_attention_head12
-optimizerAdam
-loss functionbinary cross-entropy

We selected these three DL models to evaluate a range of similar models for the categories they each represent. As part of our model selection process, we referenced the parameters reported in the respective research papers of these DL models that achieved the best performance. These parameters were selected in Table2 as the pre-set hyperparameters in our framework. By doing so, we aim to replicate the optimal performance achieved by these models and ensure consistency in our evaluation and comparison.

4.4 Baselines

We compare DLAP against four prompting frameworks[45, 34, 30, 44] that leverage LLMs to detect vulnerabilities.

  1. \bullet PRolRol{}_{\text{Rol}}start_FLOATSUBSCRIPT Rol end_FLOATSUBSCRIPT

    (Role-based prompts): According to White et al.[44], providing GPT with a clear role would greatly alleviate its illusion problem. Our first baseline is proposed by Zhang et al.[45], making GPT a vulnerability detection system.

  2. \bullet PAuxAux{}_{\text{Aux}}start_FLOATSUBSCRIPT Aux end_FLOATSUBSCRIPT

    (Auxiliary information prompts): Based on the view of Zhang et al.[45], providing the LLMs more semantic information about the code for vulnerability detection improves its performance. Therefore, in baseline 2, we provide data flow as auxiliary information to prompts.

  3. \bullet PCotCot{}_{\text{Cot}}start_FLOATSUBSCRIPT Cot end_FLOATSUBSCRIPT

    (Chain-of-thought prompts): According to Wei et al.[42], due to the potential capabilities of LLMs for multi-turn dialogue, constructing a COT better assists LLMs in reasoning[42]. Therefore, in baseline 3, we constructed a two-step thinking chain to drive the LLM in the process of vulnerability detection.Step1: To make LLMs correctly determine whether a code is vulnerable. This step drives the LLMs to understand the purpose of the code exactly. Therefore, we designed first-step prompts for detecting the intent of the code.Step2: Based on the first step, we continue to prompt LLMs to detect vulnerabilities for inputs.

  4. \bullet GRACE:

    GRACE is a vulnerability detection prompting framework that enhances the capabilities of LLM for software vulnerability detection. It achieves this by incorporating graph structural information from the code. GRACE employs codeT5 and ICL techniques to use graph information.

4.5 Evaluation Metrics

As vulnerability detection is formulated as a binary classification problem in this paper, we use precision (𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPT), recall (vulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPT), and F1-score (F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) to measure the performance of each framework. Considering vulnerability is a minor class but is of great severity, we also use FPR as a metric. FPR pays attention to false positives since making mistakes on them would cause more serious outcomes than making mistakes on false negatives. In this paper, the minor class (positive) is vulnerability, and it occupies a very small portion. The definition of FPR is shown in Equation6. Moreover, Matthews correlation coefficient (MCC) is also used as an evaluation metric. MCC, a.k.a., phi coefficient, is a metric to measure the performance of binary classifiers on imbalanced datasets. MCC is a more comprehensive metric than FPR. The definition of MCC is shown in Equation8.

FPR=FPFP+TNFPRFPFPTN\footnotesize\hskip 54.06006pt\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}FPR = divide start_ARG FP end_ARG start_ARG FP + TN end_ARG(6)
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCCTPTNFPFNTPFPTPFNTNFPTNFN\footnotesize\text{MCC}=\frac{{\text{TP}\times\text{TN}-\text{FP}\times\text{%FN}}}{{\sqrt{{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(%\text{TN}+\text{FN})}}}}MCC = divide start_ARG TP × TN - FP × FN end_ARG start_ARG square-root start_ARG ( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN ) end_ARG end_ARG(7)

where TP represents correctly detected vulnerabilities, TN represents correctly detected non-vulnerabilities, FP represents incorrectly detected vulnerabilities, and FN represents incorrectly detected non-vulnerabilities.

The Coefficient of Variation (CV) is a statistical measure used to determine the dispersion of data points in a dataset relative to its mean. It is particularly valuable when comparing the variability of datasets with different means. The CV is calculated using the equation:

CV=σμCV𝜎𝜇\footnotesize\hskip 71.13188pt\text{CV}=\frac{\sigma}{\mu}CV = divide start_ARG italic_σ end_ARG start_ARG italic_μ end_ARG(8)

where σ𝜎\sigmaitalic_σ represents the standard deviation and μ𝜇\muitalic_μ denotes the mean of the dataset.

A higher CV indicates a greater level of dispersion within the data distribution, reflecting more variability relative to the mean. 𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPT, vulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPT, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and FPR range from 0 to 1, with higher values indicating better performance of a classifier. MCC ranges from -1 to +1, with higher values indicating better performance of a classifier. We use percentage values (%percent\%%) to highlight the differences between results.

5 Results and Analysis

This section analyzes the experimental results to address the research questions.

ProjectLinevulDevignSysevr
𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC
Chrome40.473.352.128.437.629.385.543.754.026.127.756.837.239.014.6
Android34.686.249.341.436.131.785.546.246.731.329.480.343.148.725.5
Linux57.176.465.413.956.448.866.356.316.944.427.722.624.914.408.8
Qemu84.255.166.701.963.952.865.558.510.750.328.610.014.804.409.0

5.1 RQ1: Selection of DL Models

We conducted experiments on four large-scale projects to investigate which category of DL model is suitable for DLAP. The results are provided in Table3, which reveal that using Linevul outperforms using others in most datasets and metrics. For instance, in the Chrome dataset, DLAP with Linevul achieves the highest MCC of 37.6%, surpassing Devign’s 26.1% and Sysevr’s 14.6%. This finding is consistent in the Linux dataset, where it secures an MCC of 56.4%, compared to 44.4% and 8.8% for Devign and Sysevr, respectively. Furthermore, the precision and F1 scores of Linevul are notably higher across the datasets, underscoring its robustness in identifying vulnerabilities with greater accuracy and fewer false positives, as evidenced by its lower FPR in datasets. Overall, using Linevul surpasses using Devign by an average of 7.2% and 10.5% on the comprehensive evaluation metrics F1 and MCC, respectively. It also outperforms integrating Sysevr by an average of 28.4% and 34.0% on the same metrics. This demonstrates that Linevul has superior adaptability and generalizability when integrated into LLMs compared to other DL models. These results indicate the effectiveness of integrating Linevul into DLAP to detect vulnerabilities, especially its superior F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which implies a higher likelihood of detecting actual vulnerabilities. Its MCC, a critical indicator of the quality of binary classifications, shows the ability of DLAP with Linevul to solve extremely imbalanced datasets.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (6)
DL modelChromeAndroidLinuxQemuaverage
Sysevr0.11.20.40.020.43
Devign0.51.22.02.01.4
Linevul2.42.62.53.32.7

To further distinguish which DL model is more suitable as a plug-in for DLAP, we also analyze the intermediate output (detection probability) of the DL model. Table4 presents the variability in the performance of different DL models across several software projects. The Linevul model, highlighted in gray and bold, displays the highest CV𝐶𝑉CVitalic_C italic_V. By comparing and analyzing probability density distribution plots Figure6 and CV𝐶𝑉CVitalic_C italic_V Table4 in the largest project dataset Google, we noticed that Linevul displays a more discrete data distribution when compared to the other models. This unique discrete detection distribution property facilitates LLM generation with implicit fine-tuning for downstream detection tasks more effectively.

5.2 RQ2: Comparision with Other Prompting Frameworks

FrameworkChromeAndroidLinuxQemu
𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC
PRolRol{}_{\text{Rol}}start_FLOATSUBSCRIPT Rol end_FLOATSUBSCRIPT24.407.211.105.802.322.506.410.005.601.322.406.610.205.601.722.206.910.504.404.2
PAuxAux{}_{\text{Aux}}start_FLOATSUBSCRIPT Aux end_FLOATSUBSCRIPT22.754.632.148.604.821.863.432.558.304.224.670.236.552.614.119.355.228.642.109.5
PCotCot{}_{\text{Cot}}start_FLOATSUBSCRIPT Cot end_FLOATSUBSCRIPT16.805.408.107.002.631.603.105.701.704.030.708.012.704.406.564.738.047.803.843.0
GRACE32.637.532.680.211.225.082.638.474.008.525.076.037.676.002.017.193.128.982.410.6
DLAP40.473.352.128.437.634.686.249.341.436.157.176.465.413.956.484.255.166.701.963.9

Due to cost constraints associated with OpenAI API calls, we employed the GPT-3.5-turbo-0125 model for vulnerability detection. Table 5 illustrates the performance comparison between the GPT model using the baseline prompting framework and the DLAP approach. The performance of each framework is evaluated based on five metrics: Precision (𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPT), Recall (vulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPT), F1 Score (F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), False Positive Rate (FPR), and Matthews Correlation Coefficient (MCC). DLAP consistently outperforms the other frameworks across nearly all metrics and datasets. Specifically, DLAP achieves the highest Precision, Recall, F1 Score, and MCC values, showcasing its superior ability to accurately identify vulnerabilities with minimal false positives. For instance, in the Chrome dataset, DLAP’s Precision of 40.4% and Recall of 73.3% significantly surpass those of the next best framework, GRACE. Furthermore, DLAP’s exceptional performance is highlighted by its F1 Score, reaching up to 52.1% in Chrome, 49.3% in Android, 65.4% in Linux, and an impressive 66.7% in Qemu, which are higher compared to the baseline frameworks. In terms of FPR, DLAP demonstrates a moderate FPR across the datasets. Despite that FPR is not the lowest on the Chrome, Android, and Linux datasets when compared to the baselines of PRolRol{}_{\text{Rol}}start_FLOATSUBSCRIPT Rol end_FLOATSUBSCRIPT and PCotCot{}_{\text{Cot}}start_FLOATSUBSCRIPT Cot end_FLOATSUBSCRIPT, DLAP is far superior to them on the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and MCC. Therefore, DLAP’s overall effect exceeds the baseline frameworks.

In particular, DLAP’s MCC values, which indicate the quality of binary classifications, significantly exceed those of the other methods, such as 37.6% in Chrome and 63.9% in Qemu, further establishing its superior performance in the task of vulnerability detection using LLMs. Our framework consistently surpasses the top baseline in terms of the MCC indicator, which, based on the nature of the MCC correlation coefficient, suggests that our predictions more accurately reflect the actual distribution and indicates DLAP is superior to the baselines in the generalization performance of large data sets.

Overall, the analysis reveals that DLAP not only excels in identifying vulnerabilities with high precision and recall but also maintains a low false positive rate and achieves outstanding overall performance as evidenced by its F1 Scores and MCC values. This demonstrates DLAP’s exceptional effectiveness in harnessing the power of LLM for the critical task of vulnerability detection, which outperforms the capabilities of other prompting frameworks.

5.3 RQ3: Prompting vs. Fine-tuning

Table6 shows that fine-tuning an LLM on a large project has a higher F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT than DLAP. However, on a small project with imbalanced data, DLAP performs better. In particular, LLMs can not fine-tuned on Qemu because the project has a small amount of data. In contrast, DLAP gets the distribution characteristics of small samples and hence can achieve better performance.In addition, fine-tuning an LLM requires stopping the model and retraining it before using it, whereas DLAP does not need to be removed for retraining during its use. It is used as a plug-in to access an LLM in real time to augment the vulnerability detection capability of LLMs. Besides, the comparison of computational cost between DLAP and LoRA fine-tuning is shown on Table7. It is clear that fine-tuning a 13B LLM requires close to 40GB of graphics memory and a lot of time. In contrast, DLAP can select a small DL model and train it to fit the target data in less than one hour.

DatasetFine-Tuning Vicuna-13BDLAP
𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC𝒫vulsubscript𝒫vul\mathcal{P}_{\text{vul}}caligraphic_P start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTvulsubscriptvul\mathcal{R}_{\text{vul}}caligraphic_R start_POSTSUBSCRIPT vul end_POSTSUBSCRIPTF1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTFPRMCC
Chrome91.474.482.001.878.640.473.352.128.437.6
Android67.035.846.704.540.434.686.249.341.436.0
Linux96.455.470.300.568.957.176.465.414.056.4
Qemu99.906.712.100.123.484.255.266.701.963.9
Total88.743.052.801.252.854.172.858.421.448.5
DatasetFine-TuningDLAP
M(MB)T(h)GPU(GB)M(MB)T(h)GPU(GB)
Chrome5.111.131.23.60.86.3
Android4.94.230.34.30.55.5
Linux4.95.530.33.80.45.5
Qemu4.81.328.70.90.32.8

Equation5 (Cf. Section3.3) indicates that DL model training information changes the relaxed attention of the LLM. This results in an implicit fine-tuning for the LLM to adapt to a specific detection task[11]. Whether through fine-tuning or In-Context Learning (ICL), the extent to which a model is fine-tuned reflects its ability to adapt to the target task, serving as a crucial factor in stimulating LLMs to perform well.According to the detection result shown in Table6, DLAP performs well in approximating fine-tuning on performance evaluation metrics.

To further explain what mechanism induces LLM to produce implicit fine-tuning and achieve good performance on the target task, we extract the attention layer from the fine-tuned local LLM to calculate the probability for each detection category.Subsequently, we gather the ICL outputs by the LLM with DLAP to calculate the probability for each detection category.The probability distribution of the different classes indicates the degree of fine-tuning of the model.Figure7 shows that probability distributions between fine-tuning and DLAP are similar. The same distribution explains that DLAP enables implicit fine-tuning at a reduced cost.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (7)

In comparison with fine-tuning, Figure8 shows a real example of using DLAP to detect vulnerabilities in Linux. The outcomes, which are easily understandable to developers, closely match the records from the actual issue fix commit. In contrast, the output from a fine-tuned LLM is limited to simple ‘yes’ or ‘no’ responses. DLAP’s results are more comprehensible to developers than those from fine-tuning alone.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (8)

6 Discussion

In this section, we discuss the DL model selection for DLAP and DLAP’s potential generalization capability.

6.1 DL Model Selection for DLAP

Based on the insights gained from RQ1, DL models with discrete predictive probability density distributions for the data are more suitable as an integrated plug-in for DLAP. Additionally, we have observed the effectiveness of a DL model as an LLM prompt model. We have discovered that its utility significantly improves when it exhibits discrete data with the highest value of CV𝐶𝑉CVitalic_C italic_V. Moreover, our experiments have highlighted the exceptional performance of Transformer-based models in driving the LLM. This advantage could be attributed to the architectural resemblance between Transformer models and the design architecture of the LLM. The similarity in their structures allows for seamless integration, enabling the attention-layer parameters derived from Transformer models to play a pivotal role in facilitating implicit fine-tuning within the LLM.

By leveraging these attention-layer parameters, the LLM dynamically adjusts and refines its internal mechanisms, implicitly adapting itself to the nuances and intricacies of different downstream tasks. This implicit fine-tuning process empowers the LLM to generate more accurate and contextually relevant responses, thereby enhancing its overall performance in various application scenarios.

In summary, our experiments have revealed the crucial roles played by both the varied conformity of the DL model and the resemblance between Transformer models and the architecture of LLMs. These factors, combined with the implicit fine-tuning facilitated by attention-layer parameters, enable the LLM to excel in adapting to and fulfilling the requirements of diverse downstream tasks.

6.2 Generalization Capability of DLAP

The DLAP framework effectively stimulates LLMs to implicitly fine-tune themselves for other software development tasks. By integrating existing static tools and deep learning models, DLAP is applied to a variety of ASAT tasks. This adaptation simplifies the process of adopting DLAP to deal with new challenges. We introduce two scenarios that may extend the applicability of DLAP.

Automated identification of affected libraries from vulnerability data is an ASAT task that requires figuring out which libraries in software are related to each of the reported vulnerabilities in open vulnerability report sets (e.g., NVD, CVE). The task is formulated as an extreme multi-label learning[15, 6]. First, DLAP constructs a sufficient vulnerability description database and combines them with libraries known to be affected by the reported vulnerabilities as a COT template library for affected libraries identification. The known affected libraries are collected to train a DL model. Then, existing static tools (fastXML888https://github.com/fastXML/fastXML) and the DL model are used to generate paramilitary results for the XML library list in the project. Finally, by combining the results as a key to query the COT template library, COT prompts can be built to augment the LLMs, and the identification of libraries from vulnerability data may be more accurate.

Code smell detection is an ASAT task that prevents software from technical debts. Code smell detection based on DL models is a multi-classification detection task comprised of several binary classification models, each designed to detect a specific category of code smell[21, 33]. Utilizing DLAP requires the creation of a comprehensive reference library of code smells and a high-quality coding standards library. Then DLAP uses the static tools (checkstyle999https://checkstyle.org) and DL models for specific detection projects. Like the progress mentioned in this paper, employing the DL model augments the LLMs with prompts for the specific project code smell detection.DLAP is utilized in other ASAT tasks which need to combine DL models with LLMs to improve the performance of LLMs in the target tasks.

7 Threats to Validity

This section analyzes possible threats to validity[48] and our efforts to mitigate their impacts.

Internal Validity.The efficacy of the DLAP relies on its core component, the DL models. While it is permissible for these DL models to result in judgment biases for input code, a completely erroneous DL model that is either irrelevant or detrimental to the task can severely undermine the performance of the DLAP. Therefore, when employing the DLAP, it is crucial to select DL models that are capable of addressing the specific objectives of the task to effectively augment the performance of LLMs. Besides, due to the close source nature of certain LLMs (GPT-3.5-turbo), their internal structures and the specific fine-tuning methods they employ remain unknown. Therefore, for our experiments, we use an open-source LLM (Llama-13b), for comparative fine-tuning studies.

Construct Validity.The relaxed attention of the LLMs changes under the stimulation of DLAP according toEquation5. We define this stimulation as the implicit fine-tuning of LLMs caused by DLAP to adapt to the feature of the target project. Because of the limitations of observing the internal output of LLMs, we can not strictly demonstrate that the stimulation produces gradient descent optimization loss on the target classification task. Instead of performing a demonstration mathematically, we show some of our results with advantages and some of the intermediate outputs of some contrasts showing fine-tuning, validating the existence of the implicit fine-tuning mechanism in the way of experimental data. These visualized experimental results eliminate the construct validity of this paper to some extent.

External Validity.For the verification of DLAP, the final effect of our template needs to drive LLM to complete the vulnerability detection, and the performance of this task is used as an evaluation to measure the effectiveness of our method. So when the LLM is different from the LLM selected in this experiment, the results of using DLAP will be different. We identify the choice of LLM as an external validity threat to this work. Considering both cost and model performance, we chose the least expensive model of the current state-of-the-art LLMs, GPT-3.5-turbo-0125. By using the best model, we make the best use of DLAP. we provide specific model selection, which ensures that other work reproduces the same level of improvement when using DLAP.

8 Conclusion

In this paper, we propose DLAP, a bespoke prompting framework for ASAT tasks that has superior and stable performance in software vulnerability detection tasks with results easily understandable to developers. Experiments show the effectiveness of augmenting LLMs by DL models to stimulate adaptive implicit fine-tuning. This progress prompts LLMs to exceed both state-of-the-art DL solutions and LLMs with alternative prompting frameworks in vulnerability detection. Through experiments, we also find that the pre-trained knowledge of LLMs combines the outputs of all parts of DLAP to achieve good performance. In the future, we will utilize DLAP in more ASAT tasks to explore how DLAP is generalized to the other tasks.

References

  • Arakelyan etal. [2023]Arakelyan, S., Das, R., Mao, Y., Ren, X., 2023.Exploring distributional shifts in large language models for code analysis, in: Proceedings of the 22nd Conference on Empirical Methods in Natural Language Processing (EMNLP), ACL. pp. 16298–16314.
  • Bai etal. [2022]Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., etal., 2022.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073 .
  • Brown etal. [2020]Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., etal., 2020.Language models are few-shot learners.Advances in Neural Information Processing Systems 33, 1877–1901.
  • Chakraborty etal. [2022]Chakraborty, S., Krishna, R., Ding, Y., Ray, B., 2022.Deep learning based vulnerability detection: Are we there yet.IEEE Transactions on Software Engineering 48, 3280–3296.
  • Chen etal. [2021]Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., etal., 2021.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374 .
  • Chen etal. [2020]Chen, Y., Santosa, A.E., Sharma, A., Lo, D., 2020.Automated identification of libraries from vulnerability data, in: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (SEIP), ACM. pp. 90–99.
  • Cheshkov etal. [2023]Cheshkov, A., Zadorozhny, P., Levichev, R., 2023.Evaluation of chatgpt model for vulnerability detection.arXiv preprint arXiv:2304.07232 .
  • Chowdhery etal. [2023]Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., etal., 2023.Palm: Scaling language modeling with pathways.Journal of Machine Learning Research 24, 1–113.
  • Christakis and Bird [2016]Christakis, M., Bird, C., 2016.What developers want and need from program analysis: An empirical study, in: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), ACM. pp. 332–343.
  • Croft etal. [2023]Croft, R., Babar, M.A., Kholoosi, M.M., 2023.Data quality for software vulnerability datasets, in: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), IEEE. pp. 121–133.
  • Dai etal. [2023]Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., Wei, F., 2023.Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers, in: Proceedings of the 2023 ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo), Association for Computational Linguistics. pp. 4005–4019.
  • Fan etal. [2020]Fan, J., Li, Y., Wang, S., Nguyen, T.N., 2020.A c/c++ code vulnerability dataset with code changes and cve summaries, in: Proceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories (MSR), ACM. pp. 508–512.
  • Fu and Tantithamthavorn [2022]Fu, M., Tantithamthavorn, C., 2022.Linevul: A transformer-based line-level vulnerability prediction, in: Proceedings of the 19th IEEE/ACM International Conference on on Mining Software Repositories (MSR), ACM. pp. 608–620.
  • Gonzalez etal. [2021]Gonzalez, D., Zimmermann, T., Godefroid, P., Schäfer, M., 2021.Anomalicious: Automated detection of anomalous and potentially malicious commits on github, in: Proceedings of 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), IEEE. pp. 258–267.
  • Haryono etal. [2022]Haryono, S.A., Kang, H.J., Sharma, A., Sharma, A., Santosa, A., Yi, A.M., Lo, D., 2022.Automated identification of libraries from vulnerability data: Can we do better?, in: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC), ACM. pp. 178–189.
  • Hsieh etal. [2019]Hsieh, Y.G., Niu, G., Sugiyama, M., 2019.Classification from positive, unlabeled and biased negative data, in: Proceedings of the 36th ACM International Conference on Machine Learning (ICML), PMLR. pp. 2820–2829.
  • Hu etal. [2022]Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., etal., 2022.Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations (ICLR).
  • Jin etal. [2023]Jin, M., Shahriar, S., Tufano, M., Shi, X., Lu, S., Sundaresan, N., Svyatkovskiy, A., 2023.Inferfix: End-to-end program repair with llms.arXiv preprint arXiv:2303.07263 .
  • Kang etal. [2022]Kang, H.J., Aw, K.L., Lo, D., 2022.Detecting false alarms from automatic static analysis tools: How far are we?, in: Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE), ACM. pp. 698–709.
  • Katsadouros etal. [2023]Katsadouros, E., Patrikakis, C.Z., Hurlburt, G., 2023.Can large language models better predict software vulnerability?IT Professional 25, 4–8.
  • Lewowski and Madeyski [2022]Lewowski, T., Madeyski, L., 2022.How far are we from reproducible research on code smell detection? a systematic literature review.Information and Software Technology 144, 106783.
  • Li etal. [2023]Li, J., Li, G., Li, Y., Jin, Z., 2023.Enabling programming thinking in large language models toward code generation.arXiv preprint arXiv:2305.06599 .
  • Li etal. [2017]Li, X., Chang, X., Board, J.A., Trivedi, K.S., 2017.A novel approach for software vulnerability classification, in: Proceedings of the 64th IEEE/ACM International Conference on Reliability and Maintainability Symposium (RAMS), IEEE. pp. 1–7.
  • Li etal. [2021]Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z., 2021.Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Transactions on Dependable and Secure Computing 19, 2244–2258.
  • Li etal. [2018]Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng, Z., Zhong, Y., 2018.Vuldeepecker: A deep learning-based system for vulnerability detection, in: Proceedings of the 25th ACM Annual Network and Distributed System Security Symposium (NDSS), The Internet Society.
  • Lin etal. [2020a]Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y., 2020a.Software vulnerability detection using deep neural networks: a survey.Proceedings of the IEEE 108, 1825–1848.
  • Lin etal. [2020b]Lin, G., Xiao, W., Zhang, J., Xiang, Y., 2020b.Deep learning-based vulnerable function detection: A benchmark, in: Proceedings of the 21st ACM Information and Communications Security: International Conference (ICICS), Springer. pp. 219–232.
  • Liu etal. [2022]Liu, X., Ji, K., Fu, Y., Tam, W., Du, Z., Yang, Z., Tang, J., 2022.P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Association for computational Linguistics. pp. 61–68.
  • Liu etal. [2023]Liu, X., Tan, Y., Xiao, Z., Zhuge, J., Zhou, R., 2023.Not the end of story: An evaluation of chatgpt-driven vulnerability description mappings, in: Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics. pp. 3724–3731.
  • Lu etal. [2024]Lu, G., Ju, X., Chen, X., Pei, W., Cai, Z., 2024.Grace: Empowering llm-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software 212, 112–236.
  • Nachtigall etal. [2022]Nachtigall, M., Schlichtig, M., Bodden, E., 2022.A large-scale study of usability criteria addressed by static analysis tools, in: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), ACM. pp. 532–543.
  • Ozturk etal. [2023]Ozturk, O.S., Ekmekcioglu, E., Cetin, O., Arief, B., Hernandez-Castro, J., 2023.New tricks to old codes: Can ai chatbots replace static code analysis tools?, in: Proceedings of the 7th ACM European Interdisciplinary Cybersecurity Conference (EICC), ACM. pp. 13–18.
  • Pecorelli etal. [2019]Pecorelli, F., DiNucci, D., DeRoover, C., DeLucia, A., 2019.On the role of data balancing for machine learning-based code smell detection, in: Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), ACM. pp. 19–24.
  • Purba etal. [2023]Purba, M.D., Ghosh, A., Radford, B.J., Chu, B., 2023.Software vulnerability detection using large language models, in: Proceedings of the 34th IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), IEEE. pp. 112–119.
  • Shi etal. [2023]Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E.H., Schärli, N., Zhou, D., 2023.Large language models can be easily distracted by irrelevant context, in: Proceedings of the 40th ACM International Conference on Machine Learning (ICML), PMLR. pp. 31210–31227.
  • Steenhoek etal. [2023]Steenhoek, B., Rahman, M.M., Jiles, R., Le, W., 2023.An empirical study of deep learning models for vulnerability detection, in: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pp. 2237–2248.
  • Telang and Wattal [2007]Telang, R., Wattal, S., 2007.An empirical analysis of the impact of software vulnerability announcements on firm stock price.IEEE Transactions on Software engineering 33, 544–557.
  • Thapa etal. [2022]Thapa, C., Jang, S.I., Ahmed, M.E., Camtepe, S., Pieprzyk, J., Nepal, S., 2022.Transformer-based language models for software vulnerability detection, in: Proceedings of the 38th ACM Annual Computer Security Applications Conference (ACSAC), ACM. pp. 481–496.
  • Tomas etal. [2019]Tomas, N., Li, J., Huang, H., 2019.An empirical study on culture, automation, measurement, and sharing of devsecops, in: 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), IEEE. pp. 1–8.
  • Touvron etal. [2023]Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., etal., 2023.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971 .
  • Tsipenyuk etal. [2005]Tsipenyuk, K., Chess, B., McGraw, G., 2005.Seven pernicious kingdoms: A taxonomy of software security errors.IEEE Security & Privacy 3, 81–84.
  • Wei etal. [2022]Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., etal., 2022.Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems 35, 24824–24837.
  • Wei etal. [2021]Wei, Y., Sun, X., Bo, L., Cao, S., Xia, X., Li, B., 2021.A comprehensive study on security bug characteristics.Journal of Software: Evolution and Process 33, e2376.
  • White etal. [2023]White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C., 2023.A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint arXiv:2302.11382 .
  • Zhang etal. [2023a]Zhang, C., Liu, H., Zeng, J., Yang, K., Li, Y., Li, H., 2023a.Prompt-enhanced software vulnerability detection using chatgpt.arXiv preprint arXiv:2308.12697 .
  • Zhang etal. [2023b]Zhang, K., Li, Z., Li, J., Li, G., Jin, Z., 2023b.Self-edit: Fault-aware code editor for code generation.arXiv preprint arXiv:2305.04087 .
  • Zhang etal. [2023c]Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., etal., 2023c.Siren’s song in the ai ocean: A survey on hallucination in large language models.arXiv preprint arXiv:2309.01219 .
  • Zhou etal. [2016]Zhou, X., Jin, Y., Zhang, H., Li, S., Huang, X., 2016.A map of threats to validity of systematic literature reviews in software engineering, in: Proceedings of the 23rd IEEE International Conference on Asia-Pacific Software Engineering Conference (APSEC), IEEE. pp. 153–160.
  • Zhou etal. [2019]Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y., 2019.Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems 32, 10197–10207.
  • Zou etal. [2019]Zou, D., Wang, S., Xu, S., Li, Z., Jin, H., 2019.μ𝜇\muitalic_μvuldeepecker: A deep learning-based system for multiclass vulnerability detection.IEEE Transactions on Dependable and Secure Computing 18, 2224–2236.

Appendix

The appendix analyzes the composition and causes of implicit fine-tuning.The implicit fine-tuning process is described as follows.we sets 𝒞(x)=COT(x)𝒞𝑥𝐶𝑂𝑇𝑥\mathcal{C}(x)=COT(x)caligraphic_C ( italic_x ) = italic_C italic_O italic_T ( italic_x ) as the input representation for the context. It sets the target task data input as x𝑥xitalic_x, and 𝒫(x)=ICL(x)𝒫𝑥𝐼𝐶𝐿𝑥\mathcal{P}(x)=ICL(x)caligraphic_P ( italic_x ) = italic_I italic_C italic_L ( italic_x ) as the input representation for DLAP prompts query. WQ,WK,WVsubscript𝑊𝑄subscript𝑊𝐾subscript𝑊𝑉W_{Q},W_{K},W_{V}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are the projection matrices for computing the attention queries. 𝐪=WQ𝐪subscript𝑊𝑄\mathbf{q}=W_{Q}bold_q = italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT is the attention query vector. In the fine-tuning process of DLAP, the attention of LLMs is represented as the Equation9

𝒜(𝐪)𝒜𝐪\displaystyle\mathcal{A}(\mathbf{q})caligraphic_A ( bold_q )=Attention(V,K,𝐪)absentAttention𝑉𝐾𝐪\displaystyle=\operatorname{Attention}(V,K,\mathbf{q})= roman_Attention ( italic_V , italic_K , bold_q )(9)
=\displaystyle==WV[𝒫(x);C(x)]softmax((WK[(x);C(x)])T𝐪d)subscript𝑊𝑉𝒫𝑥𝐶𝑥softmaxsuperscriptsubscript𝑊𝐾𝑥𝐶𝑥𝑇𝐪𝑑\displaystyle W_{V}[\mathcal{P}(x);C(x)]\operatorname{softmax}\left(\frac{%\left(W_{K}[\mathcal{B}(x);C(x)]\right)^{T}\mathbf{q}}{\sqrt{d}}\right)italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ; italic_C ( italic_x ) ] roman_softmax ( divide start_ARG ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ caligraphic_B ( italic_x ) ; italic_C ( italic_x ) ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )

where WQ,WK,WVsubscript𝑊𝑄subscript𝑊𝐾subscript𝑊𝑉W_{Q},W_{K},W_{V}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are the projection matrices for computing the attention queries. To simplify this process, we eliminate the nonlinear function softmaxsoftmax\operatorname{softmax}roman_softmax and related scaling factor d𝑑\sqrt{d}square-root start_ARG italic_d end_ARG to facilitate the analysis of the attention changes in this process itself. We obtain an approximate relaxed linear attention, Equation10.

𝒜(𝐪)𝒜𝐪\displaystyle\mathcal{A}(\mathbf{q})caligraphic_A ( bold_q )WV[𝒫(x);C(x)](WK[𝒫(x);C(x)])T𝐪absentsubscript𝑊𝑉𝒫𝑥𝐶𝑥superscriptsubscript𝑊𝐾𝒫𝑥𝐶𝑥𝑇𝐪\displaystyle\approx W_{V}[\mathcal{P}(x);C(x)]\left(W_{K}[\mathcal{P}(x);C(x)%]\right)^{T}\mathbf{q}≈ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ; italic_C ( italic_x ) ] ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ; italic_C ( italic_x ) ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q(10)
=WVC(x)(WKC(x))T𝐪+WV𝒫(x)(WK𝒫(x))T𝐪absentsubscript𝑊𝑉𝐶𝑥superscriptsubscript𝑊𝐾𝐶𝑥𝑇𝐪subscript𝑊𝑉𝒫𝑥superscriptsubscript𝑊𝐾𝒫𝑥𝑇𝐪\displaystyle=W_{V}C(x)\left(W_{K}C(x)\right)^{T}\mathbf{q}+W_{V}\mathcal{P}(x%)\left(W_{K}\mathcal{P}(x)\right)^{T}\mathbf{q}= italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT italic_C ( italic_x ) ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_C ( italic_x ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q + italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT caligraphic_P ( italic_x ) ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT caligraphic_P ( italic_x ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q
=𝒜~(𝐪)absent~𝒜𝐪\displaystyle=\widetilde{\mathcal{A}}(\mathbf{q})= over~ start_ARG caligraphic_A end_ARG ( bold_q )

We define the context (information from COT library) prompts as the initial parameters Winitsubscript𝑊initW_{\text{init}}italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT that need to be updated by the attention layer in Equation11.

Winit=Wv[𝒞(x)]WK[𝒞(x)]T𝐪)\displaystyle W_{\text{init}}=W_{v}[\mathcal{C}(x)]\cdot W_{K}[\mathcal{C}(x)]%^{T}\cdot\mathbf{q})italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT [ caligraphic_C ( italic_x ) ] ⋅ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ caligraphic_C ( italic_x ) ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ bold_q )(11)

According to research[11], we reverse use the dual form of the attention of transformer derived by them. Therefore, the adaptive implicit fine-tuning of LLMs stimulated by DLAP for specific projects are written as the Equation10

𝒜~(𝐪)~𝒜𝐪\displaystyle\widetilde{\mathcal{A}}(\mathbf{q})over~ start_ARG caligraphic_A end_ARG ( bold_q )=Winit𝐪+WV[𝒫(x)](WK[𝒫(x)])T𝐪absentsubscript𝑊init𝐪subscript𝑊𝑉delimited-[]𝒫𝑥superscriptsubscript𝑊𝐾delimited-[]𝒫𝑥𝑇𝐪\displaystyle=W_{\text{init}}\mathbf{q}+W_{V}[\mathcal{P}(x)]\left(W_{K}[%\mathcal{P}(x)]\right)^{T}\mathbf{q}= italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT bold_q + italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ] ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_q(12)
=\displaystyle==Winit𝐪+LinearAttn(WV[𝒫(x)],WK[𝒫(x)],𝐪)subscript𝑊init𝐪LinearAttnsubscript𝑊𝑉delimited-[]𝒫𝑥subscript𝑊𝐾delimited-[]𝒫𝑥𝐪\displaystyle W_{\text{init}}\mathbf{q}+\operatorname{LinearAttn}\left(W_{V}[%\mathcal{P}(x)],W_{K}[\mathcal{P}(x)],\mathbf{q}\right)italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT bold_q + roman_LinearAttn ( italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ] , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ] , bold_q )
=\displaystyle==Winit𝐪+i((WV[𝒫(x)]i)(WK[𝒫(x)]i))𝐪\displaystyle W_{\text{init}}\mathbf{q}+\sum_{i}\left((W_{V}\textbf{[}\mathcal%{P}(x)]_{i})\otimes\left(W_{K}\textbf{[}\mathcal{P}(x)]_{i}\right)\right)%\mathbf{q}italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT bold_q + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊗ ( italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [ caligraphic_P ( italic_x ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) bold_q
=\displaystyle==Winit𝐪+ΔW𝒫(x)𝐪subscript𝑊init𝐪Δsubscript𝑊𝒫𝑥𝐪\displaystyle W_{\text{init}}\mathbf{q}+\Delta W_{\mathcal{P}(x)}\mathbf{q}italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT bold_q + roman_Δ italic_W start_POSTSUBSCRIPT caligraphic_P ( italic_x ) end_POSTSUBSCRIPT bold_q
=\displaystyle==(Winit+ΔW𝒫(x))𝐪subscript𝑊initΔsubscript𝑊𝒫𝑥𝐪\displaystyle\left(W_{\text{init}}+\Delta W_{\mathcal{P}(x)}\right)\mathbf{q}( italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT + roman_Δ italic_W start_POSTSUBSCRIPT caligraphic_P ( italic_x ) end_POSTSUBSCRIPT ) bold_q

Through , we conclude that the relaxed attention mechanism is influenced by the prompt 𝒫𝒫\mathcal{P}caligraphic_P.

DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection (2024)
Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 6002

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.