20 Apr 2023

Prompt Based Learning

Summary on the survey paper named “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. Here’s the link.

Prompt-based Learning

In many downstream NLP tasks, it’s difficult to gather huge amount of annotated data to sufficiently train a neural network with good generalization and prediction capability. And it’s known that LLMs are pretrained on tons of data and the knowledge or ability required to solve downstream tasks may already be equipped. Hence, prompt-based learning aims to make use of internal knowledge of LLMs to solve downstream tasks, avoiding the over-fitted training on small-scaled domain specific dataset. All the information is in text, including original input data, task description, etc. Prompt-based method makes predictions in 3 steps, namely prompt addition, answer search, answer mapping. Prompt addition is to integrate input data and prompt template together. There’re 2 types of prompt, cloze and prefix. Cloze prompt is to fill in the blanks, e.g. “I love this movie. It’s a [MASK] movie”. Prefix prompt continues a string prefix, e.g. “I love this movie. What’s the sentiment of this review? [MASK]”. Prompt template engineering is worth exploring, such as hand-crafted prompt and automated prompt (discrete or continuous). Answer search is to restrict the label space. The label space of LLM is usually the whole vocabulary so it’s necessary to restrict the label space for downstream tasks. Answer mapping is to transform the textual answer into prediction. This is trivial in cases where the output of LLM is directly the answer. But it’s necessary in cases, for example, in sentiment analysis some methods use sentiment-bearing words to represent sentiment. Then it’s necessary to mapped the words generated by LLM to a certain sentiment. Prompt answer engineering can be explored in the procedure.

Prompt template engineering

Hand-crafted template is an intuitive way but sometimes experts fail to design the optimal templates. Hence, some researchers put efforts on automated prompt. Automated prompt can further be divided into discrete prompt, where the prompt is an actual text string, and continuous prompt, where the prompt is represented in the embedding space.

Discrete prompt.

Take token as basic unit and prompt engineering is performed at token level.

Prompt mining. Scrape a large text corpus, e.g. Wikipedia, that contains input x and output y (label). And find either middle words or dependency paths between the input and output. Frequent middle words or dependency path can serve as template.
Prompt paraphrasing. Paraphrase can be done using back-translation, synonym replacement, learn a neural rewriter.
Gradient-based search. AutoPrompt method is an example, which formulates classification problems as language modelling, and it proposes method for automatic prompt template engineering and answer engineering. More specifically, for template engineering, it takes the first order approximation (Taylor expansion) of the change of the log-likelihood that would be produced by swapping the \(j\)th trigger token \(x^{(j)}_{trig}\) with another token \(w_{in}\in V\).

\[V_{cand} = topk(w^T_{in}\nabla logp(y|x_{prompt}))\]

Prompt generation. Adopt T5 model to generate prompt in a seq2seq manner
Prompt scoring. Use unidirectional LM to score the manually filled prompts and select the one with highest probability.

Continuous prompt (soft prompt).

In soft prompt, it relaxes 2 constraints: (1) relax the constraint that the embeddings of template have to be the embeddings of natural language words. (2) remove the restriction that the template is parameterized by the pretrained LLMs’ parameters. Instead, the templates has their own parameters and can be tuned based on training data.

Prefix tuning. Prepend a sequence of continuous task-specific vectors to the input, while keep the LM parameters frozen.
Tuning initialized with discrete prompt. Discrete prompt is a good starting point.
Hard-soft prompt hybrid tuning. Insert tunable embeddings into a hard prompt template.
Prompt tuning with rules (use manually crafted sub-template to compose a complete template using logical rules).

Prompt answer engineering

Answer shape, answer space, and mapping are the key parts in prompt answer engineering. Answer shape characterizes the granularity of answer. Answer space design impacts the answer space and the mapping method from answer to real prediction. Mapping and answer space design are usually tightly coupled.

Answer shape

Common choices include tokens, span, and sentence. Tokens: one of the token in the LLM’s vocabulary or a subset of the vocabulary. Span: a short multi-token span, which is usually used together with cloze prompt. Sentence: a sentence or document, which is usually used together with prefix prompt. In practice, how to choose answer shape depends on the task. Tokens and span are widely used in classification (sentiment classification) as well as other tasks (relation extraction). Longer phrasal or sentential answers are often used in language generation or multi-choice QA (where the scores of multiple phrases are compared with each other or abstractive QA)

Answer space design.

The key is to obtain answer space, either from the whole vocabulary of LLM, or from manual efforts, or from external resources or toolkit, or from training data.

Manual design
- Unconstrained space. In many cases, the answer space is the space of all tokens, fixed length spans, token sequences. Using identity mapping can obtain final prediction from answer.
- Constrained space. In many other cases, the answer space is limited. For each label, we can design an answer space or list manually. However, manual design requires expertise and may fall in sub-optimal. Here’s an example using manually crafted template filling in potential answers for LLM to rank. “Template-Based Named Entity Recognition Using BART” manually constructed template such as “Bangkok is a location” for LLM to rank to approach NER problem.
Discrete answer search
- Answer paraphrasing. First define an initial answer space. Then obtain a paraphrased answer set to expand the coverage of answers (still need to have mapping from answers to label). Finally the label probability is the marginal probability of all paraphrased answers.
- Prune-then-search. Initialize with a pruned answer space (can be the real label space) and then search for more answers. In AutoPrompt, a logistic classifier is firstly trained taking contextual representation of [MASK] as input and classifying real label. Then feed the LLM’s output word embedding into logistic classifier to obtain the score of input word associating with certain label. Because the mapping from the contextual representation of [MASK] to label represents the association between particular context and label, if a word achieves high score for a label in logistic classifier, then it means the word is strongly associated with that label and can be absorbed into the answer space.
- Label decomposition. When performing relation extraction, “KnowPrompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction” automatically decomposes each relation label into its constituent words and uses them as answer.
Continuous answer search. Very few work explores this direction. WARP: Word-level adversarial reprogramming assigns a virtual token embedding for each label and optimizes the virtual embedding directly together with the prompt token embedding.

Multi-Prompt Learning

Prompt ensemble, which inserts multiple prompts and aggregates the multiple outputs. To aggregate multiple predictions, following methods can be applied.
- Uniform average
- Weighted average
- Majority voting
- Knowledge distillation
Prompt augmentation (demonstration learning, similar with GPT-3), providing more demonstrations in the input text.

Training strategy

Prompt-less finetuning
Tuning-free prompt
Fixed-LLM prompt tuning
Fixed-prompt LLM tuning
LLM+prompt tuning.