Fine tune a pretrained model BART (Bidirectional and Auto-Regressive Transformers) on the task of Keyphrase Generation

4 min readSep 15, 2022

Model structure:
BART is a denoising autoencoder built with a sequence-to-sequence model that is applicable to a very wide range of end tasks. Pretraining has two stages (1) text is corrupted with an arbitrary noising function, and (2) a sequence-to-sequence model is learned to reconstruct the original text. BART uses a standard Transformer-based neural machine translation architecture. It uses a standard seq2seq/NMT architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). This means the encoder’s attention mask is fully visible, like BERT, and the decoder’s attention mask is causal, like GPT2. (https://arxiv.org/pdf/1910.13461v1.pdf)

BART: Inputs to the encoder need not be aligned with decoder outputs, allowing arbitrary noise transformations. Here, a document has been corrupted by replacing spans of text with mask symbols. The corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is an input to both the encoder and decoder, and we use representations from the final hidden state of the decoder.

Parameters:
— bad words ids: None, List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text.
— bos token id: 0, The id of the ’beginning-of-sequence’ token.
— decoder start token id: 2, If an encoder-decoder model starts decoding with a different token than ’bos’, the id of that token.
— diversity penalty: 0.0, This value is subtracted from a beam’s score if it generates a token same as any beam from other groups at a particular time. Note that ’diversity penalty’ is only effective if ’group beam search’ is enabled.
— do sample: False, Whether or not to use sampling; use greedy decoding otherwise.
— early stopping: True, Whether to stop the beam search when at least ’num beams’ sentences are finished per batch or not.
— eos token id: 2, The id of the ’end-of-sequence’ token.
— length penalty: 2.0, Exponential penalty to the length. 1.0 means no penalty. Set to value 1.0 in order to encourage the model to generate shorter sequences, to a value 1.0 in order to encourage the model to produce longer
— max length: 130, The maximum length of the sequence to be generated.
— min length: 30, The minimum length of the sequence to be generated.
no-repeat ngram size: 3, If set to int > 0, all ngrams of that size can only occur once.
— num beam groups: 1, Number of groups to divide ’num beams’ into in order to ensure diversity among different groups of beams. (https://arxiv.org/pdf/1610.02424.pdf)
— num beams: 7, Number of beams for beam search. 1 means no beam search.
— num return sequences: 1, the number of independently computed returned sequences for each of the element in the batch. • pad token id: 1, The id of the ’padding’ token.
— repetition penalty: 1.0, The parameter for repetition penalty. 1.0 means no penalty. (https://arxiv.org/pdf/1909.05858.pdf)
— temperature: 1.0, The value used to module the next token probabilities.
— top k: 50, The number of highest probability vocabulary tokens to keep for tab topk-filtering.
— top p: 1.0, If set to float

Experimental results:
Three evaluation metrics, precision, recall, and F-measure (F1) are employed for measuring the algorithm’s performance. Following the standard definition, precision is defined as the number of correctly-predicted keyphrases over the number of all predicted keyphrases, and recall is computed by the number of correctly predicted keyphrases over the total number of data (total number of actual keyphrases) records. Note that, when determining the match of two keyphrases, we use Porter Stemmer for preprocessing. Dataset used in the experiment is a subset of ’kp20k’, where 10000 documents were used for training (8000 documents) and validation (2000 documents), and 2000 documents were used for testing the results.

Analysis of result:

Good results:

-- prediction @ 5:
• Ground truth: human-machine interaction; composite model; automation surprises; mode confusion; refusal state; blocking state
• Predicted: automation surprises; human-machine interaction; mode confusion; refusal state; blocking state; reachability problem; composite model; machine-user interaction
• Ground truth: heart rate variability; respiratory phase; respiratory sinus arrhythmia
• Predicted: heart rate variability; respiratory phase; respiratory-phase analysis; respiratory sinus arrhythmia; swallowing; coughing; vocalization; reading aloud; conversation
— prediction @ m:
• Ground truth: field programmable gate array; fpga; function decomposition; mapping; power dissipation; delay; area; power; power minimization
• Predicted: network; fpgas; boolean decomposition techniques; configurable logic blocks; function decomposition; power dissipations; use; obdds; addressing; map; paper; field programmable gate arrays; boolean functions; decomposition; delay; method; functional; area; fpga synthesis; mapping algorithm; configurability; power; logic; logic cad; synthesis; algorithm
• Ground truth: augmented reality; interactive modeling; multiple object tracking; structure from motion; image retrieval; online structure frommotion; 3d modeling; image-based tracking
• Predicted: augmented reality; structure from motion; multiple object tracking; interactive modeling

Bad results

— prediction @ 5
• Ground truth: drought indices; monthly rainfall time series; spatsim
• Predicted: meteorological drought; monthly rainfall data; drought index; precipitation index; drought maps; spell/run analysis; computeraided process planning; drought assessment
• Ground truth: standard cell design; low power circuit design; near threshold operation; pn ratio optimization
• Predicted: vlsi; standard cell library; near-threshold voltage; p/n well boundaries; power consumption reduction; low-power design
— prediction @ m
• Ground truth: motif; atomset; backward pruning; bitonic partitioning; recursive fuzzy hashing; self-adaptive expansion
• Predicted: data mining; structural motifs; parallel algorithms; cluster computing; cluster clustering; clustering strategies; cluster environments; cluster parallelization; cluster scheduling; cluster communication
• Ground truth: breast cancer; ais; fuzzy rules induction; acgh; data mining; ifrais
• Predicted: fuzzy rule induction system (ifrais); artificial immune system (ais); breast cancer characterization; genomic aberrations; gene Ontology (gol)

References

• https://arxiv.org/pdf/1910.13461.pdf
• https://ohmeow.com/
• https://huggingface.co/
• https://www.fast.ai/
• https://www.nltk.org/

Fine tune a pretrained model BART (Bidirectional and Auto-Regressive Transformers) on the task of Keyphrase Generation

Good results:

Bad results

Written by Sagar Dhiman

No responses yet