Probing of Contextualized Embedding

WHAT DO YOU LEARN FROM CONTEXT? PROBING FOR SENTENCE STRUCTURE IN CONTEXTUALIZED WORD REPRESENTATION

AIM

Investigate how word-level contextual representations encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena.

EDGE PROBING

Based on recent token-level probing work, Paper introduce a novel edge probing task design and construct a broad suite of sub-sentence task taken from traditional NLP pipeline.

probing model architecture

All parameters inside the dashed line are fixed, while we train the span pooling and MLP classifiers to extract information from the contextual vectors. The example shown is for semantic role labeling, where \(s^{(1)}\) = [1, 2) corresponds to the predicate(“eat”), while \(s^{(2)}\) = [2, 5) is the argument (“strawberry ice cream”), and we predict label A1 as positive and others as negative. For entity and constituent labeling, only a single span is used.

Formulation

Sentence as a list of tokens \(T = [t_0, t_1,...., t_n]\), and

Labeled edge as {\({ s^{(1)}, s^{(2)},L}\)}.

we treat \(s^{(1)} = [i^{(1)},j^{(1)}]\) and, optionally, \(s^{(2)} = [i^{(2)},j^{(2)}]\) as spans.

L to be a set of zero or more targets from a task-specific label set.

example:- The important thing about Disney is that it is a global \([brand]_1\)

span or \(s^{(1)}\) = [brand], Label = NN (Noun)

Matric used binary \(F1\) score

TASK:-

PROBING MODEL:

Contextual embedding Vector and Integer spans as inputs, and Projection layer followed by the self-attention pooling operator of Lee et al. (2017) to compute fixed-length span representations. Pooling is only within the bound of a span.

Now Span representation is fed into a two-layer MLP followed by sigmoid output layer.

Training by minimizing binary cross-entropy agianst the target label set L

Model - used four recent Contextual encoder mdoels: CoVe, ELMo, OpenAI GPT, and BERT

How to get input embedding for Egde probing models ?

For CoVe, and ELMo, we are going to input embedding obtain by models directly and

For BERT and GPT (because of subword embedding) , Two methods for yielding contextual vectord for each token:

Experiment

Lexical Baseline - train directly on the most closely related context-independent word

Randomized ELMo- intialising weights of ELMo randomly

Word-Level CNN- extended lexical baseline by introducing a fixed-width convolutional layer on top of word respesentation.

Results

Encoding of syntatic vs semantic information

Relevant information being contained in intermediate layers implies top layers may be overly specialized to perform next-word prediction.

Contextual representation is strictly more expressive, since it includes access to the lexical representations either by concatenation or by scalar mixing.

larest gains on tasks which are considered to be largerly synatactic, such as dependency and constituent labeling, and smaller gains on tasks which aee considered to require more semantic reasoning, such as SPR and Winograd.

Effect of Architecture

Orthonormal encoder improves significantly on the lexical baseline,but that overall the learned weights account for over 70% of the improvements from full ELMo.**

Encoding non local context

ELMO encoder propagates a large amount of information about constituent, most of it is local in nature.

ELMo improvements it does bring are laregly due to long-range information

Paper

Code