Semantics with LMs
Linguist 230B (Advanced Semantics)
Semantic Analysis with Deep Learning¶
In this notebook, we will experiment with large language models to see whether they capture our semantic understanding of English lexical items, and whether we can establish new semantic understanding!
Set-Up¶
Import modules, set up utility functions.
import pickle
import pandas as pd
import torch
from torch.nn.functional import cosine_similarity
from transformers import AutoTokenizer, AutoModel
def rank_by_distance(df, index, repr='break_rep_layer24', dist_fn=cosine_similarity):
row = df.iloc[index]
representation = row[repr]
distances = df.apply(lambda r: dist_fn(representation, r[repr]), axis=1)
sorted_df = df.copy()
sorted_df['distance'] = distances
sorted_df = sorted_df.sort_values(by='distance', ascending=False)
return sorted_df
Analysis of the Verb Break¶
Analysis of the verb break
, following Petersen et al., 2022.
# load Petersen's dataset
weights_name = 'roberta-large'
with open(f"reps/{weights_name.replace('/', '_')}_df.pickle", "rb") as f:
df = pickle.load(f)
Linguist-Annotated Meanings¶
Explore the meanings of break
that linguists have annotated.
df[['sentence', 'meaning', 'construction']].head()
sentence | meaning | construction | |
---|---|---|---|
txt file | |||
wlp_mag_1990.txt | Is this picture too bleak? Well, it is just possible that all the imponderable factors will break right. | happen | unaccusative |
wlp_acad_2005.txt | Opponents thus engaged in many of the strategies of resistance that feminists have suggested are necessary to" break mothering free of ideological encapsulation." (n54) Despite these attempts to challenge mandatory testing and maternal ideology simultaneously, opponents were not able to break free of maternal ideology completely. | break_free_escape | causative |
wlp_acad_2001.txt | Networking with other music teachers, both new and experienced, can be a professional lifeline for beginning music teachers and can help to break the isolation of the first years of teaching. | end | causative |
wlp_mag_2007.txt | "Am I good enough to be No. 1? Sure, but who's gon na break Tiger's legs? Laughs I want to be the best. | separate_into_parts | causative |
wlp_fic_2008.txt | "When are you going to break down and say yes?" Josie's cell phone pealed and vibrated simultaneously in her pocket. | break_down_succumb | unaccusative |
sample = df.sample().iloc[0]
print('Sentence:', sample['sentence'])
print('Meaning:', sample['meaning'])
print('Construction:', sample['construction'])
Sentence: When, for example, she learned on Aug. 21 that the phone-harassment story was about to break, Di took the extraordinary step of summoning Richard Kay, the Daily Mail correspondent who has become her champion. Meaning: reveal Construction: unaccusative
probe_subset = df['meaning'].value_counts()
probe_subset = probe_subset[probe_subset >= 10]
print(len(probe_subset))
probe_subset
27
separate_into_parts 150 end 126 decipher 62 break_down_separate_into_parts 61 violate 59 break_up_separate_into_parts 35 surpass 34 break_down_destroy 31 break_into_intrude 28 reveal 26 appear 25 break_through_pass_through 24 render_inoperable 23 unclassified 21 break_down_render_inoperable 21 break_free_escape 19 break_down_succumb 18 cause_to_fail 17 break_up_end_relationship 17 break_up_end 16 break_out_escape 15 break_even_profit=loss 14 succumb 13 break_out_start 12 experience_sorrow 11 break_away_detach 10 break_off_end 10 Name: meaning, dtype: int64
Model Representations¶
Explore what model representations look like. Can we make any sense of these?
sample = df.sample().iloc[0]
sample['break_rep_layer24']
tensor([[-0.0964, -0.1548, -0.1657, ..., 0.1349, 0.0659, 0.1067]])
sample['break_rep_layer24'].shape
torch.Size([1, 1024])
Do These Representations Capture Meaning?¶
While we can't make any sense of these representations directly, we can use vector distances between the representations to see that they are context-modulated, and capture semanticists' understandings of the senses of break
!
# let's pick a random sentence
index = 12
sample = df.iloc[index]
print('Sentence:', sample['sentence'])
print('Meaning:', sample['meaning'])
print('Construction:', sample['construction'])
Sentence: I had the impression the scumbag was already seeing someone else so I had to break it off. Meaning: break_off_end Construction: causative
# what are the closest sentences?
ranked_df = rank_by_distance(df, index)
ranked_df[['sentence', 'meaning', 'construction']].head()
sentence | meaning | construction | |
---|---|---|---|
txt file | |||
wlp_news_1993.txt | I had the impression the scumbag was already seeing someone else so I had to break it off. | break_off_end | causative |
wlp_fic_2011.txt | I told Hanne to break things off with Todd a long time ago. | break_off_end | causative |
wlp_fic_1995.txt | We did have an affair, and I did break it off kind of abruptly. | break_off_end | causative |
wlp_fic_1991.txt | we must n't break it off or do anything rash until we've explored every avenue," she said ambiguously. | break_off_end | causative |
wlp_news_2005.txt | You can break it off, give the ring back and free yourself.Source: Judith Sherven and James Sniechowski, co-authors of" The Smart Couple's Guide to the Wedding of Your Dreams," or just go to www.smart weddingcouples.com##3045682 Piled on a table are plastic bags of drugs, medication worth an estimated $3,000. | break_off_end | causative |
print(ranked_df.iloc[:5]['sentence'].values)
['I had the impression the scumbag was already seeing someone else so I had to break it off.' 'I told Hanne to break things off with Todd a long time ago.' 'We did have an affair, and I did break it off kind of abruptly.' 'we must n\'t break it off or do anything rash until we\'ve explored every avenue," she said ambiguously.' 'You can break it off, give the ring back and free yourself.Source: Judith Sherven and James Sniechowski, co-authors of" The Smart Couple\'s Guide to the Wedding of Your Dreams," or just go to www.smart weddingcouples.com##3045682 Piled on a table are plastic bags of drugs, medication worth an estimated $3,000.']
Our Own Analysis!¶
Petersen et al. used a somewhat large hand-annotated dataset, but we don't need this in order to explore the semantics of these neural networks!
sentences = [
# physical manifestation (tome)
'This book has a flexible cover',
'The book you gave me was full of stains',
# text
'This book has an extremely controversial point to make',
'The book you gave me is full of great insights',
# linguistic realization of content
'This book has been translated into 5 languages',
'The book is popular for its bold word choice and literary style',
# the characters in the text
'This book you gave me is full of misprints',
'The book has a bunch of typos',
# metaphors/idioms
'This is one for the books',
'They got a subpeona to examine our books'
]
key_words = ['book'] * 8 + ['books'] * 2
df = pd.DataFrame({'sentence': sentences, 'key_word': key_words})
# load our model
tokenizer = AutoTokenizer.from_pretrained(weights_name)
model = AutoModel.from_pretrained(weights_name)
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias'] - This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
from break_utils import hf_encode, get_indices, hf_represent, get_break_reps
df['key_ids'] = df['key_word'].apply(lambda w: hf_encode(f" {w}", tokenizer)[0])
df['ids'] = df['sentence'].apply(lambda x: get_indices(x, tokenizer))
df[f'reps'] = df['ids'].apply(lambda x: hf_represent(x, model))
layer = 24
df[f'rep_layer{layer}'] = df.apply(
lambda row: get_break_reps(row, row['key_ids'], layer=layer), axis=1)
index = -1
sample = df.iloc[index]
print('Sentence:', sample['sentence'])
Sentence: They got a subpeona to examine our books
ranked_df = rank_by_distance(df, index, repr=f'rep_layer{layer}')
ranked_df[['sentence', 'distance']].head()
sentence | distance | |
---|---|---|
9 | They got a subpeona to examine our books | [tensor(1.0000)] |
0 | This book has a flexible cover | [tensor(0.9915)] |
1 | The book you gave me was full of stains | [tensor(0.9913)] |
6 | This book you gave me is full of misprints | [tensor(0.9910)] |
4 | This book has been translated into 5 languages | [tensor(0.9908)] |
Compositional Analysis?¶
We likely won't have time for this, but I was trying to experiment with negation to see whether we can detect any "lessening" effect on intense adjectives like "terrible."
weights = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(weights)
model = AutoModel.from_pretrained(weights)
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias'] - This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
text = 'This is not terrible'
tokens = tokenizer(text)
import pandas as pd
sentences = [
'This is terrible',
'This is awful',
'This is not terrible',
'This is bad',
'This is not bad',
'This is okay',
'This is good',
'This is not good',
'This is very terrible',
'This is the worst',
"This isn't great",
'This is great'
]
key_words = [
'terrible',
'awful',
'terrible',
'bad',
'bad',
'okay',
'good',
'good',
'terrible',
'worst',
'great',
'great'
]
df = pd.DataFrame({'sentence': sentences, 'key_word': key_words})
from break_utils import hf_encode, get_indices, hf_represent, get_break_reps
df['key_ids'] = df['key_word'].apply(lambda w: hf_encode(f" {w}", tokenizer)[0])
df['ids'] = df['sentence'].apply(lambda x: get_indices(x, tokenizer))
df[f'reps'] = df['ids'].apply(lambda x: hf_represent(x, model))
layer = 12
df[f'rep_layer{layer}'] = df.apply(
lambda row: get_break_reps(row, row['key_ids'], layer=layer), axis=1)
from torch.nn.functional import cosine_similarity, pairwise_distance
index = 0
row = df.iloc[index]
representation = row[f'rep_layer{layer}']
distances = df.apply(lambda r: cosine_similarity(representation, r[f'rep_layer{layer}']), axis=1)
sorted_df = df.copy()
sorted_df['distances'] = distances
sorted_df = sorted_df.sort_values(by='distances', ascending=False)
sorted_df[['sentence', 'distances']]
sentence | distances | |
---|---|---|
2 | This is not terrible | [tensor(1.)] |
0 | This is terrible | [tensor(0.9426)] |
8 | This is very terrible | [tensor(0.9358)] |
4 | This is not bad | [tensor(0.9163)] |
1 | This is awful | [tensor(0.9132)] |
10 | This isn't great | [tensor(0.8949)] |
3 | This is bad | [tensor(0.8888)] |
9 | This is the worst | [tensor(0.8711)] |
6 | This is good | [tensor(0.8663)] |
5 | This is okay | [tensor(0.8652)] |
11 | This is great | [tensor(0.8636)] |
7 | This is not good | [tensor(0.8601)] |
sorted_df[['sentence', 'distances']]
sentence | distances | |
---|---|---|
0 | This is terrible | [tensor(1.0000)] |
8 | This is very terrible | [tensor(0.9854)] |
1 | This is awful | [tensor(0.9566)] |
2 | This is not terrible | [tensor(0.9426)] |
3 | This is bad | [tensor(0.9290)] |
11 | This is great | [tensor(0.9149)] |
9 | This is the worst | [tensor(0.8989)] |
6 | This is good | [tensor(0.8979)] |
5 | This is okay | [tensor(0.8908)] |
10 | This isn't great | [tensor(0.8889)] |
4 | This is not bad | [tensor(0.8852)] |
7 | This is not good | [tensor(0.8651)] |
Quantifier Scope and Ambiguities?¶
This is slightly more ambitious, but I was hoping to see whether visualizing attention can give us any insights into scope ambiguities...
# Load model and retrieve attention weights
from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
sentence = "Not every boy said that some person has a dog"
inputs = tokenizer.encode_plus(sentence, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
head_view(attention, tokens)