playground

Semantic Analysis with Deep Learning¶

In this notebook, we will experiment with large language models to see whether they capture our semantic understanding of English lexical items, and whether we can establish new semantic understanding!

Set-Up¶

Import modules, set up utility functions.

In [44]:

import pickle

import pandas as pd

import torch
from torch.nn.functional import cosine_similarity

from transformers import AutoTokenizer, AutoModel

In [46]:

def rank_by_distance(df, index, repr='break_rep_layer24', dist_fn=cosine_similarity):
    row = df.iloc[index]
    representation = row[repr]
    distances = df.apply(lambda r: dist_fn(representation, r[repr]), axis=1)

    sorted_df = df.copy()
    sorted_df['distance'] = distances

    sorted_df = sorted_df.sort_values(by='distance', ascending=False)

    return sorted_df

Analysis of the Verb Break¶

Analysis of the verb break, following Petersen et al., 2022.

In [47]:

# load Petersen's dataset
weights_name = 'roberta-large'
with open(f"reps/{weights_name.replace('/', '_')}_df.pickle", "rb") as f:
    df = pickle.load(f)

Linguist-Annotated Meanings¶

Explore the meanings of break that linguists have annotated.

In [48]:

df[['sentence', 'meaning', 'construction']].head()

Out[48]:

	sentence	meaning	construction
txt file
wlp_mag_1990.txt	Is this picture too bleak? Well, it is just possible that all the imponderable factors will break right.	happen	unaccusative
wlp_acad_2005.txt	Opponents thus engaged in many of the strategies of resistance that feminists have suggested are necessary to" break mothering free of ideological encapsulation." (n54) Despite these attempts to challenge mandatory testing and maternal ideology simultaneously, opponents were not able to break free of maternal ideology completely.	break_free_escape	causative
wlp_acad_2001.txt	Networking with other music teachers, both new and experienced, can be a professional lifeline for beginning music teachers and can help to break the isolation of the first years of teaching.	end	causative
wlp_mag_2007.txt	"Am I good enough to be No. 1? Sure, but who's gon na break Tiger's legs? Laughs I want to be the best.	separate_into_parts	causative
wlp_fic_2008.txt	"When are you going to break down and say yes?" Josie's cell phone pealed and vibrated simultaneously in her pocket.	break_down_succumb	unaccusative

In [49]:

sample = df.sample().iloc[0]
print('Sentence:', sample['sentence'])
print('Meaning:', sample['meaning'])
print('Construction:', sample['construction'])

Sentence: When, for example, she learned on Aug. 21 that the phone-harassment story was about to break, Di took the extraordinary step of summoning Richard Kay, the Daily Mail correspondent who has become her champion.
Meaning: reveal
Construction: unaccusative

In [50]:

probe_subset = df['meaning'].value_counts()
probe_subset = probe_subset[probe_subset >= 10]

print(len(probe_subset))
probe_subset

Out[50]:

separate_into_parts               150
end                               126
decipher                           62
break_down_separate_into_parts     61
violate                            59
break_up_separate_into_parts       35
surpass                            34
break_down_destroy                 31
break_into_intrude                 28
reveal                             26
appear                             25
break_through_pass_through         24
render_inoperable                  23
unclassified                       21
break_down_render_inoperable       21
break_free_escape                  19
break_down_succumb                 18
cause_to_fail                      17
break_up_end_relationship          17
break_up_end                       16
break_out_escape                   15
break_even_profit=loss             14
succumb                            13
break_out_start                    12
experience_sorrow                  11
break_away_detach                  10
break_off_end                      10
Name: meaning, dtype: int64

Model Representations¶

Explore what model representations look like. Can we make any sense of these?

In [13]:

sample = df.sample().iloc[0]

sample['break_rep_layer24']

Out[13]:

tensor([[-0.0964, -0.1548, -0.1657,  ...,  0.1349,  0.0659,  0.1067]])

In [14]:

sample['break_rep_layer24'].shape

Out[14]:

torch.Size([1, 1024])

Do These Representations Capture Meaning?¶

While we can't make any sense of these representations directly, we can use vector distances between the representations to see that they are context-modulated, and capture semanticists' understandings of the senses of break!

In [51]:

# let's pick a random sentence
index = 12
sample = df.iloc[index]

print('Sentence:', sample['sentence'])
print('Meaning:', sample['meaning'])
print('Construction:', sample['construction'])

Sentence: I had the impression the scumbag was already seeing someone else so I had to break it off.
Meaning: break_off_end
Construction: causative

In [54]:

# what are the closest sentences?
ranked_df = rank_by_distance(df, index)
ranked_df[['sentence', 'meaning', 'construction']].head()

Out[54]:

	sentence	meaning	construction
txt file
wlp_news_1993.txt	I had the impression the scumbag was already seeing someone else so I had to break it off.	break_off_end	causative
wlp_fic_2011.txt	I told Hanne to break things off with Todd a long time ago.	break_off_end	causative
wlp_fic_1995.txt	We did have an affair, and I did break it off kind of abruptly.	break_off_end	causative
wlp_fic_1991.txt	we must n't break it off or do anything rash until we've explored every avenue," she said ambiguously.	break_off_end	causative
wlp_news_2005.txt	You can break it off, give the ring back and free yourself.Source: Judith Sherven and James Sniechowski, co-authors of" The Smart Couple's Guide to the Wedding of Your Dreams," or just go to www.smart weddingcouples.com##3045682 Piled on a table are plastic bags of drugs, medication worth an estimated $3,000.	break_off_end	causative

In [55]:

print(ranked_df.iloc[:5]['sentence'].values)

['I had the impression the scumbag was already seeing someone else so I had to break it off.'
 'I told Hanne to break things off with Todd a long time ago.'
 'We did have an affair, and I did break it off kind of abruptly.'
 'we must n\'t break it off or do anything rash until we\'ve explored every avenue," she said ambiguously.'
 'You can break it off, give the ring back and free yourself.Source: Judith Sherven and James Sniechowski, co-authors of" The Smart Couple\'s Guide to the Wedding of Your Dreams," or just go to www.smart weddingcouples.com##3045682 Piled on a table are plastic bags of drugs, medication worth an estimated $3,000.']

Our Own Analysis!¶

Petersen et al. used a somewhat large hand-annotated dataset, but we don't need this in order to explore the semantics of these neural networks!

In [56]:

sentences = [
    # physical manifestation (tome)
    'This book has a flexible cover',
    'The book you gave me was full of stains',
    # text
    'This book has an extremely controversial point to make',
    'The book you gave me is full of great insights',
    # linguistic realization of content
    'This book has been translated into 5 languages',
    'The book is popular for its bold word choice and literary style',
    # the characters in the text
    'This book you gave me is full of misprints',
    'The book has a bunch of typos',
    # metaphors/idioms
    'This is one for the books',
    'They got a subpeona to examine our books'
]

key_words = ['book'] * 8 + ['books'] * 2

df = pd.DataFrame({'sentence': sentences, 'key_word': key_words})

In [57]:

# load our model
tokenizer = AutoTokenizer.from_pretrained(weights_name)
model = AutoModel.from_pretrained(weights_name)

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

In [58]:

from break_utils import hf_encode, get_indices, hf_represent, get_break_reps

df['key_ids'] = df['key_word'].apply(lambda w: hf_encode(f" {w}", tokenizer)[0]) 
df['ids'] = df['sentence'].apply(lambda x: get_indices(x, tokenizer))
df[f'reps'] = df['ids'].apply(lambda x: hf_represent(x, model))

In [62]:

layer = 24
df[f'rep_layer{layer}'] = df.apply(
                lambda row: get_break_reps(row, row['key_ids'], layer=layer), axis=1)

In [63]:

index = -1
sample = df.iloc[index]

print('Sentence:', sample['sentence'])

Sentence: They got a subpeona to examine our books

In [64]:

ranked_df = rank_by_distance(df, index, repr=f'rep_layer{layer}')
ranked_df[['sentence', 'distance']].head()

Out[64]:

	sentence	distance
9	They got a subpeona to examine our books	[tensor(1.0000)]
0	This book has a flexible cover	[tensor(0.9915)]
1	The book you gave me was full of stains	[tensor(0.9913)]
6	This book you gave me is full of misprints	[tensor(0.9910)]
4	This book has been translated into 5 languages	[tensor(0.9908)]

Compositional Analysis?¶

We likely won't have time for this, but I was trying to experiment with negation to see whether we can detect any "lessening" effect on intense adjectives like "terrible."

In [39]:

weights = 'roberta-base'

tokenizer = AutoTokenizer.from_pretrained(weights)
model = AutoModel.from_pretrained(weights)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

In [40]:

text = 'This is not terrible'

tokens = tokenizer(text)

In [158]:

import pandas as pd

sentences = [
    'This is terrible',
    'This is awful',
    'This is not terrible',
    'This is bad',
    'This is not bad',
    'This is okay',
    'This is good',
    'This is not good',
    'This is very terrible',
    'This is the worst',
    "This isn't great",
    'This is great'
]

key_words = [
    'terrible',
    'awful',
    'terrible',
    'bad',
    'bad',
    'okay',
    'good',
    'good',
    'terrible',
    'worst',
    'great',
    'great'
]

df = pd.DataFrame({'sentence': sentences, 'key_word': key_words})

In [159]:

from break_utils import hf_encode, get_indices, hf_represent, get_break_reps

df['key_ids'] = df['key_word'].apply(lambda w: hf_encode(f" {w}", tokenizer)[0]) 

In [160]:

df['ids'] = df['sentence'].apply(lambda x: get_indices(x, tokenizer))

In [161]:

df[f'reps'] = df['ids'].apply(lambda x: hf_represent(x, model))

In [162]:

layer = 12
df[f'rep_layer{layer}'] = df.apply(
                lambda row: get_break_reps(row, row['key_ids'], layer=layer), axis=1)

In [169]:

from torch.nn.functional import cosine_similarity, pairwise_distance

index = 0

row = df.iloc[index]

representation = row[f'rep_layer{layer}']

distances = df.apply(lambda r: cosine_similarity(representation, r[f'rep_layer{layer}']), axis=1)

In [170]:

sorted_df = df.copy()
sorted_df['distances'] = distances

sorted_df = sorted_df.sort_values(by='distances', ascending=False)

In [165]:

sorted_df[['sentence', 'distances']]

Out[165]:

	sentence	distances
2	This is not terrible	[tensor(1.)]
0	This is terrible	[tensor(0.9426)]
8	This is very terrible	[tensor(0.9358)]
4	This is not bad	[tensor(0.9163)]
1	This is awful	[tensor(0.9132)]
10	This isn't great	[tensor(0.8949)]
3	This is bad	[tensor(0.8888)]
9	This is the worst	[tensor(0.8711)]
6	This is good	[tensor(0.8663)]
5	This is okay	[tensor(0.8652)]
11	This is great	[tensor(0.8636)]
7	This is not good	[tensor(0.8601)]

In [171]:

sorted_df[['sentence', 'distances']]

Out[171]:

	sentence	distances
0	This is terrible	[tensor(1.0000)]
8	This is very terrible	[tensor(0.9854)]
1	This is awful	[tensor(0.9566)]
2	This is not terrible	[tensor(0.9426)]
3	This is bad	[tensor(0.9290)]
11	This is great	[tensor(0.9149)]
9	This is the worst	[tensor(0.8989)]
6	This is good	[tensor(0.8979)]
5	This is okay	[tensor(0.8908)]
10	This isn't great	[tensor(0.8889)]
4	This is not bad	[tensor(0.8852)]
7	This is not good	[tensor(0.8651)]

Quantifier Scope and Ambiguities?¶

This is slightly more ambitious, but I was hoping to see whether visualizing attention can give us any insights into scope ambiguities...

In [67]:

# Load model and retrieve attention weights

from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel

model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

In [72]:

sentence = "Not every boy said that some person has a dog"

inputs = tokenizer.encode_plus(sentence, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list) 

In [73]:

head_view(attention, tokens)

Layer: