Evaluating and Understanding Generative LLMs

Introduction

Here I chronicle my research endeavors over the past nine months. Where did they start? By trying to evaluate gender representation bias in short stories generated by language models. Where did they end? With preliminary evidence of high-level decision-making by language models when writing restaurant reviews. How did I get there? I’m not entirely sure myself…

This document is meant to compile my methods, experiments, and conclusions in order to illuminate the general challenges I faced and insights I gained from analyzing the behavior of language models during text generation.

First Steps: Bias in Short Story Generation

What inspires ChatGPT to tell a story? When prompted to write a children’s book, ChatGPT draws from its massive collection of training data to personify animals and concoct imaginative adventures – yet studies show that it also tends to perpetuate stereotypes and biases, associating male characters with war and politics and female characters with family and caregiving (Lucy 2023).

My initial hope was to understand representation bias in a story-telling language model and discover its source – in the training data and in the model’s parameters. To this end, I followed existing works that investigate bias in fictional human-written stories (Fast (2016) and in GPT-generated short stories (Lucy 2023). I found that a language model pre-trained on short stories (Eldan 2023) indeed reflects social biases in the adjectives/verbs associated with gendered pronouns and in the topics of stories with male/female characters. However, my findings left me with questions about the overall meaning of aggregated bias metrics.

Bias Finding I: Context Window Analysis

For my analyses, I chose to use the TinyStories model, a 100M parameter model that learned English solely from GPT-generated short stories (Eldan 2023). With a small enough training dataset, I can understand how biases in the training data are reflected – or potentially amplified – by the model when it generates its own short stories.

Figure 1: Proportion of words that appear in the context window of male pronouns (he/him) on the x-axis and female pronouns (she/her) on the y-axis. Dashed line represents a fully de-biased model

Following Fast (2016), I evaluated bias in the model by comparing words that appear within an n-word context window around gendered pronouns (“he/him” and “she/her”). Though generally symmetric, I found that certain words such as “gun”, “fish”, and “officer” are more strongly associated with male pronouns, whereas words such as “dress”, “pink”, and “pretty” are more strongly associated with female pronouns in generated stories (see Figure 1).

Figure 2: Word clouds of words associated with male pronouns (left, blue) and female pronouns (right, pink). Larger words correspond to greater correlation with the associated pronoun.

Figure 2 shows word clouds of verbs that have strong correlations with male and female pronouns. Again, stereotypes appear to be perpetuated: male characters are associated with “joking” and “flapping” (perhaps again related to fishing?) while female characters are associated with “kissing” and “adoring”.

So okay, the window context analysis reveals some biases in text generated by the TinyStories model. But what does it actually tell us about the short stories themselves? To understand this further, I analyzed the topics of the generated stories.

Bias Finding II: Topic Modeling Analysis

Lucy 2023 exploys Latent Dirichlet Allocation (LDA) to explore the topics of GPT-generated models and finds evidence of gender representation bias in these stories. Following their analysis, I hoped to compare the bias latent in the stories generated by the TinyStories model and the bias in the dataset on which it was trained. My preliminary results corroborate the bias discovered by my earlier experiments, and even suggest some evidence of bias amplification.

LDA is an unsupervised learning method that clusters a series of documents (in our case, short stories) into topics based on shared keywords. Following the analysis conducted in Lucy 2023 (I recommend using the Mallet LDA model for this analysis), I discovered what seems to be representation bias. Stories with female protagonists follow topics that relate to cooking or going on a picnic, whereas stories with male protagonists follow topics that relate to fishing adventures or playing with toy cars.

Topic	Topic words
cooking	cake, eat, happy, picnic, kitchen
mothering?	bird, fly, baby, butterfly, safe
fishing	water, fish, lake, swim, frog
toy cars	toy, play, car, magnet, open

Table 1: Main topics analyzed by LDA, including the top 5 topic words, for stories prompted with female protagonists (top half, purple) and male protagonists (bottom half, blue).

Going a step further, I wanted to compare the topics in generated stories to those in the original dataset. To this end, I took two topics each from the LDA results for male and female protagonist stories, and computed the correlation between the main character’s gender (detected by keyword matching to common names and pronouns) and the generated topic (detected by keyword matching from the topic words from the LDA results). Doing this, I found some evidence of bias amplification: the correlation between the gender of the main character and the topic of the story was greater in the model-generated stories than in the pre-training dataset (see Table 2).

	Training dataset		Model Generations
Topic	Male	Female	Male	Female
adventure	0.64	0.46	0.68	0.40
cars	0.77	0.61	0.83	0.25
cooking	0.50	0.69	0.50	0.65
schoolwork	0.52	0.79	0.21	0.79

Table 2: Ratio of stories containing male/female pronouns, broken down by topic. Upper two topics (blue) correspond to stereotypically male topics, and lower two topics (purple) correspond to stereotypically female topics.

Takeaways

I’d like to point out that my bias evaluations were preliminary and limited – gender is not a binary construct, and a holistic bias evaluation would consider more gender representations (e.g., a they/them neutral gender) as well as a fuller range of story topics. Nevertheless, there emerged evidence of gender bias in the TinyStories model (and dataset) that is worth studying further. I think that this is an especially interesting setting in which to explore bias amplification: does gender representation bias (in terms of topic-character correlation) increase between training data and model generations? If so, why?

Here are some challenges I faced when thinking about bias amplification in this context:

Topic classification There is a wide variety of story topics in the training dataset, and potentially unlimited topics that the model might generate. To conduct the proposed bias amplification study, we need to be sure that we cover the relevant story topics, and that we classify them with high accuracy
Identifying stereotypes The stereotypes I relied on were self-identified and hence arguable. To conduct a proper analysis, we would need to understand which story topics are stereotypically associated with male characters, female characters, or other nonbinary characters.
Individual bias Our bias measurement is an aggregate statistic, meaning that the model tends to associate certain topics with certain characters. But what can we say about the model when it’s generating a certain story? There is a similar distinction in the algorithmic fairness literature between group fairness which relies on aggregate statistics and individual fairness (Dwork 2012) or counterfactual fairness (Kusner 2017) which apply to individual examples.

The latter challenge prompted me to shift my focus towards interpreting a model’s computation when it tells a short story. When the main character is male, is the model more likely to describe an outdoor adventure? Is the model more likely to assign characteristics such as strong or brave? And is it possible to detect and control these topics and character traits before the story is generated?

Explaining Model Behavior: Counterfactual Short Stories

At this fork my research pivoted from bias evaluation towards interpretability. After all, when I ask a model to generate a story, it will decide upon such things as whether the main character is male or female, whether they are strong or smart or friendly, and where their adventures lead them. Whereas my original research question was to find the patterns in these generations, what I became curious about is understanding the decision-making process itself. How does a model decide on a character’s attribute? And how can we use this information to control the high-level features of the short story that it generates?

The Plan: Interchange Intervention Analysis

To analyze the decision-making process of a model, it is helpful to have counterfactual examples across which only a single high-level feature is edited. For example, what would a story look like if its main character was strong instead of smart, while everything else stayed the same? With such counterfactual pairs, we can isolate the components of the model responsible for the high-level feature we are interested in – by comparing the model activations that correspond to a story where the main character is strong instead of smart, we may be able to piece together how a model assigns character traits to its dramatis personae. An ideal end goal would be a table of characteristics for every fictional character in a story, with a user ability to tweak those attributes in order to get a new story.

Figure 3: An interchange intervention from a source prompt (orange) to a base prompt (blue), resulting in a counterfactual short story (pink).

So, what does this sort of analysis look like? Imagine that I directly ask a model to generate a story about a character named Lily who is fast. Our goal is to find the mechanism in the model responsible for adhering to the prompt and, at some point in the story, causing Lily to display her speed.

We can isolate this mechanism by performing an interchange intervention between the keyword “fast” and some other prompt about a different character with a different trait (say, strong Tom). Swapping Tom’s trait with “fast” should result in a new, counterfactual story where Tom – and not Lily! – displays his strength. Given these (source, base, counterfactual) triplets we should be able to try out different swaps until we locate the component responsible for the “fast” trait and only the “fast” trait.

Performing interchange interventions across many character attributes, we can identify the activation component that stores a character’s attribute. Swapping values for that component from the activations of different character traits, we can then steer the model’s generation to give the user transparency and control over the characteristics of the story’s protagonists.

The Execution: Creating a Counterfactual Story Dataset

For the proposed analysis to work, I needed to create a dataset consisting of counterfactual stories, which differ in only the character trait of their protagonist. This led me on a quest to generate short stories with ChatGPT.

Figure 4: Narrative arc, from here.

To keep the stories consistent, I adopted a five-sentence structure following the narrative arc. I grouped a few pieces together to create the following template:

Exposition: introduce character and scene
Initial incident: introduce problem that character must solve
Climax: solve the problem using the character’s trait
Resolution: describe how the problem is solved
Denouement: conclude the story

I also considered two variants: (1) explicit prompt, where I explicitly describe the trait of the character to ChatGPT (e.g., “Tom is a strong boy”); (2) implicit prompt, where I provide a “previous chapter” in which the character applies their trait to solve a problem (e.g., “In a previous chapter, Tom lifted a heavy stone to find hidden treasure in a pyramid”).

Whereas an explicit prompt might lead to more direct analysis (we can swap activations directly over this token), I found that it also led to more boring and clearcut stories – ChatGPT would repeatedly describe Tom as “strong” and “powerful” and so on. Meanwhile, implicit prompts tended to result in more interesting stories where the character demonstrates their strength (sort of a “show don’t tell” scenario).

I also think that implicit prompts are somewhat more natural: the training datasets of models like ChatGPT contain lots of fictional characters whose traits may not have been explicitly stated, but demonstrated throughout their narratives. An implicit prompt analysis might give us the opportunity to understand the attributes assigned to popular fictional characters such as Hermione Granger or Albus Dumbledore.

Takeaways

In the end, after trying out several story variants with several language models, I decided not to go forth with the counterfactual short story dataset.

A few challenges I faced and my reflections on them are listed below.

Explicit / Implicit Stories I found it difficult to get ChatGPT to be consistent with the character’s main trait, without being super obvious about it. I tended to get lots of “tell not show” stories from ChatGPT, although implicit prompts helped address this problem.
Dataset Verification How can I validate the stories generated by ChatGPT? For each story, I needed to make sure that the character name and trait, the story setting and problem, and the story resolution were consistent with the prompt. The larger the dataset, the more difficult this sort of verification becomes.
Counterfactual Stories The idea of a counterfactual in this setting might be a bit vague. Is it reasonable to modify the character’s trait without changing anything else? In my experiments I preserved the protagonist’s name, the story setting, and the problem faced – but other factors, such as potential side characters or whether the character even succeeds in resolving the problem, could also influence the generated story.

Ultimately, I hope to revisit the idea of a counterfactual short story dataset. I think that it has the potential to explore representation bias in character traits, but also the concept of time and grounded logic in current models (Piper, 2021). See Qin 2019 and Zi 2021 for very cool examples of this.

Explaining Model Behavior: Counterfactual Restaurant Reviews

At this point in my scientific journey, I was no longer looking into gender representation bias and no longer working with short stories – at least, not directly. However, my research question broadly remained the same: how can we understand and control the high-level decision making process of a language model during freeform generation?

An apt dataset to help answer this question is the CEBaB dataset (Abraham 2022), which was previously used to benchmark interpretability methods on classification models. CEBaB consists of human-edited counterfactual restaurant reviews which differ in their sentiment towards a single high-level aspect such as food, noise, service, or ambiance in the restaurant.

Figure 5: Interchange intervention between a source review (blue) to a base review (orange), resulting in a counterfactual restaurant review (pink).

Prior analyses employed these counterfactual reviews as inputs to a sentiment classifier in order to identify the model component responsible for sentiment towards a certain aspect. But what would happen if we flipped things around, and used the counterfactual reviews as outputs which differ only in the high-level sentiment? This resolves the challenge we faced with creating our short story dataset! Now, we have human-verified counterfactual generations that we can use to isolate the components that cause a model to decide on whether or not it likes the food in its fictional restaurant review.

Figure 5 illustrates the new interchange intervention setup. It’s nearly identical to the proposed short story analysis: we process a source prompt, and swap its activations into a separate run on a base prompt for a different restaurant. After the swap, we would expect the model to be more likely to output the counterfactual review where the same details of the base example are kept, but the sentiment towards service is drawn from the source example. For my analysis, I use the Phi-3 mini 4k instruct model, a 3.8B parameter model which I figured was good enough at following directions but not too large to be unwieldy for interpretability analysis.

Note that we don’t really expect the model to output the exact counterfactual – there are many different possible counterfactual reviews, since there are many different ways of expressing negative sentiment towards service. Our hope is that there’s just enough signal to isolate the high-level aspect that we’re interested in: the model should be more likely to be positive towards service than negative towards service after the interchange intervention is performed.

Does the Model Really Decide? Ensuring Model Consistency

There is a hidden assumption in my proposed analysis, which made itself transparent almost immediately: I want to understand how a model makes high-level decisions during generation, but who says that the model really makes these decisions? In the worst case scenario (for my analysis), the model is just as likely to output positive reviews as it is to output negative reviews, and all decision-making is delegated to sampling the next token!

One way to get around this obstacle is to explicitly prompt the model: “please write a restaurant review where you are happy with the service but unhappy with the food”. But then the analysis might be somewhat less interesting – we’re no longer studying when the decision gets made, only when it gets applied. Hence, I opted for a version of the implicit short stories – I can provide the model a “previous chapter” restaurant review, and ask it to reflect the same sentiments in a new review about a different restaurant. To guarantee that the model follows the prompt, I instruct-tuned the model on sampled restaurant reviews that match the previous review in aspect-level sentiment.

Table 3: Accuracy and consistency of an instruct-tune Phi-3-mini model on generating reviews that match CEBaB reviews in terms of aspect-level sentiment (specifically service).

I evaluated the model on a held-out validation set of one-shot restaurant reviews. For each example, I generated 10 different reviews. I then classified the sentiment of each review using a RoBERTa model that I trained on the CEBaB dataset. This let me compute the consistency of the model: how often do the generated reviews share the same aspect-level sentiment, such as towards service? I could also directly compute the accuracy of the model with regards to my one-shot training: how often does the aspect-level sentiment of a generated review agree with the sentiment of the one-shot example in the prompt?

Figure 6: Count plot, where the x-axis is the size of the majority sentiment class out of 10 generated reviews for a given one-shot example. Accuracy is determined by whether the majority class is the same as the sentiment of the one-shot example.

My instruct-tuning certainly improved model consistency, but still did not bring it to 100% – perhaps more rigorous fine-tuning, or taking the explicit approach of asking the model to generate a positive/negative review, would result in more consistent generations. Perhaps unsurprisingly, there was a strong correlation between accuracy and consistency: the more a model’s output matched that of the one-shot prompt in sentiment, the more consistent the sentiment was across multiple model generations.

Now that I have a partially-consistent model, my original research question makes sense: given that the model makes a decision as to the high-level sentiment of its generated review – and we know this because (1) its reviews are consistent in their sentiments and (2) we don’t “give away” the sentiment by explicitly stating it in our prompt – can we understand where and how this decision gets made?

A Glimpse of Interpretability: Probing and Distributed Alignment Search

When I ask a model (more specifically, my instruction-tuned, consistent model) to generate a restaurant review, which components are responsible for the sentiment that the model chooses to express towards the service in this fictional restaurant?

To begin to answer this question, it’s helpful to understand where the relevant information is stored in the model. To this end, I trained a small linear probe that takes as input the hidden state of the language model when prompted to generate a restaurant review, and learns to predict the service-level sentiment of the generated review. Should this information exist in the model’s activations, then we might have hope that we can identify the mechanism responsible for the generated sentiment.

Figure 7: Accuracy of a linear classifier that predicts the service-level sentiment of a language model given the activation over a particular layer and token, evaluated on a held-out validation set.

And indeed, the probing results seem promising! Agreeing with previous interpretability analyses, around the middle layers the relevant sentiment information is “moved” to the last token position.

However, it’s important to note that probing is not a causal analysis. Just because the sentiment information is there, it doesn’t mean that the model actually uses it. In order to be sure that the components we identified through probing have a causal effect on the sentiment outputted by our model, we must check that interchange interventions change the sentiment in the direction that we would expect. The higher the interchange intervention accuracy (IIA) is, the more we can be sure that the component we identified has the effect we predicted.

Figure 8: Interchange intervention loss over training of DAS. Loss is going down, which is more than I expected!

Fortunately, my initial experiments found components with (some) causal effect on the outputted sentiment as well! I ran Distributed Alignment Search (DAS) (Geiger 2024), an automated search over model activations that minimizes an interchange intervention loss – in particular, to compute the loss for a component found by DAS, we swap this component from a source run to a separate base run, and compute the cross-entropy between the resulting model logits and the counterfactual example from CEBaB.

Table 4:(Left) Interchange intervention accuracy on a held-out validation set. 3-way includes an “unknown” category, where service sentiment is not expressed. (Right) Sampled example of original-model review, source review, and DAS-generated review.

After running DAS, I evaluated the interchange intervention accuracy based on how often an interchange intervention over the discovered component resulted in an output with the intended service-level sentiment (as classified by a RoBERTa model trained on CEBaB). The accuracy over only positive/negative reviews (ignoring ones where the service-level sentiment is “unknown”) is 40% – perhaps far from explaining the entire behavior of the model, but also not too shabby, especially considering that the model is only consistent 60% of the time to begin with!

Takeaways

The fact that we can identify the component responsible for sentiment generated in a restaurant review is encouraging: even without a fully-consistent model, in a setting where we don’t have perfect counterfactual examples, and where we cannot rely on a single logit to determine whether the model did/did not express positive/negative sentiment towards service, current interpretability methods can still make headway in understanding where the decision about the service-level sentiment gets made.

Future work that integrates the rest of the aspect in CEBaB (food, noise, and ambiance), and understands the interplay between these aspects in model-generated reviews, can shed more light on interpretability in the context of free-form generation. I think it’s also worthwhile to consider additions to the interchange-intervention loss that incorporate contrastive examples, such as in DPO. For example, we know that after an interchange intervention, the sentiment towards service should become more positive than negative – perhaps this is best conveyed through a DPO loss between a positive and negative example, instead of only the example of positive sentiment.

Conclusion

I hope that this document illuminates a bit of my research pathway, and highlights the connections and challenges of bias analysis and interpretability research with generative models.

There are still many open questions I have about understanding model mechanisms during free-form generation.

How can we set up contrastive outputs that give us the right signal with which to interpret a model? Can we compare logits across sample generations in the same way that we compare logits over a single token such as in IOI?
How can we understand mechanisms that might get applied at different token positions? Is there a single mechanism responsible for generating positive sentiment, and if so, does it get activated at each new token generated by the model?
How many of the decisions made during a model’s generation can even be attributed to its internal computations, instead of just the “luck of the draw” from the tokens we sampled? Perhaps methods for search over the sample space can complement mechanistic interpretability research that focuses on the model activations

However, I believe that these questions are answerable and may provide valuable insights into related questions on bias and model evaluation. If we can make transparent the connection that a model makes between a restaurant’s cuisine and its food/noise/service/ambiance, or the connection that it makes between a character’s name and their personality trait, then we can give users a deeper understanding and greater control over the model’s internal biases.