notes

Catching Up: Backlogged Impressions of NAACL 2024

research

August 23, 2024 5-minute read

Recollecting Mexico

NAACL 2024 🇲🇽 took place over a month ago, but its heat is here to stay.

NAACL 2024 featured 565 Main Conference and 304 Findings papers, selected from 2434 papers that were mostly submitted during the December 2023 ARR cycle. The acceptance rate for the Main Conference was 23%.

Below, I've listed 10 🔥 papers from the event.

10 🔥 Papers

In no particular order, these papers were fire:

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation (Yoon et al.)

Applying LLMs for content recommendation seems to be of increasing interest, and this work provides additional evidence of two limitations of LLM-augmented recommendation: (1) their outputs can lack diversity, and (2) they may be overly optimistic. The ramifications for search are intruiging. The paper also discusses prompting as a possible solution to address those issues, but it is also clear that future efforts could benefit from going beyond prompting.

Can LLM’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis (Dorbala et al.)

When LLM/VLMs fill in for human annotators in embodied navigation, 83.3% of participants mark synthetic annotations as greater than "decent" accuracy---promising news, considering that annotations for embodied navigation are particularly costly. Plus, embodied agents perform almost as well when following synthetic instructions. The catch: 43% of participants described that synthetic instructions were very different from what they wrote. These insights indicate a promising opportunity for co-annotation in embodied navigation.

MacGyver: Are Large Language Models Creative Problem Solvers? (Tian et al.)

This dataset helps evaluate LLMs for everyday innovation, an ability that humans use for daily tasks but one that is understudied in LLMs. For example, "You need to separate the egg whites from the yolks. You have a turkey baster, a bowl, a slotted spoon, and a chopstick ... How do you separate the egg whites from the yolks with only these items?" (Wow, what a stressful situation!) While both humans and LLMs struggle, humans tend to either succeed or fail---whereas LLMs usually provide responses that combine elements of both right and wrong. Interestingly, humans excel in domains they know well (e.g., household tasks), whereas LLMs are more successful in other domains (e.g., gardening/fishing). Maybe humans and LLMs can combine their strengths?

Adjusting Interpretable Dimensions in Embedding Space with Human Judgements (Erk and Apidianaki)

Interpreting latent spaces is difficult, but informative, for CSS, NLP, and ++ applications. For word embeddings, the standard way to obtain an interpretable dimensions is via seed pair of antonyms (e.g., safe vs calm), but choosing seeds is difficult and may require trial and error. This work uses human annotations by ranking words (e.g., on a "danger" scale: dolphin=2.1, tiger=4.9) to help obtain interpretable dimensions that align closely with human judgements. With more ranked annotations, we can get better aligned dimensions.

NLP Systems That Can’t Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps (Gligoric et al.)

Like earlier bag-of-words models, LLMs still underperform at interpreting words in context... Specifically, they conflate the meaning of words when they are used vs mentioned. For example, the meaning of "banana" is different in the following: "Bananas have peels" vs "Bananas has 7 letters." Lacking nuance can lead to blanket judgements and outcomes (e.g., all online posts containing "bananas are born greedy" are censored). The suggested remedy is to explicitly define the use-mention distinction in the prompt and use CoT, which can help LLMs at use-mention distinctions and making judgements in content moderation and counterspeech applications.

Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering (Srikanth et al.)

QA systems should address surface questions and the involved subtext, such as false assumptions posed in the question. Understanding the subtext through pragmatic inference is paramount, especially in the high-risk domain of maternal health: for the question, "Can I color my hair after giving birth?", it is necessary to answer both the surface question (e.g., "It is safe for you to color your hair") and the subtext (e.g., "Hair dye chemicals cannot pass through breast milk from mother to child"). In this work, experts rated QA systems with induced pragmatic behavior as more helpful and informative.

Crafting In-context Examples according to LMs' Parametric Knowledge (Lee et al.)

How helpful are in-context examples when they are encoded vs not encoded in the LLM's parametric knowledge? In fact, a mixture of both "known" and "unknown" exemplars is most helpful---and this trend generalizes to multiple tasks. Further, sorting questions by increasing difficulty (i.e., asking easier questions first) can improve performance. This seems to align well with what we have seen from decomposed QA; awesome!

HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM (Wang et al.)

Longer responses are commonly, but sometimes mistakenly, judged as being more helpful. This well-known bias of reward models can hurt the response quality of the LLMs they train. HelpSteer introduces a dataset with five attributes of quality annotations to increase annotation granularity: helpfulness, correctness, coherence, complexity, and verbosity. With this dataset, the resulting LLM can be steered to adapt to the desired style at inference time.

Finding Replicable Human Evaluations via Stable Ranking Probability (Riley et al.)

While these recommendations for stability in human evaluation studies are intuitive, it is interesting to see empirical support. In essence: (1) raters should be given examples where the dependent variable is only system A vs B, (2) raters should be given equal workload, (3) scores should be Z-score normalized per rater, and (4) each pair should only be rated once.

Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong (Si et al.)

This investigation of LLMs vs search engines reveals interactions across search efficiency, explanation accuracy, and explanation format for the user's ability to verify a given claim. Usually, LLMs can make users more efficient than search engines by providing tailored explanations. However, users become inaccurate at verifying the claim when the explanations are wrong. When dual explanations are given for the possibility of a claim being true or false, users are accurate, but they are less efficient. These dynamics reveal that LLMs, by themselves, are currently unable to achieve an overall improvement over traditional search engines.

Takeaways

As the wave of prompting research continues to surge, NAACL had truly intriguing advancements. I'm excited to see what developments occur next.

Footnotes

As the average yearly temperature in Mexico City rises, so does the number of papers at each consecutive NAACL ◡̈ .

Previous notes

What is up w/ Japanese Zoos?! April 26, 2024 A First-Timer's Impressions of AAAI 2024 (w/ Friends) March 03, 2024