Less Knowledge Than Llama 3.3 70b?

#60

by phil111 - opened 14 days ago

14 days ago

•

I haven't run a full test, but even Llama 4 Maverick with its 402b parameters appears to have less broad English knowledge, and a higher hallucination rate, than Llama 3.3 70b.

For example, when twice asked "Who played Robert Barone’s wife on the TV show Everybody Loves Raymond?" Maverick responded with "The character Debra Barone, Robert Barone's wife, was played by actress Doris Roberts on the TV show Everybody Loves Raymond."

To start with, Llama 3 70b, DeepSeek V3, Gemini, GPT4, Sonnet... all get this one right, which isn't surprising since it's a basic question about the main cast of a very popular TV that appears multiple times on Wikipedia and 100s of times across major websites.

And not only did the massive Maverick get it wrong, it got it wrong in a way only very small models ever did. Firstly, Debra is married to Robert's brother Ray, and they're the leading characters on the show. It's called Everybody Loves RAYMOND after all. Secondly, Debra is portrayed by the notable actress Patricia Heaton, yet Maverick went on to say she was portrayed by the much older actress Doris Roberts who portrayed Ray and Robert's mother on the show.

And this isn't an isolated incident. Maverick reliably struggled to get basic questions right about the main characters from popular movies and TV shows (e.g. Two and a Half Men). And even if you don't target specific characters and just ask for the main cast it reliably makes major mistakes that other similar sized models don't make.

For example, when asked "What are the 6 main characters, and the actors who portrayed them, on the TV show Corner Gas? Don't add details, just list them. And what year did the show first air?" Maverick responded with the following, and was still wrong even after taking the time to try to correct itself. Note: Corner Gas is the most watched Canadian TV show and it even got the main character Brent Leroy portrayed by Brent Butt horribly wrong, plus made numerous other major mistakes. How is this even possible? Tiny Llama 3.3 3b didn't make these kinds of mistakes.

Here is the corrected list:

Brent Fitzpatrick - Eric Tagg
Lacey Burrows - Gabrielle Miller
Wanda Dollard - Janet Wright
Emma Leroux - Erin Karpluk
Oscar Leroux - Brent Briscoe
Hank Yarbo - Eric Peterson

phil111

14 days ago

•

edited 13 days ago

Maverick's issues extend beyond just factual mistakes. For example, its responses can be incoherent, self-contradictory, and grammatically incorrect, as illustrated by its following response when attempting to correct its previous mistake.

"You are correct that Debra Barone is Ray Barone's wife, played by Patricia Heaton, not Robert Barone's wife. Robert Barone's wife is Marie Barone's character is actually played by Doris Roberts, and she is Ray and Robert's mother. Robert's wife, Amy McDougall Barone, is played by Monica Horan."

I ran this past other models (e.g. Gemini) and they identified the same errors I did. And the sentence "Robert Barone's wife is Marie Barone's character is actually played by Doris Robers, and she is Ray and Robert's mother." is so bad that it's entertaining. And then it doesn't even segue into the correct answer.

But it did get the correct answer. Plus I can reliably get correct answers out of Maverick despite it making egregious factual errors, including with the above cast of Corner Gas. This isn't normal. I can rarely get the correct answers out of other LLMs that make comparably egregious factual errors by simply changing the wording of the prompt. This makes me suspect that the model is innately health, but at some point a rectifiable error was made that ended up scrambling the weights.

nlpguy

12 days ago

My suggestion would be to use the base model and let it adapt to the conversation from context. Maybe some difference between finetuning and pretraining scrambled the weights?

phil111

12 days ago

That's a good idea. I'm definitely going to test the base model when I get more RAM and can run it locally.

However, the weight scrambling of both the official release and the talkative LLMsys Experimental versions is so severe and pervasive that it's hard to imagine fine-tuning alone could have done it.

tiyebai

12 days ago

This comment has been hidden (marked as Off-Topic)

phil111

12 days ago

@tiyebai In case you're still interested I always run base models by downloading a Q4_K_M GGUF version of a model and running it via a CPU using Koboldcpp (although it doesn't yet support Llama 4 and you'll need ~80 GB+ RAM to even run Scout, and at least 256 GB to run Maverick).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment