Push tokenizer again

by tomaarsen HF Staff - opened 17 days ago

base: refs/heads/main

←

from: refs/pr/7

Discussion Files changed

+30737

-2

tomaarsen

Sentence Transformers - Cross-Encoders org 17 days ago

In this PR, I'm repushing the tokenizer with a newer version of transformers, with the goal of also generating the tokenizer.json used by the tokenizers-backed fast tokenizers. There should not be any changes in the performance of the tokenizer, the only difference is that Transformers and Sentence Transformers can now directly use the fast tokenizer without having to convert it from the slow tokenizer. This should make loading the model faster.

Tom Aarsen

Push tokenizer againf8fdc6ea

tomaarsen changed pull request status to merged 17 days ago

RobCzikkel

10 days ago

•

edited 10 days ago

@tomaarsen with the new version the scores I'm getting are vastly different compared to that of the previous versions. are the scores normalized?

tomaarsen

Sentence Transformers - Cross-Encoders org 10 days ago

Previous versions of Sentence Transformers, or previous revisions of this model?

RobCzikkel

10 days ago

haven’t updated ST, so it has to be the repo update as I re-loaded the model to test something

tomaarsen

Sentence Transformers - Cross-Encoders org 10 days ago

Thanks for letting me know, I'll try and reproduce this.

tomaarsen

Sentence Transformers - Cross-Encoders org 10 days ago

I'm not able to reproduce it right now, could you share a bit more details?


from sentence_transformers import CrossEncoder

model_name = "cross-encoder/ms-marco-MiniLM-L12-v2"
model_old = CrossEncoder(model_name, revision="da094b5a1ec84ff5e21eafb6c7141c037b167efb")
model_new = CrossEncoder(model_name)

query = "Which planet is known as the Red Planet?"
passages = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model_old.predict([(query, passage) for passage in passages])
print(scores)
scores = model_new.predict([(query, passage) for passage in passages])
print(scores)

[-7.1846724  9.501249   5.721795   6.847083 ]
[-7.1846724  9.501249   5.721795   6.847083 ]

Tom Aarsen

tomaarsen

Sentence Transformers - Cross-Encoders org 10 days ago

I can reproduce it with older versions now, I'll roll back the changes.

The difference is in the default applied activation function. If you're using an older SentenceTransformer version, then this used to give values without an activation function (i.e. not bound to [0, 1]), and now it does use a Sigmoid.
My apologies for this regression, I'll make sure this is reverted by the end of tomorrow.

Tom Aarsen

RobCzikkel

10 days ago

@tomaarsen ok, that makes sense.

thanks for looking into it!

tomaarsen

Sentence Transformers - Cross-Encoders org 10 days ago

I've reverted it! I see that when I updated the models under the https://huggingface.co/cross-encoder organization, for about 70% of them, I accidentally removed the changes that I made to my code that prevents the config, tokenizer, etc. from being reuploaded. The consequence is that the configuration is updated to the modern v4+ format, which the old Sentence Transformers versions don't recognize. The difference is that for some of the models, this meant that the non-default activation function in the new v4+ format was not recognized, and it defaulted to the Sigmoid activation function.

I've resolved it for all models now. Thanks a bunch for reporting this!

RobCzikkel

10 days ago

@tomaarsen thanks for the quick response. glad it got sorted. cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment