Push tokenizer again

#7
by tomaarsen HF Staff - opened
Sentence Transformers - Cross-Encoders org

In this PR, I'm repushing the tokenizer with a newer version of transformers, with the goal of also generating the tokenizer.json used by the tokenizers-backed fast tokenizers. There should not be any changes in the performance of the tokenizer, the only difference is that Transformers and Sentence Transformers can now directly use the fast tokenizer without having to convert it from the slow tokenizer. This should make loading the model faster.

  • Tom Aarsen
tomaarsen changed pull request status to merged

@tomaarsen with the new version the scores I'm getting are vastly different compared to that of the previous versions. are the scores normalized?

Sentence Transformers - Cross-Encoders org

Previous versions of Sentence Transformers, or previous revisions of this model?

haven’t updated ST, so it has to be the repo update as I re-loaded the model to test something

Sentence Transformers - Cross-Encoders org

Thanks for letting me know, I'll try and reproduce this.

Sentence Transformers - Cross-Encoders org

I'm not able to reproduce it right now, could you share a bit more details?


from sentence_transformers import CrossEncoder

model_name = "cross-encoder/ms-marco-MiniLM-L12-v2"
model_old = CrossEncoder(model_name, revision="da094b5a1ec84ff5e21eafb6c7141c037b167efb")
model_new = CrossEncoder(model_name)

query = "Which planet is known as the Red Planet?"
passages = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model_old.predict([(query, passage) for passage in passages])
print(scores)
scores = model_new.predict([(query, passage) for passage in passages])
print(scores)
[-7.1846724  9.501249   5.721795   6.847083 ]
[-7.1846724  9.501249   5.721795   6.847083 ]
  • Tom Aarsen
Sentence Transformers - Cross-Encoders org

I can reproduce it with older versions now, I'll roll back the changes.

The difference is in the default applied activation function. If you're using an older SentenceTransformer version, then this used to give values without an activation function (i.e. not bound to [0, 1]), and now it does use a Sigmoid.
My apologies for this regression, I'll make sure this is reverted by the end of tomorrow.

  • Tom Aarsen

@tomaarsen ok, that makes sense.

thanks for looking into it!

Sentence Transformers - Cross-Encoders org

I've reverted it! I see that when I updated the models under the https://huggingface.co/cross-encoder organization, for about 70% of them, I accidentally removed the changes that I made to my code that prevents the config, tokenizer, etc. from being reuploaded. The consequence is that the configuration is updated to the modern v4+ format, which the old Sentence Transformers versions don't recognize. The difference is that for some of the models, this meant that the non-default activation function in the new v4+ format was not recognized, and it defaulted to the Sigmoid activation function.

I've resolved it for all models now. Thanks a bunch for reporting this!

@tomaarsen thanks for the quick response. glad it got sorted. cheers!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment