Model Description

Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph.

The tinyBert-keyword model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction.

huawei-noah/TinyBERT_General_4L_312D is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks.

  • Finetuned from: huawei-noah/TinyBERT_General_4L_312D

How to use

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
from transformers import AutoTokenizer, AutoModelForTokenClassification
import difflib

tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword")
model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device)
text = """
Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information.
Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language.
Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner.
Types of Vision Language Models:

Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models.
LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks.
VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations.
"""
id2label  = model.config.id2label

tokenized = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_offsets_mapping=True,
        return_tensors="pt"
    )

input_ids = tokenized["input_ids"].to(device)
attention_mask = tokenized["attention_mask"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
token_predictions = [id2label[pred.item()] for pred in predictions[0]]
entities = []
current_entity = None

for idx, (token, pred) in enumerate(zip(tokens, token_predictions)):
    if pred.startswith("B-"):
        if current_entity:
            entities.append(current_entity)
        current_entity = {"type": pred[2:], "start": idx, "text": token}
    elif pred.startswith("I-") and current_entity:
            current_entity["text"] += f" {token}"
    elif current_entity:
        entities.append(current_entity)
        current_entity = None

if current_entity:
    entities.append(current_entity)
keywords = []
for i in entities:
    keywords.append(i['text'])
def clean_keyword(keyword):
    return keyword.replace(" ##", "")

def find_closest_word(keyword, word_positions):
    keyword_cleaned = clean_keyword(keyword)
    best_match = None
    best_score = float('inf')

    for pos, word in word_positions.items():
        score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio()
        if score > 0.8 and (best_match is None or score > best_score):
            best_match = word
            best_score = score

    return best_match or keyword_cleaned
words = text.split()
word_positions = {i: word.strip(".,") for i, word in enumerate(words)}

cleaned_keywords = []
for keyword in keywords:
    closest_word = find_closest_word(keyword, word_positions)
    cleaned_keywords.append({'text': closest_word})
unique_keywords = {}
for item in cleaned_keywords:
    text = item['text'].lower()
    if text not in unique_keywords:
        unique_keywords[text] = item

cleaned_keywords_unique = list(unique_keywords.values())

if len(cleaned_keywords_unique) > 5:
  final_keywords = cleaned_keywords_unique[:5]
else:
  final_keywords = cleaned_keywords_unique

text_values = [item['text'] for item in final_keywords]
text_values
Downloads last month
40
Safetensors
Model size
14.3M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nirusanan/tinyBert-keyword

Finetuned
(30)
this model