Model Description
Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph.
The tinyBert-keyword model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction.
huawei-noah/TinyBERT_General_4L_312D is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks.
- Finetuned from: huawei-noah/TinyBERT_General_4L_312D
How to use
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
from transformers import AutoTokenizer, AutoModelForTokenClassification
import difflib
tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword")
model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device)
text = """
Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information.
Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language.
Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner.
Types of Vision Language Models:
Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models.
LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks.
VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations.
"""
id2label = model.config.id2label
tokenized = tokenizer(
text,
padding=True,
truncation=True,
return_offsets_mapping=True,
return_tensors="pt"
)
input_ids = tokenized["input_ids"].to(device)
attention_mask = tokenized["attention_mask"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
token_predictions = [id2label[pred.item()] for pred in predictions[0]]
entities = []
current_entity = None
for idx, (token, pred) in enumerate(zip(tokens, token_predictions)):
if pred.startswith("B-"):
if current_entity:
entities.append(current_entity)
current_entity = {"type": pred[2:], "start": idx, "text": token}
elif pred.startswith("I-") and current_entity:
current_entity["text"] += f" {token}"
elif current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
keywords = []
for i in entities:
keywords.append(i['text'])
def clean_keyword(keyword):
return keyword.replace(" ##", "")
def find_closest_word(keyword, word_positions):
keyword_cleaned = clean_keyword(keyword)
best_match = None
best_score = float('inf')
for pos, word in word_positions.items():
score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio()
if score > 0.8 and (best_match is None or score > best_score):
best_match = word
best_score = score
return best_match or keyword_cleaned
words = text.split()
word_positions = {i: word.strip(".,") for i, word in enumerate(words)}
cleaned_keywords = []
for keyword in keywords:
closest_word = find_closest_word(keyword, word_positions)
cleaned_keywords.append({'text': closest_word})
unique_keywords = {}
for item in cleaned_keywords:
text = item['text'].lower()
if text not in unique_keywords:
unique_keywords[text] = item
cleaned_keywords_unique = list(unique_keywords.values())
if len(cleaned_keywords_unique) > 5:
final_keywords = cleaned_keywords_unique[:5]
else:
final_keywords = cleaned_keywords_unique
text_values = [item['text'] for item in final_keywords]
text_values
- Downloads last month
- 40
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for nirusanan/tinyBert-keyword
Base model
huawei-noah/TinyBERT_General_4L_312D