Model Card for gibranlynardi/catboost_prediksi_aplikasi_merek


🧠 Model Details

Description

KeloraAI is an open-source multimodal classification system designed to predict the success of trademark applications in Indonesia. It integrates visual and textual embeddings to assess the likelihood of an application being registered (Didaftar) or rejected (Ditolak).

It consists of three CatBoost classification models:

  • trademark_catboost_model_image.cbm: Logo-based image embedding model using DINOv2
  • trademark_catboost_model_textdense.cbm: Dense text embedding model using Multilingual E5
  • trademark_catboost_model_textsparse.cbm: Sparse text embedding model using OpenSearch Neural Sparse

Authors

  • Gibran Tegar Ramadhan Putra Lynardi (Universitas Indonesia)
  • Harish Azka Firdaus (Universitas Indonesia)
  • Daffa Syafitra (Universitas Indonesia)

Languages

Multilingual (supports Bahasa Indonesia and English)

Finetuned From

  • Text (Dense): intfloat/multilingual-e5-large-instruct
  • Text (Sparse): opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1
  • Image: facebook/dinov2-with-registers-large

βœ… Uses

Direct Use

  • Predicts trademark application outcome ("Didaftar" or "Ditolak")
  • Inputs: brand name, class description, and optionally logo image
  • Target users: SMEs, legal consultants, regulators

Downstream Use

  • Can be integrated into brand registration portals and legal screening tools

Out-of-Scope Use

  • Should not be used as a sole legal decision-making tool
  • Limited generalization due to dataset scope

⚠️ Bias, Risks, and Limitations

  • Class Imbalance: Majority of data are successful registrations
  • Visual Limitation: Logos only available for generic brands
  • Data Coverage: Historical data subset only
  • Interpretability: SHAP used but human validation still essential

πŸ› οΈ Getting Started

Example: Text Dense Model

import catboost as cb
from sentence_transformers import SentenceTransformer
import pandas as pd
import re

model = cb.CatBoostClassifier()
model.load_model("trademark_catboost_model_textdense.cbm")

embedding_model = SentenceTransformer("intfloat/multilingual-e5-large-instruct")
brand = "LOGO+SUPER"
brand_clean = re.sub(r"\\s*[+&]*\\s*logo\\s*[+&]*\\s*", "", brand.lower())
brand_emb = embedding_model.encode([brand_clean])

Note: Final input must include OHE, cosine similarities, and class-based statistics.


πŸ“Š Training Details

  • Data Source: Scraped from pdki-indonesia.dgip.go.id
  • Samples: 337,334 total (1988–2024)
  • With Logos: 87,424
  • Split:
    • Train: 215,965
    • Valid: 54,043
    • Test: 67,326

Feature Engineering

  • Cosine similarity to most similar rejected/registered brand and logo
  • Gap in years to those most similar samples
  • Popularity and rejection rate per NICE class

Training

  • Model: CatBoostClassifier
  • Objective: Logloss
  • Metric: F1-score (especially for class "Ditolak")
  • Imbalance Handling: auto_class_weights=Balanced
  • Early Stopping: 50–75 rounds

πŸ§ͺ Evaluation

Results

Model F1-Score
DINOv2 (image) 0.9072
Multilingual E5 (dense) 0.8520
OpenSearch (sparse) 0.8511

Insights

  • Visual similarity is highly predictive (DINOv2 performs best)
  • Text similarity and temporal gaps are key factors across all models
  • Sparse model provides better interpretability via token-level SHAP

πŸ” Explainability

  • SHAP explains impact of each feature
  • Dominant features:
    • max_sim_to_registered_brand_train
    • max_sim_to_rejected_brand_train
    • year_gap_to_most_sim_registered

🌍 Environmental Impact

  • Trained on Kaggle GPUs (Tesla T4)
  • Embedding time:
    • Text: ~11 minutes (Multilingual E5)
    • Images: ~1.5 hours for 80k logos
  • Carbon footprint: [estimation TBD based on cloud usage logs]

πŸ“– Citation

@article{lynardi2024keloraai,
  title={KeloraAI: Sistem Open-source Prediksi Keberhasilan Aplikasi Merek Dagang menggunakan Embedding Based Semantic Retreival dan Visual Search Classification},
  author={Lynardi, Gibran Tegar Ramadhan Putra and Syafitra, Daffa and Firdaus, Harish Azka},
  journal={Fakultas Ilmu Komputer, Universitas Indonesia},
  year={2024}
}

πŸ—‚οΈ Glossary

  • CatBoost: Gradient boosting algorithm optimized for categorical data
  • SHAP: SHapley Additive exPlanations for feature importance
  • DINOv2: Self-supervised ViT for image embeddings
  • Multilingual E5: Dense sentence embedding model for semantic similarity
  • OpenSearch Sparse: Sparse text encoder for efficient retrieval and explainability
  • NICE Class: International classification of goods/services for trademarks

πŸ“« Contact

Gibran Tegar Ramadhan Putra Lynardi
πŸ“§ gibran.tegar@ui.ac.id
πŸ”— GitHub Repository

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train gibranlynardi/catboost_prediksi_aplikasi_merek