ViT-Tiny Classifier for RVL-CDIP Document Classification (Distilled)

This model is a compressed Vision Transformer (ViT-Tiny) trained using knowledge distillation from DiT-Large on the RVL-CDIP dataset for document image classification. This model was developed as part of a research internship at the Laboratory of Complex Systems, Ecole Centrale Casablanca

Model Details

  • Student Model: ViT-Tiny (Vision Transformer)
  • Teacher Model: microsoft/dit-large-finetuned-rvlcdip
  • Training Method: Knowledge Distillation
  • Parameters: ~5.5M (55x smaller than teacher)
  • Dataset: RVL-CDIP (320k document images, 16 classes)
  • Task: Document Image Classification
  • Accuracy: 0.9210
  • Compression Ratio: ~55x parameter reduction from teacher model

Document Classes

The model classifies documents into 16 categories:

  1. letter - Personal or business correspondence
  2. form - Structured forms and applications
  3. email - Email communications
  4. handwritten - Handwritten documents
  5. advertisement - Marketing materials and ads
  6. scientific_report - Research reports and studies
  7. scientific_publication - Academic papers and journals
  8. specification - Technical specifications
  9. file_folder - File folders and organizational documents
  10. news_article - News articles and press releases
  11. budget - Financial budgets and planning documents
  12. invoice - Bills and invoices
  13. presentation - Presentation slides
  14. questionnaire - Surveys and questionnaires
  15. resume - CVs and resumes
  16. memo - Internal memos and notices

Usage

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

# Load model
processor = AutoImageProcessor.from_pretrained("HAMMALE/vit-tiny-classifier-rvlcdip")
model = AutoModelForImageClassification.from_pretrained("HAMMALE/vit-tiny-classifier-rvlcdip")

# Load and classify an image
image = Image.open("path_to_your_document_image.jpg")
inputs = processor(image, return_tensors="pt")

# Get predictions
outputs = model(**inputs)
predicted_class_id = outputs.logits.argmax(-1).item()

# Get class names
class_names = [
    "letter", "form", "email", "handwritten", "advertisement", 
    "scientific_report", "scientific_publication", "specification", 
    "file_folder", "news_article", "budget", "invoice", 
    "presentation", "questionnaire", "resume", "memo"
]

predicted_class = class_names[predicted_class_id]
print("Predicted class:", predicted_class)

Performance

Metric Value
Accuracy 0.9210
Parameters ~5.5M
Model Size ~22 MB
Input Size 224x224 pixels

Training Details

  • Student Architecture: Vision Transformer (ViT-Tiny)
  • Teacher Model: microsoft/dit-large-finetuned-rvlcdip
  • Distillation Method: Knowledge Distillation
  • Input Resolution: 224x224
  • Preprocessing: Standard ImageNet normalization
  • Framework: Transformers/PyTorch
  • Distillation Benefits: Maintains high accuracy with 55x fewer parameters

Dataset

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset contains:

  • 400,000 grayscale document images
  • 16 document categories
  • Images collected from truth tobacco industry documents
  • Standard train/validation/test splits

Citation

@misc{hammale2025vit_tiny_rvlcdip_distilled,
  title={ViT-Tiny Classifier for RVL-CDIP Document Classification (Distilled)},
  author={Hammale, Mourad},
  year={2025},
  howpublished={\url{https://huggingface.co/HAMMALE/vit-tiny-classifier-rvlcdip}},
  note={Knowledge distilled from microsoft/dit-large-finetuned-rvlcdip}
}

Acknowledgments

This model was created by HAMMALE (Mourad) through knowledge distillation from the larger DiT-Large model (microsoft/dit-large-finetuned-rvlcdip), achieving significant compression while maintaining competitive performance for document classification tasks.

License

This model is released under the Apache 2.0 license.

Downloads last month
51
Safetensors
Model size
5.53M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HAMMALE/vit-tiny-classifier-rvlcdip