--- language: "code" license: "mit" tags: - dockerfile - hadolint - multilabel-classification - codebert model-index: - name: Multilabel Dockerfile Classifier results: [] --- # ๐Ÿงฑ Dockerfile Quality Classifier โ€“ Multilabel Model This model predicts **which rules are violated** in a given Dockerfile. It is a multilabel classifier trained to detect violations of the top 30 most frequent rules from Hadolint. --- ## ๐Ÿง  Model Overview - **Architecture:** Fine-tuned `microsoft/codebert-base` - **Task:** Multi-label classification (30 labels) - **Input:** Full Dockerfile content as plain text - **Output:** For each rule โ†’ probability of violation - **Max input length:** 512 tokens - **Threshold:** 0.5 (configurable) --- ## ๐Ÿ“š Training Details - **Total training files:** ~15,000 Dockerfiles with at least one rule violation - **Per-rule cap:** Max 2,000 files per rule to avoid imbalance - **Perfect (clean) files:** ~1,500 examples with no Hadolint violations - **Label source:** Hadolint output (top 30 rules only) - **One-hot labels:** `[1, 0, 0, 1, ...]` for 30 rules --- ## ๐Ÿงช Evaluation Snapshot Evaluation on 6,873 labeled examples: | Metric | Value | |----------------|--------| | Micro avg F1 | 0.97 | | Macro avg F1 | 0.95 | | Weighted avg F1| 0.97 | | Samples avg F1 | 0.97 | More metrics available in `classification_report.csv` --- ## ๐Ÿš€ Quick Start ### ๐Ÿงช Step 1 โ€” Create test script Save as `test_multilabel_predict.py`: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from pathlib import Path import numpy as np import json import sys MODEL_DIR = "LeeSek/multilabel-dockerfile-model" TOP_RULES_PATH = "top_rules.json" THRESHOLD = 0.5 def main(): if len(sys.argv) < 2: print("Usage: python test_multilabel_predict.py Dockerfile [--debug]") return debug = "--debug" in sys.argv file_path = Path(sys.argv[1]) if not file_path.exists(): print(f"File {file_path} not found.") return labels = json.load(open(TOP_RULES_PATH)) tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR) model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR) model.eval() text = file_path.read_text(encoding="utf-8") inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = torch.sigmoid(logits).squeeze().cpu().numpy() triggered = [(labels[i], probs[i]) for i in range(len(labels)) if probs[i] > THRESHOLD] top5 = np.argsort(probs)[-5:][::-1] print(f"\n๐Ÿงช Prediction for file: {file_path.name}") print(f"๐Ÿ“„ Lines in file: {len(text.splitlines())}") if triggered: print(f"\n๐Ÿšจ Detected violations (p > {THRESHOLD}):") for rule, p in triggered: print(f" - {rule}: {p:.3f}") else: print("โœ… No violations detected.") if debug: print("\n๐Ÿ›  DEBUG INFO:") print(f"๐Ÿ“ Text snippet:\n{text[:300]}") print(f"๐Ÿ”ข Token count: {len(inputs['input_ids'][0])}") print(f"๐Ÿ“ˆ Logits: {logits.squeeze().tolist()}") print("\n๐Ÿ”ฅ Top 5 predictions:") for idx in top5: print(f" - {labels[idx]}: {probs[idx]:.3f}") if __name__ == "__main__": main() ``` Make sure `top_rules.json` is available next to the script. --- ### ๐Ÿ“„ Step 2 โ€” Create good and bad Dockerfile Good: ```docker FROM node:18 WORKDIR /app COPY . . RUN npm install CMD ["node", "index.js"] ``` Bad: ```docker FROM ubuntu:latest RUN apt-get install python3 ADD . /app WORKDIR /app RUN pip install flask CMD python3 app.py ``` ### โ–ถ๏ธ Step 3 โ€” Run the script ```bash python test_multilabel_predict.py Dockerfile --debug ``` --- ## ๐Ÿ—‚ Extras The full training and evaluation pipeline โ€” including data preparation, training, validation, prediction, and threshold calibration โ€” is available in the **`scripts/`** folder. > ๐Ÿ’ฌ **Note:** Scripts are written with **Polish comments and variable names** for clarity during local development. Logic is fully portable. --- ## ๐Ÿ“˜ License MIT --- ## ๐Ÿ™Œ Credits - Based on [Hadolint](https://github.com/hadolint/hadolint) - Powered by [Hugging Face Transformers](https://huggingface.co)