Image-Text-to-Text
OpenCLIP
zw123 commited on
Commit
cac603b
·
verified ·
1 Parent(s): c13076d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - UCSC-VLAA/Recap-DataComp-1B
5
+ - mlfoundations/datacomp_1b
6
+ library_name: open_clip
7
+ ---
8
+ [[Paper]](https://arxiv.org/abs/2501.09446)
9
+
10
+ A DeltaCLIP-H/14-336 Model that is adversarially pre-trained with web-scale image-text data to reach non-robust-VLM helpfulness levels on clean data while being robust on adversarially attacked data.
11
+
12
+ ## Model Usage
13
+ ### With OpenCLIP
14
+ ```
15
+ import torch
16
+ import torch.nn.functional as F
17
+ from urllib.request import urlopen
18
+ from PIL import Image
19
+ from open_clip import create_model_from_pretrained, get_tokenizer
20
+
21
+ model, preprocess = create_model_from_pretrained('hf-hub:zw123/delta_clip_l14_224')
22
+ tokenizer = get_tokenizer('hf-hub:zw123/delta_clip_l14_224')
23
+
24
+ image = Image.open(urlopen(
25
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
26
+ ))
27
+ image = preprocess(image).unsqueeze(0)
28
+
29
+ text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
30
+
31
+ with torch.no_grad(), torch.cuda.amp.autocast():
32
+ image_features = model.encode_image(image)
33
+ text_features = model.encode_text(text)
34
+ image_features = F.normalize(image_features, dim=-1)
35
+ text_features = F.normalize(text_features, dim=-1)
36
+
37
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
38
+
39
+ print("Label probs:", text_probs) # prints: [[0., 0., 0., 1.0]]
40
+ ```
41
+
42
+ ## Release
43
+ These models are released under the Creative Commons Attribution 4.0 license.
44
+ LLNL-DATA- 2003001
45
+
46
+ ## Citation
47
+ If you find this model useful, please consider citing our paper:
48
+ ```bibtex
49
+ @article{wang2025double,
50
+ title={Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness},
51
+ author={Wang, Zeyu and Xie, Cihang and Bartoldson, Brian and Kailkhura, Bhavya},
52
+ journal={arXiv preprint arXiv:2501.09446},
53
+ year={2025}
54
+ }
55
+ ```