--- license: mit language: - multilingual tags: - nlp base_model: OpenGVLab/InternVL2_5-4B pipeline_tag: text-generation inference: true --- # NuExtract-2-4B [experimental version] by NuMind 🔥 NuExtract 2.0 experimental is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual. NB: This is an experimental version that will be superseeded by NuExtract 2.0 We provide several versions of different sizes, all based on the InternVL2.5 family. | Model Size | Model Name | Base Model | Huggingface Link | |------------|------------|------------|------------------| | 2B | NuExtract-2.0-2B | [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) | [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) | | 4B | NuExtract-2.0-4B | [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) | | 8B | NuExtract-2.0-8B | [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) | ## Overview To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type. Support types include: * `verbatim-string` - instructs the model to extract text that is present verbatim in the input. * `string` - a generic string field that can incorporate paraphrasing/abstraction. * `integer` - a whole number. * `number` - a whole or decimal number. * `date-time` - ISO formatted date. * Array of any of the above types (e.g. `["string"]`) * `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`). * `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`). If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels). The following is an example template: ```json { "first_name": "verbatim-string", "last_name": "verbatim-string", "description": "string", "age": "integer", "gpa": "number", "birth_date": "date-time", "nationality": ["France", "England", "Japan", "USA", "China"], "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]] } ``` An example output: ```json { "first_name": "Susan", "last_name": "Smith", "description": "A student studying computer science.", "age": 20, "gpa": 3.7, "birth_date": "2005-03-01", "nationality": "England", "languages_spoken": ["English", "French"] } ``` ⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks. ## Inference Use the following code to handle loading and preprocessing of input data: ```python import torch import torchvision.transforms as T from PIL import Image from torchvision.transforms.functional import InterpolationMode IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratio def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height # calculate the existing image aspect ratio target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) # find the closest aspect ratio to the target target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) # calculate the target width and height target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] # resize the image resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) # split the image split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_images def load_image(image_file, input_size=448, max_num=12): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16): """ Prepares multi-modal input components (supports multiple images per prompt). Args: messages: List of input messages/prompts (strings or dicts with 'role' and 'content') image_paths: List where each element is either None (for text-only) or a list of image paths tokenizer: The tokenizer to use for applying chat templates device: Device to place tensors on ('cuda', 'cpu', etc.) dtype: Data type for image tensors (default: torch.bfloat16) Returns: dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model """ # Make sure image_paths list is at least as long as messages if len(image_paths) < len(messages): # Pad with None for text-only messages image_paths = image_paths + [None] * (len(messages) - len(image_paths)) # Process images and collect patch information loaded_images = [] num_patches_list = [] for paths in image_paths: if paths and isinstance(paths, list) and len(paths) > 0: # Load each image in this prompt prompt_images = [] prompt_patches = [] for path in paths: # Load the image img = load_image(path).to(dtype=dtype, device=device) # Ensure img has correct shape [patches, C, H, W] if len(img.shape) == 3: # [C, H, W] -> [1, C, H, W] img = img.unsqueeze(0) prompt_images.append(img) # Record the number of patches for this image prompt_patches.append(img.shape[0]) loaded_images.append(prompt_images) num_patches_list.append(prompt_patches) else: # Text-only prompt loaded_images.append(None) num_patches_list.append([]) # Create the concatenated pixel_values_list pixel_values_list = [] for prompt_images in loaded_images: if prompt_images: # Concatenate all images for this prompt pixel_values_list.append(torch.cat(prompt_images, dim=0)) else: # Text-only prompt pixel_values_list.append(None) # Format messages for the model if all(isinstance(m, str) for m in messages): # Simple string messages: convert to chat format batch_messages = [ [{"role": "user", "content": message}] for message in messages ] else: # Assume messages are already in the right format batch_messages = messages # Apply chat template prompts = tokenizer.apply_chat_template( batch_messages, tokenize=False, add_generation_prompt=True ) return { 'prompts': prompts, 'pixel_values_list': pixel_values_list, 'num_patches_list': num_patches_list } def construct_message(text, template, examples=None): """ Construct the individual NuExtract message texts, prior to chat template formatting. """ # add few-shot examples if needed if examples is not None and len(examples) > 0: icl = "# Examples:\n" for row in examples: icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n" else: icl = "" return f"""# Template:\n{template}\n{icl}# Context:\n{text}""" ``` To handle inference: ```python IMG_START_TOKEN='' IMG_END_TOKEN='' IMG_CONTEXT_TOKEN='' def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None): """ Generate responses for a batch of NuExtract inputs. Support for multiple and varying numbers of images per prompt. Args: model: The vision-language model tokenizer: The tokenizer for the model pixel_values_list: List of tensor batches, one per prompt Each batch has shape [num_images, channels, height, width] or None for text-only prompts prompts: List of text prompts generation_config: Configuration for text generation num_patches_list: List of lists, each containing patch counts for images in a prompt Returns: List of generated responses """ img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN) model.img_context_token_id = img_context_token_id # Replace all image placeholders with appropriate tokens modified_prompts = [] total_image_files = 0 total_patches = 0 image_containing_prompts = [] for idx, prompt in enumerate(prompts): # check if this prompt has images has_images = (pixel_values_list and idx < len(pixel_values_list) and pixel_values_list[idx] is not None and isinstance(pixel_values_list[idx], torch.Tensor) and pixel_values_list[idx].shape[0] > 0) if has_images: # prompt with image placeholders image_containing_prompts.append(idx) modified_prompt = prompt patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else [] num_images = len(patches) total_image_files += num_images total_patches += sum(patches) # replace each placeholder with image tokens for i, num_patches in enumerate(patches): image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN modified_prompt = modified_prompt.replace('', image_tokens, 1) else: # text-only prompt modified_prompt = prompt modified_prompts.append(modified_prompt) # process all prompts in a single batch tokenizer.padding_side = 'left' model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True) input_ids = model_inputs['input_ids'].to(model.device) attention_mask = model_inputs['attention_mask'].to(model.device) eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip()) generation_config['eos_token_id'] = eos_token_id # prepare pixel values flattened_pixel_values = None if image_containing_prompts: # collect and concatenate all image tensors all_pixel_values = [] for idx in image_containing_prompts: all_pixel_values.append(pixel_values_list[idx]) flattened_pixel_values = torch.cat(all_pixel_values, dim=0) print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches") else: print(f"Processing text-only batch with {len(prompts)} prompts") # generate outputs outputs = model.generate( pixel_values=flattened_pixel_values, # will be None for text-only prompts input_ids=input_ids, attention_mask=attention_mask, **generation_config ) # Decode responses responses = tokenizer.batch_decode(outputs, skip_special_tokens=True) return responses ``` To load the model: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left') model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2" # we recommend using flash attention ).to("cuda") ``` Simple 0-shot text-only example: ```python template = """{"names": ["verbatim-string"]}""" text = "John went to the restaurant with Mary. James went to the cinema." input_messages = [construct_message(text, template)] input_content = prepare_inputs( messages=input_messages, image_paths=[], tokenizer=tokenizer, ) generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} with torch.no_grad(): result = nuextract_generate( model=model, tokenizer=tokenizer, prompts=input_content['prompts'], pixel_values_list=input_content['pixel_values_list'], num_patches_list=input_content['num_patches_list'], generation_config=generation_config ) for y in result: print(y) # {"names": ["John", "Mary", "James"]} ``` Text-only input with an in-context example: ```python template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""" text = "John went to the restaurant with Mary. James went to the cinema." examples = [ { "input": "Stephen is the manager at Susan's store.", "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}""" } ] input_messages = [construct_message(text, template, examples)] input_content = prepare_inputs( messages=input_messages, image_paths=[], tokenizer=tokenizer, ) generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} with torch.no_grad(): result = nuextract_generate( model=model, tokenizer=tokenizer, prompts=input_content['prompts'], pixel_values_list=input_content['pixel_values_list'], num_patches_list=input_content['num_patches_list'], generation_config=generation_config ) for y in result: print(y) # {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]} ``` Example with image input and an in-context example. Image inputs should use `` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input). ```python template = """{"store": "verbatim-string"}""" text = "" examples = [ { "input": "", "output": """{"store": "Walmart"}""" } ] input_messages = [construct_message(text, template, examples)] images = [ ["0.jpg", "1.jpg"] ] input_content = prepare_inputs( messages=input_messages, image_paths=images, tokenizer=tokenizer, ) generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} with torch.no_grad(): result = nuextract_generate( model=model, tokenizer=tokenizer, prompts=input_content['prompts'], pixel_values_list=input_content['pixel_values_list'], num_patches_list=input_content['num_patches_list'], generation_config=generation_config ) for y in result: print(y) # {"store": "Trader Joe's"} ``` Multi-modal batched input: ```python inputs = [ # image input with no ICL examples { "text": "", "template": """{"store_name": "verbatim-string"}""", "examples": None, }, # image input with 1 ICL example { "text": "", "template": """{"store_name": "verbatim-string"}""", "examples": [ { "input": "", "output": """{"store_name": "Walmart"}""", } ], }, # text input with no ICL examples { "text": "John went to the restaurant with Mary. James went to the cinema.", "template": """{"names": ["verbatim-string"]}""", "examples": None, }, # text input with ICL example { "text": "John went to the restaurant with Mary. James went to the cinema.", "template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""", "examples": [ { "input": "Stephen is the manager at Susan's store.", "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}""" } ], }, ] input_messages = [ construct_message( x["text"], x["template"], x["examples"] ) for x in inputs ] images = [ ["0.jpg"], ["0.jpg", "1.jpg"], None, None ] input_content = prepare_inputs( messages=input_messages, image_paths=images, tokenizer=tokenizer, ) generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048} with torch.no_grad(): result = nuextract_generate( model=model, tokenizer=tokenizer, prompts=input_content['prompts'], pixel_values_list=input_content['pixel_values_list'], num_patches_list=input_content['num_patches_list'], generation_config=generation_config ) for y in result: print(y) # {"store_name": "WAL*MART"} # {"store_name": "Trader Joe's"} # {"names": ["John", "Mary", "James"]} # {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]} ``` ## Template Generation If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2 models can automatically generate this for you. E.g. convert XML into a NuExtract template: ```python def generate_template(description): input_messages = [description] input_content = prepare_inputs( messages=input_messages, image_paths=[], tokenizer=tokenizer, ) generation_config = {"do_sample": True, "temperature": 0.4, "max_new_tokens": 256} with torch.no_grad(): result = nuextract_generate( model=model, tokenizer=tokenizer, prompts=input_content['prompts'], pixel_values_list=input_content['pixel_values_list'], num_patches_list=input_content['num_patches_list'], generation_config=generation_config ) return result[0] xml_template = """ """ result = generate_template(xml_template) print(result) # { # "SportResult": { # "Date": "date-time", # "Sport": "verbatim-string", # "Venue": "verbatim-string", # "HomeTeam": "verbatim-string", # "AwayTeam": "verbatim-string", # "HomeScore": "integer", # "AwayScore": "integer", # "TopScorer": "verbatim-string" # } # } ``` E.g. generate a template from natural language description: ```python text = """Give me relevant info about startup companies mentioned.""" result = generate_template(text) print(result) # { # "Startup_Companies": [ # { # "Name": "verbatim-string", # "Products": [ # "string" # ], # "Location": "verbatim-string", # "Company_Type": [ # "Technology", # "Finance", # "Health", # "Education", # "Other" # ] # } # ] # } ```