vietnamese_hate_speech_detection

Sleeping

App Files Files Community

jesse-tong commited on Apr 10

Commit

30ae36c

1 Parent(s): 946b455

Update datasets

Browse files

Files changed (4) hide show

LICENSE.md +15 -16
scripts/merge_datasets.py +116 -0
utils/convert_vihsd_gemini.py +166 -0
utils/word_segmentation_vi.py +4 -0

LICENSE.md CHANGED Viewed

@@ -1,26 +1,25 @@
-AGPL-3.0
 This repository as a whole is licensed under the [GNU Affero General Public License v3.0 or any later version (AGPL v3.0 or later)](https://www.gnu.org/licenses/agpl-3.0.en.html).
 ## Third-Party Components
 This repository uses the following third-party components, each under their respective licenses:
-| Component | License | Description |
-|-----------|---------|-------------|
-| PhoBERT v2 | [AGPLv3.0](https://www.gnu.org/licenses/agpl-3.0.en.html) | Pre-trained language model for Vietnamese |
-| ViTHSD | [MIT License](https://opensource.org/licenses/MIT) | Vietnamese Targeted Hate Speech Detection |
-| underthesea | [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html) | Vietnamese NLP Toolkit |
-| transformers | [Apache License 2.0](https://github.com/huggingface/transformers/blob/main/LICENSE) | State-of-the-art NLP library by Hugging Face |
-| torch (PyTorch) | [BSD License](https://github.com/pytorch/pytorch/blob/master/LICENSE) | Open-source machine learning library |
 | datasets | [Apache License 2.0](https://github.com/huggingface/datasets/blob/main/LICENSE) | Dataset library by Hugging Face |
-| pandas | [BSD 3-Clause License](https://github.com/pandas-dev/pandas/blob/main/LICENSE) | Data analysis and manipulation library |
-| scikit-learn | [BSD 3-Clause License](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING) | Machine learning library for Python |
-| numpy | [BSD 3-Clause License](https://github.com/numpy/numpy/blob/main/LICENSE.txt) | Scientific computing library |
-| tokenizers | [Apache License 2.0](https://github.com/huggingface/tokenizers/blob/main/LICENSE) | Fast tokenizers library by Hugging Face |
-| torchtext | [BSD License](https://github.com/pytorch/text/blob/main/LICENSE) | Text processing utilities for PyTorch |
-| maturin | [MIT License or Apache License 2.0](https://github.com/PyO3/maturin/blob/main/license-mit) | Build and publish Rust extensions for Python |
-| accelerate | [Apache License 2.0](https://github.com/huggingface/accelerate/blob/main/LICENSE) | Library for easy PyTorch distributed training |
 ## AGPLv3.0 License Requirements

 This repository as a whole is licensed under the [GNU Affero General Public License v3.0 or any later version (AGPL v3.0 or later)](https://www.gnu.org/licenses/agpl-3.0.en.html).
 ## Third-Party Components
 This repository uses the following third-party components, each under their respective licenses:
+| Component | License | Description | Link to repository (if possible) |
+|-----------|---------|-------------| -------------------------------- |
+| PhoBERT v2 | [AGPLv3.0](https://www.gnu.org/licenses/agpl-3.0.en.html) | Pre-trained language model for Vietnamese | [vinai/phobert-base-v2](https://huggingface.co/vinai/phobert-base-v2) |
+| ViTHSD | [MIT License](https://opensource.org/licenses/MIT) | Vietnamese Targeted Hate Speech Detection Dataset | [bakansm/ViTHSD](https://github.com/bakansm/ViTHSD) |
+| ViHSD | [MIT License](https://opensource.org/licenses/MIT) | Vietnamese Hate Speech Detection Dataset | [sonlam1102/vihsd](https://huggingface.co/datasets/sonlam1102/vihsd) |
+| underthesea | [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html) | Vietnamese NLP Toolkit | [undertheseanlp/underthesea](https://github.com/undertheseanlp/underthesea) |
+| transformers | [Apache License 2.0](https://github.com/huggingface/transformers/blob/main/LICENSE) | State-of-the-art NLP library by Hugging Face | [huggingface/transformers](https://github.com/huggingface/transformers) |
+| torch (PyTorch) | [BSD License](https://github.com/pytorch/pytorch/blob/master/LICENSE) | Open-source machine learning library | [Repo \(Github\)](https://github.com/pytorch/pytorch/blob/master/LICENSE) |
 | datasets | [Apache License 2.0](https://github.com/huggingface/datasets/blob/main/LICENSE) | Dataset library by Hugging Face |
+| pandas | [BSD 3-Clause License](https://github.com/pandas-dev/pandas/blob/main/LICENSE) | Data analysis and manipulation library |  |
+| scikit-learn | [BSD 3-Clause License](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING) | Machine learning library for Python |  |
+| numpy | [BSD 3-Clause License](https://github.com/numpy/numpy/blob/main/LICENSE.txt) | Scientific computing library |  |
+| tokenizers | [Apache License 2.0](https://github.com/huggingface/tokenizers/blob/main/LICENSE) | Fast tokenizers library by Hugging Face |  |
+| torchtext | [BSD License](https://github.com/pytorch/text/blob/main/LICENSE) | Text processing utilities for PyTorch |  |
+| maturin | [MIT License or Apache License 2.0](https://github.com/PyO3/maturin/blob/main/license-mit) | Build and publish Rust extensions for Python |  |
+| accelerate | [Apache License 2.0](https://github.com/huggingface/accelerate/blob/main/LICENSE) | Library for easy PyTorch distributed training |  |
 ## AGPLv3.0 License Requirements

scripts/merge_datasets.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import os
+import glob
+import pandas as pd
+import argparse
+from tqdm import tqdm
+import shutil
+def merge_datasets(input_dirs, output_dir, preserve_splits=False):
+    """
+    Merge CSV datasets from multiple directories into one directory
+    Args:
+        input_dirs (list): List of input directory paths
+        output_dir (str): Output directory path
+    """
+    # Create output directory if it doesn't exist
+    os.makedirs(output_dir, exist_ok=True)
+    # Define the expected columns for the format in test.csv
+    expected_columns = ['content', 'individual', 'groups', 'religion/creed', 'race/ethnicity', 'politics']
+    # Dictionary to hold dataframes for each split if preserving splits
+    combined_data = {}
+    if preserve_splits:
+        combined_data = {'train': [], 'dev': [], 'test': []}
+    else:
+        combined_data['all'] = []
+    # Process each input directory
+    for input_dir in input_dirs:
+        print(f"Processing directory: {input_dir}")
+        # Find all CSV files in the directory
+        csv_files = glob.glob(os.path.join(input_dir, "*.csv"))
+        for file_path in tqdm(csv_files, desc=f"Processing files in {os.path.basename(input_dir)}"):
+            file_name = os.path.basename(file_path)
+            # Read the CSV file
+            try:
+                df = pd.read_csv(file_path)
+                print(f"  Reading {file_name}: {len(df)} rows")
+            except Exception as e:
+                print(f"  Error reading {file_name}: {e}")
+                continue
+            # Rename 'free_text' column to 'content' if it exists
+            if 'free_text' in df.columns:
+                df.rename(columns={'free_text': 'content'}, inplace=True)
+            # Check if 'content' column exists
+            if 'content' not in df.columns:
+                print(f"  Warning: 'content' column not found in {file_name}. Skipping.")
+                continue
+            # Ensure all required columns exist
+            for col in expected_columns:
+                if col != 'content' and col not in df.columns:
+                    df[col] = 0  # Set default value for missing columns
+            # Convert category columns to integer type
+            for col in expected_columns:
+                if col != 'content' and col in df.columns:
+                    df[col] = df[col].fillna(0).astype(int)
+            # Drop unnecessary columns
+            df = df[expected_columns]
+            # Determine which split this file belongs to
+            if preserve_splits:
+                if 'train' in file_name.lower():
+                    combined_data['train'].append(df)
+                elif 'dev' in file_name.lower():
+                    combined_data['dev'].append(df)
+                elif 'test' in file_name.lower():
+                    combined_data['test'].append(df)
+                else:
+                    # If not explicitly marked, add to all splits
+                    for split in ['train', 'dev', 'test']:
+                        combined_data[split].append(df)
+            else:
+                combined_data['all'].append(df)
+    # Combine and save the data
+    for split, dfs in combined_data.items():
+        if not dfs:
+            print(f"No data for {split} split")
+            continue
+        combined_df = pd.concat(dfs, ignore_index=True)
+        # Remove duplicates
+        combined_df = combined_df.drop_duplicates(subset=['content'])
+        # Save to output directory
+        output_file = os.path.join(output_dir, f"{split}.csv" if preserve_splits else "combined.csv")
+        combined_df.to_csv(output_file, index=False)
+        print(f"Saved {len(combined_df)} rows to {output_file}")
+def main():
+    parser = argparse.ArgumentParser(description="Merge CSV datasets from multiple directories")
+    parser.add_argument("--input_dirs", required=True, nargs='+',
+                        help="List of input directory paths containing CSV files")
+    parser.add_argument("--output_dir", required=True,
+                        help="Output directory path for merged datasets")
+    args = parser.parse_args()
+    merge_datasets(
+        args.input_dirs,
+        args.output_dir,
+        preserve_splits=True
+    )
+if __name__ == "__main__":
+    main()

utils/convert_vihsd_gemini.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import os
+import glob
+import pandas as pd
+import argparse
+from google import genai
+from tqdm import tqdm
+import time
+import re
+from word_segmentation_vi import word_segmentation_vi
+def setup_genai(api_key):
+    """Configure the Google Generative AI client with your API key"""
+    return genai.Client(api_key=api_key)
+def classify_text(model, text):
+    """Classify Vietnamese text into hate speech categories using Google's Generative AI"""
+    prompt = f"""
+    Analyze the following Vietnamese text for hate speech (each sentence is separated by a newline):
+    "{text}"
+    Rate it on these categories (0=NORMAL, 1=CLEAN, 2=OFFENSIVE, 3=HATE):
+    - individual (targeting specific individuals)
+    - groups (targeting groups or organizations)
+    - religion/creed (targeting religious groups or beliefs)
+    - race/ethnicity (racial/ethnic hate speech)
+    - politics (political hate speech)
+    If the text doesn't specify a person or group in a category, return 0 for that category.
+    Else, return 1 for CLEAN, 2 for OFFENSIVE, or 3 for HATE.
+    For each sentence in the text, return only 5 numbers separated by commas (corresponding to the label of individual, groups, religion/creed, race/ethnicity, politics) and numbers for each sentence seperated by newlines, like (with no other text):
+    0,1,0,0,0
+    1,0,0,0,2
+    """
+    try:
+        response = model.models.generate_content(model="gemini-2.0-flash", contents=prompt)
+        values = response.text.strip().split('\n')
+        values = [line.split(',') for line in values]
+        return values
+    except Exception as e:
+        print(f"Error classifying text: {e}")
+        return None
+def process_file(input_file, output_file, model, rate_limit_pause=4):
+    """Process a single CSV file to match the test.csv format"""
+    print(f"Processing {input_file}...")
+    # Read the input file
+    try:
+        df = pd.read_csv(input_file)
+    except Exception as e:
+        print(f"Error reading {input_file}: {e}")
+        return
+    # Rename column free_text to content
+    if 'free_text' in df.columns:
+        df.rename(columns={'free_text': 'content'}, inplace=True)
+    elif 'content' not in df.columns:
+        print(f"Error: 'content' column not found in {input_file}")
+        return
+    # Ensure all required columns exist
+    category_columns = ['individual', 'groups', 'religion/creed', 'race/ethnicity', 'politics']
+    for col in category_columns:
+        if col not in df.columns:
+            # Change column type to int if it doesn't exist
+            df[col] = 0
+    # Process each batch (100 rows at a time)
+    batch_size = 100
+    for start in tqdm(range(0, len(df), batch_size), desc="Processing batches"):
+        end = min(start + batch_size, len(df))
+        batch_df = df.iloc[start:end]
+        # Skip if all categories already have values
+        if all(batch_df[cat].all() != 0 for cat in category_columns):
+            continue
+        # Join 50 rows by newlines, and classify all at once
+        text_to_classify = "\n".join([str(sentence) for sentence in batch_df['content'].tolist()])
+        classifications = classify_text(model, text_to_classify)
+        # Try 2 more times, else skip
+        if classifications is None:
+            for _ in range(2):
+                classifications = classify_text(model, text_to_classify)
+                if classifications is not None:
+                    break
+                time.sleep(rate_limit_pause)
+            else:
+                print(f"Error classifying batch starting at index {start}. Skipping...")
+                continue
+        try:
+            # Update the DataFrame with the classifications
+            for i, row in enumerate(classifications):
+                for j, col in enumerate(category_columns):
+                    df.at[start + i, col] = int(row[j])
+        except Exception as e:
+            for _ in range(2):
+                classifications = classify_text(model, text_to_classify)
+                if classifications is not None:
+                    break
+                time.sleep(rate_limit_pause)
+            else:
+                print(f"Error classifying batch starting at index {start}. Skipping...")
+                continue
+        try:
+            for i, row in enumerate(classifications):
+                for j, col in enumerate(category_columns):
+                    df.at[start + i, col] = int(row[j])
+        except Exception as e:
+            print(f"Error updating DataFrame: {e}")
+            continue
+        time.sleep(rate_limit_pause)
+    # Apply word segmentation to the content column
+    df['content'] = df['content'].apply(lambda x: word_segmentation_vi(str(x)))
+    # Save processed file, export columns of category_columns is int
+    for col in category_columns:
+        df[col] = df[col].astype(int)
+    # Drop label_id column if it exists
+    if 'label_id' in df.columns:
+        df.drop(columns=['label_id'], inplace=True)
+    df.to_csv(output_file, index=False)
+    print(f"Saved processed file to {output_file}")
+def main():
+    parser = argparse.ArgumentParser(description="Process ViHSD CSV files with Google Generative AI")
+    parser.add_argument("--input_dir", required=True, help="Directory containing input CSV files")
+    parser.add_argument("--output_dir", required=True, help="Directory to save processed files")
+    parser.add_argument("--api_key", required=True, help="Google Generative AI API key")
+    parser.add_argument("--pause", type=float, default=4.0, help="Pause between API calls (seconds)")
+    args = parser.parse_args()
+    # Ensure output directory exists
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Setup Google Generative AI
+    model = setup_genai(args.api_key)
+    # Get all CSV files in the input directory
+    csv_files = glob.glob(os.path.join(args.input_dir, "*.csv"))
+    if not csv_files:
+        print(f"No CSV files found in {args.input_dir}")
+        return
+    print(f"Found {len(csv_files)} CSV files to process")
+    # Process each file
+    for input_file in csv_files:
+        output_file = os.path.join(args.output_dir, os.path.basename(input_file))
+        if os.path.exists(output_file):
+            print(f"Output file {output_file} already exists. Skipping...")
+            continue
+        process_file(input_file, output_file, model, args.pause)
+if __name__ == "__main__":
+    # This script is used to process ViHSD CSV files with Google Generative AI
+    # First, git clone from https://huggingface.co/datasets/sonlam1102/vihsd
+    main()

utils/word_segmentation_vi.py CHANGED Viewed

@@ -17,6 +17,10 @@ if __name__ == "__main__":
         df = pandas.read_csv(file_path)
         if 'content' in df.columns:
             df['content'] = df['content'].apply(lambda text: word_segmentation_vi(str(text)))
             df.to_csv(file_path, index=False)
             print(f"Processed {file}")
         else:

         df = pandas.read_csv(file_path)
         if 'content' in df.columns:
             df['content'] = df['content'].apply(lambda text: word_segmentation_vi(str(text)))
+            if 'Unnamed: 0' in df.columns:
+                df.drop(columns=['Unnamed: 0'], inplace=True)
             df.to_csv(file_path, index=False)
             print(f"Processed {file}")
         else: