--- language: en license: mit tags: - keras - lstm - spam-classification - text-classification - binary-classification - email - deep-learning library_name: keras pipeline_tag: text-classification model_name: Spam Email Classifier (BiLSTM) datasets: - SetFit/enron_spam --- # ๐Ÿ“ง Spam Email Classifier using BiLSTM This model uses a **Bidirectional LSTM (BiLSTM)** architecture built with **Keras** to classify email messages as **Spam** or **Ham**. It was trained on the [Enron Spam Dataset](https://huggingface.co/datasets/SetFit/enron_spam) using GloVe word embeddings. --- ## ๐Ÿง  Model Architecture - **Tokenizer**: Keras `Tokenizer` trained on the Enron dataset - **Embedding**: Pretrained [GloVe.6B.100d](https://nlp.stanford.edu/projects/glove/) - **Model**: `Embedding โ†’ BiLSTM โ†’ Dropout โ†’ Dense(sigmoid)` - **Input**: English email/message text - **Output**: `0 = Ham`, `1 = Spam` --- ## ๐Ÿงช Example Usage ```python from tensorflow.keras.models import load_model from huggingface_hub import hf_hub_download import pickle from tensorflow.keras.preprocessing.sequence import pad_sequences # Load files from HF Hub model_path = hf_hub_download("lokas/spam-emails-classifier", "model.h5") tokenizer_path = hf_hub_download("lokas/spam-emails-classifier", "tokenizer.pkl") # Load model and tokenizer model = load_model(model_path) with open(tokenizer_path, "rb") as f: tokenizer = pickle.load(f) # Prediction function def predict_spam(text): seq = tokenizer.texts_to_sequences([text]) padded = pad_sequences(seq, maxlen=50) # must match training maxlen pred = model.predict(padded)[0][0] return "๐Ÿšซ Spam" if pred > 0.5 else "โœ… Not Spam" # Example print(predict_spam("Win a free iPhone now!"))