This project implements an embedding-based sentiment classification system that classifies Twitter tweets into Positive, Negative, or Neutral sentiments using Gemini text embeddings and machine learning.
Social media platforms generate millions of posts daily, making manual sentiment analysis impractical. Understanding public sentiment helps brands, governments, and organizations make informed decisions.
The goal of this project is to build a sentiment classifier using:
- Text preprocessing and cleaning
- Semantic embeddings generated using Gemini
- A machine learning classification model
-
Dataset: Twitter Tweets Sentiment Dataset
-
Size: ~27,000 tweets
-
Columns:
textID,text,selected_text,sentiment -
Sentiment Labels: Positive, Negative, Neutral
@Dataset Link: https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
- Python
- Pandas, NumPy
- NLTK (text preprocessing)
- Google Gemini Embeddings (
text-embedding-004) - Scikit-learn (Logistic Regression)
- Matplotlib, Seaborn
- WordCloud
- VS Code (Jupyter Notebook)
- Exploratory Data Analysis (EDA)
- Text preprocessing and cleaning
- Word cloud visualization
- Embedding generation using Gemini
- Model training using Logistic Regression
- Model evaluation using classification metrics
- Custom tweet sentiment prediction
- The model successfully classifies tweets into positive, negative, and neutral categories.
- Semantic embeddings capture contextual meaning effectively.
- Custom user-defined tweets were accurately classified.
-"I absolutely love this new phone!" β Positive -"This service is horrible and frustrating" β Negative -"The event happened yesterday" β Neutral
- Clone the repository:
- Install required dependencies:
- pip install -r requirements.txt
- Add your Gemini API key in the code:
- api_key = "WRITE_YOUR_API_KEY_HERE"
- Open and run the notebook in VS Code or Jupyter Notebook.