Detecting Fake reviews on Amazon

Python PySpark Flask NLP BERT HuggingFace 

This project was created as a part of the graduate course CSE 6242 - Data and Visual Analytics (Fall 2021) at Georgia Tech. Authors: Sittun Swayam Prakash, Atrima Ghosh, Parth Iramani, Zoe Masood, Jenna Gottschalk, Mugundhan Murugesan. The aim of our project is to detect fake reviews on Amazon using the review text. Our approach combines semi-supervised learning and transformer models to identify fake Amazon reviews. The end product of our project is a web application which will allow the user to predict whether a review is fake or real.

The Datasets used in our projects are:
Labelled Dataset from Kaggle - https://www.kaggle.com/lievgarcia/amazon-reviews
Unlabelled dataset from Amazon -https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz

Following are the steps involved in creating and evaluation of the Model to predict fake reviews using review text:
(1) We split the original labeled dataset into four parts: 70% training set, 10% first validation set to compare initial supervised classification models, 10% second validation set to compare the updated classification models, and 10% test set to evaluate the final classifier’s performance.
(2) We trained the initial model on the training set of labeled data.
(3) We generated pseudo-labels by using the initial model to classify the unlabeled data set.
(4) The most confidently predicted pseudo-labels above a specific threshold became training data for the next step.
(5) We updated the initial model using the pseudo-labels as training data.
(6) Finally, we tested the final classifier on the test set of the original labeled data.

The poster below explains the motivation and results of this project:

My contributions:
- I worked on data processing, cleaning and feature engineering using spark on databricks platfrom.
- Implemented the final model retraining step of the semi supervised approach.
- Procured temporary server and assisted in deployment of the project on the Microsoft Azure server.
- Contributed to contents of the poster, report and video creation.



We also created a webpage that users can use to predict if a review on amazon is fake. The screenshot is attached below:



Below is the video of the project presentation:



Feel free to checkout the code in the github repository:

Github Repository


This website is designed for desktop view