Problem Statement being a data scientist for the marketing division at reddit.
i must get the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this instance will be fairly benign and so I will utilize the precision rating and a baseline of 63.3per cent to price success. Utilizing TFiDfVectorization, I’ll get the function value to find out which terms have actually the greatest forecast energy for the mark variables. If successful, this model may be utilized to a target other pages which have comparable regularity associated with same terms and expressions.
See dating-advice-scrape and relationship-advice-scrape notebooks because of this component.
After switching most of the scrapes into DataFrames, we conserved them as csvs that you can get when you look at the dataset folder of the repo.
Information Cleaning and EDA
- dropped rows with null self text column becuase those rows are worthless in my opinion.
- combined name and selftext column in to one brand brand brand new all_text columns
- exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.
Preprocessing and Modeling
Found the baseline precision rating 0.633 which means that if i select the value that develops most frequently, i will be appropriate 63.3% of that time.
First effort: logistic regression model with default CountVectorizer paramaters. train score: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first group of scraping, pretty bad rating with a high variance. Train 99%, test 72%
- attempted to decrease max features and rating got a whole lot worse
- tried with lemmatizer preprocessing instead and test score went as much as 74percent
Just increasing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a get a get a cross val to 82.3 Nonetheless, these rating disappeared.
I do believe Tfidf worked the very best to diminish my overfitting due to variance issue because
we customized the stop terms to simply just simply take the ones away that have been really too regular to be predictive. This is a success, but, with increased time we most likely could’ve tweaked them much more to improve all ratings. Taking a look at both the solitary terms and terms in sets of two (bigrams) had been the most useful param that gridsearch advised, nevertheless, most of my top many predictive terms finished up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term ended up being needed to show as much as 2, helped be rid of the. Gridsearch additionally proposed 90% max df rate which assisted to eradicate oversaturated terms too. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to simply concentrate the essential commonly used terms of that which was kept.
Summary and tips
Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores
therefore I think the model is willing to introduce a test. If marketing engagement increases, similar keywords could possibly be utilized to locate other possibly profitable advance payday loans online New Hampshire pages. It was found by me interesting that taking out fully the overly used terms assisted with overfitting, but brought the precision rating down. I believe there is certainly probably nevertheless space to relax and play around with the paramaters of this Tfidf Vectorizer to see if various end terms make a different or
Used Reddit’s API, needs collection, and BeautifulSoup to clean posts from two subreddits: Dating guidance & union information, and trained a binary category model to anticipate which subreddit confirmed post originated from