Monday, December 5, 2022
HomeCloud ComputingStrategic Listening: A Information to Python Social Media Evaluation

Strategic Listening: A Information to Python Social Media Evaluation


With a worldwide penetration price of 58.4%, social media offers a wealth of opinions, concepts, and discussions shared each day. This information provides wealthy insights into crucial and fashionable dialog subjects amongst customers.

In advertising, social media evaluation will help firms perceive and leverage client habits. Two widespread, sensible strategies are:

  • Matter modeling, which solutions the query, “What dialog subjects do customers discuss?”
  • Sentiment evaluation, which solutions the query, “How positively or negatively are customers talking a couple of matter?”

On this article, we use Python for social media information evaluation and exhibit methods to collect very important market info, extract actionable suggestions, and establish the product options that matter most to purchasers.

To show the utility of social media evaluation, let’s carry out a product evaluation of varied smartwatches utilizing Reddit information and Python. Python is a powerful alternative for information science tasks, and it provides many libraries that facilitate the implementation of the machine studying (ML) and pure language processing (NLP) fashions that we’ll use.

This evaluation makes use of Reddit information (versus information from Twitter, Fb, or Instagram) as a result of Reddit is the second most trusted social media platform for information and knowledge, in line with the American Press Institute. As well as, Reddit’s subforum group produces “subreddits” the place customers suggest and criticize particular merchandise; its construction is good for product-centered information evaluation.

First we use sentiment evaluation to match person opinions on fashionable smartwatch manufacturers to find which merchandise are considered most positively. Then, we use matter modeling to slim in on particular smartwatch attributes that customers continuously focus on. Although our instance is restricted, you may apply the identical evaluation to another services or products.

Making ready Pattern Reddit Knowledge

The information set for this instance incorporates the title of the submit, the textual content of the submit, and the textual content of all feedback for the newest 100 posts made within the r/smartwatch subreddit. Our dataset incorporates the newest 100 full discussions of the product, together with customers’ experiences, suggestions about merchandise, and their execs and cons.

To gather this info from Reddit, we’ll use PRAW, the Python Reddit API Wrapper. First, create a shopper ID and secret token on Reddit utilizing the OAuth2 information. Subsequent, comply with the official PRAW tutorials on downloading submit feedback and getting submit URLs.

Sentiment Evaluation: Figuring out Main Merchandise

To establish main merchandise, we will study the constructive and unfavorable feedback customers make about sure manufacturers by making use of sentiment evaluation to our textual content corpus. Sentiment evaluation fashions are NLP instruments that categorize texts as constructive or unfavorable based mostly on their phrases and phrases. There’s all kinds of doable fashions, starting from easy counters of constructive and unfavorable phrases to deep neural networks.

We are going to use VADER for our instance, as a result of it’s designed to optimize outcomes for brief texts from social networks through the use of lexicons and rule-based algorithms. In different phrases, VADER performs nicely on information units just like the one we’re analyzing.

Use the Python ML pocket book of your alternative (for instance, Jupyter) to investigate this information set. We set up VADER utilizing pip:

pip set up vaderSentiment 

First, we add three new columns to our information set: the compound sentiment values ​​for the submit title, submit textual content, and remark textual content. To do that, iterate over every textual content and apply VADER’s polarity_scores methodology, which takes a string as enter and returns a dictionary with 4 scores: positivity, negativity, neutrality, and compound.

For our functions, we’ll use solely the compound rating—the general sentiment based mostly on the primary three scores, rated on a normalized scale from -1 to 1 inclusive, the place -1 is probably the most unfavorable and 1 is probably the most constructive—so as to characterize the sentiment of a textual content with a single numerical worth:

# Import VADER and pandas
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
import pandas as pd

analyzer = SentimentIntensityAnalyzer()

# Load information
information = pd.read_json("./sample_data/information.json", strains=True)

# Initialize lists to retailer sentiment values 
title_compound = []
text_compound = []
comment_text_compound = []

for title,textual content,comment_text in zip(information.Title, information.Textual content, information.Comment_text):
    title_compound.append(analyzer.polarity_scores(title)["compound"])
    
    text_compound.append(analyzer.polarity_scores(textual content)["compound"])

    comment_text_compound.append(analyzer.polarity_scores(comment_text["compound"])

# Add the brand new columns with the sentiment    
information["title_compound"] = title_compound
information["text_compound"] = text_compound
information["comment_text_compound"] = comment_text_compound 

Subsequent, we wish to catalog the texts by product and model; this enables us to find out the sentiment scores related to particular smartwatches. To do that, we designate an inventory of product strains we wish to analyze, then we confirm which merchandise are talked about in every textual content:

list_of_products = ["samsung", "apple", "xiaomi", "huawei", "amazfit", "oneplus"]

for column in ["Title","Text","Comment_text"]:
    for product in list_of_products:
        l = []
        for textual content in information[column]:
            l.append(product in textual content.decrease())
        information["{}_{}".format(column,product)] = l

Sure texts might point out a number of merchandise (for instance, a single remark would possibly examine two smartwatches). We are able to proceed in considered one of two methods:

  • We are able to discard these texts.
  • We are able to cut up these texts utilizing NLP strategies. (On this case, we might assign part of the textual content to every product.)

For the sake of code readability and ease, our evaluation discards these texts.

Sentiment Evaluation Outcomes

Now we’re in a position to study our information and decide the typical sentiment related to numerous smartwatch manufacturers, as expressed by customers:

for product in list_of_products:
    imply = pd.concat([data[data["Title_{}".format(product)]].title_compound,
                      information[data["Text_{}".format(product)]].text_compound,
                      information[data["Comment_text_{}".format(product)]].comment_text_compound]).imply()
    print("{}: {})".format(product,imply))

We observe the next outcomes:

Smartwatch

Samsung

Apple

Xiaomi

Huawei

Amazfit

OnePlus

Sentiment Compound Rating (Avg.)

0.4939

0.5349

0.6462

0.4304

0.3978

0.8413

Our evaluation reveals useful market info. For instance, customers from our information set have a extra constructive sentiment relating to the OnePlus smartwatch over the opposite smartwatches.

Past contemplating common sentiment, companies must also contemplate the elements affecting these scores: What do customers love or hate about every model? We are able to use matter modeling to dive deeper into our present evaluation and produce actionable suggestions on services.

Matter Modeling: Discovering Essential Product Attributes

Matter modeling is the department of NLP that makes use of ML fashions to mathematically describe what a textual content is about. We are going to restrict the scope of our dialogue to classical NLP matter modeling approaches, although there are current advances going down utilizing transformers, equivalent to BERTopic.

There are lots of matter modeling algorithms, together with non-negative matrix factorization (NMF), sparse principal parts evaluation (sparse PCA), and latent dirichlet allocation (LDA). These ML fashions use a matrix as enter then cut back the dimensionality of the information. The enter matrix is structured such that:

  • Every column represents a phrase.
  • Every row represents a textual content.
  • Every cell represents the frequency of every phrase in every textual content.

These are all unsupervised fashions that can be utilized for matter decomposition. The NMF mannequin is often used for social media evaluation, and is the one we’ll use for our instance, as a result of it permits us to acquire simply interpretable outcomes. It produces an output matrix such that:

  • Every column represents a subject.
  • Every row represents a textual content.
  • Every cell represents the diploma to which a textual content discusses a particular matter.

Our workflow follows this course of:

A green box labeled “Start topic modeling analysis” points right to a dark blue box: “Identify and import dependencies.” This box points right to a second dark blue box: “Create corpus of texts.” This box points right to a third dark blue box: “Apply NMF model.” This box points right to a fourth dark blue box: “Analyze results.” This box points right to a green box labeled “Integrate results into marketing,” and down to two light blue boxes: “General analysis” and “Detailed (sentiment-based) analysis.”
The Matter Modeling Course of

First, we’ll apply our NMF mannequin to investigate normal subjects of curiosity, after which we’ll slim in on constructive and unfavorable subjects.

Analyzing Basic Matters of Curiosity

We’ll take a look at subjects for the OnePlus smartwatch, because it had the best compound sentiment rating. Let’s import the required packages offering NMF performance and customary cease phrases to filter from our textual content:

from sklearn.feature_extraction.textual content import CountVectorizer
from sklearn.feature_extraction.textual content import TfidfTransformer
from sklearn.decomposition import NMF

import nltk
nltk.obtain('stopwords')
from nltk.corpus import stopwords

Now, let’s create an inventory with the corpus of texts we’ll use. We use the scikit-learn ML library’s CountVectorizer and TfidfTransformer features to generate our enter matrix:

product = "oneplus"
corpus = pd.concat([data[data["Title_{}".format(product)]].Title,
                      information[data["Text_{}".format(product)]].Textual content,
                      information[data["Comment_text_{}".format(product)]].Comment_text]).tolist()

count_vect = CountVectorizer(stop_words=stopwords.phrases('english'), lowercase=True)
x_counts = count_vect.fit_transform(corpus)

feature_names = count_vect.get_feature_names_out()
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)

(Notice that particulars about dealing with n-grams—i.e., different spellings and utilization equivalent to “one plus”—might be present in my earlier article on matter modeling.)

We’re prepared to use the NMF mannequin and discover the latent subjects in our information. Like different dimensionality discount strategies, NMF wants the full variety of subjects to be set as a parameter (dimension). Right here, we select a 10-topic dimensionality discount for simplicity, however you might check totally different values to see what variety of subjects yields the perfect unsupervised studying end result. Strive setting dimension to maximise metrics such because the silhouette coefficient or the elbow methodology. We additionally set a random state for reproducibility:

import numpy as np

dimension = 10
nmf = NMF(n_components = dimension, random_state = 42)
nmf_array = nmf.fit_transform(x_tfidf)

parts = [nmf.components_[i] for i in vary(len(nmf.components_))]
options = count_vect.get_feature_names_out()
important_words = [sorted(features, key = lambda x: components[j][np.where(features==x)], reverse = True) for j in vary(len(parts))]

important_words incorporates lists of phrases, the place every record represents one matter and the phrases are ordered inside a subject by significance. It features a mixture of significant and “rubbish” subjects; it is a widespread end in matter modeling as a result of it’s troublesome for the algorithm to efficiently cluster all texts into just some subjects.

Inspecting the important_words output, we discover significant subjects round phrases like “finances” or “cost”, which factors to options that matter to customers when discussing OnePlus smartwatches:

['charge', 'battery', 'watch', 'best', 'range', 'days', 'life', 'android', 'bet', 'connectivity']
['budget', 'price', 'euros', 'buying', 'purchase', 'quality', 'tag', 'worth', 'smartwatch', '100']

Since our sentiment evaluation produced a excessive compound rating for OnePlus, we’d assume that this implies it has a decrease price or higher battery life in comparison with different manufacturers. Nonetheless, at this level, we do not know whether or not customers view these elements positively or negatively, so let’s conduct an in-depth evaluation to get tangible solutions.

Analyzing Optimistic and Destructive Matters

Our extra detailed evaluation makes use of the identical ideas as our normal evaluation, utilized individually to constructive and unfavorable texts. We are going to uncover which elements customers level to when talking positively—or negatively—a couple of product.

Let’s do that for the Samsung smartwatch. We are going to use the identical pipeline however with a distinct corpus:

  • We create an inventory of constructive texts which have a compound rating higher than 0.8.
  • We create an inventory of unfavorable texts which have a compound rating lower than 0.

These numbers have been chosen to pick out the highest 20% of constructive texts scores (>0.8) and high 20% of unfavorable texts scores (<0), and produce the strongest outcomes for our smartwatch sentiment evaluation:

# First the unfavorable texts.
product = "samsung"
corpus_negative = pd.concat([data[(data["Title_{}".format(product)]) & (information.title_compound < 0)].Title,
                      information[(data["Text_{}".format(product)]) & (information.text_compound < 0)].Textual content,
                      information[(data["Comment_text_{}".format(product)]) & (information.comment_text_compound < 0)].Comment_text]).tolist()


# Now the constructive texts.
corpus_positive = pd.concat([data[(data["Title_{}".format(product)]) & (information.title_compound > 0.8)].Title,
                      information[(data["Text_{}".format(product)]) & (information.text_compound > 0.8)].Textual content,
                      information[(data["Comment_text_{}".format(product)]) & (information.comment_text_compound > 0.8)].Comment_text]).tolist()

print(corpus_negative)
print(corpus_positive)

We are able to repeat the identical methodology of matter modeling that we used for normal subjects of curiosity to disclose the constructive and unfavorable subjects. Our outcomes now present far more particular advertising info: For instance, our mannequin’s unfavorable corpus output features a matter in regards to the accuracy of burned energy, whereas the constructive output is about navigation/GPS and well being indicators like pulse price and blood oxygen ranges. Lastly, now we have actionable suggestions on points of the smartwatch that the customers love and areas the place the product has room for enchancment.

A word cloud with various words, from largest to smallest: health, pulse, screen, sensor, fitness, exercise, miles, feature, heart, active.
Phrase Cloud of a Samsung Optimistic Matter, Created With the wordcloud Library

To amplify your information findings, I would suggest making a phrase cloud or one other related visualization of the essential subjects recognized in our tutorial.

By our evaluation, we perceive what customers consider a goal product and people of its rivals, what customers love about high manufacturers, and what could also be improved for higher product design. Public social media information evaluation lets you make knowledgeable choices relating to enterprise priorities and improve general person satisfaction. Incorporate social media evaluation into your subsequent product cycle for improved advertising campaigns and product design—as a result of listening is all the things.


Additional Studying on the Toptal Engineering Weblog:

  1. Knowledge Mining for Predictive Social Community Evaluation
  2. Ensemble Strategies: Elegant Strategies to Produce Improved Machine Studying Outcomes
  3. Getting Began With TensorFlow: A Machine Studying Tutorial
  4. Adversarial Machine Studying: The best way to Assault and Defend ML Fashions

The editorial crew of the Toptal Engineering Weblog extends its gratitude to Daniel Rubio for reviewing the code samples and different technical content material introduced on this article.



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments