Introducing the data
This deta set was put together in 2017 to determine which
Halloween candy is most preferred. According to
the project's article, FiveThirtyEight conducted an experimental survey
where people are randomly presented two Halloween candies
(out of 85 different kinds) and asked to choose the one they
prefer.
FiveThirtyEight collected the voting data from 8,371
different IP addresses on about 269,000 candy combinations.
The data set
shows the win rate (popularity) for each candy, as well as
its attributes.
Rows and columns
In the data set, each row corresponds to a single candy.
For each candy, the data includes the following
variables. For binary variables, 1 means "Yes", and 0
means "No".
- competitorname: name of the candy
-
chocolate: whether the candy contains
chocolate
-
fruity: whether the candy is fruit flavored
-
peanutalmondy: whether the candy contains
peanuts, peanut butter, or almonds
-
nougat: whether the candy contains nougat
-
crispedricewafer: whether the candy contains
crisped rice, wafers, or a cookie component
- hard: whether the candy is a hard candy
- bar: whether the candy is a candy bar
-
pluribus: whether the candy is one of many
candies in a bag or box
-
sugarpercent: the percentile of sugar the
candy falls under within the data set
-
pricepercent: the unit price percentile
compared to the rest of the set
-
winpercent: the overall win percentage
according to 269,000 matchups
Visualizing the data
In this section, I'm going to visualize the data to answer
the following questions:
- What is the candy's win rate?
-
Is there any relationship between a candy's sugar
content, or unit price, and its popularity?
-
How many candies are hard candies and how many are candy
bars? Which type is more popular overall?
2. Relationships between sugar, price, and win
percentiles
Here, I'm making a scatter plot to see if there's a
correlation between a candy's sugar content, or unit
price, and its popularity.
On the y-axis, "Sugar percentile" is the sugar percentage
the candy falls under within the data set. The color key
"Price percentile" shows the candy's unit price percentile
compared to the other candies in the data set.
From this visualization, it doesn't look like there's a
notable correlation between how much sugar each candy
contains and its popularity, or the candy's price and its
popularity.
3. Candy bar or hard candy?
Finally, all the candies can be grouped as either 1)
hard candy, 2) candy bar, or 3) neither.
In this section, I'm going to create an alluvial diagram
and visualize the relationship between a candy's type
and win percentage. Here, I will use
the R Graph Gallery
and
CRAN - R Project
as a reference.
First, I'm going to manipulate the data into an alluvial
form. To do this, I divide the win rate into four groups
(quantiles).
From the top 25% in popularity, I'll label the data as
"Rank 1", followed by "Rank 2" (25% ~ 50%), "Rank 3"
(50% ~ 75%), and "Rank 4" (bottom 25%).
This visualization shows that candy bars tend to be more
preferred; about half of candy bars are ranked on the
top 25%. Conversely, hard candies are less popular; most
of them are ranked below 50%.
However, as shown in the visualization, more than half
the candies are labeled "Neither" as the candy type.
Those candies include:
Reese's Peanut Butter cup,
Candy Corn,
Peanut butter M&M's,
Skittles Original, and so forth. Since a lot of popular candies are
labeled as "Neither", more detailed analysis is needed
to better see what types of candies are popular among
American people.
Now, I'm craving candy...
Happy Halloween!