Introducing the data

This deta set was put together in 2017 to determine which Halloween candy is most preferred. According to the project's article, FiveThirtyEight conducted an experimental survey where people are randomly presented two Halloween candies (out of 85 different kinds) and asked to choose the one they prefer.

FiveThirtyEight collected the voting data from 8,371 different IP addresses on about 269,000 candy combinations.

The data set shows the win rate (popularity) for each candy, as well as its attributes.

Rows and columns

In the data set, each row corresponds to a single candy. For each candy, the data includes the following variables. For binary variables, 1 means "Yes", and 0 means "No".

  • competitorname: name of the candy
  • chocolate: whether the candy contains chocolate
  • fruity: whether the candy is fruit flavored
  • peanutalmondy: whether the candy contains peanuts, peanut butter, or almonds
  • nougat: whether the candy contains nougat
  • crispedricewafer: whether the candy contains crisped rice, wafers, or a cookie component
  • hard: whether the candy is a hard candy
  • bar: whether the candy is a candy bar
  • pluribus: whether the candy is one of many candies in a bag or box
  • sugarpercent: the percentile of sugar the candy falls under within the data set
  • pricepercent: the unit price percentile compared to the rest of the set
  • winpercent: the overall win percentage according to 269,000 matchups

Visualizing the data

In this section, I'm going to visualize the data to answer the following questions:

1. The win rate ranking

First, I'm going to visualize the candies' win rates. Since there are many attributes to candies, I'm going to visualize how many candies have certain attributes, and where they are located in the ranking.

To do that, I'm going to create an ordered bar chart.

Each bar represents the candy's win rate. The bar in the higher position has more popularity (higher win rate).

The candy with the highest win rate is Reese's Peanut Butter cup with the win rate of 84.18%.

The least preferred candy is Nik L Nip, wax bottled mini-candy drinks with the win rate of 22.45%.

These are candies that contain chocolate. From the visualization, we can say higher-ranked candies tend to contain chocolate. (You can always hover over bars to check out which candy that bar represents!)

These candies have fruity flavors. This shows that candies that contain fruity flavors are fairly equally distributed, except for the ones in the highest-ranked group.

There are other candy attributes: caramel, crispy, peanuty, and nougat. You can click the buttons above the chart to show the bars with the attribute. If you hover over the bars, you can also see, in the tooltip, the candy's attributes shown as circles of the corresponding colors on the button.

Explore more yourself!

2. Relationships between sugar, price, and win percentiles

Here, I'm making a scatter plot to see if there's a correlation between a candy's sugar content, or unit price, and its popularity.

Scatter plot of sugar, price, win percentages

On the y-axis, "Sugar percentile" is the sugar percentage the candy falls under within the data set. The color key "Price percentile" shows the candy's unit price percentile compared to the other candies in the data set.

From this visualization, it doesn't look like there's a notable correlation between how much sugar each candy contains and its popularity, or the candy's price and its popularity.

3. Candy bar or hard candy?

Finally, all the candies can be grouped as either 1) hard candy, 2) candy bar, or 3) neither.

In this section, I'm going to create an alluvial diagram and visualize the relationship between a candy's type and win percentage. Here, I will use the R Graph Gallery and CRAN - R Project as a reference.

First, I'm going to manipulate the data into an alluvial form. To do this, I divide the win rate into four groups (quantiles).

From the top 25% in popularity, I'll label the data as "Rank 1", followed by "Rank 2" (25% ~ 50%), "Rank 3" (50% ~ 75%), and "Rank 4" (bottom 25%).

Alluvial diagram of candy types and win rate

This visualization shows that candy bars tend to be more preferred; about half of candy bars are ranked on the top 25%. Conversely, hard candies are less popular; most of them are ranked below 50%.

However, as shown in the visualization, more than half the candies are labeled "Neither" as the candy type. Those candies include: Reese's Peanut Butter cup, Candy Corn, Peanut butter M&M's, Skittles Original, and so forth. Since a lot of popular candies are labeled as "Neither", more detailed analysis is needed to better see what types of candies are popular among American people.

Now, I'm craving candy...
Happy Halloween!