Introduction and purpose of the project

This spring, cherry blossoms saw the record earliest first-blooming in various locations in Japan. There are many news reports, including one by The Washington Post and BBC. These articles tell concerns about the trend of first-blooming getting earlier and earlier, which they say is a sign of climate change.

The purpose of this project is to understand the statistical trend of first-blooming dates of cherry blossoms in Japan, as well as to explore some theories regarding first-blooming dates.

Exploratory data analysis questions

The following are three exploratory data analysis questions I’m going to explore in this project:

  1. Are cherry blossoms’ first-blooming date really getting earlier over time? Cherry blossoms may have bloomed the earliest ever, but it could have happened by random chance. To determine what is really happening, I’m going to investigate the general trend regarding first-blooming date over time.

For this question, I’m first going to compare the average first blooming dates over the decades, followed by geographical analysis to show in which locations cherry blossoms were blooming on April 1st over time. I’m then comparing each year’s actual first-blooming dates to 30-year-average in various locations in Japan.

  1. What is the geographical distribution about whether 400- and 600-degree theories hold true? There are several theories to determine cherry blossom’s first-blooming date. One is the 400-degree theory, (the first-blooming happens when the cumulative daily average temperature since February 1st reaches 400 degrees). Another is the 600-degree theory (first-blooming occurs when the cumulative daily high temperature since February 1st reaches 600 degrees).

In the previous project, I examined these theories in six major locations in Japan and concluded these theories would hold true in some locations but not others. In this section, I’m going to conduct the same analysis for all observation locations and see if there is a geographical pattern of where these theories would be true.

Moreover, for places where these theories are found not to expect the actual first-blooming date, I’m going to calculate how many days before or after the expected date (based on the theories) first-blooming is likely to happen.

  1. Is there a significant correlation between difference in days and latitude? Based on the analysis on Question 2, I’d like to determine if there is a significant correlation between latitude and difference in days (between actual and expected first-blooming date based on 400- and 600-theories).


The main data set I’m using is about cherry blossom’s first-bloom dates from 1953 to 2020 in more than 100 locations in Japan, published by Japan Meteorological Agency(JMA). I downloaded a version posted on

This data set includes cherry blossom’s first-blooming dates in more than 100 locations in Japan where the first-blooming were ever observed by JMA. For this project, I’m going to focus on 58 locations where observations are currently conducted.

I’m also using geographical data about JMA’s observation locations, available on the agency’s list of observatories.

Finally, to analyze the second and third questions, I’m using data about daily high and average temperature in more than 50 locations in Japan. Those data sets were directly downloaded from JMA, by using the agency’s downloading system.

For all tests in this project, I’m going to apply the alpha value of 0.05.

Load data and conduct an initial analysis

First, I’m going to load the data set I’m going to mainly analyze on this project.

Load first-blooming date data

# load data set about cherry blossom's first-blooming date
sakura_raw <- read_csv("data/sakura_first_bloom_dates.csv")

# tidy data
sakura_gathered <- sakura_raw %>% 
  select(-`30 Year Average 1981-2010`, -Notes) %>% 
  gather(key = "year", value = "date", "1953":"2020") %>% 
  rename(location = `Site Name`,
         now_observed = `Currently Being Observed`)

# filter out locations currently not observed, calculate date of the year
sakura <- sakura_gathered %>% 
  filter(now_observed == TRUE) %>% 
    date_x = ymd(date),
    # Not using lubridate::floor_date() function because some dates should be calculated as negative values
    floor_date = as.Date(paste(year, "-01-01", sep = ""), format = "%Y-%m-%d"),
    doy_x = interval(floor_date, date_x) %>% time_length(unit = "day")) %>% 
  select(-floor_date, -now_observed, -date)

## # A tibble: 6 x 4
##   location  year  date_x     doy_x
##   <chr>     <chr> <date>     <dbl>
## 1 Wakkanai  1953  1953-05-21   140
## 2 Asahikawa 1953  1953-05-11   130
## 3 Abashiri  1953  1953-05-24   143
## 4 Sapporo   1953  1953-05-07   126
## 5 Obihiro   1953  1953-05-15   134
## 6 Kushiro   1953  NA            NA

The cleaned data set has 3944 rows and 4 columns. Each row shows an observation, and columns are as follows:

  • location: the location the cherry blossom’s first-blooming was observed
  • year: the year of observation
  • date_x: date of first-blooming (Year-Month-Day)
  • doy_x: date of first-blooming, converted as the number of days after January 1st

I’m computing the dates as number of days after January 1st. Negative numbers mean the first-blooming was before January 1st (late December). I count those cases as first-blooming in the coming spring (thus negative number of days after January 1st).

Load observation location data and conduct an initial analysis

First, I’m going to plot the data to see overall distribution.

# First exploratory plot
sakura %>% 
  group_by(year) %>% 
  ggplot(aes(x = as.numeric(year), y = doy_x)) +
  geom_point(color = "#ff0080", alpha = 0.1) +
  geom_smooth(color = "black") +
  scale_x_continuous(breaks = c(1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 
                                1995, 2000, 2005, 2010, 2015, 2020)) +
  labs(title = "Distribution of cherry blossom's first-bloom days",
       x = "Year", y = "Number of days after January 1st") +