This document describes all the steps taken to clean and transform
Cyclistic’s raw datasets to prepare the data for the analysis stage. For
the purpose of this case study, the data has been collected for the
months of August 2021 to July 2022. The dataset can be downloaded here.
Please note that Cyclistic is a fictional company. Raw data has been
collected by Motivate International Inc. the company which operates the
City of Chicago’s Divvy bicycle sharing service. The license to use this
public dataset can be found here.
These were some of the packages used for the data cleaning process.
library(tidyverse)
library(lubridate)
library(stringr)
library(scales)
2.1 Load Raw Data
We will create separate dataframes for loading data from August 2021 to July 2022.
trip_202108 <- read_csv("202108-divvy-tripdata/202108.csv")
trip_202109 <- read_csv("202109-divvy-tripdata/202109.csv")
trip_202110 <- read_csv("202110-divvy-tripdata/202110.csv")
trip_202111 <- read_csv("202111-divvy-tripdata/202111.csv")
trip_202112 <- read_csv("202112-divvy-tripdata/202112.csv")
trip_202201 <- read_csv("202201-divvy-tripdata/202201.csv")
trip_202202 <- read_csv("202202-divvy-tripdata/202202.csv")
trip_202203 <- read_csv("202203-divvy-tripdata/202203.csv")
trip_202204 <- read_csv("202204-divvy-tripdata/202204.csv")
trip_202205 <- read_csv("202205-divvy-tripdata/202205.csv")
trip_202206 <- read_csv("202206-divvy-tripdata/202206.csv")
trip_202207 <- read_csv("202207-divvy-tripdata/202207.csv")
2.2 Check for data structure
The structure summary outputs will help us identify if any of the individual datasets have different string types, column names etc.
str(trip_202108)
str(trip_202109)
str(trip_202110)
str(trip_202111)
str(trip_202112)
str(trip_202201)
str(trip_202202)
str(trip_202203)
str(trip_202204)
str(trip_202205)
str(trip_202206)
str(trip_202207)
We found that all the column names were fine and the string types were
adhering to the data.
2.3 Merge Datasets
We will merge all 12 datasets into 1 dataframe.
# Combine into a single DataFrame
all_trips <- bind_rows(trip_202108, trip_202109, trip_202110, trip_202111, trip_202112,
trip_202201, trip_202202, trip_202203, trip_202204, trip_202205,
trip_202206, trip_202207)
3.1 Calculating Ride Length
Ride length will help us in further analysis and would also let us know if there are any rides less than 0.
# Calculating trip time in secs
all_trips <- mutate(all_trips, trip_time = ended_at - started_at)
# Converting trip time to numeric data type
all_trips$trip_time <- as.numeric(as.character(all_trips$trip_time))
3.2 Summarise by date variables
We will add separate columns for date variables (date, month, day of week, week, time of day etc.) which will help us in future analysis.
# Date
all_trips <- all_trips %>%
mutate(Date = as.Date(started_at))
# Day of week
all_trips <- all_trips %>%
mutate(day_of_week = wday(all_trips$started_at, label=TRUE))
# Month
all_trips <- all_trips %>%
mutate(month_name = format(started_at, "%m"))
# Year
all_trips <- all_trips %>%
mutate(year = year(ymd(Date)))
# Day
all_trips <- all_trips %>%
mutate(day= day(ymd(Date)))
# Time of Day
all_trips <- all_trips %>%
mutate(Time_of_Day = format(as.POSIXct(started_at), format = "%H:%M:%S"))
4.1 Remove rows with ride length < 0
As identified in section 3.1, there were some rows with ride length less than 0. These rides were those where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.
# Creating a separate dataframe for trip time greater than 0
all_trips_v2 <- all_trips_v2 %>%
filter(trip_time>0)
# Sorting dataset by starting time
all_trips_v2 <- arrange(all_trips_v2, started_at)
4.2 Remove rows with incomplete data
There were some instances where the station names were not recorded. We can check and remove these stations from our dataset.
# Remove blank or NA rows from start station name and end station name
## We will use this dataframe for further analysis
trips_cleaned <- all_trips %>%
filter(!is.na(start_station_name) | start_station_name=="") %>%
filter(!is.na(end_station_name) | end_station_name=="")
4.3 Checking for Inconsistencies
We will check for inconsistencies in the start and end station names and if all the stations are in title case.
#Check for inconsistencies in start station names
trips_cleaned %>%
filter(
str_detect(start_station_name, "[:upper:]")
& !str_detect(start_station_name,"[:lower:]")
) %>%
group_by(
start_station_name
) %>%
count(
start_station_name
)
#Check for inconsistencies in end station names
trips_cleaned %>%
filter(
str_detect(end_station_name, "[:upper:]")
& !str_detect(end_station_name,"[:lower:]")
) %>%
group_by(
end_station_name
) %>%
count(
end_station_name
)
4.4 Remove Inconsistencies
We found that there are some station names where the partial name is either upper or lower. We will change all the station names to title case.
#Convert upper and lower station names to title case
trips_cleaned$start_station_name <- str_to_title(trips_cleaned$start_station_name)
trips_cleaned$end_station_name <- str_to_title(trips_cleaned$end_station_name)
4.5 Checking for duplicates
The ride id is column is unique to each ride. We could use it to check for any duplicate rows.
#Checking for duplicate rows
duplicate_check <- trips_cleaned %>%
count(ride_id) %>%
filter(n > 1)
There were no duplicate rows observed in the dataset.
5.1 Check rideable type
#Checking for unique ride types available
unique(trips_cleaned[c("rideable_type")])
The above code returned three distinct ride types which are used in the cleaned dataset. It was done to identify whether a ride type was added to the dataset at a later date.
#Creating a separate dataframe to check if any ride type was added after the other
ride_type_added <- trips_cleaned %>%
group_by(rideable_type, year, month) %>%
count(rideable_type)
We found that all the ride types were added from the starting i.e Aug 2021 from the dataset.
5.2 Check station names
There might be some instances where some stations might be added or
removed from Cyclistic’s Portfolio. This can be reviewed using below
code:
Firstly a dataframe which lists unique station names should be
created.
#Identify unique station names
unique_station_names <- trips_cleaned %>%
group_by(start_station_name) %>%
count(start_station_name)
Further the dataframes which lists the unique station names used each month should be created
#Aug 2021 dataframe which lists the unique station names used each month
Aug_21 <- trips_cleaned %>%
filter(year==2021 & month==8) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Sep 2021 dataframe which lists the unique station names used each month
Sep_21 <- trips_cleaned %>%
filter(year==2021 & month==9) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Oct 2021 dataframe which lists the unique station names used each month
Oct_21 <- trips_cleaned %>%
filter(year==2021 & month==10) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Nov 2021 dataframe which lists the unique station names used each month
Nov_21 <- trips_cleaned %>%
filter(year==2021 & month==11) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Dec 2021 dataframe which lists the unique station names used each month
Dec_21 <- trips_cleaned %>%
filter(year==2021 & month==12) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Jan 2022 dataframe which lists the unique station names used each month
Jan_22 <- trips_cleaned %>%
filter(year==2022 & month==1) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Feb 2022 dataframe which lists the unique station names used each month
Feb_22 <- trips_cleaned %>%
filter(year==2022 & month==2) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Mar 2022 dataframe which lists the unique station names used each month
Mar_22 <- trips_cleaned %>%
filter(year==2022 & month==3) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Apr 2022 dataframe which lists the unique station names used each month
Apr_22 <- trips_cleaned %>%
filter(year==2022 & month==4) %>%
group_by(start_station_name) %>%
count(start_station_name)
#May 2022 dataframe which lists the unique station names used each month
May_22 <- trips_cleaned %>%
filter(year==2022 & month==5) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Jun 2022 dataframe which lists the unique station names used each month
Jun_22 <- trips_cleaned %>%
filter(year==2022 & month==6) %>%
group_by(start_station_name) %>%
count(start_station_name)
#Jul 2022 dataframe which lists the unique station names used each month
Jul_22 <- trips_cleaned %>%
filter(year==2022 & month==7) %>%
group_by(start_station_name) %>%
count(start_station_name)
Each unique station name can be tested against the monthly filter data frames to assess which unique station name was used in a particular month.
#Checking whether station is being used in particular month
unique_station_names$Aug_21 <- as.integer(unique_station_names$start_station_name
%in% Aug_21$start_station_name)
unique_station_names$Sep_21 <- as.integer(unique_station_names$start_station_name
%in% Sep_21$start_station_name)
unique_station_names$Oct_21 <- as.integer(unique_station_names$start_station_name
%in% Oct_21$start_station_name)
unique_station_names$Nov_21 <- as.integer(unique_station_names$start_station_name
%in% Nov_21$start_station_name)
unique_station_names$Dec_21 <- as.integer(unique_station_names$start_station_name
%in% Dec_21$start_station_name)
unique_station_names$Jan_22 <- as.integer(unique_station_names$start_station_name
%in% Jan_22$start_station_name)
unique_station_names$Feb_22 <- as.integer(unique_station_names$start_station_name
%in% Feb_22$start_station_name)
unique_station_names$Mar_22 <- as.integer(unique_station_names$start_station_name
%in% Mar_22$start_station_name)
unique_station_names$Apr_22 <- as.integer(unique_station_names$start_station_name
%in% Apr_22$start_station_name)
unique_station_names$May_22 <- as.integer(unique_station_names$start_station_name
%in% May_22$start_station_name)
unique_station_names$Jun_22 <- as.integer(unique_station_names$start_station_name
%in% Jun_22$start_station_name)
unique_station_names$Jul_22 <- as.integer(unique_station_names$start_station_name
%in% Jul_22$start_station_name)
#Adding column for count of occurences for all the months
unique_station_names$count <- rowSums(unique_station_names[,3:14])
We will check which stations were added or removed in the Cyclistic
POrtfolio in this particular time. For this purpose two months were used
in each filter in order to avoid any anomalies whereby a station was
simply not used for the month instead of the station being completely
removed or added to Cyclistic’s portfolio.
FOr analyzing stations added, we will find the count of occurence of
each station for the August 21 and September 21 less than 1. Similarly,
we will find it for June 22 and July 22 less than 1 for finding the
stations removed.
#Checking whether the station is recently added in the rider's portfolio
station_check_A <- unique_station_names %>%
filter(Aug_21<1 & Sep_21<1)
#Checking whether the station is recently deleted from the portfolio
station_check_B <- unique_station_names %>%
filter(Jun_22<1 & Jul_22<1)
This showed us that out of 1306 unique station names 546 stations were added and 39 stations were removed from the cyclistic Portfolio during the analysis period.
5.3 Checking for distinct member types
Let’s check the distinct type of members in the dataset. From the data given we know that there are two categories of riders : member & casual
unique(trips_cleaned[c("member_casual")])
Finally, we will save the cleaned dataset for further analysis.
#Save the cleaned dataset
write.csv(trips_cleaned, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/Cyclistic_Combined_Data.csv")
#Save additional useful datasets
write.csv(all_trips, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/all_trips.csv")
write.csv(ride_type_added, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/ride_type_check.csv")
write.csv(unique_station_names, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/station_name_check.csv")