Data Cleaning Process

1. Document Description

This document describes all the steps taken to clean and transform Cyclistic’s raw datasets to prepare the data for the analysis stage. For the purpose of this case study, the data has been collected for the months of August 2021 to July 2022. The dataset can be downloaded here.

Please note that Cyclistic is a fictional company. Raw data has been collected by Motivate International Inc. the company which operates the City of Chicago’s Divvy bicycle sharing service. The license to use this public dataset can be found here.

These were some of the packages used for the data cleaning process.

library(tidyverse)
library(lubridate)
library(stringr)
library(scales)

2. Combine Datasets

2.1 Load Raw Data

We will create separate dataframes for loading data from August 2021 to July 2022.

trip_202108 <- read_csv("202108-divvy-tripdata/202108.csv")
trip_202109 <- read_csv("202109-divvy-tripdata/202109.csv")
trip_202110 <- read_csv("202110-divvy-tripdata/202110.csv")
trip_202111 <- read_csv("202111-divvy-tripdata/202111.csv")
trip_202112 <- read_csv("202112-divvy-tripdata/202112.csv")
trip_202201 <- read_csv("202201-divvy-tripdata/202201.csv")
trip_202202 <- read_csv("202202-divvy-tripdata/202202.csv")
trip_202203 <- read_csv("202203-divvy-tripdata/202203.csv")
trip_202204 <- read_csv("202204-divvy-tripdata/202204.csv")
trip_202205 <- read_csv("202205-divvy-tripdata/202205.csv")
trip_202206 <- read_csv("202206-divvy-tripdata/202206.csv")
trip_202207 <- read_csv("202207-divvy-tripdata/202207.csv")

2.2 Check for data structure

The structure summary outputs will help us identify if any of the individual datasets have different string types, column names etc.

str(trip_202108)
str(trip_202109)
str(trip_202110)
str(trip_202111)
str(trip_202112)
str(trip_202201)
str(trip_202202)
str(trip_202203)
str(trip_202204)
str(trip_202205)
str(trip_202206)
str(trip_202207)

We found that all the column names were fine and the string types were adhering to the data.

2.3 Merge Datasets

We will merge all 12 datasets into 1 dataframe.

# Combine into a single DataFrame

all_trips <- bind_rows(trip_202108, trip_202109, trip_202110, trip_202111, trip_202112,
                       trip_202201, trip_202202, trip_202203, trip_202204, trip_202205,
                       trip_202206, trip_202207)

3. Prepare Dataset

3.1 Calculating Ride Length

Ride length will help us in further analysis and would also let us know if there are any rides less than 0.

# Calculating trip time in secs

all_trips <- mutate(all_trips, trip_time = ended_at - started_at)

# Converting trip time to numeric data type

all_trips$trip_time <- as.numeric(as.character(all_trips$trip_time))

3.2 Summarise by date variables

We will add separate columns for date variables (date, month, day of week, week, time of day etc.) which will help us in future analysis.

# Date

all_trips <- all_trips %>% 
  mutate(Date = as.Date(started_at))

# Day of week

all_trips <- all_trips %>% 
  mutate(day_of_week = wday(all_trips$started_at, label=TRUE))

# Month

all_trips <- all_trips %>% 
  mutate(month_name = format(started_at, "%m"))
         
# Year

all_trips <- all_trips %>% 
  mutate(year = year(ymd(Date)))
         
# Day

all_trips <- all_trips %>% 
  mutate(day= day(ymd(Date)))

# Time of Day

all_trips <- all_trips %>% 
  mutate(Time_of_Day = format(as.POSIXct(started_at), format = "%H:%M:%S"))

4. Clean the dataset

4.1 Remove rows with ride length < 0

As identified in section 3.1, there were some rows with ride length less than 0. These rides were those where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

# Creating a separate dataframe for trip time greater than 0

all_trips_v2 <- all_trips_v2 %>% 
  filter(trip_time>0) 

# Sorting dataset by starting time

all_trips_v2 <- arrange(all_trips_v2, started_at)

4.2 Remove rows with incomplete data

There were some instances where the station names were not recorded. We can check and remove these stations from our dataset.

# Remove blank or NA rows from start station name and end station name

## We will use this dataframe for further analysis

trips_cleaned <- all_trips %>% 
  filter(!is.na(start_station_name) | start_station_name=="") %>% 
  filter(!is.na(end_station_name) | end_station_name=="")

4.3 Checking for Inconsistencies

We will check for inconsistencies in the start and end station names and if all the stations are in title case.

#Check for inconsistencies in start station names

trips_cleaned %>%
  filter(
    str_detect(start_station_name, "[:upper:]")
    & !str_detect(start_station_name,"[:lower:]")
  ) %>%
  
  group_by(
    start_station_name
  ) %>%
  
  count(
    start_station_name
  )

#Check for inconsistencies in end station names

trips_cleaned %>%
  filter(
    str_detect(end_station_name, "[:upper:]")
    & !str_detect(end_station_name,"[:lower:]")
  ) %>%
  
  group_by(
    end_station_name
  ) %>%
  
  count(
    end_station_name
  )

4.4 Remove Inconsistencies

We found that there are some station names where the partial name is either upper or lower. We will change all the station names to title case.

#Convert upper and lower station names to title case

trips_cleaned$start_station_name <- str_to_title(trips_cleaned$start_station_name)
trips_cleaned$end_station_name <- str_to_title(trips_cleaned$end_station_name)

4.5 Checking for duplicates

The ride id is column is unique to each ride. We could use it to check for any duplicate rows.

#Checking for duplicate rows

duplicate_check <- trips_cleaned %>%
  count(ride_id) %>%
  filter(n > 1)

There were no duplicate rows observed in the dataset.

5. Understand Dataset

5.1 Check rideable type

#Checking for unique ride types available

unique(trips_cleaned[c("rideable_type")])

The above code returned three distinct ride types which are used in the cleaned dataset. It was done to identify whether a ride type was added to the dataset at a later date.

#Creating a separate dataframe to check if any ride type was added after the other

ride_type_added <- trips_cleaned %>% 
  group_by(rideable_type, year, month) %>% 
  count(rideable_type)

We found that all the ride types were added from the starting i.e Aug 2021 from the dataset.

5.2 Check station names

There might be some instances where some stations might be added or removed from Cyclistic’s Portfolio. This can be reviewed using below code:
Firstly a dataframe which lists unique station names should be created.

#Identify unique station names

unique_station_names <- trips_cleaned %>% 
  group_by(start_station_name) %>% 
  count(start_station_name)

Further the dataframes which lists the unique station names used each month should be created

#Aug 2021 dataframe which lists the unique station names used each month

Aug_21 <- trips_cleaned %>% 
  filter(year==2021 & month==8) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Sep 2021 dataframe which lists the unique station names used each month

Sep_21 <- trips_cleaned %>% 
  filter(year==2021 & month==9) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Oct 2021 dataframe which lists the unique station names used each month

Oct_21 <- trips_cleaned %>% 
  filter(year==2021 & month==10) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Nov 2021 dataframe which lists the unique station names used each month

Nov_21 <- trips_cleaned %>% 
  filter(year==2021 & month==11) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Dec 2021 dataframe which lists the unique station names used each month

Dec_21 <- trips_cleaned %>% 
  filter(year==2021 & month==12) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Jan 2022 dataframe which lists the unique station names used each month

Jan_22 <- trips_cleaned %>% 
  filter(year==2022 & month==1) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Feb 2022 dataframe which lists the unique station names used each month

Feb_22 <- trips_cleaned %>% 
  filter(year==2022 & month==2) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Mar 2022 dataframe which lists the unique station names used each month

Mar_22 <- trips_cleaned %>% 
  filter(year==2022 & month==3) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Apr 2022 dataframe which lists the unique station names used each month

Apr_22 <- trips_cleaned %>% 
  filter(year==2022 & month==4) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#May 2022 dataframe which lists the unique station names used each month

May_22 <- trips_cleaned %>% 
  filter(year==2022 & month==5) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 

#Jun 2022 dataframe which lists the unique station names used each month  

Jun_22 <- trips_cleaned %>% 
  filter(year==2022 & month==6) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name) 
  
#Jul 2022 dataframe which lists the unique station names used each month

Jul_22 <- trips_cleaned %>% 
  filter(year==2022 & month==7) %>% 
  group_by(start_station_name) %>% 
  count(start_station_name)

Each unique station name can be tested against the monthly filter data frames to assess which unique station name was used in a particular month.

#Checking whether station is being used in particular month

unique_station_names$Aug_21 <- as.integer(unique_station_names$start_station_name
                                          %in% Aug_21$start_station_name)

unique_station_names$Sep_21 <- as.integer(unique_station_names$start_station_name
                                          %in% Sep_21$start_station_name)

unique_station_names$Oct_21 <- as.integer(unique_station_names$start_station_name
                                          %in% Oct_21$start_station_name)

unique_station_names$Nov_21 <- as.integer(unique_station_names$start_station_name
                                          %in% Nov_21$start_station_name)

unique_station_names$Dec_21 <- as.integer(unique_station_names$start_station_name
                                          %in% Dec_21$start_station_name)

unique_station_names$Jan_22 <- as.integer(unique_station_names$start_station_name
                                          %in% Jan_22$start_station_name)

unique_station_names$Feb_22 <- as.integer(unique_station_names$start_station_name
                                          %in% Feb_22$start_station_name)

unique_station_names$Mar_22 <- as.integer(unique_station_names$start_station_name
                                          %in% Mar_22$start_station_name)

unique_station_names$Apr_22 <- as.integer(unique_station_names$start_station_name
                                          %in% Apr_22$start_station_name)

unique_station_names$May_22 <- as.integer(unique_station_names$start_station_name
                                          %in% May_22$start_station_name)

unique_station_names$Jun_22 <- as.integer(unique_station_names$start_station_name
                                          %in% Jun_22$start_station_name)

unique_station_names$Jul_22 <- as.integer(unique_station_names$start_station_name
                                          %in% Jul_22$start_station_name)

#Adding column for count of occurences for all the months

unique_station_names$count <- rowSums(unique_station_names[,3:14])

We will check which stations were added or removed in the Cyclistic POrtfolio in this particular time. For this purpose two months were used in each filter in order to avoid any anomalies whereby a station was simply not used for the month instead of the station being completely removed or added to Cyclistic’s portfolio.
FOr analyzing stations added, we will find the count of occurence of each station for the August 21 and September 21 less than 1. Similarly, we will find it for June 22 and July 22 less than 1 for finding the stations removed.

#Checking whether the station is recently added in the rider's portfolio 

station_check_A <- unique_station_names %>% 
  filter(Aug_21<1 & Sep_21<1) 

#Checking whether the station is recently deleted from the portfolio

station_check_B <- unique_station_names %>% 
  filter(Jun_22<1 & Jul_22<1)

This showed us that out of 1306 unique station names 546 stations were added and 39 stations were removed from the cyclistic Portfolio during the analysis period.

5.3 Checking for distinct member types

Let’s check the distinct type of members in the dataset. From the data given we know that there are two categories of riders : member & casual

unique(trips_cleaned[c("member_casual")])

6. Save the Dataset

Finally, we will save the cleaned dataset for further analysis.

#Save the cleaned dataset

write.csv(trips_cleaned, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/Cyclistic_Combined_Data.csv")

#Save additional useful datasets

write.csv(all_trips, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/all_trips.csv")

write.csv(ride_type_added, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/ride_type_check.csv")

write.csv(unique_station_names, file = "C:/Users/jains/Downloads/Cyclistic Project/Trip_Data_082021-072022/station_name_check.csv")