library(tidyr)
library(dplyr)
library(ggplot2)

Code that was used to merge the data sets:

getwd()
## [1] "C:/Users/david/Desktop"
setwd('I:\\Data Analytics\\archive')
getwd()
## [1] "I:/Data Analytics/archive"
stock_data <- read.csv('stock prices.csv')
industry_data <- read.csv('Industry.csv')

stock_data_complete <- stock_data %>%
  left_join(industry_data, by = c("symbol" = "Symbol")) %>%
  drop_na()   # removes any rows where the join didn't find a match

write.csv(stock_data_complete, "stock_data_complete.csv", row.names = FALSE)

Isolate Data on the Information Technology Sector

It was decided to isolate the records concerning the Information Technology sector to limit the number of Companies and sub-sectors included in the different analyses. The goal is to make the graphs easier to read. The date column was also adjusted so that ggplot2 would recognize its information as dates rather than text.

tech_data <- stock_data_complete %>%
  filter(GICS.Sector == "Information Technology") %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d"))

tech_data <- select(tech_data, -"Headquarters.Location", -"Date.added", -"CIK", -"Founded")

write.csv(tech_data, "tech_data.csv", row.names = FALSE)

Overall Question

Going into this analysis, the question I want to research is whether there are any variables present in the data set that could be used to predict the future directional change of a stock’s value. This question will be refined and adjusted as I explore the dataset.

First Look

To start off I will first take a look at the top 5 Information Technology companies to get a visual comparison of how their closing prices compare over time. To do this I will first need to figure out what the top five companies are and create a small dataframe that contains just their information. I will be using a company’s average daily trading volumn as the metric for comparing the companies against each other when determine the top five companies. Finally I will use ggplot to generate a line graph so we can visualy the closing prices for the stocks over the span of the data set.

  1. Select the top 5 Technology Stocks by average volume to create a sub data set.
# Obtain a list of the symbols for the top 5 information technology stocks
top_symbols <- tech_data %>%
  group_by(symbol) %>%
  summarise(avg_volume = mean(volume, na.rm = TRUE)) %>%
  arrange(desc(avg_volume)) %>%
  slice(1:5) %>%
  pull(symbol)

# Create smaller data set by filtering larger data set by the created list of top 5 information technology stock symbols
tech_top <- tech_data %>%
  filter(symbol %in% top_symbols)
  1. Create a multi-line graph to display the closing values of the top 5 tech stocks over the span of the data set.
ggplot(tech_top, aes(x = date, y = close, color = Security, group = Security)) +
     geom_line() +
     theme_minimal()+
     labs(title = "Top 5 Tech Stocks by Avg Volume",
          x = "Date", y = "Closing Price")

This graph shows the time series data for the daily closing prices of the top five performing stocks in the Information Technology sector. The results were limited to the top 5 stocks in this sector due to the Information Technology sector comprising 40 different stocks, which was done for readability purposes. The top technology stocks were determined by comparing their average daily trading volume with that of the stocks in the sector. According to the graph, Apple Inc. was the best-performing stock, with an overall upward trend that significantly outperformed the other four top stocks. Microsoft is next in line.

Average Daily Trading Volume for All Information Technoloy Stocks

Previously, we extracted the top 5 Information Technology stocks to compare against each other. This time, we will use a bar chart to plot the values of every stock in the Information Technology sector to see if any information can be derived. To do this I will be finding the mean of the daily trading volumes for each stock.

  1. Calculate the average volume of the technology stocks.
avg_volume_data <- tech_data %>%
  group_by(symbol, Security) %>%
  summarise(avg_volume = mean(volume, na.rm = TRUE), .groups = "drop")
  1. Create a bar chart that shows the average trading volume of each company in the Information Technology sector.
ggplot(avg_volume_data, aes(x = Security, y = avg_volume)) +
  geom_bar(stat = "identity", fill = "#97B3C6", color="white") +
  theme_minimal()+
  labs(title = "Average Trading Volume by Tech Company",
       x = "Company",
       y = "Average Volume") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The graph displays the names of companies in the Information Technology sector along the X-axis and their average trading volume on the Y-axis. Examining this graph alongside our first graph, we can see similarities between the top 5 performing Information Technology stocks. The top 5 performing Information Technology stocks are the same stocks shown on this graph, which have the top 5 average daily trading volumes; however, their ranking order is slightly different. Returning to my original question, these first two graphs lead me to wonder if daily trading volume has a significant impact on a stock’s performance.

Testing to See if a Corrolation Exists Between the Opening and Closing of Information Technology Stocks

We will test to see if there is a correlation between the daily number of trades and the daily price shift of a stock in the Information Technology sector.

  1. Add a column to our data set for daily price shifts.
tech_data <- tech_data %>%
  mutate(price_shift = close - open)
  1. Do a correlation test testing volume and price_shift against each other.
cor_test <- cor.test(tech_data$volume, tech_data$price_shift)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  tech_data$volume and tech_data$price_shift
## t = -4.0936, df = 39374, p-value = 4.256e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03049660 -0.01075057
## sample estimates:
##        cor 
## -0.0206256
  1. Check the number of rows that we are analyzing.
num_row <- nrow(tech_data)
num_row
## [1] 39376

The goal was to check if there was a correlation between daily trading volume and daily changes in stock prices. The number of observations in the data set was checked because a small correlation could have a larger impact when the data set is larger. Here is what we learned from the results:

  1. Our confidence interval of 95% would give us an alpha of 0.05, and since our resulting p-value of 4.256e-05 is smaller than 0.05, we will reject the null hypothesis that there is no correlation between the daily trading volumes and daily stock price changes.

  2. Even though a correlation exists, our corr value of -0.0206256 tells us that it’s basically an insignificant correlation that, if it were ever to show up, would manifest as increased trading volume resulting in slight downward daily stock price shifts and vice versa.

Distrubution of Daily Closing Prices

Now we will take a look at the individual closing prices of all Information Technology stocks to see if they form a recognisable distrubution. To do this I will use a histogram where the X-axis contains bins for different closing values.

ggplot(data = tech_data, aes(close))+
  geom_histogram(binwidth = 10, fill="#97B3C6", color="white")+
  theme_minimal()+
  scale_x_continuous(breaks = seq(0,max(tech_data$close, na.rm=TRUE), by=20))+
  labs(x = "Closing Prices",
    y = "Density",
    title = "Distrubution of Closing Prices"
  )

The resulting distribution of closing prices forms a left-skewed curve. Across the 40 stocks in the Information Technology sector, the majority of the 39,379 observed closing prices fall between $20 and $100, with decreasing frequency extending out to around $260.

Several factors could explain this distribution. Lower closing prices may reflect subpar stock performance at the time of observation, or they could result from stock splits intended to maintain accessible prices. An interesting follow-up question is whether the higher-priced observations represent a consistent group of stocks or if occasional outliers drive them. If the higher-priced stocks are consistent, comparing them with the top five stocks by trading volume identified in Question 1 could provide additional insight into sector dynamics.

Checking to See if a Relationship Exists Between the Top Sub Industries

  1. First we will check to see how many Sub Industries exist in the Information Technology sector to see if its a viable point to sub divide the data.
num_unique <- length(unique(tech_data$GICS.Sub.Industry))
num_unique
## [1] 11

We have 11 Information Technology Sub-Sectors to work with, so we will perform our analysis on the top 2 Sub-Sectors using the same ranking metric of daily trading volume and a comparative metric of closing price.

  1. Following the same methodology as the first section, we will figure out what the top 2 performing Sub Industries in the Information Technology sector are by average trading volume.
top_two_groups <- tech_data %>%
  group_by(GICS.Sub.Industry) %>%
  summarise(avg_volume = mean(volume, na.rm = TRUE)) %>%
  arrange(desc(avg_volume)) %>%
  slice(1:2) %>%
  pull(GICS.Sub.Industry)
  1. Create a smaller data set containing just the information on the top 2 performing Information Technology Sub Industries by filtering the larger Information Technology data set.
two_groups_data <- tech_data %>%
  filter(GICS.Sub.Industry %in% top_two_groups)
  1. Since the data was found to be left-skewed when plotting the histogram section 4, I cannot use any statistical tests that assume a normal distribution of data. Therefore, I will use a Mann-Whitney U test and stick with closing prices as our metric.
wilcox.test(close~GICS.Sub.Industry, data=two_groups_data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  close by GICS.Sub.Industry
## W = 2929772, p-value = 2.757e-11
## alternative hypothesis: true location shift is not equal to 0

The resulting p-value is really small, much smaller than the default 95% confidence coefficient (alpha: 0.05) allows. Since the Mann-Whitney U tests for differences in median values, we must conclude that a difference was found. Based on the small p-value, we must reject the null hypothesis that states there is no difference in the median closing prices of the stocks in these two sub-sectors of the Information Technology sector. Knowing this tells us that there is a difference between the median closing prices, but not what the two industries were or what their median closing prices were.

  1. Let’s determine what the top two sub-industries of the Information Technology sector were and their median closing prices.
two_groups_data %>%
  group_by(GICS.Sub.Industry) %>%
  summarise(median_close = median(close))
## # A tibble: 2 × 2
##   GICS.Sub.Industry                          median_close
##   <chr>                                             <dbl>
## 1 Systems Software                                   51.3
## 2 Technology Hardware, Storage & Peripherals         43.5