Chapter 2 Data sources

The primary focus of our project is to analyze the sentiment of the US Stock Market. This sentiment largely drives the prices of the stocks we will be analyzing throughout the project. The main sources of our data are Yahoo Finance and Bloomberg Finance. We have used this data in different forms throughout the project. At some places, we have downloaded the processed comma separated files provided by Yahoo Finance for historical data, while at times we have used R packages to directly scrap data from these web pages. The github repository page for this chapter contains the code used.

Primary Data Sources:

  1. https://www.bloomberg.com/markets/stocks
  2. https://finance.yahoo.com/

The CSV files we are using across this analysis are stored in the data folder of our repository.

2.1 R Packages for Scraping Finance Data

2.1.1 The getSymbols() Function

getSymbols() is a CRAN package function available in Quantmod and is a wrapper to load data from various sources, local or remote. Current src methods available for the getSymbols function are: yahoo, google, MySQL, FRED, csv, RData, oanda, and av. In the example below we would show an example with Yahoo to capture data for Appleā€™s stock prices.

Below is a sample of the data we get directly using the getSymbols() function for Apple.

Table 2.1: Data from getSymbols
date AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2021-03-01 123.75 127.93 122.79 127.79 116307900 127.1968
2021-03-02 128.41 128.72 125.01 125.12 102260900 124.5392
2021-03-03 124.81 125.71 121.84 122.06 112966300 121.4934
2021-03-04 121.75 123.60 118.62 120.13 178155000 119.5724
2021-03-05 120.98 121.94 117.57 121.42 153766600 120.8564
2021-03-08 120.93 121.00 116.21 116.36 154376600 115.8199

2.1.2 BatchGetSymbols() Function

Another interesting methodology of capturing data for multiple stocks at once in the BatchGetSymbols() package, made available within the same CRAN package (BatchGetSymbols). It locally stores a downloaded cache of the ticker symbols we call the function with. This data can then be used for analysis locally in the same session. Once the session is reset, we would need to run the code again to rebuild the cache.

After downloading the data, we can check the success of the process for each ticker. Notice that the last ticker does not exist in yahoo finance and therefore results in an error. All information regarding the download process is provided in the dataframe df.control:

Table 2.2: Combined Data from BatchGetSymbol Function
ticker src download.status total.obs perc.benchmark.dates threshold.decision
FB yahoo OK 41 1.0000000 KEEP
MMM yahoo OK 41 1.0000000 KEEP
PETR4.SA yahoo OK 40 0.9512195 KEEP

Moreover, this data can now easily be plotted and used for manipulation as well.

Although this is a neat method to capture data for more then one ticker in the same function the cache functionality makes it difficult to use through multiple sessions. For this reason, we would not be using this method too often. But it is still handy, when one-time stock comparisons are needed (as shown above)

2.2 Structure of input Data

Now, we will glance through the structure of the data we are going to be analyzing. Here, we have used a downloaded CSV file from Yahoo Finance that contains the same data fetched by the getSymbols() package. The downloaded CSV file can be found in the docs section.

Table 2.3: Data from CSV download
Date Open High Low Close Adj Close Volume
2021-09-07 15375.98 15403.44 15343.28 15374.33 15374.33 3967040000
2021-09-08 15360.35 15360.35 15206.61 15286.64 15286.64 4113530000
2021-09-09 15296.06 15352.38 15245.17 15248.25 15248.25 3997250000
2021-09-10 15332.92 15349.47 15111.31 15115.49 15115.49 4567980000
2021-09-13 15211.43 15215.44 15030.85 15105.58 15105.58 4701190000
2021-09-14 15168.45 15181.19 15008.30 15037.76 15037.76 4571950000

We observe that the structure of both data are the same. We would therefore, be using these interchangeably in different scenarios. Structurally, there are no further changes at this point needed to these data sources.