Scraping Yahoo Finance for Value Investing Metrics

The Yahoo Finance website is one of the best sources of financial information on the internet and it is a popular place for programmers to gather stock market statistics. In particular, the statistics page contains important financial ratios like the current ratio, and debt to equity. These ratios are used especially by value investors to find companies which are trading for less than their intrinsic or book value.

In this article I will show you how to scrape the information from the web using Python, and then compile and clean the data in Excel. We are going to gather information from every stock in the S&P 500 for the purposes of this article, but this could easily be adapted for any list of stocks. Code for the webscraper is embedded at the end of the article, and I intend to share all code and formulas to make it as easy as possible to reproduce.

Part 1: Build a Webscraper to Gather Stock Market Data

The first part of this project is to gather the data using Python. There are a number of Python packages to make sure to download before we get started. They can be installed using pip (for me it was necessary to use pip3) in the command line:

· Pandas

· Numpy

· BeautifulSoup

· Requests

· lxml

· time

To begin, we need to create a file called stats.py. At the top of the file import the required packages.

Some of the names of the packages are shortened as pd, np etc. This simply saves time when typing them again in our code.

We have now added a few lists. table_list to hold the html page, and company_name to hold a list of stock tickers to identify our data. The num_list lists represent the number of items in each of the tables we want to gather info from in the html code on the yahoo statistics website. The ticker_list holds our list of stocks. The ticker list can be much longer as I have shortened it to show nicely in this tutorial, but I have found that it may crash if this list is over ~ 200 stocks, so it is best to run this in sets of around 100-150 or so. One important thing to keep in mind is for the url the stock ticker needs to use a dash any time there might be a period, so it is BRK-B instead of BRK.B for instance.

We then create the dataframe and the csv file called yahoo.csv which we will add the scraped data to. Next is our first forloop used to get the url for our stock and find each of the tables. The page object gets the html from the webpage, soup allows this to be read by BeautifulSoup, and the source_code contains all of our tables.

Next are our forloops which add the data in the tables within the html in the Yahoo Finance Statistics webpage into our csv file. These loops also add the stock ticker for each item of info to a separate csv file, and we will combine the two csv files later. We have 9 different forloops since we have 9 tables that we want to collect data from, each forloop uses the num_list items we created earlier to know how many items in the table to add to our yahoo.csv file, and how many tickers to add to a separate list called company_name, so we know which company our data is about.

The last item in our forloop is used to print the ticker name so we can see the progress of the program and know where to look for an error if one occurs. The errors I have ran into include running the program too quickly, which results in Yahoo blocking the data for a few seconds. This is the reason for the time.sleep for 1 second at the end. Another issue occurs if I run too many tickers at once, and the program crashes. This program will also crash if it is run for a stock ticker that is no longer used. At this point we also want to set the table_list to contain nothing by setting table_list = [].

Lastly, we want to create our second csv file called yahootickers.csv containing company_name, our list of stock tickers, so we can tell what company our data is about.

This leaves us with two columns in yahoo.csv (make sure you have both columns) and one column of tickers in yahootickers.csv.

Part 2: Creating a Useable Data Table in Excel from our Scraped Output

We next will move on to sorting through our data to create a more useable file in Excel. Combining the columns of data in Excel that resulted from our Python program yields the result below.

Now that we have our data in Excel, concatenate the two columns above (column B and C in this image) in an inserted column at column B. Copy & paste values to contert it from a formula to a value, and delete the columns you concatenated keeping the column you created at B.

Then do text to columns, select delimited, then check “other” and enter a quote mark as the value, then next, then finish.

You then will be left with something looking like this:

In column B above we still have the Float and the EBITDA, and we need to move those data into the correct columns. To do this we need to filter for those items by unchecking 0 through 10 like the below image. I also named the columns “Number”, “Item”, and “Value”.

Once we have filtered the data we need to add the formulas to the Item and Value columns to move this data to the columns C and D. For me the filtered rows start at row 21, so that is what my formulas are referencing.

For the “item” in column C above =MID(B21,3,FIND(“ “,B21,3)-3)

For the “value” in column D above =RIGHT(B21,LEN(B21)-FIND(“ “,B21,3))

Then we need to remove the filter, copy and paste values for the entire sheet, and delete the “number” column.

In the last column to the right there will be some dates. Filter to exclude “blanks” and then set column d to equal column d for those values. Then unfilter, copy and paste values for the entire sheet, and delete column d which contains the dates below.

Now, on a new sheet at B1 copy and paste transpose each of the items in the “item” column. And at A2 add the ticker name for each company. This is easily done by copying over all of the values in the “company ticker” column and using the delete duplicates function in the data tab of Excel.

This results in the first row and column of our new sheet being filled in with the company tickers, and the data categories respectively.

We then add this formula:

=INDEX(yahoo!$C$2:$C$29737,MATCH(1,(yahoo!$A$2:$A$29737=Sheet1!$A2)*(yahoo!$B$2:$B$29737=Sheet1!B$1),0))

The sheet names above are “yahoo” and “Sheet1”, but may be necessary to adjust this based on the names of your sheets. We then need to copy this formula all the way across and all the way down, and copy and paste values. You are then left with all of the data from the yahoo finance statistics website.

You can then use vlookup to add the company name to the Excel sheet by matching the ticker.

The list of companies I used can be found here: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies

This concludes the tutorial as our data is in a much more usable form. This could be used to create stock screens, gather information about specific industries, and be a useful tool for value investors.

The Python code for Part 1 is below: