Gentleman or lady in the streets, freak in the spreadsheets
TLDR;
Spreadsheets are versatile tools good for basic data analysis and reporting, SQL is absolutely necessary, visualization tools are clunky and kind of awful, and Python is awesome
Series notes:
Data science and data analysis involve using multiple tools for data analysis, transformation, manipulation, and presentation.
In this series I will cover the various categories of tools and analyze several examples from each category.
I firmly believe that the tools we use shape the way we think about data, and that having a broad understanding of the various tools of the trade can enhance the way we approach data problems.
Introduction:
Spreadsheets are the Ur tool of data analysis. They are inevitably the first tool a data jockey will use, and you will use them continuously throughout your career. If you’re in accounting or finance, they will be your primary tool. Spreadsheets were revolutionary to the business and scientific world. For a deeper dive into their impact, check out the podcast 50 Things That Made the Modern Economy’s episode about them. They form the bedrock of many systems, and I have encountered numerous multi-million-dollar business unit’s run out of spreadsheets.
Advantages
Spreadsheets are incredibly easy to use and very flexible. Anyone who is computer literate can fire up a spreadsheet and do basic mathematical operations with almost no training or guidance.
They have many advanced functions and are the most versatile tool used in analytics. For lightweight applications every step of the data process can be done in a single spreadsheet.
Part of what makes them so easy to use is their flexibility. The basic structure of a spreadsheet is a cell. Any cell in the spreadsheet can easily reference another cell which makes for intuitive simple operation.
Spreadsheets are also capable of nested complex calculations and power users of spreadsheets are capable of dynamic complex analysis.
Disadvantages:
Flexibility is also one of the principal disadvantages of spreadsheets. Because they have no structure, people do incredibly strange things with them. This can lead to a lot of inconsistent entry and formatting making them difficult to use.
The image above shows the same data formatted completely differently in the same spreadsheet. The top table uses correct table formatting, the bottom one doesn’t. There are no constraints in excel, which allows you to make some ludicrously strange data formats.
Spending too much time working with spreadsheets can inhibit you as an analyst because they are capable of such flexible manipulations. This environment does not prepare you well for working with more rigid data structures.
Spreadsheets are also slow. The row capacity in Microsoft Excel is about 2 million rows, and processing anything above one-hundred-thousand rows will go at a glacial pace. By way of comparison, I once accidentally added 20 million rows to a Pandas Dataframe and it only took 3 minutes to process. A spreadsheet program would have crashed out under that kind of load.
As you become a spreadsheet power user, it can create heritage problems as you pass those systems to less advanced users to either be perpetuated or reviewed.
Microsoft Excel:
Excel is powerful and robust. It’s the category leader for a reason, and almost everyone has access to it. It can handle numerous file types, interfaces well with other products in the Microsoft office suite and can handle some ETL work with its feature Power Query (also called Extract and Transform).
You can rapidly deploy small scale solutions in excel. I once built a search engine in excel letting customer service reps look up parts in a parts catalog.
For all its limitations, I think Power Query is a great interface to teach you how to think about data manipulations in a way that is more intuitive to understand than in scripting languages that don’t have a GUI.
While I now almost exclusively use Pandas for this sort of work, I still use techniques I learned to do in Power Query. I’m not sure I’d have as strong a handle on the kinds of transformations that are possible without using PQ as a bridge.
Excel’s big weakness is it’s harder to share spreadsheets and work collaboratively with Excel. While they’re getting better, this capacity still lags google sheets. Google Sheets also has a much better API making it easier to push data from a scripting language to a google sheet for sharing and collaboration.
Must have skills:
Know and understand pivot tables and V-Lookups. A good tip in an interview is if someone asks you how to do a V-Lookup ask them if they know about an Index Match, and demonstrate that instead.
Google Sheets
Over the last year, I’ve noticed that Google Sheets has become a lot more ubiquitous with many people opting to use Google Sheets instead of Excel.
Google Sheets has much more robust collaborative capabilities than Microsoft does. The introduction of Google Data Studio has also transformed Google Sheets making it much more powerful.
Google sheets also has a robust app community. The picture below is of a sheet I set up to scrape twitter for a particular hashtag:
While I’ve heard good things about Google’s scripting language, I have never used it, and my general advice is it is better to spend time learning open source languages than proprietary ones.
Lastly the Google API is one of the best available. They have outstanding documentation so it’s actually quite easy to use more powerful tools such as Python and Pandas and then use the API to publish to a Google Sheet.
The primary disadvantage of Google Sheets is it isn’t as powerful on its own as Microsoft Excel is.
Must have skills:
Learn Google Data Studio and how to source data from the internet directly into Google Sheets.
Roundup:
Six months ago I was very anti-Google Sheets; and then l discovered the Google API.
The API gives it good interoperability with more powerful tools, and I am now prototyping a reporting system which will use Google Sheets as a data visualization tool to push reports out in. I’m hoping to use Python and SQL as the back-end and Google Sheets as the front end.
Conclusion:
Spreadsheets are the base tool of data analysis and you will use them throughout your entire career.
In my own workflow, if I get a relatively small data set and I just want to get an idea what it looks like I will often just fire it up in Excel, just to see what I’m dealing with.
Spreadsheets will also often end up being your presentation medium which you will share with your customer.
Final advice:
Use spreadsheets intelligently. If you’re finding that you’re regularly chaining excel functions together it’s time to learn a scripting language.
This article was originally published at CharlesMendelson.com
About the Author
Charles Mendelson is a marketing data analyst at PitchBook. His current tech stack includes SQL, Python, Microsoft Excel, Google Sheets, and Mode Analytics. He’s also getting a masters degree in psychology from the Harvard Extension School. If you’re looking for a speaker for a conference or seminar or if you’d like to ask any questions you can connect with him on LinkedIn.