While doing some BI work for a local client, I whipped up a Pareto chart of her YTD billings by client to show her how top-heavy she is, relying on too few clients for too high a percentage of her sales. Pretty standard thing to do with any data set to get a sense of what you’re looking at. I’ve also been on a baseball stats kick lately for personal exploration, especially after completing the fantastic edX class Sabermetrics 101: Introduction to Baseball Analytics, and a great way to brush up on SQL and R. Baseball is perfect because there’s so much free public data, and still so much unexplored territory.
For some reason I had never considered looking at baseball counting stats and whether the leaderboards were top-heavy enough to conform to the Pareto Principle.
I put together HR first of course, because everyone digs the long ball. Nope, note even close. Duplicated the sheet, started editing the four pills and filters and sorting rules to chart triples, and then realized that’s tedious and inefficient. I needed a selection parameter! Took a few minutes to figure out how get sums and sorts and filters to all be controlled by one parameter collecting a String, when the values we’re working with are Numbers, but I got it working.
The short answer: the distribution doesn’t work for 2015 so far. We don’t get over 54% for the top quintile in any stat.
Hmmmm. Small sample size? Now I had to look at all time numbers. Fire up MAMP and Sequel Pro to get into the Lahman Database, and four lines of code later I had the batting stats for every player in MLB history through the 2014 season. I chose to grab all of them since I was going this far. Back to Tableau, and wouldn’t you know. Some nearly perfect Pareto distributions.
Look at the singles: at player 20.00%, we have 80.42% of all singles ever hit. Walks are extremely close as well, and home runs are at a respectable 78.2%.
What does this all mean? Nothing too surprising. Longevity is everything, and longevity comes from quality play (and of course staying healthy). But there are 4,970 players with 1-10 career home runs. Filtering out the pitchers you’re still left with plenty of weak-hitting position players that washed out of the league. The guys who stick around will dominate the totals, and the elite will account for the highest percentage. The top 1,613 of all 7,529 players (21.42%) to ever have hit an MLB home run account for 80% of total home runs.
Then there’s Frank O’Connor. Played in three games for the 1893 Philadelphia Phillies. Two AB: a single and a home run, with three RBI. Also pitched four innings and gave up five runs. That’s a story I’d like to hear.
A fun little project that took care of some curiosity, and I created a new-to-me parameter process. Which I should now document before I forget…