## Research Update: Yea

Another unproductive week. Shut up.

## Dexter

I forgot that I had a blog. Back to watching Dexter and not doing any real work.

## Visualizing Goons

Okay, so it’s finally here. I spent an unnecessarily large portion of the Thanksgiving Break finishing up the HockeyFights.com scraper. It’s in my Github (hockeyfights.R). The reason this scraper took so long was because the HTML table format of the HockeyFights.com is incredibly annoying to work with. In addition, the website blocks you if you scrape too often.

Anyway, the script spits out “final.table.” It has all the fights that occurred in the 2013/2014 season. All the goon names are displayed in the first two columns. The third column, displays who won the fight, but in a Gephi-friendly format. It was a bit tricky to write because first, you have to determine who “won” the fight based on the awful hockeyfighst.com tables. Then, you have to put it in a Gephi acceptable format. What I mean by that is, if goon1 beats goon2, in order to show that as a Directed Edge in Gephi, you have to use a concatenated string form like this: “goon1;goon2.” Makes sense? So here are some entries:

Once you have that, the rest is just importing the data and playing around with it with Gephi. Here is something I created:

All the pussy goons who got into just one or two fights are all in the “parameter.” The real enforcers are in the middle, beautifully intertwined with one another. They are in a league of their own. Tom Sestito looks like the front runner in this year’s goon race. Anyway, I will create more visual graphs when I have the time. It’s amazing how much you can accomplish once you have some nice clean data.

One last note before I finish. I’ve noticed that you can’t view the images in full size with Tumblr. So I decided to upload all the images I’ve used for the blog to imgur. My imgur URL is: http://goonstats.imgur.com . Here, you can find all the old images from the blog and view them in full size. Anyway, cheers!

## Busy being Lazy

Nope, no post today. I didn’t even get to finish my HockeyFights scraper. And also, I’m pretty behind with my research. I’ve been busy being lazy. Sigh….

Happy Thanksgiving, and I’ll be back next week.

## Research Update: Poisson Distribution

I’ll continue from where I left off. I’ll be brief because I don’t have that big of a desire to write today.

The problem with the model I proposed last week is that it assumes normality when calculating defense, and simple arithmetic sums often do not do justice. For example, if the expected number of tip-in shots in a game with adjusted defense is 1 (it’s usually around 1.5), but there wasn’t a single tip-in in the observed game, our model would spit out impossible numbers: \[\operatorname{E}[X_{tip-in}] = n\cdot p = 0\cdot p = 0\]

Hence a linear normal model would be ill-advised. That’s when the stochastic calculation of Poisson model comes in. If you want to know more about the Poisson process, just read it on Wikipedia.

But anyway, once you have the poisson process with adjusted defense and offense, you \(log\) the whole model to put it in an additive form. Then the rest is just plugging in the equation into a beautiful R function known as Generalized Linear Models (glm): \( glm( \lambda \mathtt{\sim} equation, family = poisson) \).

So there, we have the whole process of predicting goal scoring in Even Strength cases in hockey. As for other “special” cases? Shorthanded events and Power Play events? It gets way more complicated, and I’m working on figuring that out this week.

Sigh. I know it’s not crystal clear. But deal with it. I’m tired today.

Finally, before I finish, I guess I should mention that I’m working on another scraper. This time, it’s http://hockeyfights.com. I’m trying to create a visual graph of all the fights that occurred in 2013 so far, using Gephi. So the script should be up soon in my GitHub. Check back later.

## Research Update: Defense Defense Defense

So this week we came up with a better model for calculating defense. Previously, I wrote that our very watered down model is essentially this: \[\log \lambda_{homeshooting} = \log\lambda + \beta^{off}_{home} + \beta^{def}_{away}\] The offensive \(\beta\) is computed like this: \[ \beta^{off}_{home} = \sum_{\alpha \in Shot Types}\operatorname{E}[X_\alpha] \]

So we have the expected number of shots the home team would take in a game according to different different types of shots. This would be a simple summation of expected values. (NOTE: for the sake of simplifying our model, we’re only considering Even Strength cases.)

The second part of the equation, \(\beta^{def}_{away}\) is not as easily computed. Defining the defensive ability of a team is not such a trivial task because there are so many factors that go into it. For example, defense in hockey is unfortunately highly dependent on the system, which isn’t measurable. In addition, measurable statistics such as blocked shots and defensive faceoff wins are not good indicators of the defensive system (there seems to a big variation in the correlation of goals allowed in a game vs. blocked shots). I guess you could make the same argument for offense, but the main goal of offense is to score by shooting. The goal of defense, is to keep the pucks out of the net. This is a pretty vague concept since there are “multiple” ways of keeping pucks out of the net.

So the simplest approach to calculate the \(\beta^{def}_{away}\) would be to measure the average number of shots allowed by a the opposing/away team, and figure out each team’s adjustment factor:

\[ \operatorname{E}[Home Shots] + \alpha \operatorname{E}[Away Shots Allowed] = ObservedShots \]

where \(\alpha\) is the adjustment factor for shots taken. Once we have the adjustment factor \(\alpha\), we can use the exact same equation to project shots for future games.

So this would be the most direct way of computing expected shots in a game factoring in the opponent’s defense. I actually came up with this idea, and am working on implementing it.

I’m gonna end it here for today, but I’d like to mention that my professor proposed another way of calculating the adjusted shots total. It’s using the Linear Mixed Effects Model and theoretically it produces similar results. But I’ll save this talk for next time.

## 74, The Magic Number

So, let’s answer this fucking question: “Does body size matter in goaltending?” I spent a good portion of this week writing a script for scraping Goalie data from NHL.com. It can be found in my Github, along with all the other R codes I wrote so far for this blog. To answer this question, I’m using the complete goalie data from the 2012~13 season.

Now that I’m armed with the NHL Goalie data that tells you Height, Weight, Birthplace, Save Percentage, Win, Losses, all the useless statistics you can all think of, the very first thing I did was to plot every goalie according to their Height and Weight (fortunately, there were only 82 goalies that played in the 2012~13 season). And here’s what we get:

Alrighty… It looks like there are a lot of goalies that are 74 inches tall, but not too much normality when it comes to weight. So let’s verify this. The average height of a goalie is 73.7 inches, median of 74 inches, and the standard deviation of only 1.8 inches. When it comes to weight, the average is 197.7 lbs, median of 197 lbs, and the standard deviation of whopping 14 lbs. So now we’re settled on the concept that Height has a pretty good normality level at 74 inches (in a very elementary statistical way), let’s one up this.

In this scatter plot below, I’ve plotted the same Weight vs. Height scatter plot, but added an extra element to it. Each plot is scaled by the number of wins these goaltenders had in 2012~13 season (so the bigger the size of the plot, the more wins the goalie had):

What’s amazing about this is that some of the best goalies in the league (starting goalies with over 20 wins in a shortened 48 game season) are all 74 inches tall (or very close to it)! Look at how many starting goalies are all centered around that 74” ball park: Braden Holtby, Tukka Rask, Sergei Bobrovsky (who won the Veznia in 2013), Marc-Andre Fleury, Antii Niemi, Henrik Lundqvist etc….the list goes on. For me, this is an incredible discovery. Unfortunately, statistics often do not tell us the “Why?” or the “How?” All we know is that these elite goalies are all of equal height. We can only assume that this magic number could be the “preferred” size for NHL goalies.

Below is another plot I created during this experiment, but wasn’t sure if I should leave it out for the blog post or not. It’s a distribution of Height/Weight Proportion vs Number of Goalie Wins. This is a weak argument for trying to show that weight doesn’t have significant effect in goaltending. The correlation value for Height/Weight Proportion vs. Numboer of Gaolie Wins turns out to be only 0.215. It didn’t occur to me at the time, that showing the correlation value would’ve been a stronger argument than plotting the distribution:

Anyway, I had a lot of fun scraping Goalie data, and doing some analysis on it. Let me know if there is any specific topic I should tackle. Cheers.

## What’s Gephi Really Good For?

Shortly after I posted about Bryzgalov last week, an asshole friend of mine asked me a really interesting question: “Does body size matter in goaltending?” What a great question, asshole! I thought about this a lot and really wanted to write about it for this week. But quite honestly, I couldn’t commit myself to writing a scraping script. I don’t think it would be hard because on NHL.com, player height and weight information are readily available with (slightly misleading) career statistics. But I just couldn’t commit myself (things have been coming up in my personal life lately). So the answer to the question from my asshole friend would have to wait until next Sunday.

On a very non-hockey related note, I got an e-mail from Tumblr saying that today was one of my old blogs’ 1 year anniversary (http://isomorphicgraph.tumblr.com). I had a total of 4 blog posts for that one. I mostly wrote about Data Visualization using Gephi. At the time I was writing on Isomorphic Graph, I really thought Gephi was the next, up and coming software for visualizing relatively large data (that you wouldn’t call “Big Data”). I thought its ability to represent data into nodes and edges (like you do in graph theory) was really cool. I used it to visualize my Facebook network. I used it to visualize Twitter hashtags. I used it to visualize random datasets that were already provided on Gephi database.

But looking back, I think Gephi’s limit ends there. Today, I thought about somehow applying Gephi to show some kind of stunning random relationships in hockey. But what relationship? After mulling over this for a good hour, I gave up on the idea. I couldn’t think of anything because the real world data you work with is never a large N x 2 adjacency matrix that shows relationships between two elements (or nodes in our case). Gephi is great for showing social networks such as Facebook and Twitter. Its overall concept is to create pretty visual aids to show relationships in data. But unfortunately, Big Data is not a series of binary relationships.

Let me know if you think I’m completely wrong, or I’m just stupid. Below are some of the useless visual aids I came up with using Gephi from those delusional days:

## Another Way of Looking at Mr. Universe

Once upon a time, Ilya Bryzgalov, was the most “exciting” goalie in NHL. He popularized the phrase, “Why You Heff to be Mad?” during his early years in Anaheim. In 2010, he single-handedly led the underdogs of the West, Phoenix Coyotes to the Stanley Cup playoff berth, which resulted in his nomination for the Veznia Trophy. He signed a whopping 9 year $51 million contract with the Philadelphia Flyers shortly after his stellar 2010~11 season. He brought his profound knowledge of the Universe to the National Hockey League, and preached the “Don’t Worry, Be Happy” attitude to the young Philadelphia locker room. And finally, he was bought out in the offseason of 2013. As of today, he is an “Emergency Backup Goaltender” for the ECHL’s Las Vegas Wranglers.

So what went wrong with Mr. Universe? Shit, I don’t freaken know (if I did, I would be an NHL coach). But what we can do is visually observe his digression over the last few years.

My initial thought process was that I should just plot all the goals Bryz gave up on a season-by-season basis, and try to explain the change over time. Well, so I did:

This is a plot of all the shots that Bryzgalov faced in 2009~2010 season (the year he was nominated for Veznia Trophy) on a hockey rink scaled axes. As you might’ve guessed, the red dots are the “goals.”

So, how useful is this scatter plot? Not very. In fact, if I created scatter plots for the last few years, you wouldn’t be able to tell the difference between any of them (when we should clearly notice a significant difference in his final year in NHL). In other words, they are not visually helpful.

So, another way of looking at the same data is by looking at the Kernel Density of the goals scored. Kernel Density Estimation is a smoothing technique for a finite sample data. You can make a connection with Polynomial Regression in a sense that it tries to find a “smooth” line that runs across points. But Kernel Density counts the number of occurrences and estimates the density of that occurrence, and normalizes the curve based on the “smoothing” parameter or the *bandwidth *value of \(h\). This is the formula for the Kernel Density Estimator:

\[\hat{f}(x) = \frac{1}{nh} \sum^n_{i=1} K\Big(\frac{x-x_i}{h}\Big)\]

So, the higher the bandwidth of \(h\), the more “precise” the \(KDE\) becomes. Now, we apply the KDE2D function on all the seasons since his Veznia Nomination year until his downfall (2009~2013), and this is what we get:

These are kind of hard to see because I used default bandwidth of 25. But if you look at the 2009 season and compare it to the 2011 and 2012 seasons, the colors in the “eyes” become brighter and stronger. This shows that Ilya started to give up more and more goals in these areas of the ice. Also, the 2011 and 2012 seasons have more random yellow spots outside of the “eyes.” This means that Ilya also started to give up goals in uncommon areas.

For the viewing pleasure, I created the Kernel Density plots of the 2009 season and the 2012 season with a much higher bandwidth of 200:

Now, it’s easier to see the huge difference in performance, and kind of guess what happened to Bryz. The Kernel Density plots show that in 2009, Mr. Universe was actually okay. In 2012, he shat all over the place.

## Introduction

Okay, so I don’t really know what the format of this blog is this gonna be. I know that I want to use this blog to keep track of my current research (I’m doing a research with a Statistics professor at Carnegie Mellon University and an undergrad). But I also want to use this to write about cool random experiments I’m doing with NHL data. So we’ll see. And yea. If you haven’t figured it out by now, this is a hockey blog.

I guess I’ll write about my research since this is the first post. We just finished writing our abstract, and it’s titled, “__Forecasting Goal Scoring in NHL using Additive Shot Models__.” The basis of our research is that we can come up with an expected number of goals in a game (calculated in higher resolution by dividing into different types of shots and types of players on the ice) as well as the expected save percentage of a particular goalie given these various types of shots. Once we have these expected values, we can use the Poisson Paradigm to model the rate of scoring for each team. So, the general idea of the research can be summed up in one Poisson Equation:

\[\lambda_{homescoring} = \lambda e^{\beta^{off}_{home}} e^{\beta^{def}_{away}}\]So when you \(log\)the whole equation:

\[\log \lambda_{homescoring} = \log\lambda + \beta^{off}_{home} + \beta^{def}_{away}\] Hence, the phrase “Additive Shot Model.” If you want to read the full abstract, just message me on twitter or something.

I’m not saying that my Poisson model is gonna give a better forecasting result than other models out there. In fact, I doubt it will, and it’s not even finished yet. But hopefully, this has some contribution in the sports academia. I don’t know. Forecasting is a really interesting topic especially in Sports. I mean…. there are people out there who make a living betting on sports games!

One final note before I close this one. My professor, AC Thomas developed an incredible R-package, and the data that I’m using is extracted from this package. It scrapes for every single shot that was ever taken since 2005 in crazy detail (i.e. who was on the ice when a particular shot was taken, the X-Y coordinate of where the shot was taken etc…) You can’t find this shit on NHL.com. If anyone is interested, the link to his Github is right here.

Cheers everyone. For the next post, I think I’m gonna write about one of my favorite goalies, Ilya Bryzgalov a.k.a. Mr. Universe, and hopefully explain why he sucks.