The NBA commonly uses five positions to group players: point guard, shooting guard, small forward, power forward, and center. However, today’s NBA is becoming increasingly position-less. Instead, teams care more about the specific skills and tendencies of players when making lineups. This has led to the need to create new grouping for players. One way to group NBA players into positions, roles, or archetypes is to use a clustering algorithm. I applied a K-means clustering algorithm to create different offensive and defensive archetypes that represent what a player brings to the floor better than traditional positions.
K-means Clustering (Skip if you don’t like math)
K-means clustering is a relatively simple unsupervised machine learning algorithm. It takes in a dataset and groups the observations based on which observations have similar traits. The number of groups (which is the “k” in “k-means”) is chosen by the person running the algorithm. The k-means algorithm determines the best clusters by randomly choosing a centroid for each cluster and grouping the observations to its closest centroid. The centroid is updated after each grouping so that it represents the mean of all the observations in the cluster. The distance between a datapoint and a centroid is usually measured by Euclidean distance, which is the sum of the squared differences in the data (the original data gets standardized so all variables are on the same scale). By repeating this process several times, using a new starting centroid each time, and taking the best grouping, we find out which clusters fit the data the best.
Instead of making five clusters that represent each player on both sides of the floor, I chose to separate offensive and defensive clusters. This is because a player can have the same role on one end of the floor, but a completely different one on the other. For example, two players can both be catch and shoot players on offense, but one could rely on getting steals defensively while the other protects the rim.
The first step in making the clusters for offense was to choose the input data. Choosing the data is crucial because the clustering algorithm will produce results based only on the inputted data. So if the inputted data doesn’t represent any important basketball traits, the algorithm will spit out clusters that are useless to us. The sample of players for the clusters included all players from the 2021-22 NBA regular season that played at least 500 minutes (this came out to be a total of 375 players).
Here is a list of stats that I included because they will help to create informative clusters (all non-percentage stats below are per 100 possessions):
- Catch & Shoot Field Goal Attempts and eFG% (to see if a player shoots off ball a lot and if they are efficient doing so)
- Pull-Up Field Goal Attempts and eFG% (to see how often a player can create his own shot off the bounce + efficiency)
- Shots Under 10 Feet Attempts and eFG% (to determine if a player is shooting more often in the paint or on the perimeter)
- Free Throw Attempts and FT% (to estimate shooting ability and how often a player gets to the rim and draws a foul)
- Drives (to find if a player tries to score by using speed from the perimeter)
- Post Ups (to see if a player uses strength or footwork to score 1 on 1 from inside the paint)
- Potential Assists (to determine the playmaking and passing ability of players)
- Turnovers (to show how well players keep control of the ball)
- Offensive Rebounds (to estimate activity on the offensive boards)
- Box Outs (to show effort in getting rebounds)
- Screen Assists (to determine frequency of screening for the ball handler)
After choosing the variables to be inputted into the k-means clustering algorithm, the next step is to choose “k,” or the number of clusters. To determine how many clusters there should be, I used the elbow method. This method looks at a graph of the total sum of squares within the clusters (lower values = better clusters) against the number of clusters. To find the optimal number of clusters, one should choose the k for which the graph appears to have an “elbow.” In the graph below, we can see that the elbow occurs at around 3-6 clusters. Ultimately, I chose to create 5 offensive clusters.
After choosing the input data and determining the number of clusters for the algorithm to group, the only things left were to run the algorithm and analyze the results. I named the five groups that were generated as the following:
- Inside Creators – versatile players who post up a lot, take lots of shots in the paint, create shots for both themselves and others, and can screen for ball handlers to generate screen assists
- Outside Creators – players who take lots of pull-up shots, generate lots of potential assists by passing to shooters, and drive a lot
- Spot Up Shooters – players that shoot lots of catch and shoot field goals (and shoot them efficiently) and have few turnovers, but don’t get many offensive rebounds or free throw attempts and don’t shoot close to the rim often
- Screen & Roll Bigs – players that create value by setting screens for ball handlers, crashing the offensive boards, and shooting well under 10 feet while avoiding drives, catch and shoot attempts, and pull-up shots
- Balanced – a mix of all the categories above; they don’t excel or fall behind in any one skill but rather they are about average in almost every skill
Here are some examples of players that were grouped into each cluster:
- Inside Creators – Joel Embiid, Giannis Antetokounmpo, LeBron James, Nikola Jokic, Karl-Anthony Towns, Anthony Davis, Jimmy Butler, Julius Randle
- Outside Creators – Luka Doncic, Trae Young, Kevin Durant, Ja Morant, Donovan Mitchell, DeMar DeRozan, Devin Booker, Jayson Tatum, Tyrese Maxey, Cameron Payne, Cole Anthony
- Spot Up Shooters – Klay Thompson, Desmond Bane, Kevin Love, Bojan Bogdanovic, Grayson Allen, Luke Kennard, Justin Holiday, Kevin Huerter
- Screen & Roll Bigs – Deandre Ayton, Montrezl Harrell, Rudy Gobert, Jarrett Allen, Hassan Whiteside, Steven Adams, Andre Drummond, JaVale McGee, Clint Capela
- Balanced – Jaren Jackson Jr., Miles Bridges, Obi Toppin, Andrew Wiggins, John Collins, Tobias Harris, Aaron Gordon, Kyle Anderson
The plot below is a Principal Component Analysis graph. This graph is a 2-dimensional way of demonstrating how close the clusters are to each other. The numerical values of the x-axis and y-axis don’t have any interpretation that is useful. Rather, the locations of the clusters on the plot are the most important feature.
From this PCA graph, we can see that the 5 clusters generated from the k-means algorithm fit the data very well. There is little overlap between any of the clusters. Further, the clusters reinforce some things that we could have expected. The horizontal axis seems to provide a measure of whether a player plays from the perimeter more (towards the right) or from in the paint and post more often (towards the left). This is because we see the “Outside Creators” and “Spot Up Shooters” clusters on the right while the “Inside Creators” and “Screen & Roll Bigs” clusters are to the left. Meanwhile, the vertical axis seems to measure how well a player creates for others, since the “Outside Creators” and “Inside Creators” clusters are located towards the top while “Spot Up Shooters” and “Screen & Roll Bigs” are towards the bottom. From the graph, we can also see that the “Balanced” cluster truly seems to be a mix of all the other clusters as it is located in the center of the other 4 clusters. Next, we can look at how each cluster compares in the inputted variables.
From the graph above, we can compare each offensive cluster in all of the variables included in the algorithm. We can see that the biggest difference between clusters occurs in post ups, where Inside Creators have a post up frequency that is extremely higher than any other cluster (8 post ups per 100; the next highest is 1.7). Some other big differences belong to Screen & Roll Bigs, who get offensive rebounds and screen assists far more than other clusters (except maybe inside creators).
The most valuable cluster seems to be Inside Creators, followed by Outside Creators. Both of these groups create looks for themselves and teammates while having a high usage, but Inside Creators create better quality shots because they are creating shots closer to the rim and drawing more fouls. The previous 4 MVP awards have gone to Inside Creators (2 for Jokic, 2 for Giannis). This is reinforced by the graph below, which shows the average points per 100 possessions among the players within each cluster, along with the number of players in each cluster. Inside Creators are the rarest NBA players, accounting for less than 5% of players in the sample, but they are probably the best as they average more than 30 points per 100.
Frequencies of Clusters within Teams
Now that we have divided players from last season into five clusters, we can see the relative frequencies of each cluster within each team. Keep in mind that these numbers only include players with at least 500 minutes in the 2021-22 NBA season. The graph below shows which clusters were the most frequent within each team. The frequencies were determined by the total minutes played by each cluster.
Using the graph above, we can see how each team’s offense operated. The Celtics had several Outside Creators that could play both on and off the ball, helping to keep their offense steady in case one of their creators played badly. Meanwhile, the Hawks were composed of Trae Young as the main creator while the majority of the rest of the minutes were given to Spot Up Shooters to capitalize on Young’s creation. The Warriors and Raptors were two teams that heavily relied on balanced players, using their all-around skills to create an offense that could attack in many different ways. The team with the best offensive rating last year, the Utah Jazz, had players with very clear offensive roles. Donovan Mitchell, Mike Conley, and Jordan Clarkson were the Outside Creators that took lots of pull-up shots and drove to create opportunities, while Spot Up Shooters like Bojan Bogdanovic and Royce O’Neal stayed on the perimeter as catch and shoot threats and Screen & Roll Bigs like Rudy Gobert and Hassan Whiteside served as screen setters and lob threats. The Jazz offense had very few Balanced players, but instead operated with clear roles.
After creating the offensive archetypes, I repeated the process for defense. This used the same sample of 375 players that had at least 500 minutes played. Here is a list of the inputs that I used for defensive analysis, what they mean, and why I used them (all non-rate stats are per 100 possessions):
- Rim Defensive Field Goal Attempts (shots within 6 feet of the rim where the player was the closest defender; gives a sense of of how often the player defends the rim)
- Rim Defensive +- (opponent FG% at the rim in all situations minus opponent FG% at the rim when the player is the closest defender; lower is better; displays rim protection effectiveness)
- Outside Shot Defensive FG% (opponent FG% when the player was the closest defender of shots more than 15 feet from the rim; helps to measure perimeter defense)
- Contested 2-point shots (plays when the defender closes out and raises his hand to defend a shot before it is released; shows defensive effort without regard to results from shooting luck)
- Contested 3-point shots (same as above but for 3-pointers)
- Deflections (plays when the defender gets his hand on the ball on a pass or dribble; represents the ability to force turnovers)
- Box Outs (plays where the defender physically makes contact and gives effort to box out his opponent; estimates effort on defensive rebounds)
- Defensive Rebounds (measures actual ability in getting defensive rebounds)
- Blocks (displays shot blocking and shot altering ability)
After choosing the variables, we use the same process as before to determine the number of clusters. Using the elbow method again, the best number of clusters would be between 3 and 5. To make it similar to the offensive clusters, I chose to create 5 defensive clusters.
The k-means algorithm was ready to run after choosing the number of clusters. Here are the resulting defensive clusters:
- Rim Protectors – players that defend many shots at the rim, allow a low field goal percentage close to the rim, contest and block lots of shots, and get lots of defensive rebounds
- Active Defenders – players that guard an average number of shots at the rim, are about average rim defenders, and get a fair amount of defensive rebounds, but also contest lots of shots both inside the arc and on the perimeter
- Disruptors – players whose main skill is forcing turnovers by getting lots of deflections; they are usually subpar rim defenders and defensive rebounders
- Wing Defenders – players that do a great job of allowing a low field goal percentage from long distance but are average in most other categories
- Hidden – bad at almost everything defensively and contest few shots
And now here are some examples for each archetype:
- Rim Protectors – Jarrett Allen, Robert Williams III, Mitchell Robinson, Anthony Davis, Rudy Gobert, Jaren Jackson Jr., JaVale McGee, Jonas Valanciunas, Hassan Whiteside
- Active Defenders – Draymond Green, Bam Adebayo, Jaden McDaniels, Kyle Anderson, Kevin Durant, Miles Bridges, Derrick White, John Collins
- Disruptors – Matisse Thybulle, Gary Payton II, Bruce Brown, Nicolas Batum, Patrick Beverley, Herb Jones, Tyrese Haliburton
- Wing Defenders – Paul George, Jimmy Butler, Klay Thompson, Jayson Tatum, LeBron James, Josh Richardson, Mikal Bridges
- Hidden – Anfernee Simons, Doug McDermott, Duncan Robinson, Trae Young, Zach LaVine
Now, lets look at a PCA graph to see how distinct the clusters truly are. The PCA graph below shows how close the clusters are to each other in 2 dimensions. The most obvious part of the graph is that Wing Defenders and Disruptors have a significant overlap. This implies that the 2 clusters are very similar to each other. This starts to make sense once we look at how clusters compare in each variable. Wing Defenders and Disruptors have obvious differences in only two stats: deflections and outside defensive field goal percentage. Therefore, the players in these two categories are basically divided based on whether they have a lot of deflections or a low Outside DFG%. Meanwhile, Rim Protectors and Hidden players seem to have relatively distinct clusters than others, so they don’t share a lot of similarities with other clusters.
The next graph shows the performance of each cluster in each inputted variable. This is a way to see what each cluster is really grouping by.
The Rim Protectors are the most obvious outlier cluster. The average standardized score for Rim Protectors in blocks, box outs, contested 2’s, defensive rebounds, and rim defensive field goal attempts are all above 1 standard deviation from the mean. This cluster also defends the rim at a higher level than anyone else by almost 1 standard deviation. Also, observe how the purple points are the lowest or second-lowest for all stats except for Outside DFG% and Rim DFG% +- (for which they are the highest). This reinforces that the Hidden cluster is the worst at defense in pretty much everything. Teams definitely don’t want a lot of defensive players that reside in this cluster.
Finally, let’s look the distribution of defensive clusters within teams.
One team with an interesting approach to roster building is the Raptors. Their team is full of players with good height (between 6’5″ and 6’10”) that can defend multiple positions, providing versatility. They had a very high proportion of both Active Defenders and Disruptors. Players like Pascal Siakam, Scottie Barnes, and Chris Boucher flew around the court, contesting tons of shots, while others like Fred VanVleet, Gary Trent Jr., and OG Anunoby forced lots of turnovers. This was one of the reasons that Toronto had a top 10 defense and also forced the most turnovers of any NBA team in the regular season.
Another unique team, although this time not in a good way, was the Hawks. Atlanta was unique because they had a lot of Hidden defenders. In fact, over 60% of their minutes in the sample were given to players in the Hidden defensive cluster. These players included Trae Young, Bogdan Bogdanovic, Danilo Gallinari, Kevin Huerter, and De’Andre Hunter. The Hawks had the 5th worst defense last season, and their abundance of below average defenders was a main reason. The good thing is that Atlanta has taken steps to fix this issue with their recent moves. They brought in Dejounte Murray (Disruptor) and Aaron Holiday (Disruptor) while shipping out Danilo Gallinari and Kevin Huerter.
How This Helps
While assigning archetypes to players may be fun, it isn’t immediately obvious why it can actually help NBA teams.
It is important to group similar players together because it can help a team assess its needs. By grouping players into five different groups on offense, it is a lot easier to see which sections are lacking and which ones are in abundance. Take the Magic, for example. They had the worst offensive rating last season. By taking a look at the frequencies of archetypes by minutes played for just the Magic, we can see that their offense consisted of mostly Outside Creators, Spot Up Shooters, and Balanced players. However, they had few minutes dedicated to Inside Creators and Screen & Roll Bigs. Therefore, it would be beneficial for the Magic to look to target players that can become great playmakers from the paint or from the post as opposed to perimeter players. They did this in the 2022 Draft, taking Paolo Banchero, who plays Power Forward and was a great shot creator at Duke. Additionally, it would be a good idea to look for some bigs that can rebound and shoot well while setting screens for the shot creators that the Magic already have.
Another reason that creating clusters of players is important is that it can help us to compare player stats with other similar players. For example, we can look at one of the most basic measures of offensive efficiency: True Shooting Percentage. True Shooting percentage is simply the number of points scored divided by two times the number of scoring opportunities. Scoring opportunities include field goal attempts and free throw attempts, but free throw attempts are multiplied by 0.44 since usually when a player is fouled he gets 2-3 free throws, not just one. The formula for TS% is PTS/(2*(FGA + 0.44*FTA)).
Interestingly, offensive archetypes differ in their average TS% by a large amount. Screen & Roll Bigs have an average TS% of 65% while Outside Creators have an average True Shooting close to 55%. This makes sense because Outside Creators have to take more difficult shots including pull-ups and floaters while Screen & Roll Bigs usually take more high percentage shots like putbacks and lobs. Using this information, we can create a cluster relative True Shooting Percentage, which is the player’s actual TS% minus his cluster’s average TS%. This allows us to evaluate efficiency on a scale that is relative to similar players.
Finally, creating clusters of players allows us to look at general trends to figure out what leads to more winning. For example, an obvious relationship would be that having more Hidden defenders would lead to having a worse defense. Using the frequency of Hidden players by team, we can see the effect it has on defensive rating. From the graph below, we can determine that for every 10% increase in Hidden defenders, a team’s defensive rating is expected to increase (which is bad) by about 1 point.
Additionally, using this graph we can also start to think about the implications of some of the most recent moves in the offseason. The Hawks acquired Dejounte Murray while parting ways with both Gallinari and Kevin Huerter, meaning their frequency of minutes given to Hidden defenders should decrease. This means we can expected Atlanta’s defensive rating to get better by a few points. Another example is how the Rudy Gobert trade will affect the Jazz defense. The Jazz had a lot of hidden defenders, but Gobert made up for a lot of that by being one of the best rim protectors in the league. But with Gobert gone, expect the Jazz defense to go from good to one of the worst.
Relationships between archetypes and team performance are even more surprising on the offensive end. Take a look at the following scatterplot. It compares the frequency of minutes given to Screen & Roll Bigs and Spot Up Shooters against team offensive rating. Screen & Roll Bigs and Spot Up Shooters are similar because neither archetype is expected to create plays. Rather, their jobs are to shoot efficiently and finish on shots that were created by others.
One would expect that teams with more shot creators would be better offensively because they have more skilled players. However, it actually turns out that teams with more play finishers (Screen & Roll Bigs and Spot Up Shooters) tend to have higher offensive ratings. Keep in mind that this sample only includes one season, so it is not definitive. Nonetheless, the relationship is interesting. I think that the reason for this trend is that teams need just a few primary creators while the rest of the players on the floor should be efficient shooters that don’t turn the ball over. This way, they can achieve the best capitalization on the shots created by their stars. The Hawks, Grizzlies, and Suns (all of which had top 5 offenses) each had over 50% of minutes dedicated to Screen & Roll Bigs + Spot Up Shooters.
Using a basic unsupervised machine learning algorithm, we can create clusters of similar NBA players on both offense and defense. This allows us to compare stats of players to those that are similar, and it lets us look at what makes a great NBA team. General managers should always be on the hunt for great shot creators, but it’s important to not forget about filling the roster with good play finishers to complement the stars.
2 thoughts on “Generating NBA Archetypes Using K-Means Clustering”
Hey Ayush, great read! I particularly enjoyed your in-depth discussion of the Elbow Plot, since it was something I hadn’t really thought of before. Being a fellow tech blogger myself, I also really appreciate how organized and well-formatted everything was – it definitely made the content much more digestible overall. Keep up the awesome work!
Very nice and detailed