100k Simulations of All Texas Private Six-Man Brackets

It was just time to get down to it. I had been delaying the inevitable, running 100,000 simulations of each and every private school six-man state bracket. For details on how I did this, please read the earlier posts I have written about the public school brackets and other Monte Carlo simulations I have written. This was very similar….

First build the start bracket using this week’s ratings from my website (www.sixmanfootball.com). Then calculate the probability of each first round game and simulate the result. After each round I update the ratings (not 100% like my formula, but a close enough estimation) and continue…. do this 100,000 times and see what happened.

Well, here’s what happened.

TAPPS D1          
TEAM FIRST QUARTERS SEMIS FINAL CHAMPION
Boerne Geneva 0 16679 33496 12466 37359
Midland Trinity 0 13728 45784 12329 28159
Baytown Christian 0 38588 25904 24432 11076
Watauga Harvest 25725 17419 26487 20469 9900
Rockwall Heritage 35535 37235 13252 10377 3601
Sugar Land Logos Prep 24633 61253 8646 2274 3194
Houston Emery-Weiner 74275 9537 10169 4741 1278
Pasadena First Baptist 64465 24177 6168 3917 1273
Abilene Christian 45374 38952 10399 4169 1106
Round Rock Christian 49480 43205 5492 864 959
Katy Faith West 50520 43067 4867 711 835
Austin Hill Country 54626 34092 7621 2879 782
Waco Vanguard 75367 22068 1715 372 478
TAPPS D2          
TEAM FIRST QUARTERS SEMIS FINAL CHAMPION
Waco Live Oak 0 9711 16354 17471 56464
Austin Veritas 0 20885 34691 29735 14689
Orange Community Christian 0 46910 27849 18231 7010
Dallas Tyler Street 36085 17714 34388 5019 6794
SA Castle Hills 22194 38296 19682 14086 5742
Cedar Park Summit 25660 66166 3416 2406 2352
Denton Calvary 0 69007 26428 2318 2247
Kerrville Our Lady of the Hills 63915 13279 18663 1980 2163
Bulverde Bracken Christian 36088 49169 9136 4452 1155
Dallas Lakehill 77806 14794 4581 2204 615
Conroe Covenant Christian 63912 29946 4061 1649 432
Lubbock Christ The King 74340 24123 751 449 337
TAPPS D3          
TEAM FIRST QUARTERS SEMIS FINAL CHAMPION
Longview Trinity 0 25590 27022 16279 31109
Fredericksburg Heritage 0 29268 21774 25772 23186
WF Notre Dame 0 36128 35968 11996 15908
Fort Worth Covenant Classical 11814 54054 22060 5912 6160
Granbury North Central Texas Academy 43834 15878 24039 10298 5951
Seguin Lifegate 15010 58039 11590 10003 5358
Richardson Canyon Creek Christian 41133 42480 8406 3951 4030
San Marcos Hill Country Christian 56166 14491 19126 6837 3380
Alvin Living Stones 0 69631 22245 5739 2385
WF Wichita Christian 58867 31930 5010 2152 2041
Brenham Christian 84990 12693 1226 805 286
Selma River City Believers 88186 9818 1534 256 206
TCAF D1          
TEAM SEMIS FINAL CHAMPION
Fort Worth Nazarene 25645 32238 42117
Wylie Preparatory 26385 35665 37950
Dallas Inspired Vision 74355 15287 10358
Waco Methodist Childrens Home 73615 16810 9575
TCAF D2          
TEAM SEMIS FINAL CHAMPION
Azle Christian 20861 16452 62687
Granbury Cornerstone 32698 49507 17795
Weatherford Christian 79139 7600 13261
Arlington St. Paul Prep 67302 26441 6257
TCAL D1          
TEAM QUARTERS SEMIS FINAL CHAMPION  
Bryan Allen Academy 1340 2173 2839 93648
SA The Atonement 47104 23154 28492 1250
Tyler King’s Academy 48830 49877 289 1004
Greenville Phoenix 44302 28918 25840 940
EP Faith 52896 22695 23532 877
Bryan Christian Homeschool (BVCHEA) 51170 47717 255 858
Houston Mount Carmel 98660 233 266 841
Clear Lake Christian 55698 25233 18487 582
TCAL D2          
TEAM QUARTERS SEMIS FINAL CHAMPION
Stephenville Faith 2150 4906 24891 68053
Sugar Land HCYA Fort Bend 4188 21572 49549 24691
Corpus Christi Annapolis 12883 64578 18001 4538
Killeen Memorial 37562 58609 2464 1365
SA Sunnybrook 62438 35860 1165 537
Corpus Christi Abundant Life 97850 625 1092 433
Corpus Christi WINGS 87117 11393 1290 200
Lockhart Lighthouse Christian 95812 2457 1548 183
TAIAO D1          
TEAM QUARTERS SEMIS FINAL CHAMPION
Tyler HEAT 9493 28388 28029 34090
SA FEAST Homeschool 23317 29535 20494 26654
Capital City Christian Home School 34148 35468 15424 14960
Temple Centex Homeschool 39095 38257 12965 9683
Fort Worth THESA 65852 22050 7102 4996
Crosby Victory and Praise 60905 27484 7228 4383
Bryan Aggieland Home School (BCAL) 76683 12947 6135 4235
Plano CHANT 90507 5871 2623 999
TAIAO D2          
TEAM QUARTERS SEMIS FINAL CHAMPION
Austin NYOS 0 18228 29436 52336
Bastrop Tribe Consolidated 0 29389 40156 30455
Waco Parkview 19490 54627 17568 8315
San Marcos Homeschool 21741 62559 8584 7116
Weatherford Home School 78259 19213 1626 902
Victoria Home School 80510 15984 2630 876

Obviously for TCAF, I am moving straight into this week since the first round was played last weekend.

Another thing to notice is that teams like Austin NYOS do not lose in the first round. Why? They got a bye.

The biggest shocker at first glance – the fact that Bryan Allen Academy is such a huge favorite. I expected it to be high, but 93.6% to win it all is a little obscene.

So I hope everyone enjoys this… and remember, no wagering.

East and Throckmorton likely to rule UIL D2 Six-Man Playoffs

After 100,000 simulations, the Throckmorton Greyhounds appear to have a 29.8% chance to win the UIL D2 Six-Man State Championship. The biggest challenge it appears will be the dominance of the East bracket, which won a dominating 80.1% of the time in the simulation.

Yesterday I wrote about how the Crowell Wildcats are a somewhat dominant 33.1% to repeat as the D1 UIL State Six-Man Champions. If you would like to read more details on the methods, I have several posted below.

Basic note: The table represents how many times each team LOST in that round or became the champion (final column).

TEAM BI-DISTRICT AREA QUARTERS SEMIS FINALS CHAMPION
Throckmorton 2277 17879 26568 18537 4942 29797
Guthrie 7473 13276 45288 15376 3977 14610
Calvert 7171 26565 28201 20608 3668 13787
Richland Springs 16212 9781 33384 23044 3859 13720
Groom 14392 22605 26298 11473 18726 6506
Follett 20907 9637 30748 13218 19457 6033
Jonesboro 21822 51329 15095 7807 1096 2851
Motley County 36812 49576 7304 3538 821 1949
Buena Vista 24382 31007 18346 15296 9149 1820
Balmorhea 35900 17918 21475 14798 8314 1595
Blanket 26142 37455 17133 12373 5848 1049
Southland 16387 56498 15533 5156 5424 1002
Chillicothe 14573 70298 12063 1898 356 812
Oglesby 83788 4289 8126 2777 309 711
Lueders-Avoca 63188 31122 3323 1401 289 677
Mt. Calm 26113 62182 8582 2343 252 528
Sands 64100 13218 12862 6674 2706 440
McLean 79093 4955 10141 2791 2601 419
Blackwell 30030 45928 14678 6701 2320 343
Mullin 78178 17599 2828 1014 112 269
Sierra Blanca 75618 14193 5708 3167 1158 156
Whitharral 41417 49108 6784 1462 1082 147
Jayton 92527 2983 3809 455 90 136
Lefors 85608 7191 4864 1239 973 125
Trinidad 92829 4507 1933 566 61 104
Loraine 73858 17345 5047 2663 984 103
High Island 73887 23748 1851 406 26 82
Rising Star 69970 22936 4751 1763 507 73
Kress 58583 36300 3781 770 511 55
Lazbuddie 83613 13706 1851 456 333 41
Harrold 97723 1423 667 125 28 34
Forestburg 85427 13443 978 105 21 26

It is interesting to note that while Richland Springs and Calvert have higher ratings at the current time, Guthrie actually has the second-highest chance to win the tournament (14610 to 13720 and 13787, for RS and Calvert, respectively). This is due to the fact that Guthrie has it easier in the first two rounds.

Out West, Groom and Follett (6506 and 6033 wins) have a combined probability that’s less than any of the top-4 from the East. On the bright side, they reach the finals more than each of these, mostly due to the fact that Throckmorton is not in their half of the draw.

It certainly looks like the West is more competitive in the sense that the teams are more even and quite a few more have solid opportunities to reach the semis and finals.

Coming Next: All of the private school draws.

 

Crowell Favorite to Win Six-Man Title with 33.1% Win Probability

I have created several Monte Carlo simulations over the past year to try and determine probabilities for various sporting events. This week I decided to tackle the Texas Six-Man state tournament. (I will publish more bracket evaluations as the week goes on)

For the past 21 seasons, I have been producing rankings for six-man football. For those of you who do not know the history, I would fax my rankings to newspapers across the state and several would actually publish them. I eventually put together a newsletter, The Huntress Report, where I would add scores, game stories, stats and schedules to the rankings and mail (or fax) to subscribers. Eventually I moved to a website, where I would update the information a week behind, so that my subscribers would be getting the freshest information first. That all was scrapped in 1999 when I decided to go 100% to the website (www.sixmanfootball.com).

METHODOLOGY

You can read some of my earlier posts (see below here at sixmanguru.com) where I discuss Monte Carlo simulations if you are interested. In this case I played the UIL Division I tournament 100,000 times using probabilities calculated from the ratings on my website. To account for upsets and a more Bayesian methodology, I modified the teams ratings to also simulate my rating systems (generally) after each round. I also recorded each round a team lost and below are the results.

Crowell, the defending DI state champions, wins the title again a whopping 33.1% of the time and reached the finals over 41% of the time.

TEAM BI-DISTRICT AREA QUARTERS SEMIS FINALS CHAMP
Crowell 8686 12684 25264 11843 8426 33097
Ira 16846 9048 41178 9920 6573 16435
May 1911 8450 19585 28915 26882 14257
Blum 9396 2817 28089 27327 22565 9806
Borden County 16195 24064 21597 25036 4836 8272
Happy 9925 21253 34686 24447 4075 5614
Abbott 26692 4293 40184 16474 9330 3027
Water Valley 22717 63399 8220 2512 1390 1762
Valley 21460 50424 14276 10825 1381 1634
Gordon 30614 18265 35123 9414 5227 1357
Knox City 83154 4192 9627 1484 664 879
Grady 42027 41247 11329 4300 508 589
Highland 91314 3270 3476 935 478 527
Aquilla 73308 2813 17360 4297 1795 427
Sterling City 44824 47109 6529 782 373 383
Zephyr 17303 55678 21972 3477 1296 274
Anton 83805 8484 4666 2544 254 247
Newcastle 69386 11555 15094 2678 1046 241
Garden City 55176 39651 4328 411 224 210
Ropes 57973 32379 6938 2288 218 204
Marfa 77283 20647 1378 374 151 167
Nazareth 78540 17028 2724 1409 143 156
Milford 90604 972 5262 2335 697 130
Santa Anna 48258 46737 2667 1705 539 94
Rochelle 51742 44020 2323 1416 429 70
Spur 90075 5121 3784 890 64 66
Leverett’s Chapel 35599 59216 4515 531 121 18
Eden 82697 14502 2461 249 77 14
Chester 44796 52912 1717 462 99 14
Tioga 98089 793 775 280 50 13
Campbell 55204 43299 1154 281 51 11
Savoy 64401 33678 1719 159 38 5

The good news is every teams has a chance to win it all — even Savoy. The bad news — it appears they only an approximate 5 in 100,000 chance. I did run this a few times and they did get as high as 12 in one of the iterations. Tioga, a team that loses 98.1% of the time in the first round actually has a better chance than Savoy with 12 wins.

Another thing that stands out would be the fact that Ira, despite winning the title a theoretical 16.4% also seems to lose in the first round (16.8%) much more often than teams like Crowell (8.7%) or May (an amazing 1.9%). This goes to show that despite the 45-point expected spread on the Ira-Knox City game, it is still a much more difficult match-up for the Bulldogs than Highland or Tioga will be for Crowell and May, respectively.

Also interesting to note is that the East wins a dominant 70.2% of the time.

The most common final is a rematch of last year’s, May v. Crowell, with Blum v Crowell coming in next. The good news for May is they reach the final 41.1% of the time, which is a very good season. Blum is expected to reach the final about 32.4% of the time.

Wednesday I will release my UIL DII simulation results (they are already done, but it is my anniversary and we are going out for dinner). I will release the private school results either late Wednesday or early Thursday.

Quick Post on MLB Probabilities (100k Monte Carlo Simulations)

I just did a quick run of 100,000 playoff simulations and wanted to share the quick results. I will try to get some finer detail or maybe look into a few changes, but here are the raw World Series champion results.

Detroit — 4950
Baltimore — 18592
LA Angels — 31876
Kansas City — 9058
Washington — 19768
San Francisco — 4246
St. Louis — 1662
LA Dodgers — 9848

So the Angels win it all 31.8% of the time, with Washington and Baltimore in a tight race for second most.

Oakland, Pittsburgh slight favorites in Wild Card probabilities

With the MLB Playoffs beginning this evening, I figured it was time to test my rankings and pull out the old probability calculator. I created the MLB Ratings based on a simple least squares NLP Optimization that I have discussed before.

Oakland at Kansas City

The Royals are in the playoffs for the first time in ages and they get to host a game. Unfortunately, they didn’t seem to have a home field advantage during the regular season, so I am not sure how much this helps (although in reality we can assume it does, at least a little). The numbers say the A’s are the better team by almost 0.7 of a run (per game, for the season). I show them as a 63.5% favorite.

San Francisco at Pittsburgh

These teams appear to be very evenly match. On a neutral field, the Giants look to be a 0.15 run favorite. However, this game is not on a neutral field and Pittsburgh has one of the few home field advantages in the playoffs (if we assume the regular season is any indication). This swing makes the Pirates about a 0.215 run favorite tomorrow night, giving them about a 54.3% chance of winning.

Detroit v. Baltimore

Neither team appears to have a home field advantage, so looking at it straight-up, we find that Baltimore looks to be about a 0.4 run favorite (or 57.9%) per game. In a five-game series, the results look like this:

([0.0747, 0.1297, 0.1501], 0.3545, [0.194, 0.2451, 0.2064], 0.6455)

Overall, Baltimore is 64.6% to win the series. The most likely outcome is a Baltimore 3-1 win (24.5%).

Los Angeles v. St. Louis

With neither team holding a home field advantage, the Dodgers look to be about 0.445 runs (or 58.8%) better than the Cards. The five-game series probabilities are:

([0.2033, 0.2512, 0.207], 0.6615, [0.07, 0.1234, 0.1451], 0.3385)

Los Angeles looks about 66.2% to win the series overall. Again, the highest likelihood for an outcome is a 3-1 Dodger win (25.1%).

I will update the probabilities and try to run a Monte Carlo simulation with the data later in the week after we see who wins the Wild Card games.

Generic Sports Series Probability Calculator

With the baseball playoffs upon us, I have decided to start building a simulator to determine series outcomes once they start. I decided to make this as generic as possible. This simulator is not specific to baseball or even to a particular series length.

Obviously, the first parts to think about I addressed in my previous post relating to home field advantage, ratings and the probability a team would win a single game versus a specific opponent.

I will come back to this later in the month, as we get closer to the playoffs and I tie this all together.

Let’s assume for today that we know the probability a specific that Team A will defeat Team B. Let’s also assume, for matters of simplicity, that this single-game probability remains the same throughout the a series, regardless of any possible home field advantage.

Since we are dealing with a single probability and no perceived home field advantage, all we need for inputs are: p(Team A wins a single game), the current series record of the two teams and the numbers of games to win the series (e.g., 1 for a one-game series, 3 for a five-game series and 4 for a seven-game series).

All of my code is listed here on github, https://gist.github.com/sixmanguru

INPUTS
Like I said, let’s keep this simple. Probabilities, current series record, length of series.

seriesProb(.54,0,0,4)

The function calls for the series probabilities, give Team A holding a 54% chance to win a single game, the series is just beginning (0-0) and it takes for games to win the series (seven-game series).

That’s all.

OUPUT
Here’s the abbreviated (rounded to four digits).

([0.085, 0.1565, 0.1799, 0.1655], 0.5869, [0.0448, 0.0967, 0.1306, 0.141], 0.4131)

The first list contains the probabilities that Team A wins the series EXACTLY 4-0, 4-1, 4-2 or 4-3. The number trailing is the total probability Team A wins the series.

The second list contains the probabilities Team A loses the series EXACTLY 0-4, 1-4, 2-4, 3-4, with the total probability they lose the series following.

ALTERNATE EXAMPLES
Let’s assume the only thing you change is the fact that Team A now leads the series 3-0.

seriesProb(.54,3,0,4)

([0.54, 0.2484, 0.1143, 0.0526], 0.9553, [0, 0, 0, 0.0448], 0.0448)

As you can see above, there exists no change for Team B to win the series now 4-0, 4-1 or 4-2 and they have a 4.5% chance to even win the series at all. This can be verified by 0.46^4, which is approximately 0.0448.

Now let’s assume that it is a one game series.

seriesProb(.54,0,0,1)

([0.54], 0.54, [0.46], 0.46)

As you can see, it is one game, so the original probabilities are returned.

Finally, as a test, we say Team A trails the series 3-4 in a seven-game series.

seriesProb(.54,3,4,4)

It quickly returns (0,1). It is impossible for Team A to win and certain that Team B will win.

LIMITATIONS
The two biggest limitations to resolve (assuming you accept the theory that you can actually assign a probability to the function at all) remain to be the possibility of a home field advantage and how it would play out based on the series’ format (i.e., 2-3-2 vs. 2-2-1-1-1 and such)

Lastly, I would like to thank Jeff Sackmann, the author of Tennis Abstract and several other endeavors. His original python code for simulating a tennis match was the foundation for this project. His Python code for tennis Markov Chains can be found here, http://summerofjeff.wordpress.com/2011/01/13/python-code-for-tennis-markov/

MLB Home Field Advantage this season

Honestly, it is hard to get fired up about the MLB Playoffs these days as a Houston Astros fan. But I figure it may be a way to test a few models and work on my programming.

After scrubbing the internet for scores, I decided to do a simple non-linear programming model to create some rankings. If you want to read more about NLP Optimization, please read my earlier posts I ran during last year’s NFL season.

I tried to apply home field advantage as a singular term, but found there wasn’t a generic home field advantage as in football. I then decided to try and determine if each teams’ individual HFA would have any effect on the ratings. With so many more games, this number had a better likelihood of showing some importance.

In general, the average score of a MLB game this year has been 4.11-4.09 in favor of the home team.

When you look at individual HFA, results are pretty amazing. As expected, the Colorado Rockies get almost a run and a half (1.47) bump at home. The Rockies are a solid 19 games better at home.

Next on the list are the Florida Marlins. First off, does anyone really call them the Miami Marlins? The Marlins have a little over a run per game advantage at home (1.14). Like the Rockies, they appear to be out of the hunt for the playoffs.

The team most likely to be able to take advantage of the home field advantage in the playoffs appears to be the Oakland A’s, who are more than 3/4 of a run (0.76) better at home. The A’s have nine more games at home in the regular season. Also, they get to finish the season at Texas, who are rating only slightly ahead of Colorado, Arizona and Miami as the worst teams in baseball. The Rangers also have no effective HFA either.

Washington (0.429), Pittsburgh (0.333) and Atlanta (0.154) are the only other teams in the playoff picture with significant home field advantages.

Here are a list of the current home field advantages. Those not listed have no significant HFA (0).

Team HFA
COL 1.473564
MIA 1.14011
OAK 0.760433
SDP 0.704609
WSN 0.429156
PIT 0.33299
CHC 0.25553
CIN 0.239817
TBR 0.210181
ATL 0.153853
PHI 0.052209
TOR 0.016576

Here are the current team ratings, as we head into the final few games of the season.

Team Rating
LAA 5.137348
SEA 4.943415
OAK 4.925794
BAL 4.771614
WSN 4.513368
DET 4.462223
LAD 4.363542
SFG 4.326927
KCR 4.305852
TOR 4.229679
CLE 4.184073
TBR 4.164056
NYM 4.038186
STL 4.031671
NYY 3.997041
ATL 3.988839
PIT 3.944171
MIL 3.930484
MIN 3.825875
CIN 3.783494
HOU 3.759729
BOS 3.720102
PHI 3.676738
CHW 3.599442
CHC 3.442758
SDP 3.341322
TEX 3.337711
MIA 3.30692
ARI 3.278469
COL 2.669157

The next step in the coming weeks will be to use the rating and home field advantage numbers to create a simulation of the playoffs.

2014 US Open Men’s Draw Simulation

The U.S. Open main draw begins this morning and for the fourth year in a row, I will not be able to attend. Gone are the good ol’ days of working for the USTA and getting to take the trip up to New York to take it all in.

Since I cannot go, I decided to utilize Markov Chain models and Monte Carlo simulations to predict who will win.

Markov Models for tennis are essentially placing some initial inputs into a model and allow it to simulate an entire match, giving you the probabilities player A wins over player B. A Monte Carlo simulation is when you run an entire tournament over and over like this. Even if you can do the math, one of the most difficult parts is creating the initial inputs to run the Markov Model.

MY METHODOLOGY
I decided to experiment with an idea that begins with something I read in Dr. Kamran Aslam’s PhD dissertation he wrote at USC. Dr. Aslam and his advisor, Dr. Paul K. Newton published portions of this paper several times, including in the Journal of Quantitative Analysis in Sport back in 2009.

Dr. Aslam took the idea that you start by finding the overall mean probability to win a point while returning. This is defined as the returning average of ‘the field’. Let’s say this is 0.330. Then, if Roger Federer is playing Novak Djokovic and Roger’s average ability to win a point returning is 0.40, then he is 0.07 better than ‘the field’. If Novak’s average is 0.41, then he is 0.08 better than ‘the field’.

Then, if Roger’s percentage he wins serve is 0.7, you subtract Novak’s ability ‘above the field’ (0.08), making Roger’s effective serving percentage, 0.62.

Likewise, if Novak’s serving percentage is 0.68, then his effective serving percentage is 0.61. Therefore, the input to the program would be 0.62 and 0.39 for Roger (one minus Novak’s effective serving percentage). If you ran this for Novak, the inputs would be 0.61 and 0.38.

Modifications
To get the data, I scraped all serving and receiving stats from the ATP website for each player in the draw. I also decided to scale the data.

Scaling
Using only hard court results for the 2014 season, I scaled the data based on the level of competition. This allowed me to include all Challenger data as well as ATP-level, which is available on the ATP site. If an opponent was inside the top-64, no scaling was done. If the opponent was ranked between 65 and 128, then I scaled it down by 1.5%. If the opponent was in the top-192, I scaled it another 1.5%. I scaled it another 1.5% for between 193-256 and another 1.5% for those over a 256 ranking.

For some matches, the opponent’s ranking is listed at N/A. In those cases, the scaling was done based on the player’s own ranking, which seemed to be close enough to the actual ranking, except in a few instances.

Scaling this way may not be the best solution, but this is a solid starting point.
I then found ‘the field’ by averaging the scaled percentages of all players in the tournament. Five players have not played on the hard courts yet this season, so I removed them when calculating the field. Also, rather than placing zeroes in the data for them, I substituted numbers slightly below the averages for both serving and receiving.

In future versions, I may substitute scaled, full season statistics, irrelevant of surface for these players.

Noah Rubin
Then there was the case of Noah Rubin, who only had one hard court match, where he had some pretty good numbers, despite losing last week in Winston-Salem. In this case, I decided to manually modify his percentages down closer to the five I had to manually enter who had not played a single hard court match.

Shortcomings
Most of the problems come from too little data on some players. Some of this can be handled by using more stringent scaling for Challenger-level matches. Two of the most noticeable are Gilles Muller and Jared Donaldson. Muller won the Guadalajara Challenger and none of his opponents’ rankings are listed in the data, so they were only scaled per his No.68-ranking. Donaldson also had a lot of Challenger results that were not scaled sufficiently.

Coding
Jeff Sackman at tennisabstract.com published some python code to run the Markov Models a few years ago (here’s a link to his 2014 predictions, which you may like more than mine). He uses similar inputs and generates a probability player A wins the match. I modified Jeff’s code for my purposes, then wrapped it within a Monte Carlo Simulation and ran it 50,000 times.

I am not posting my entire code just yet on github, but hope to soon. I need to refine my entire process, soup-to-nuts, before I feel comfortable with that.

THE RESULTS
The table below shows howe far a player advances. For instance, Roger Federer lost 1802 out of 50,000 trials in the first round, but won the tournament 16895 times.

Federer seems to be the biggest winner here with Rafael Nadal out. I know this isn’t perfect, but it is a good start and something to work with moving forward. There are some basic assumptions I make and some data that needs refining, but overall I am satisfied with the outcome.

PLAYER R1 R2 R3 R16 Q S F W PCT
Roger-Federer 1802 1399 4752 5083 5099 8057 6913 16895 33.8%
Tomas-Berdych 4984 1008 5294 6528 7710 11934 5119 7423 14.8%
Novak-Djokovic 244 18702 4269 3604 5764 4425 5658 7334 14.7%
Andy-Murray 795 7560 7183 7213 13748 5073 4877 3551 7.1%
Gilles-Muller 5497 25948 3678 3013 3864 2726 2705 2569 5.1%
Milos-Raonic 3273 16891 7391 7462 5420 5007 2704 1852 3.7%
Stan-Wawrinka 10010 5230 11653 5295 7827 5558 2753 1674 3.3%
Kei-Nishikori 9637 4005 2004 16234 7388 6449 2777 1506 3.0%
David-Ferrer 6078 14349 3023 8772 9880 5189 1576 1133 2.3%
Blaz-Kavcic 6213 9611 16077 5177 6755 3917 1524 726 1.5%
Marin-Cilic 17736 6164 6662 8257 6488 3217 905 571 1.1%
Peter-Gojowczyk 6533 24512 5850 5298 3446 2674 1152 535 1.1%
Jared-Donaldson 21742 3033 10349 5732 6508 1532 677 427 0.9%
David-Goffin 2882 3878 15412 13738 10691 2144 836 419 0.8%
Roberto-Bautista-Agut 9661 6577 12280 16611 2253 1578 685 355 0.7%
Adrian-Mannarino 9823 8095 13559 14600 1878 1270 528 247 0.5%
Paolo-Lorenzi 4830 17542 13070 6225 6295 1285 515 238 0.5%
Simone-Bolelli 12534 12440 6899 10541 4619 2052 681 234 0.5%
Facundo-Bagnis 22299 5898 6488 11127 2308 1093 590 197 0.4%
Bernard-Tomic 16933 18795 2488 5200 4245 1712 432 195 0.4%
Igor-Sijsling 10360 6161 21214 6118 3278 2052 626 191 0.4%
Ernests-Gulbis 10555 16235 7808 10532 2450 1743 487 190 0.4%
Gael-Monfils 28258 2975 8844 4478 4152 860 292 141 0.3%
Dominic-Thiem 13221 16989 7085 8985 1991 1272 317 140 0.3%
Ivo-Karlovic 12126 5162 26714 3018 1564 938 342 136 0.3%
Jo-Wilfried-Tsonga 20942 8099 8426 7785 3365 856 395 132 0.3%
Richard-Gasquet 13310 18723 9403 4126 3460 664 222 92 0.2%
Yen-Hsun-Lu 15007 14542 16524 1747 1306 530 254 90 0.2%
Benoit-Paire 20522 10624 8481 6731 2643 642 277 80 0.2%
Grigor-Dimitrov 15605 14089 10471 5488 3458 618 197 74 0.1%
Philipp-Kohlschreiber 27701 5789 5829 8094 1545 654 317 71 0.1%
Marcos-Baghdatis 32264 5563 4588 4262 2312 804 148 59 0.1%
John-Isner 14229 12162 12604 8587 1547 599 217 55 0.1%
Kevin-Anderson 3544 18025 16794 7219 3297 922 151 48 0.1%
Juan-Monaco 29058 7290 6559 4912 1679 336 124 42 0.1%
Sam-Querrey 2888 23684 19638 1877 1236 444 192 41 0.1%
Alexander-Kudryavtsev 24959 5369 15596 2145 1205 570 117 39 0.1%
Bradley-Klahn 21459 11049 12642 2322 1899 430 162 37 0.1%
Evgeny-Donskoy 25041 5464 15495 2100 1199 560 116 25 0.1%
Steve-Johnson 22309 10788 9442 5674 1130 539 93 25 0.1%
Dudi-Sela 751 25740 14179 6185 2661 380 84 20 0.0%
Tommy-Robredo 19699 17062 5206 5582 1758 570 106 17 0.0%
Radek-Stepanek 11794 30753 3478 2102 1464 282 111 16 0.0%
Andreas-Beck 2542 27796 11149 6347 1739 323 89 15 0.0%
Sergiy-Stakhovsky 15749 11889 12577 7060 2052 550 109 14 0.0%
Julien-Benneteau 29478 9134 6133 3817 1163 200 61 14 0.0%
Wayne-Odesnik 40363 3301 1383 3746 849 295 54 9 0.0%
James-McGee 10881 25214 7925 4574 1153 192 52 9 0.0%
Andrey-Kuznetsov 28541 9826 9132 1393 891 159 49 9 0.0%
Jiri-Vesely 39990 3818 3995 1132 811 202 44 8 0.0%
Tatsuma-Ito 27691 9767 7701 3842 650 298 43 8 0.0%
Lleyton-Hewitt 45016 1036 2024 1202 499 183 34 6 0.0%
Ivan-Dodig 21405 16081 8031 3614 590 241 32 6 0.0%
Mikhail-Youzhny 14304 19251 10826 4501 894 191 27 6 0.0%
Marco-Chiudinelli 20849 21573 4124 2442 820 164 22 6 0.0%
Dustin-Brown 33067 12283 1366 1966 1030 242 41 5 0.0%
Jan-Lennard-Struff 21002 16351 8368 3712 434 111 17 5 0.0%
Blaz-Rola 23408 15137 9150 1331 812 118 40 4 0.0%
Gilles-Simon 10414 5720 27011 4809 1745 265 32 4 0.0%
Thomaz-Bellucci 17702 25481 4806 1182 643 167 15 4 0.0%
Feliciano-Lopez 28595 13364 5638 2021 284 85 9 4 0.0%
Jeremy-Chardy 22487 19476 5697 1275 794 223 45 3 0.0%
Fernando-Verdasco 26592 13988 7748 1047 529 77 17 2 0.0%
Jerzy-Janowicz 25303 14188 7428 2276 656 132 15 2 0.0%
Ryan-Harrison 34395 9438 4306 1399 417 34 9 2 0.0%
Dusan-Lajovic 24697 14643 7500 2296 715 128 20 1 0.0%
Alejandro-Falla 27513 16945 4151 827 449 97 17 1 0.0%
Edouard-Roger-Vasselin 30301 13073 3280 2566 648 120 11 1 0.0%
Jack-Sock 17867 26513 1780 3228 486 114 11 1 0.0%
Fabio-Fognini 15873 23836 7021 2983 230 47 9 1 0.0%
Sam-Groth 16829 31352 1087 534 148 41 8 1 0.0%
Guillermo-Garcia-Lopez 34993 9023 5491 333 124 27 8 1 0.0%
Illya-Marchenko 29151 16700 2530 1245 321 47 5 1 0.0%
Victor-Estrella-Burgos 39640 4215 5357 598 145 39 5 1 0.0%
Kenny-De-Schepper 39445 7622 1860 930 112 26 4 1 0.0%
Mikhail-Kukushkin 28998 13406 5509 1915 134 34 3 1 0.0%
Andreas-Seppi 34251 8382 5378 1699 261 27 1 1 0.0%
Matthias-Bachinger 38206 11021 552 173 41 5 1 1 0.0%
Daniel-Gimeno-Traver 9294 29670 6064 4411 431 97 33 0 0.0%
Vasek-Pospisil 37466 7425 2746 1885 410 58 10 0 0.0%
Lukas-Rosol 3411 36289 9252 834 174 33 7 0 0.0%
Lukas-Lacko 36779 9154 2435 1390 181 57 4 0 0.0%
Paul-Henri-Mathieu 44503 5116 256 70 41 10 4 0 0.0%
Tobias-Kamke 9550 7288 27984 4629 464 82 3 0 0.0%
Marinko-Matosevic 48198 762 571 318 113 35 3 0 0.0%
Tim-Smyczek 20643 22118 5015 2062 125 34 3 0 0.0%
Marcos-Giron 35771 8081 4583 1414 124 24 3 0 0.0%
Pere-Riba 40177 4868 3526 1305 104 17 3 0 0.0%
Teymuraz-Gabashvili 22253 21265 5983 390 91 15 3 0 0.0%
Andreas-Haider-Maurer 40339 4504 3543 1481 108 23 2 0 0.0%
Filip-Krajinovic 29357 16801 2890 900 40 10 2 0 0.0%
Donald-Young 43787 3968 1813 305 106 20 1 0 0.0%
Denis-Istomin 36690 9754 2577 701 260 17 1 0 0.0%
Jarkko-Nieminen 37874 4004 7699 332 73 17 1 0 0.0%
Nicolas-Mahut 32298 15471 1808 306 102 14 1 0 0.0%
Alejandro-Gonzalez 24004 22641 2772 478 101 3 1 0 0.0%
Andrey-Golubev 34127 13201 2166 478 25 2 1 0 0.0%
Yoshihito-Nishioka 45170 3981 714 113 21 0 1 0 0.0%
Damir-Dzumhur 43922 4573 607 655 215 28 0 0 0.0%
Benjamin-Becker 43467 5726 551 194 54 8 0 0 0.0%
Pablo-Andujar 32133 16181 760 850 70 6 0 0 0.0%
Marcel-Granollers 12044 29804 7917 210 19 6 0 0 0.0%
Martin-Klizan 12104 35990 1429 390 82 5 0 0 0.0%
Nick-Kyrgios 35696 10478 3088 667 66 5 0 0 0.0%
Santiago-Giraldo 27747 17902 4055 243 48 5 0 0 0.0%
Frank-Dancevic 17016 28639 3479 755 108 3 0 0 0.0%
Taro-Daniel 46727 2871 309 78 13 2 0 0 0.0%
Dmitry-Tursunov 25996 21351 2271 317 64 1 0 0 0.0%
Aleksandr-Nedovyesov 39119 9397 1234 239 10 1 0 0 0.0%
Niels-Desein 47118 1643 1091 139 8 1 0 0 0.0%
Radu-Albot 39586 4073 6027 279 35 0 0 0 0.0%
Leonardo-Mayer 9476 29457 10544 512 11 0 0 0 0.0%
Albert-Ramos-Vinolas 33171 16487 255 76 11 0 0 0 0.0%
Federico-Delbonis 23729 20814 5281 167 9 0 0 0 0.0%
Noah-Rubin 26271 19393 4197 131 8 0 0 0 0.0%
Matthew-Ebden 40450 4523 4807 213 7 0 0 0 0.0%
Joao-Sousa 32984 15840 1045 125 6 0 0 0 0.0%
Michael-Llodra 40706 8643 555 93 3 0 0 0 0.0%
Robin-Haase 49205 666 115 11 3 0 0 0 0.0%
Pablo-Cuevas 46456 3144 374 24 2 0 0 0 0.0%
Steve-Darcis 37896 11966 124 14 0 0 0 0 0.0%
Jurgen-Melzer 37956 11030 1005 9 0 0 0 0 0.0%
Albert-Montanes 40524 8732 738 6 0 0 0 0 0.0%
Pablo-Carreno-Busta 47458 2446 93 3 0 0 0 0 0.0%
Diego-Schwartzman 49756 234 8 2 0 0 0 0 0.0%
Maximo-Gonzalez 47112 2751 136 1 0 0 0 0 0.0%
Carlos-Berlocq 49249 733 17 1 0 0 0 0 0.0%
Borna-Coric 46589 3335 76 0 0 0 0 0 0.0%

All the data you need to predict World Cup games is at the World Bank

Forget massive mixed models, evaluating world-wide player and team data. Forget checking historical World Cup data. The only data you need to predict World Cup winners comes from a single source — The World Bank.

Yep, that’s right. Let’s Keep It Simple, Stupid and take GDP (Gross Domestic Product) growth since the last World Cup in 2010. Honestly, they do not even have all of that, so we will take the growth in 2010, 2011 and 2012.

So far using these three growth figures is a perfect 5-0 through the early Saturday game.

The Data:

Change in GDP for first five World Cup countries

Change in GDP for first five World Cup games

Scores:
Brazil 3 Croatia 1 (12.35 v -3.95)
Mexico 1 Camaroon (13.57 v 12.48)
Netherlands 5 Spain 1 (1.99 v -2)
Chile 3 Australia 1 (19.1 v 7.16)
Columbia 3 Greece 0 (15.73 v -16.95)
So there you go. Five-for-Five.

So what’s the outlook for the US versus Ghana?

Ghana’s GDP % increase for the last 3 recording periods: 8, 15, 7.9 for a total incredible increase of 34%

US’s GDP % increase: 2.5, 1.8, 2.8 for a total of 7.27%

uh-oh.

Predicting Federer-Tursunov and other Friday French Open Matches Using Markov Chain

Today I was enamored with the FiveThirtyEight.com article, Inside the Shadowy World of High-Speed Tennis Betting. The article mentions the courtsiders who would sit court side at a tennis match and try to relay information quicker than the tournament computers to betting partners. Great read. Not sure these courtsiders were really doing anything illegal.

Buried deep in the article was a mention of the system this one organization created to predict the outcome of tennis matches for betting purposes. It links to a website, Summer of Jeff, and a post, Python Code for Tennis Markov. If you follow the links to the gitHub site, there is some pretty elaborate Python code for generating probabilities based on Markov Chain theory. The code is pretty easy to use, if you understand Python and statistics, although it needs some cleaning up if you plan on using it for entire match prediction (hint: the matchProbs function needs some fixes to run).

The biggest issue is determining the initial probabilities. You need to create each server’s probability to win a point.

To do this, I decided to hit the trusty ATPworldtour.com website and pulled that information up.

FEDERER-TURSUNOV
For the year Roger Federer has won 90% of all service games, but only 70% of his service points. On clay this season, he is 89% and 67%. On the other hand, Dmitry Tursunov has won 22% of return games and 36% of return points. On clay he is 24% and 37%. Assuming the majority of these results came from ‘inferior’ players, we might suggest that these numbers regress to each other. I am going to say that Federer is likely to win 65% of his service points. One down.

Now when Tursunov serves, he’s won 75% of service games and 61 of service points, 70%-60% on clay. Federer has won 29% of service return games and 41% of points, 27%-40% on clay. That seems to work out quite nicely to 60-40, so Federer’s return probability will be 40%.

Plugging this into the handy code mentioned above, we get that Federer is a 78.5% favorite to win tomorrow.

TSONGA-JANOWICZ
Jo-Wilfried Tsonga has won 68% of service points, 65% on clay, while Jerzy Janowicz has won 34% of return points all season and an improved 36% on clay. What is crazy about this is you might suggest that Janowicz is a better clay court than hard court player. Well, amazingly, he had not won a single clay court match this spring before winning his first two rounds at Roland Garros. Oh well. I am still going to give his the benefit and place Tsonga as 65% to win a point on serve.

Returning, Tsonga has been 34% for the year and 35% on clay, while Janowicz has won 62% on serve and 68% on clay. Again Janowicz stats are much better on the terre battue. I am going to just split this straight and leave Tsonga’s return percentage at 34%.

We all know the French crowd will be pulling for their man, so that may be the edge, however, the stats say that Janowicz looks to be a slight favorite at 56.1%. Moving Tsonga’s serve percentage up just a point makes this a dead heat.

THE ODDS
Looking at the odds at SportsBook.com, Federer is -2500, so that’s a ridiculous bet, but Janowicz is actually +325 v. Tsonga, so that may be worth a play. I hope to look into this more as the tournament progresses.