Software update with new split EVs

ericfarmer

New Member
With a lot of useful input from Colin, I am working on revising my CA to (1) provide more options, and (2) be more accurate with splits. As an initial step in that direction, I have an updated release in the usual locations:

1. A zip archive with pre-compiled executables for Windows on my web site:
https://sites.google.com/site/erfarmer/downloads

2. The latest 5.3 tag of the code repository at:
https://bitbucket.org/possiblywrong/blackjack/

The most visible change to most of you is the option in the strategy calculator to compute split EVs allowing strategy to differ "post-split". (See the first few posts in this recent thread: http://www.blackjackinfo.com/bb/showthread.php?t=22385) Years ago I had provided code snippets that you could plug in to do this, but it is now a more easily configurable run-time option.

You can now more easily generate split EVs that agree with other reference sources (Cacarulo, etc.)-- at least for no resplitting; that is my next project :). (I seem to be in the minority with the opinion that "pre-split" is still an interesting and more practically realizable specification of strategy... that is actually slightly more of a pain to compute. But I left both in as options, so you can choose what you like.)

There are other changes that are probably less immediately interesting, such as configuring the payoff for blackjack (previously hard-wired at 3:2), and more cleanly specifying "mixed" strategies that are a combination of things like "always stand on 10,2" and "do whatever maximizes EV." I also did a more thorough code review after Colin had identified several potential divide-by-zero bugs for "pathological" shoes.
 

k_c

Well-Known Member
ericfarmer said:
With a lot of useful input from Colin, I am working on revising my CA to (1) provide more options, and (2) be more accurate with splits. As an initial step in that direction, I have an updated release in the usual locations:

1. A zip archive with pre-compiled executables for Windows on my web site:
https://sites.google.com/site/erfarmer/downloads

2. The latest 5.3 tag of the code repository at:
https://bitbucket.org/possiblywrong/blackjack/

The most visible change to most of you is the option in the strategy calculator to compute split EVs allowing strategy to differ "post-split". (See the first few posts in this recent thread: http://www.blackjackinfo.com/bb/showthread.php?t=22385) Years ago I had provided code snippets that you could plug in to do this, but it is now a more easily configurable run-time option.

You can now more easily generate split EVs that agree with other reference sources (Cacarulo, etc.)-- at least for no resplitting; that is my next project :). (I seem to be in the minority with the opinion that "pre-split" is still an interesting and more practically realizable specification of strategy... that is actually slightly more of a pain to compute. But I left both in as options, so you can choose what you like.)

There are other changes that are probably less immediately interesting, such as configuring the payoff for blackjack (previously hard-wired at 3:2), and more cleanly specifying "mixed" strategies that are a combination of things like "always stand on 10,2" and "do whatever maximizes EV." I also did a more thorough code review after Colin had identified several potential divide-by-zero bugs for "pathological" shoes.
I made some changes to Eric's version 5.3 to compute resplits more accurately using the program's existing "pre-split" fixed strategy. The method I used is recursive and is pretty accurate. It gives marginally different results than the method I presently use in my programs but it is close enough that there appears to be virtually no difference in the overall EVs generated by using the fixed "pre-split strategy. The program now computes 1, 2, and 3 splits. It still uses the program's existing interface to output the split EV, though. If valueSplit[pairCard][upCard] were changed to valueSplit[numSplits][pairCard][upCard] then the values could be saved and referenced.

I've opened a Bitbucket account and uploaded 2 sets of files for version 5.3, VS2010 and Vc++ 6. However, I can't seem to get the site to work like it's supposed to. When I cd to files on my computer and try hg clone... I get the following - 'hg' is not recognized as an internal or external command,
operable program or batch file.

I don't know how to make the files downloadable to others.
 

assume_R

Well-Known Member
k_c said:
When I cd to files on my computer and try hg clone... I get the following - 'hg' is not recognized as an internal or external command,
operable program or batch file.

I don't know how to make the files downloadable to others.
You have to install mercurial. If you use linux or freebsd, install the "mercurial" package. In Windows, you can use tortoisehg which interfaces with Windows Explorer. Not sure about Mac.
 

MGP

Well-Known Member
If you need values to compare to you can check them against my calculator. The split values are exact given a fixed strategy post-split and you can assign any strategy you want and only apply it post-split if you want or both pre and post.

The CDZ ones automatically use the same CD pre-split strategy post-split. If you want the strategy to vary post-split just based on the number of paircards removed you can compare it to the CDP straetgy.
 

ericfarmer

New Member
k_c said:
I made some changes to Eric's version 5.3 to compute resplits more accurately using the program's existing "pre-split" fixed strategy. The method I used is recursive and is pretty accurate. It gives marginally different results than the method I presently use in my programs but it is close enough that there appears to be virtually no difference in the overall EVs generated by using the fixed "pre-split strategy.
First, thanks very much to k_c for sharing your changes with me. Although I ended up taking a rather different approach, you (1) motivated me to revisit my resplitting algorithm, and (2) made the key observation that I had missed all this time about how to accurately compute resplits within the "framework" of my CA. (More on this shortly.)

The result is an update that-- I think-- now accurately computes CDZ- or CDP for splitting up to a maximum of 2, 3, or 4 hands. The new version 5.5 is available pre-built for Windows or via Bitbucket, at the two links at the top of this thread.

Also, thanks very much to MGP as well, for your recently available CA that proved very useful in testing. I encountered some interesting discrepancies between my results and those reported elsewhere, including k_c's changes to my code, as well as Cacarulo's tables from the old bjmath.com (http://www.bjmath.com/bjmath/ev/ev.htm (Archive copy)). Your CA was very handy for verifying some test cases, particularly a few very small shoes that I analyzed via brute-force.

k_c, I have not tracked down yet what is the root cause of the differences in your results, but I think I can identify where the issues occur. The problem seems to arise whenever there are more pair cards in the shoe than the maximum number of split hands. For example, splitting 10s for any number of decks, or splitting any pair with two or more decks, appears to generate different results.

(For anyone interested, many years ago I wrote up a rather terse description of how my algorithm now looks, at the link below. The idea was to aggregate the EV in terms of a relatively small number of parameterized EVs, conditioned on removing some number of pair cards and non-pair cards from the shoe. I never bothered to implement this, because it seemed difficult-- within my framework-- to compute the values conditioned on removing unspecified *non-pair* cards. But k_c's recent approach had the key new ingredient: that you can recursively evaluate those conditional EVs, eventually in terms of EVs with only *pair cards* removed, which is more easily do-able in my CA. The math involved is at:

http://sites.google.com/site/erfarmer/downloads/blackjack_split.pdf)
 

MGP

Well-Known Member
Glad it's helping. I really wish the split values from the old BJMath would disappear. Cacarulo and I got the exact values (he beat me by a hair) after a discussion started there and Cacarulo had them published in BJA3. The old values only cause confusion.
 

iCountNTrack

Well-Known Member
A feeble effort compared to MGP's

I know it looks pathetic when compared to MGP's CA, but it does have a couple of extra things mainly calculating probabilities of the net outcome on the round, variance on the round, and ev calculations using post-split optimal strategies.
for 3 splits ( split to 4 hands) it runs really slow for a full deck, it will take 2-3 days of computational time if your computer is robust enough to handle it, 2 splits (split to 3 hands) are calculated within 1-2 hours with a moderately powerful computer. You can also deplete the deck to half or less for faster results.


I am working on a faster multi-threaded version to get some reasonable execution time.
the code was compiled with VS 2010 so you might get an error about some missing libraries. Best is to install Microsoft Visual C++ 2010 Redistributable Package (x86)
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=5555

PM me for password to open rar file.
(Dead link: http://www.mediafire.com/?xt38dyik0ll50n2)
Cheers,

ICNT
 

assume_R

Well-Known Member
iCountNTrack said:
I know it looks pathetic when compared to MGP's CA, but it does have a couple of extra things mainly calculating probabilities of the net outcome on the round, variance on the round, and ev calculations using post-split optimal strategies.
for 3 splits ( split to 4 hands) it runs really slow for a full deck, it will take 2-3 days of computational time if your computer is robust enough to handle it, 2 splits (split to 3 hands) are calculated within 1-2 hours with a moderately powerful computer. You can also deplete the deck to half or less for faster results.


I am working on a faster multi-threaded version to get some reasonable execution time.
the code was compiled with VS 2010 so you might get an error about some missing libraries. Best is to install Microsoft Visual C++ 2010 Redistributable Package (x86)
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=5555

PM me for password to open rar file.
(Dead link: http://www.mediafire.com/?9gi0nqb0jjq9b3i)
Cheers,

ICNT
Some tips:

You can make multithreading really easy by using omp. For example, here is a "for" loop which is now parallelized:
Code:
# pragma omp parallel for // create a parallel for loop over all features
for (int f = 0; f < F; ++f)
{
     // do some stuff in parallel

    # pragma omp critical
    {
         // write to a shared variable which can't be done in parallel
    }
}
Then, you compile it using the openmp option, which is /openmp by the command line, or in Configuration Properties --> C/C++ --> Language

You can test it using the following code:

Code:
#include <omp.h>
int main(int argc, char *argv[])
{
	#pragma omp parallel
	{
		printf("Hello from thread %d (%d total)\n", omp_get_thread_num(), omp_get_num_threads());
        }
}
Finally, my other tip, is if you don't want people to have to download the redistributable, just compile your program with the visual studio static libraries. This can be done using the /Mt command line switch, or by changin Configuration Properties --> C/C++ --> Runtime Library from "Multi-threaded DLL" to "Multi-threaded"
 

iCountNTrack

Well-Known Member
assume_R said:
Some tips:

You can make multithreading really easy by using omp. For example, here is a "for" loop which is now parallelized:
Code:
# pragma omp parallel for // create a parallel for loop over all features
for (int f = 0; f < F; ++f)
{
     // do some stuff in parallel

    # pragma omp critical
    {
         // write to a shared variable which can't be done in parallel
    }
}
Then, you compile it using the openmp option, which is /openmp by the command line, or in Configuration Properties --> C/C++ --> Language

You can test it using the following code:

Code:
#include <omp.h>
int main(int argc, char *argv[])
{
	#pragma omp parallel
	{
		printf("Hello from thread %d (%d total)\n", omp_get_thread_num(), omp_get_num_threads());
        }
}
Finally, my other tip, is if you don't want people to have to download the redistributable, just compile your program with the visual studio static libraries. This can be done using the /Mt command line switch, or by changin Configuration Properties --> C/C++ --> Runtime Library from "Multi-threaded DLL" to "Multi-threaded"
Thanks for the tips mostly the first one, it makes multithreading looks so simple, the example with the for loop is exactly what i am working on, that is the only way i can see an easy application of multithreading, so instead of running the outermost loops(that deal the first card) sequentially, I will do it in parallel. Thanks very much.
 

assume_R

Well-Known Member
iCountNTrack said:
Thanks for the tips mostly the first one, it makes multithreading looks so simple, the example with the for loop is exactly what i am working on, that is the only way i can see an easy application of multithreading, so instead of running the outermost loops(that deal the first card) sequentially, I will do it in parallel. Thanks very much.
You're welcome. Just an FYI multithreading works best the more "outer" your threaded loop is.

If you have this:

Code:
for (int i = 0; i < 20; ++i)
{
  for (int j = 0; j < 1000; ++j)
  {
     // do something
  }
}
It is much better to parallelize the outer loop. This is a silly, obvious, example, but in general just try very hard to parallelize as outside your logic you can. This way the inside of your logic can just fly through the CPU without the overhead of starting new threads too often. :)

As for the second tip, with the way compilers (especially Visual Studio) can really trim down file size, and the way that data storage these days is so cheap, I find compiling as many libraries into my program statically as I can (including the runtimes, as well as external libraries). In the old days, it was much more important to use shared libraries to cut down on file size, but these days I find it so much more of a hassle dealing with dependencies (and versions, conflicts, etc. etc.) than just turning my 2 MB executable into a 10 MB executable with no worries about dependencies. Just my personal preference.
 

iCountNTrack

Well-Known Member
assume_R said:
You're welcome. Just an FYI multithreading works best the more "outer" your threaded loop is.

If you have this:

Code:
for (int i = 0; i < 20; ++i)
{
  for (int j = 0; j < 1000; ++j)
  {
     // do something
  }
}
It is much better to parallelize the outer loop. This is a silly, obvious, example, but in general just try very hard to parallelize as outside your logic you can. This way the inside of your logic can just fly through the CPU without the overhead of starting new threads too often. :)

As for the second tip, with the way compilers (especially Visual Studio) can really trim down file size, and the way that data storage these days is so cheap, I find compiling as many libraries into my program statically as I can (including the runtimes, as well as external libraries). In the old days, it was much more important to use shared libraries to cut down on file size, but these days I find it so much more of a hassle dealing with dependencies (and versions, conflicts, etc. etc.) than just turning my 2 MB executable into a 10 MB executable with no worries about dependencies. Just my personal preference.
yep exactly my thoughts about multithreading the outermost loop. I am sure you will get a pm from me or two :).

BTW i remember you mentioning once that the pre-increment operator is faster than the post-increment operator is this true, shouldnt the compiler automatically optimize that?
 

ericfarmer

New Member
MGP said:
Glad it's helping. I really wish the split values from the old BJMath would disappear. Cacarulo and I got the exact values (he beat me by a hair) after a discussion started there and Cacarulo had them published in BJA3. The old values only cause confusion.
Thanks, this is good to know. I had always assumed that those tables were accurate, at least for whatever post-split strategy was assumed (I didn't see this mentioned explicitly anywhere). So I spent quite a bit of time poking around in my algorithm this weekend, trying to figure out what was wrong with it... when in fact nothing was. :)
 

assume_R

Well-Known Member
iCountNTrack said:
BTW i remember you mentioning once that the pre-increment operator is faster than the post-increment operator is this true, shouldnt the compiler automatically optimize that?
For primitive types, it makes little to no difference (Eric tested this), but with non-primitive types, such as iterators or operator-overloaded classes, it will make a big difference.

The difference is that with the post-increment, a temporary may be created.

See the following sites, or have a go at searching google for some other sites:

http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=171881
http://www.gotw.ca/gotw/002.htm
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Preincrement_and_Predecrement

The last one is the google style guide (internal google guide for using c++) which I follow in all my programs for consistency.

Edit: Regarding your question on compiler optimization, the compiler won't optimize anything that can affect the logic of your code.

Consider this example:

Code:
int I = 5;
for (int i = 0, j = 0; i < I; j = ++i)
    printf("%d\n", j);
printf("\n");
for (int i = 0, j = 0; i < I; j = i++)
    printf("%d\n", j);
As you can see, the increment can change the logic of your code, and hence the compiler will avoid doing anything to affect the logic.
 

London Colin

Well-Known Member
assume_R said:
The difference is that with the post-increment, a temporary may be created.
And I guess the subtle point is that a compiler cannot optimize this away for a non-primitive type, because there is no guarantee that the overloaded pre- and post-increment operators both have exactly the same semantics (aside from the pre and post-ness of them), though it would be poor programming practice for them not to have.
 

assume_R

Well-Known Member
London Colin said:
And I guess the subtle point is that a compiler cannot optimize this away for a non-primitive type, because there is no guarantee that the overloaded pre- and post-increment operators both have exactly the same semantics (aside from the pre and post-ness of them), though it would be poor programming practice for them not to have.
Yeah, that sounds like another reason the compiler can't optimize it.

But even with primitive types, it doesn't optimize the post-increment, but with int's, the creation of a temporary int takes almost no time.

However, we can see how it can change the efficiency of a program on a large scale:

Code:
// Timer was shown to me by Eric Farmer
#include <ctime>
struct Timer
{
    Timer(double& elapsed) : elapsed(elapsed), tic(std::clock()) {}
    ~Timer()
    {
        elapsed = static_cast<double>(std::clock() - tic) / CLOCKS_PER_SEC;
    }
    double& elapsed;
    std::clock_t tic;
};

int main(int argc, char* argv)
{

	double elapsed = 0, temp;
	const int I = 2000, J = 2000, K = 3000;
	printf("Post-increment...");
	{
		Timer t(elapsed);
		for (int i = 0; i != I; i++)
			for (int j = 0; j != J; j++)
				for (int k = 0; k != K; k++)
					temp = i + j + k;
	}
	printf("done in %f seconds.\n%d\n", elapsed, temp);

	elapsed = 0;
	printf("Pre-increment...");
	{
		Timer t(elapsed);
		for (int i = 0; i != I; ++i)
			for (int j = 0; j != J; ++j)
				for (int k = 0; k != K; ++k)
					temp = i + j + k;
	}
	printf("done in %f seconds.\n%d\n", elapsed, temp);
}
Result:



So with 2000 * 2000 * 3000 = 12 billion operations, pre-increment of "int" saved 1.7 seconds.
 

London Colin

Well-Known Member
assume_R said:
Yeah, that sounds like another reason the compiler can't optimize it.
I'm pretty sure it's the only reason.

Any form of compiler optimization takes place after an analysis of the code to determine what can and can't be done without changing the logic. I imagine testing for whether a post can be safely swapped for a pre would be one of the more trivial cases. And in the case of the boilerplate code -

for (i=0; i<n; i++)

- it is completely trivial.


assume_R said:
But even with primitive types, it doesn't optimize the post-increment, but with int's, the creation of a temporary int takes almost no time.

However, we can see how it can change the efficiency of a program on a large scale:

....

Result:



So with 2000 * 2000 * 3000 = 12 billion operations, pre-increment of "int" saved 1.7 seconds.
I suppose it depends on the compiler (and possibly what flags you specify).

See the link in my previous post, where someone determined that exactly the same assembly code was being generated in both cases -
[post]242901[/post]
... based on this quick experiment, there is no difference between pre-increment and post-increment for int types when using the Visual Studio 2008 C++ compiler with optimizations enabled or disabled.
 

London Colin

Well-Known Member
assume_R said:
Some tips:

You can make multithreading really easy by using omp. For example, here is a "for" loop which is now parallelized:
Code:
# pragma omp parallel for // create a parallel for loop over all features
for (int f = 0; f < F; ++f)
{
     // do some stuff in parallel

    # pragma omp critical
    {
         // write to a shared variable which can't be done in parallel
    }
}
Thanks for posting about this. I hadn't come across OpenMP before.

I have a question, though. How is the assignment of threads to processor cores managed by default?

For this sort of for-loop example I guess we ideally want just one thread executing per core, with tasks (i.e. the loop iterations) being assigned to each thread as they become free.

What I mean is if we parallelize a 1-to-10 loop on a dual-core machine we wouldn't really want five threads kicked off on each, since the execution times of each loop iteration might vary considerably, meaning one core might run out of things to do while the other is still busy. (Moreover, with a triple-core, there would have to be an unequal number of threads assigned to each.)

I had a brief look at the specs at http://openmp.org/wp/openmp-specifications/ and at Wikipedia http://en.wikipedia.org/wiki/OpenMP. It looks like you might be able to explicitly control this kind of thing, but I couldn't really get an overview; there's a lot of detail to wade through. :)

On the same theme, it looks like maybe on a single-core machine, unless you build against a 'stub' version of OpenMP which does nothing, you may end up with a multithreaded loop that might actually execute a bit more slowly than the sequential version, thanks to the overheads that omp must add.
 

assume_R

Well-Known Member
London Colin said:
Thanks for posting about this. I hadn't come across OpenMP before.

I have a question, though. How is the assignment of threads to processor cores managed by default?

For this sort of for-loop example I guess we ideally want just one thread executing per core, with tasks (i.e. the loop iterations) being assigned to each thread as they become free.

What I mean is if we parallelize a 1-to-10 loop on a dual-core machine we wouldn't really want five threads kicked off on each, since the execution times of each loop iteration might vary considerably, meaning one core might run out of things to do while the other is still busy. (Moreover, with a triple-core, there would have to be an unequal number of threads assigned to each.)

I had a brief look at the specs at http://openmp.org/wp/openmp-specifications/ and at Wikipedia http://en.wikipedia.org/wiki/OpenMP. It looks like you might be able to explicitly control this kind of thing, but I couldn't really get an overview; there's a lot of detail to wade through. :)

On the same theme, it looks like maybe on a single-core machine, unless you build against a 'stub' version of OpenMP which does nothing, you may end up with a multithreaded loop that might actually execute a bit more slowly than the sequential version, thanks to the overheads that omp must add.
If you look at my "Hello world" example, you can try it out.

But essentially it defaults to the number of cores (including hyperthreaded virtual cores!) available on the machine.

I'm not actually sure if that's computed runtime or compile-time (but I would assume runtime, but you should try out the hello world program on different machines to confirm). You can specify it yourself though if you want by using the omp_set_num_threads(int) function in the omp.h file.
 

assume_R

Well-Known Member
London Colin said:
On the same theme, it looks like maybe on a single-core machine, unless you build against a 'stub' version of OpenMP which does nothing, you may end up with a multithreaded loop that might actually execute a bit more slowly than the sequential version, thanks to the overheads that omp must add.
Eh, I'm not actually even sure that's true. Let's say you load up 3 threads on a single core machine. And your machine has 1000 other threads running. Now, you are using 3/1003 of the processing power instead of 1/1001 :). But I'm not sure that would make any difference, especially with the overhead.
 

London Colin

Well-Known Member
assume_R said:
If you look at my "Hello world" example, you can try it out.

But essentially it defaults to the number of cores (including hyperthreaded virtual cores!) available on the machine.

I'm not actually sure if that's computed runtime or compile-time (but I would assume runtime, but you should try out the hello world program on different machines to confirm). You can specify it yourself though if you want by using the omp_set_num_threads(int) function in the omp.h file.
The issue is not just the number of threads, but also how work is allocated to those threads. It does all seem to be documented, but in the kind of way that makes me read through the same stuff multiple times and still come away thinking "huh?". :)

Edit: 'Scheduling Clauses' seems to be what I was looking for : http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses

assume_R said:
Eh, I'm not actually even sure that's true. Let's say you load up 3 threads on a single core machine. And your machine has 1000 other threads running. Now, you are using 3/1003 of the processing power instead of 1/1001 :). But I'm not sure that would make any difference, especially with the overhead.
I would think most, if not all multitasking schedulers allocate time per process, not per thread, otherwise a rogue process could bring the system to its knees just by starting lots of threads.

Actual running time can potentially be improved by setting a high priority for a process, but that's not really a programming issue. We just assume our program has an entire (virtual) machine to itself. Questions of speed relate to how many clock cycles it takes within that virtual-machine environment, i.e. during the time the OS is actually scheduling it to run.
 
Top