Benchmarking PLUS Contributions wanted for a test suite

Hello all! I am pleased to announce the release of my extension Benchmarking. It will soon be on the Inform 7 website, for now you can download it from curiousdannii.github.com/if/exte … arking.i7x

Benchmarking is a performance test framework for I7. It has two types of users in mind:

  1. Story and extension authors can use Benchmarking to compare alternatives for some slow programming task.
  2. Interpreter authors can use Benchmarking to compare their interpreter with others, as well as to compare interpreter updates to see whether they have a performance benefit or deficit.

It is easy to set up, as the example shows for how to test text matching. If the tests do not run in the built-in I7 interpreter, then either use output.ulx from the Build folder with another interpreter, or (if you’re on Windows) download the updated 6G60 build.

Any comments on this extension will be appreciated!

-------------------

I am also looking for contributions for a performance test suite. Has your I7 story ever run too slowly? Can you identify what the problem is? If so, then I’d love that example for the test suite. Please post here so that we can coordinate and not end up with 10 different scope tests.

There is ‘-)’ missing in your example. Looks pretty neat otherwise.

It’s an Inform 7 bug: inform7.com/mantis/view.php?id=377

Very nice! Thank you for this!

FYI, searching a 100-member list for a given value takes roughly 28 times as long (assuming that it’s toward the end of the list) as simply retrieving the value from a given slot of the same list:

Not earth-shattering, but it is interesting to me how much longer a search takes versus a simple access.

I do have a couple of UI suggestions:

  1. The game closes down immediately a few moments after finishing the last test. In Gargoyle, this means that the game window closes too. Please offer the user the chance to either rerun the tests or quit at the end, and don’t quit without the pressing of an explicit key. It takes a long time to run any of the simple tests I’m trying, and it’s really annoying to have the window just close in your face when they finish!

  2. Currently it is a bit hard to customize the conditions for a test. For example, if I need to set up a list full of random data for my test(s) to work on, I need to do something like this:

First when play begins: repeat with x running from 1 to 99: add a random number between 1 and 99 to search list; add 100 at entry 90 in search list; say "The search list is [number of entries in search list] items long: "; say search list in brace notation; wait for any key.

What’s difficult? First, the user needs to remember to add “first” to the when play begins rule to ensure that his rule prints before the program is taken over by the benchmarker; changing the benchmarker control rule to “last when play begins” would eliminate this problem. Second, if the user wants to print text in this code block, he needs to stick something like “wait for any key” in there; this calls glk_select. Most users won’t realize that this is necessary and will bang their heads on the wall trying to figure out why their introductory text is not printing. Maybe a “press any key to begin” in the benchmarker’s when play begin rule would fix this?

Hm, that shouldn’t happen. It’s the interpreter’s responsibility to keep any last-printed Glk output visible to the user after glk_exit().

Looking into it, I was wrong–the window does stay open, but only until the user presses any key. So twice today I accidentally pressed a random key while trying to do something else and the window shut down. I still think it’s a good idea to offer the user an explicit option to end the execution. EDIT: Aha! Using the scroll wheel apparently counts as a keypress, at least in the Mac OSX Gargoyle, so it was attempting to use the mouse to view the scrollback buffer that causes the window to close.

Another UI idea: Allow the user to request a transcript so that the data can be dumped to a file.

Hi Erik, I’ve added your three suggestions. Will post again when I upload the update.

In the future I plan to make it write out to a data file, and give you the option to compare with other interpreters. I’ll remove the transcript option then.

Great, thanks. I’ve got what appear to be some intermittent issues with memory on some simple examples I’ve been playing with. I’ll post again if I come up with something coherent to report.

Yes something is going wrong with the sampling - 93902 samples seems far too high to me, though it could be possible if your timer resolution is very good, ie. 50µs. I’m only getting 1ms in Windows Gargoyle :frowning:. Can you check what “the minimum timer resolution” is, with an After running the benchmark framework rule? It stores the samples in a list, so with that many samples it’s sure to cause problems.

The framework takes at least 5 samples, and runs each test for at least 5 seconds as well (mainly because that’s what benchmark.js does.) But if too many samples is a problem I can limit that too. I’m sure 100 would be more than fine.

Would you also be interested in seeing the raw number of times a function was run?

If you compile Gargoyle with -DGARGLK_PROFILE_TIME, it uses a high resolution timer for glk_current_time calls. Resolution varies by platform but 50µs is in the ballpark.

Erik has access to a build like that, so possibly he’s using it here.

I’ve uploaded an update. The biggest change is how the iteration multiplier is calculated, which is the number of times a test case needs to be run to reach the timer resolution. It seems to work reliably with my tests. In the event that the test case does ever get a time of 0 then a message is printed and that time is discarded.

Dannii, was the “minimum timer resolution” actually the value you wanted me to test? It always returns “1”, the minimum value, so I assume that the value is not measured in microseconds…?

I’m using the latest public release of Gargoyle, 2011.1.

When the 95-96,000 samples reaches its end and sits for a few minutes as if the interpreter has hung, then finally spits out the results. I assume that during that time it is processing the massive list you mentioned! (Even the smaller value, 25-28,000 samples, has a noticeable delay, but it’s not as bad.) I’m no math whiz, so apologies if this is silly, but would it be possible to process the results iteratively, so that you don’t actually need to store all the values? I’m sure that the mean could be got that way, and probably the variance too…?

I’m not noticing a whole lot of difference between your new update and the original. Here are before and after results:

For the record, the code I’m testing in both cases is:

[spoiler][code]Include Benchmarking by Dannii Willis.

Search list is a list of numbers variable.

First when play begins:
try switching the story transcript on;
repeat with x running from 1 to 99:
add a random number between 1 and 99 to search list;
add 100 at entry 90 in search list;
say “The search list is [number of entries in search list] items long: [search list in brace notation].”;
wait for any key.

Searching a simple list is a test case.
The author is “Erik Temple”.
The description is “Looks for the number 100 in a list of random numbers 100 entries long.”
To run simple-list search (this is running simple search):
if 100 is listed in the search list:
do nothing.
The run phrase is running simple search.

Selecting from a simple list is a test case.
The author is “Erik Temple”.
The description is “Grabs entry 90 from a list of random numbers 100 entries long.”
To run simple-list select (this is running simple selection):
if entry 90 in search list is 100:
do nothing.
The run phrase is running simple selection.

After running the benchmark framework:
say “The minimum timer resolution was: [the minimum timer resolution].”[/code][/spoiler]

I seem to recall Ron mentioning that different systems might return the system clock in different units (was it nanoseconds vs microseconds?) That couldn’t be a factor here, could it? I wouldn’t think so, given that the results look pretty reasonable, but…

I do think it would potentially be useful to see the raw number of times a function was run, and it might also be useful to be able to cap the number (running 90,000 iterations takes a long time!).

FYI, here is a typical result on my system for your example from the extension docs:

It looks like you still have the old version. The new one caps it at 100 samples. It might be possible to iteratively calculate the variance, but I don’t know how. The algorithm I have needs the mean to be calculated first. But with only 100 samples that won’t be a problem any more.

If your timer resolution really is just 1µs then that’s great! And although different systems might be able to get the time to a different resolution, Glk normalises it to microseconds. So if you have a ns timer then it will divide by 1000, and if you only have a ms timer then it will multiply by 1000. On windows Garglk gives you the time in µs, but the value is cached and only updated every ms. The extension measures what resolution you have by running a tight loop that runs until the value changes (and samples that 30 times.) If you’re running on a Mac then potentially you do have access to a real µs timer (Garglk actually accesses a ns timer).

Yep, when I downloaded the file after your update post, GitHub was apparently still feeding out the old version. I just tried again and got the 120118 version. This works very nicely, and the values from 100 samples are indeed equivalent to those from 95,000 samples:

It might be nice to eventually have easy access to change the max # samples run (in case I wanted to raise or lower it), but of course that’s unlikely to be necessary.

With the number of samples run capped at 100, I don’t think it’s necessary any longer, but for future use in some other project perhaps:

This algorithm from en.wikipedia.org/wiki/Algorithms … g_variance seems to be able to calculate the mean as the data are collected, rather than iterating over the full dataset post-collection.

[code]def online_variance(data):
n = 0
mean = 0
M2 = 0

for x in data:
    n = n + 1
    delta = x - mean
    mean = mean + delta/n
    if n > 1:
        M2 = M2 + delta*(x - mean)

variance_n = M2/n
variance = M2/(n - 1)
return (variance, variance_n)[/code]

Anyway, Benchmarking is sweet! (It also happens to be a good illustration of the use of floating point in Glulx Inform, which will come in handy if I ever update Glimmr to make use of them.)

Thanks again for all the work!

Just a couple more thoughts:

First, I’ve edited my local copy of Benchmarking to show the description of each test case:

Before benchmarking a test case (called test case) (this is the showing a test case's info rule): now the current test case is the test case; now the current phase is ""; now the current sample number is 0; say "[line break][The test case][line break][italic type][description of the test case][roman type]";[### Changed ET - 1/18/2012] pause briefly;I’m not sure where that is desirable for all users, but there it is.

Second, for the bank of test cases you’re hoping to put together: Ron Newcomb a while back posted some code for searching through tables that used optimized glulx opcodes (@linearsearch, and maybe one of the others?) That might make for a nice comparison with the standard I7 code.

I’ve just pushed a major update to the extension. It now performs significance tests, so that you’ll be able to see if a new algorithm is actually faster than the old one. It’s not perfect however, and other computer programs may make the results less accurate, so it’s best to run the tests without anything else going on.

It doesn’t show the test author/description yet, but that is still planned.

This is super cool! Thanks for making this. :slight_smile:

::replying so I can find the thread again later::

Also, tres cool, Dannii.

I’ve just pushed an update which gives a menu option to show test descriptions. Which means this extension is now out of beta! I’ll be sending it to the I7 website soon.

Thanks for your comments everyone.

I’m still keen for contributions for a test suite. If you have any slow code please send it my way!

I have a little game called Sugar that is very slow, because it does a bunch of regex stuff on the entire output of the game every turn. Is this useful? Should I PM you the code?