Glulxe fatal error: Memory access out of range (2519010C)

I’m always pushing limits here!

I’ve been running a big suite of tests in Kerkerkruip and I just got this error:

Are there any settings I can increase to avoid it?

That’s not a running-out-of-memory error, that’s a bug that accesses a garbage memory address.

Have you looked at inform7.com/mantis/view.php?id=1390 ?

That’s kind of a relief. It was crashing at the same point every time, so maybe it’s something I can actually fix.

Okay, I think I found the point where the crash happens, but it’s weird. I’m not sure the code is always failing at the same point, but it often fails on the line “blank out the whole of…”

A last Standard AI rule for a person (called P) (this is the select an action and do it rule): log "select an action and do it for [P]"; [this log entry USUALLY shows up] blank out the whole of the Table of AI Action Options; log "blanked out table of AI Action Options"; [if there is a crash within this rule, this log entry NEVER shows up] ...

The Table of AI Action Options contains stored actions. Could that be causing a problem somehow?

Table of AI Action Options Option Action Weight a stored action a number with 20 blank rows

What version of I7? The bugs I linked to were in 6L02; should be fixed in 6L38; but perhaps you’ve found another case that was missed.

This is the updated Mac version of 6L38. From the About box:

Do you have any suggestions for how to replicate the bug on a smaller scale? I don’t even know where to begin at this point.

While looking for a workaround, I did learn that repeating through the table is ok, but trying to blank out an individual row also causes the error. And I also learned that “choose row 1” does not choose the first nonblank row. Isn’t there a phrase to do that?

[code]To cautiously blank out (contents - a table name):
while the number of filled rows in contents > 0:
choose a random row in contents;
log “blanking out [option entry]: [action weight entry][line break]”;
blank out the whole row;

A last Standard AI rule for a person (called P) (this is the select an action and do it rule):
log “select an action and do it for [P] - [number of filled rows in table of ai action options] rows”;
cautiously blank out Table of AI Action Options;
[/code]

The method used in some of the example code is “repeat through [table]: [do stuff]; break.” If it gets past the break statement, you’re out of non-blank rows.

Wow, that seems like a really ugly hack.

I’m unable to reproduce the error with a short test case.

Yeah, me too. I guess I should commit my code so other people can mess around with it. But even the test that crashes doesn’t always do it - it seems to depend on certain starting conditions that I can’t identify. It’ll have to wait until tomorrow, though.

When you do, commit it to a branch please :slight_smile: (Actually it would probably be good to keep all the test stuff in a branch until you’re completely finished.)

For those of you not following on github, here’s a link to the relevant file on the bugfix branch:

github.com/i7/kerkerkruip/blob/ … 20Core.i7x

To replicate the bug, you’ll need to clone the entire branch, run Kerkerkruip, start a game, and enter this command:

queue test dreadful-presence-test

If that doesn’t replicate the problem, you can try tweaking the random seed… ask me if you need more information about that. But for me, right now, the VM crashes with every seed, although it happens after an unpredictable number of table accesses.

I’m reading the discussion on the Mantis bug reports and checking that against Tables.i6t. I noticed this comment for ForceTableEntryBlank:

I wonder how this interacts with the stored action variable “the main actor’s action,” which is set from an entry in the Table of AI Action Options. If I’m reading this code correctly, that setting is done by copying and not by reference, so there should be no problem. And there is not a problem most of the time. It’s only in this one test that I’ve ever seen the crash. The test does involve some hijacking of the normal Kerkerkruip AI behavior (dreadful presence stops people from acting sometimes), but not it a way that I could imagine being the cause. I’m not sure what else is special about the test, but there could be something I’m missing.

Edit: Maybe it could be the cause. By diverting the normal sequence of AI rules, it might bypass the line that updates the main actor’s action, leaving an action from an earlier turn still in there. I still don’t know why that would affect the table, but at least it’s something to check.

But anyway, perhaps a close reading of ForceTableEntryBlank might shed some light on this. I’ll look at it, but I don’t know if I have enough understanding of the code to really see it.

Here’s something really weird:

I copied the code for “Force Entry Blank” into an Include block in my project, planning to insert some debug messages. Then I ran the code to check that it still worked. Lo and behold, no VM crash! But I did get this when the game restarted itself after the test:

After some more investigation, it looks like a garbage value is being written into the TB_Blanks address in the locale description priority column of the Table of Locale Priorities. I don’t know how. Maybe I copied something over wrong from Tables.i6t, but I don’t know how that could have happened either. In fact, I tried copying it again to be sure, and I got the same result. I even checked all the whitespace in case there were some anomalies in the copy/paste operation. None that I could find.

Okay, I’ve found the point where the Table of Locale Priorities gets corrupted. It’s this rule:

Last when play begins (this is the create shimmering items rule): repeat with guy running through people: unless guy is the player: repeat with item running through things held by guy: if item is a weapon and item is not a natural weapon: if item is readied: let new-weapon be a new object cloned from item; now new-weapon is shimmering; now shimmer-owner of new-weapon is guy; if item is clothing: if guy wears item: let new-cloth be a new object cloned from item; now new-cloth is shimmering; now shimmer-owner of new-cloth is guy.
I assume from this that there’s a problem with dynamic objects.

Oh wait… now I know why copying the relevant sections of Tables.i6t changed the behavior: Dynamic Tables is included by Dynamic Objects, and it also replaces these sections of Tables.i6t. I was actually undoing some of the work done by Dynamic Tables! Now back to the drawing board…

Fresh start now.

I believe I’ve copied all the code correctly this time before inserting print statements. Now it looks like the original crash, and the crash is happening in FlexFree. I don’t think I understand how this is supposed to work, so let me just show you what I did:

[code]include (-
[ FlexFree block fromtxb ptxb memsize;
@getmemsize memsize;
print "FlexFree “, block, " memsize=”, memsize, “^”;
if (block == 0) return;
if ((block->BLK_HEADER_FLAGS) & BLK_FLAG_RESIDENT) return;
if ((block->BLK_HEADER_N) & $80) return; ! not a flexible block at all
if ((block->BLK_HEADER_FLAGS) & BLK_FLAG_MULTIPLE) {
print “Block is multiple^”;
if (block–>BLK_PREV ~= NULL) (block–>BLK_PREV)–>BLK_NEXT = NULL;
fromtxb = block;
for (:(block–>BLK_NEXT)~=NULL:block = block–>BLK_NEXT) {
print "current block is ", block, “, next=”, block–>BLK_NEXT, “, previous=”, block–>BLK_PREV, “(NULL=”, NULL, “)^”;
}
while (block ~= fromtxb) {
print "Freeing component block ", block, “^”;
ptxb = block–>BLK_PREV; FlexFreeSingleBlockInternal(block); block = ptxb;
}
}
print "Freeing original block ", block, “^”;
FlexFreeSingleBlockInternal(block);
];

! The rest of this section is unmodified…

[ FlexFreeSingleBlockInternal block free nx;
block–>BLK_HEADER_KOV = 0;
block–>BLK_HEADER_RCOUNT = 0;
block->BLK_HEADER_FLAGS = BLK_FLAG_MULTIPLE;
for (free = Flex_Heap:free ~= NULL:free = free–>BLK_NEXT) {
nx = free–>BLK_NEXT;
if (nx == NULL) {
free–>BLK_NEXT = block;
block–>BLK_PREV = free;
block–>BLK_NEXT = NULL;
FlexMergeInternal(block);
return;
}
if (UnsignedCompare(nx, block) == 1) {
free–>BLK_NEXT = block;
block–>BLK_PREV = free;
block–>BLK_NEXT = nx;
nx–>BLK_PREV = block;
FlexMergeInternal(block);
return;
}
}
];
-) instead of “Deallocation” in “Flex.i6t”.
[/code]

And the output:

I’m going to assume there was a stream mixup and that programming error actually happens after the last print statement. So it looks like the trouble begins when the object’s first child is the 0 object (yourself?)… or maybe I’m way off. This is just weird-looking.

Don’t assume that. Probably not true.

Can you compile the version of Glulxe in the branches github.com/erkyrath/cheapglk/tree/debugger ? This would let you set a breakpoint on the RT__Err() function and look at the stack trace.

Maybe next week… that would have a bit of a learning curve for me.

Oh, you’re right. It must be this line that produces the programming error:

		if (block-->BLK_PREV ~= NULL) (block-->BLK_PREV)-->BLK_NEXT = NULL;

According to the output, block–>BLK_PREV = 0. I guess you can’t write to 0–>BLK_NEXT…?