Beyond Brown

When brown just isn't enough

RAM handling on the Atari ST

I mean, doesn’t look that hard to handle, does it?

Probably one of the more boring and contrived subjects to discuss, especially in old platforms like this, is how to handle memory in applications, from demos to games or even a normal GEM application. However, since people are still confused about this on anything than the trivial cases, it’s probably a good idea to review some concepts and offer some food for thought.

Table Of Contents

How to allocate RAM from the system in the first place

Let’s hope right into it without further ado. What are some common ways to allocate RAM for our program?

Let the assembler/compiler handle it

“Just use ds.w bro!”. And indeed that method gives a lot of mileage. Just tell the assembler or compiler when creating our TOS PRG to reserve some RAM on the BSS and give us its start address by means of a symbol:

    bss
my_awesome_buffer:  ds.w 16384

or in C:

unsigned word my_awesome_buffer[16384];

Boom, done. For most cases this is what people do and get away with.

…until the computer runs out of RAM and there is need for more.

Hardcoded addresses

An equivalent to that is to go one step above and not even ask the assembler. Just hardcode all the buffers as equates and move on. This is very valuable when one needs absolute control over RAM, doesn’t care to be system friendly, and does not intend to return to the system at all (barring a total system reset).

Perfectly legal and doable. As long as the program and other data are out of the way of those addresses. Which means that a relocator is probably needed after loading. Or a system that takes over the system at boot time and disables the OS. Which increases complexity. And, in this age, reduces the amount of people that are going to be bothered to use the software as most people these days own a sort of hard disk solution and don’t really want to mess with floppies. (Unless one also writes a hard disk driver that is low level for their purposes, and manages to not trash other data on disk, etc).

Allocation via spreadsheet

Not to diss the above ideas of course. There have been a good number of software shipped that have their RAM allocation managed manually by people using spreadsheets. There is an anecdote of one of the people who worked on the original God Of War games on PlayStation 2 (Christen Ericson) describing this idea as-is. Unfortunately the story is now probably hidden behind Twitter’s “log on to see post” wall, so it’s not easy to find (and the author might have taken then post down anyway).

System allocator

If static buffers don’t cut it, then it’s time to let the system assist.

The relevant functions we need are Malloc and Mfree in GEMDOS. One asks the system for a block of contiguous RAM like so:

    move.l   amount,-(sp)
    move.w   #$48,-(sp)
    trap     #1
    addq.l   #6,sp

and the other frees it:

    pea     startadr
    move.w  #$49,-(sp)
    trap    #1
    addq.l  #6,sp

There are of course some drawbacks with this as well:

  • It is common knowledge (or at least it used to be) that the early TOS versions (especially 1.0) have many bugs in Malloc, especially when many different blocks are requested. The situation was so bad that the early C SDKs used their own Malloc replacement as part of their runtime. So the best good practice is to reserve as much RAM as possible into one block, and then split the allocation manually
  • Freeing a block of RAM does not automatically mean that it’s available. For example, if 3 blocks are allocated with sizes a, b and c, then b is freed, and we request a block of size d>b then the system cannot give the b block to the user. This is known as fragmentation and although TOS tries to do its best to work around the problem (like for example joining adjacent freed blocks into a common large one), the system can run out of RAM under circumstances
  • On later TOS versions there is another call in GEMDOS called MxAlloc which can also give blocks of “fast” (or “TT” or “alt”) RAM

Consolidating all the above points, we come up with something like the following C code to allocate almost all RAM available:

#include <osbind.h>

long ram_available;
void *free_ram_start;
void *allocation_start;

void grab_ram()
{
    // Check if GEMDOS is new enough to use Mxalloc
    // (so we can allocate TT/FastRAM)
    unsigned short gemdos_version = Sversion();
    unsigned char use_mxalloc = gemdos_version >= 0x1900;

    if (use_mxalloc)
    {        
        ram_available = Mxalloc(-1, MX_PREFTTRAM);
        if (!ram_available)
        {
            Cconws("\x1b" "ECannot allocate RAM!\r\nPress any key to exit.");
            Pterm0();
        }
        free_ram_start = (void *)Mxalloc(ram_available, MX_PREFTTRAM);
        if ((unsigned long)free_ram_start < 0xe00000)
        {
            // No FastRAM allocated, so let's mark that we didn't
            // (needed for allocating screen RAM for pictures below)
            use_mxalloc = 0;
        }
    }
    else
    {
        ram_available = Malloc(-1);
        if (ram_available < 65536)
        {
            // Probably enough for caching all the music files
            Cconws("\x1b" "ENot enough RAM!\r\nPress any key to exit.");
            Pterm0();
        }
        ram_available -= 1024;  // leave 1k to the system for luck
        free_ram_start = (void *)Malloc(ram_available);
    }
    allocation_start = free_ram_start;
}

void free_ram()
{
    Mfree(free_ram_start);
}

Scared of C? See the assembly code for the above here.

Using RAM

Using some method from the above (or even something completely different, doesn’t matter) we are now the proud owners of a block of RAM. How can we use (and possibly re-use) it effectively?

Assembler handles it (hardcoded)

If we simply asked the assembler to reserve some memory statically (either from TEXT, DATA or BSS segments), and we don’t care much, we can just go ahead and fill it in with the data we want to generate, and that’s all.

In case that the buffer needs to be used more than once, we simply need to rewrite to it, taking care that the two (or more) use cases don’t overlap. This can lead to waste if the amount of RAM needed between the use cases is significantly different.

Structs

XiA has written a piece on this which is a recommended read. Here we will present a quick review of the ideas discussed.

Assume we want to store a few parameters in that buffer, like game stats. Traditionally one can do something like the following:

lives:          ds.w 1
energy:         ds.w 1
powerup_charge: ds.b 1
hidden_level_trigger:
                ds.b 1
game_state:     ds.w 1
sprite_base_pointer:
                ds.l 1
sprite_animation_frame_index:
                ds.b 1

And so on and so forth. Then we can simply access the data by refering to the labels:

                subq.w #1,lives
                beq game_over
                move.w #$3f,energy
                rts

That’s all good, but what if in the example above we want to have 2 players? The easy solution is to just duplicate the variables and also the controlling code that accesses them. What if we need 20 though? In this case can use something more sophisticated (note: the code has been written with the assembler rmac in mind, for devpac/vasm follow the instructions in the comments):

                .abs 0          ; devpac: rsreset
player_lives:   ds.w 1          ; devpac: rs.w
player_energy:  ds.w 1          ; devpac: rs.w
powerup_charge: ds.b 1          ; devpac: rs.b
hidden_level_trigger:
                ds.b 1          ; devpac: rs.b
game_state:     ds.w 1          ; devpac: rs.w
sprite_base_pointer:
                ds.l 1          ; devpac: rs.l
sprite_animation_frame_index:
                ds.b 1          ; devpac: rs.w
player_data_size equ ^^abscount ; devpac: player_data_size equ rscount
                .text           ; devpac: not needed

Here we asked the assembler to create a series of symbols that their values start at 0 and increment depending on the size requested in the ds (rs) instruction. So in practice this is player_lives equ 0, player_energy equ 2 etc etc.

Now we can instantiate two or more of those structures whenever we want in RAM, but in our case we could do something like (assuming a0 has our buffer’s start):

                move.l a0,player_1_data
                lea player_data_size(a0),a0
                move.l a0,player_2_data
                ...
                bss
player_1_data:  ds.l 1
player_2_data:  ds.l 1

And then we can simply use them like this:

                move.l player_1_data,a0
                bsr handle_lives
                move.l player_2_data,a0
                bsr handle_lives

                ...

handle_lives:   subq.w #1,player_lives(a0)
                beq game_over
                move.w #$3f,player_energy(a0)
                rts
                ...

This way, more reusable and flexible code can be written at the expense of a speed penalty. But in practice this shouldn’t be too bad (depending on the case of course).

Bump allocator (ask for RAM from allocated block)

Once we have obtained a block of RAM, no matter how, we can start allocating it as we want. There has been a substantial amount of research and development done in this field, but for most purposes we can get away with the simplest one we can get away with. And that’s what is commonly known as a bump allocator.

There are many resources to consult about the bump allocator (for example Wikipedia), but in the end it boils down to: keep a pointer to an address in RAM, starting at the top of the allocated block of RAM. Every time a routine asks for some RAM, return that address, and then increment it by the amount of RAM requested. That’s all there’s to it: it’s terrible, it doesn’t support freeing RAM (perhaps just the last block allocated, but that’s mostly up to the caller), but it’s super fast and that’s what most people need most of the times. It also comes with the benefit of instant deallocting of all buffers by resetting the allocation pointer.

“Fixed” and “Dynamic” allocation

As said above, bump allocators have a hidden cost of not being able to free allocated memory easily (which might also lead to fragmentation). Suppose we have a use case where some allocated buffers have to be reused many times (let’s call them “fixed”), while others have to be wiped before the fixed buffers (let’s call those “dynamic”. This will lead to slightly more complicated handling code, and we don’t like complications.

One solution would be to perform the allocation of the fixed buffers before the dynamic, so at least we know where to reset our allocator when it’s time to let the dynamic buffers go. At best this is mildly uncomfortable, but it could be impossible as a fixed buffer might need to be allocated after the dynamic ones due to execution order (for example, we might not know how much memory to allocate before too late).

Another idea is to have two different bump allocators, each with their own allocated block of RAM. But this immediately brings up the problem of how much RAM to allocate to each allocator. In cases where RAM availability is low, one might have to change these values quite a bit until everything fits.

A more elegant solution is inspired by The Black Lotus, and simply involves having either the fixed or dynamic blocks coming from the start of the allocated block towards the end, and the other coming from the end going to start. (Check out the original article under “Memory Allocation) So basically: 2 bump allocators, one starting from the start, and the other from the end.

The implementation following was used almost verbatim in TONY - Montezuma’s Gold

DYNAMIC_RAM_SIZE .equ $30000

; d0=how many bytes does the idiot want.
;    if negative, reserve the absolute value from the static reserve
;    (the idea here is that we reserve from the same block of ram. "dynamic" allocations
;    run from the start of the buffer towards the end, and "static" allocations start
;    from the end of the buffer towards the start. Obviously, if the static and dynamic
;    cursors cross we're out of RAM)
; On exit: d0=address or 0 on failure
ask_for_some_ram:
                lea current_dynamic_ram_pointer,a1
                tst.l d0
                bgt.s ask_for_some_ram_dynamic
                lea current_static_ram_pointer,a1
                move.l (a1),a0
                add.l a0,d0
                and.w #$fffc,d0             ; round it down to 4 bytes
                move.l d0,(a1)
                bra.s ask_for_some_ram_check_bounds
ask_for_some_ram_dynamic:
                move.l (a1),a0
                add.l a0,d0
                addq.l #3,d0                ; round it up to 4 bytes
                and.w #$fffc,d0
                move.l d0,(a1)
                move.l a0,d0                ; return value
ask_for_some_ram_check_bounds:
                move.l current_dynamic_ram_pointer,a1
                cmp.l current_static_ram_pointer,a1
                bgt.s ask_for_some_ram_fail ; static and dynamic buffers crossed, we're out of RAM
                rts
ask_for_some_ram_fail:
                moveq #0,d0                 ; insert appropriate swear word here
                rts

                .data
current_dynamic_ram_pointer:    .dc.l dynamic_ram_buffer
current_static_ram_pointer:     .dc.l dynamic_ram_buffer+DYNAMIC_RAM_SIZE

                .bss
dynamic_ram_buffer:     .ds.l DYNAMIC_RAM_SIZE>>2

                .text
free_dynamic_buffers:
                move.l #dynamic_ram_buffer,current_dynamic_ram_pointer
                rts

The code should be self explanatory, but two things worth mentioning are: a) There is bounds checking so if the two pointers cross, then the allocator is out of RAM and then it signals the caller by returning 0, b) RAM blocks are aligned to 4 byte increments to save the calling code some logic.

Initialising RAM

One common trick which is very handy for debugging (and runtime checks also) is to fill the allocated buffer with a known pattern before handing it off to the caller, like for example $12345678. That way it becomes much easier to find problems where more RAM is required than allocated, because the spillover will be visible under a debugger, as the buffer will spill over the allocated amount and the pattern will not be visible.

An extra trick here is for the allocator to reserve more RAM than asked (for example 16 bytes), and fill that extra bit with yet a different known patter, like $87654321. Then if 2 consecutive buffers are both filled and the pattern is corrupt then this is an indication that one buffer tried to spill to the next one. This might even prevent the program from crashing, but this is not the intended use.

(Note that these can also be runtime checks and added to the code at any suspicious place. Then locating the source of memory spillovers can be easier. Alternatively, if a debugger supports it, write breakpoints can be employed)

Linked list - Arena allocator

With more complex problems we might not know how much memory we need ahead of time. We might want to store homogenous data (i.e. data of the same structure) or eterogenous data (i.e. different structures). And we might not know how many of each structure we need.

One of the simpler ways to solve this problem is often called an Arena allocator. The general idea is that each time we run out of RAM, we allocate a fixed amount that is reasonable and we use it.

Some code might make this concept clearer:

ARENA_SIZE=4096

; Struct 1
    .abs
struct_1_next:      ds.l 1  ; points to the next structure of this type
struct_1_field_1:   ds.l 1
struct_1_field_2:   ds.w 1
struct_1_field_3:   ds.w 1
struct_1_field_4:   ds.b 1
struct_1_field_5:   ds.b 1
struct_1_size     = ^^ABSCOUNT
    .text

; Struct 2
    .abs
struct_2_next:      ds.l 1  ; points to the next structure of this type
struct_2_field_1:   ds.w 1
struct_2_field_2:   ds.w 1
struct_2_size     = ^^ABSCOUNT
    .text

; Arena data
    .abs
arena_next:         ds.l 1  ; points to the next arena (0=no more)
arena_data_size:  = ^^ABSCOUNT
    .text

;
; Or setup code
;
    move.w #struct_1_size,d7
    bsr add_struct_to_arena     ; allocate the first one
    move.l a0,struct_1_first    ; save the address of the first one
    move.l a0,a1
    moveq #120-1-1,d0           ; need to allocate 120 "struct_1"s (the first is allocated above)
allocate_1:
    bsr add_struct_to_arena     ; allocate another one
    move.l a0,struct_1_next(a1) ; link the previous one to this
    dbra d0,allocate_1
    clr.l struct_1_next(a1)     ; signal end of list

    move.w #struct_2_size,d7
    bsr add_struct_to_arena     ; allocate the first one
    move.l a0,struct_2_first    ; save the address of the first one
    move.l a0,a1
    moveq #600-1-1,d0           ; need to allocate 600 "struct_2"s (the first is allocated above)
allocate_2:
    bsr add_struct_to_arena     ; allocate another one
    move.l a0,struct_2_next(a1) ; link the previous one to this
    dbra d0,allocate_2
    clr.l struct_2_next(a1)     ; signal end of list

; We now need to allocate and append a few more "struct_1"s to the chain.
; We could have kept the last address when the first batch was added, but
; if we want to avoid that, we can now traverse the list of "struct_1"s
; until we find one which has a 0 in its "next" element
    move.l struct_1_first,a1
scan_1:
    move.l struct_1_next(a1),d0
    beq.s scan_1_done
    move.l d0,a1
    bra.s scan_1
scan_1_done:
    ; When we reach this, a1 has the last linked list in the chain
    move.w #struct_1_size,d7
    moveq #93-1,d0              ; need to allocate 93 "struct_1"s
allocate_1_part_2:
    bsr add_struct_to_arena     ; allocate another one
    move.l a0,struct_1_next(a1) ; link the previous one to this
    dbra d0,allocate_1_part_2

    illegal                     ; end of demonstration, folks!

; Add struct to arena if there's free space for it, or allocate new
; Input:  d7.w=struct size
; Output: a0.l=pointer to struct's start
add_struct_to_arena:
    move.l d1,-(sp)
    move.w current_arena_free_bytes,d1
    cmp.w d7,d1
    bge.s add_struct_to_arena_go

    ; Out of size for current arena, allocate a new one
    move.l #ARENA_SIZE,d0
    bsr ask_for_some_ram                                        ; use the above allocator, or any else
    ; TODO: panic if this fails (i.e. d0=0)
    move.w #ARENA_SIZE-arena_data_size,current_arena_free_bytes ; set the number of free bytes for this arena
    move.w #arena_data_size,current_arena_offset                ; set the offset of the first free byte for this arena
    move.l current_arena_address,d1
    move.l d1,a0
    bne.s add_struct_to_arena_link                              ; skip if we already have an arena

    ; This is the first arena, so we need to do some extra steps
    move.l d0,first_arena_address   
    move.l d0,a0
    bra.s add_struct_to_arena_finalise

    ; Point current arena to new one
add_struct_to_arena_link:
    move.l d0,arena_next(a0)                                    ; point current arena to next
    move.l d0,a0

add_struct_to_arena_finalise:
    clr.l arena_next(a0)                                        ; indicate that this is the last arena in the chain
    move.l a0,current_arena_address

add_struct_to_arena_go:
    add.w d7,current_arena_offset                               ; update offset for next addition
    sub.w d7,current_arena_free_bytes                           ; update number of free bytes
    move.l current_arena_address,a0
    adda.w current_arena_offset,a0                              ; a0 now contains the start address we return
    move.l (sp)+,d1
    rts

; Static arena bookkeeping data
current_arena_address:      dc.l 0
current_arena_offset:       dc.w 0
current_arena_free_bytes:   dc.w 0
first_arena_address:        dc.l 0

struct_1_first:             ds.l 1
struct_2_first:             ds.l 1

Our scenario is that we have two structs, struct_1 and struct_2 and we require a few copies of each. Unforunately we don’t have any guarantee about the ordering, so we need to keep the address of each of the copy. This is stored in the field struct_1_next or struct_2_next as an address. So if we want to process all struct_1 copies, we can simply start from the address of its first copy, struct_1_first, do some work, then acquire the next copy by reading struct_1_first’s struct_1_next field. We’re at the end of our linked list when struct_1_next reads 0. See scan_1 above for an example.

Because the count of each copy of the structs we need might be arbitrary, we allocate RAM for them by using the subroutine add_struct_to_arena. This routine allocates chunks of ARENA_SIZE of bytes (set to 4k in our example) and adds each struct by using a bump allocator. When the current arena is exhausted, we simply allocate a new one, and link them together. This is not strictly necessary, but if for some reason we want to deallocate the RAM at some point and are using the system allocator, then this gives us a way to free the arenas (as Mfree requires the address of the start of each allocation).

There are more things one can do with linked lists/arenas like deleting entries, reclaiming space, sorting, etc. but these are omitted here. Some links at the end should cover these topics.

Please note: the code above has not been tested, but is more there to demonstrate the concept. If anyone discovers bugs in it, don’t hesitate to get in touch and they will be corrected.

List pools

A different way to solve the problem in the previous section is presented here. Here, we have a pool of lists (structs) that are initially unlinked. This list is doubly linked, i.e. it contains a “next” as well as a “previous” member. This means that the list can be traversed in both directions. In order to know where the list starts and ends, two “sentinel” items are placed at the start and end.

As said, at first (during “reset”) the list contains only the sentinel objects. Then during “initialise” an item is pulled from the pool of available lists (the last one) and is added to the chain. Then another list is pulled from the pool and is added to the chain. Finally, one list is deleted from the chain and is placed back into the pool.

This kind of structure offers plenty of flexibility. The pool is statically allocated in this example, but it could easily be grown by adding a new block of ram and allocating more structs to it (for example see the arena allocator above).

(Again, this code has not been tested and exists for demonstration purposes. Any bugs reported are appreciated)

; Reset list pools
                lea     list_pool(PC),A6
                move.w  #(max_list_items-1)*4,list_count
                moveq   #max_list_items-1,D7
                lea     list+(max_list_items-1)*my_list_len(PC),A2
pathfind_resetpool:move.l A2,(A6)+
                lea     -my_list_len(A2),A2
                dbra    D7,pathfind_resetpool

; Initialise lists
                subq.l  #4,A6           ;grab first pointer from pool for initial entry
                subq.w  #4,list_count
                move.l  #list_first,my_list_prev(A6) ;terminator for first open
                move.l  #list_last,my_list_next(A6) ;terminator for first open

                move.l  A6,list_first+my_list_next ;initialise sentinels
                move.l  A6,list_last+my_list_prev


; Let's add a list from the pool
                lea     list_pool,A1 ;get a new object from the pool
                move.w  list_count,D0
                ; TODO panic if D0=0
                subq.w  #4,D0
                movea.l 0(A1,D0.w),A1
                move.w  D0,list_count
                                                    ;update pointers
                move.l  #list_last,my_list_next(A1) ;last in the list for now
                move.l  A1,list_last+my_list_prev
                move.l  A3,my_list_prev(A1)
                move.l  A3,my_list_parent(A1)
                move.l  A1,my_list_next(A3)

;
; Remove item in a4 from open list and place it on closed
;
                move.l A1,A4
                lea     list_pool,A3
                move.w  list_count,D5
                move.l  A4,0(A3,D5.w)   ;release item into the pool
                addq.w  #4,D5           ;update pool counter
                move.w  D5,list_count
                movea.l my_list_prev(A3),A2 ;fixup open list chain
                movea.l my_list_next(A3),A3
                move.l  A3,my_list_next(A2)
                move.l  A2,my_list_prev(A3)


                illegal     ; And that is all

                    .abs
my_list_x:          DS.W 1
my_list_y:          DS.W 1
my_list_dir:        DS.W 1
my_list_scor_mov:   DS.W 1
my_list_scor_est:   DS.W 1
my_list_parent:     DS.L 1
my_list_next:       DS.L 1
my_list_prev:       DS.L 1
my_list_len         EQU ^^ABSCOUNT
                    .text

; Space for lists
           even
list:      DS.B my_list_len*max_list_items
; Space for sentinel objects for lists
; Should be initialised before use!
list_first:DS.B my_list_len
list_last: DS.B my_list_len

list_pool: DS.L max_list_items
list_count:DS.W 1

Closing words

This article grew substantially in size from its initial conception. It is a good example of how a seemingly simple and boring subject like this can get very nuanced/complicated.

It is not expected that people will try to implement or use in practice all the ideas presented. At the very least this should have provided some food for thought, and possibly inspiration for people to mutate the ideas presented to suit their own needs.

Further reading

Niklas Gray has written 3 articles about data structures in the now-defunct blog for “Our machinery”: Part 1, Part 2, Part 3. Recommended.

Dmitry Soshnikov has written a more extensive article on pool allocators, well worth a read.

GGN

Breathing, Atari. The rest is optional.