Turbo C / VGA x86 assembly: Copy from ram to vram

I'm just having fun with turbo c to draw "sprites" on an 8086/286 (emulated with pcem) with an MCGA/VGA card.

Compiled with turbo c 3.0 it should work on real 8086 with MCGA. I'm not using the VGA mode x because it is a bit complex and I don't need extra vram for the things I want to do, even if there is some flickering on the screen, it's ok :).

In C, I have a bunch of memcpys moving data from the loaded sprite struct to the VGA in mode 13:

byte *VGA=(byte *)0xA0000000L;    
typedef struct tagSPRITE             
{
    word width;
    word height;
    byte *data;
} SPRITE;

void draw_sprite(SPRITE *sprite){
    int i = 0; int j = 0; 
    for(j=0;j<16;j++){
        memcpy(&VGA[0],&sprite->data[i],16);
        screen_offset+=320;
        i+=16;
    }
}

The goal is to convert that code to a specific assembly function to speed things just a bit.

(editor's note: this was the original asm attempt and text that an answer was based on. See the revision history to see what happened to this question. It was all removed in the last edit, making only the asker's own answer make sense, so this edit tries to make both answers make sense.)

I tried to write it in assembly with something like this, which I'm sure has huge mistakes:

void draw_sprite(SPRITE *sprite){
    asm{
        mov ax,0A000h
        mov es,ax           /* ES points to the video memory */

        mov di,0            /* ES + DI = destination video memory */
        mov si,[sprite.data]/* source memory ram ???*/
        mov cx,16           /* bytes to copy */

        rep movsb           /* move 16 bytes from ds:si to es:di (I think this is the same as memcpy)*/

        add di,320          /* next scanline in vram */         
        add si,16           /* next scanline of the sprite*/
        mov cx,16   

        rep movsb           /* memcpy */

        /*etc*/
    }
}

I know the ram address can't be stored in a 16 bit register because it is bigger than 64k, so mov si,[sprite.data] is not going to work.

So How do I pass the ram address to the si register? (if it's possible).

I know I have to use ds and si registers to set something like a "bank" in "ds", and then, the "si" register can read a 64k chunk of the ram, (so that movsb can move ds:si to es:di). But I just don't know how it works.

I also wonder if that asm code would be faster than the c code (on an 8086 8 Mhz, or a 286), because you don't have to repeat the first part every loop.

I'm not copying from vram to vram for the moment, because I'd have to use the mode X and that's another story.

Solution

Thanks to Michael Petch, Peter Cordes, and everybody. I got the answer.

The assembly code to copy data to the vga video memory looks like this:

DGROUP          GROUP    _DATA, _BSS
_DATA           SEGMENT WORD PUBLIC 'DATA'
_DATA           ENDS
_BSS            SEGMENT   WORD PUBLIC 'BSS'             
_BSS            ENDS
_TEXT           SEGMENT BYTE PUBLIC 'CODE'
                ASSUME CS:_TEXT,DS:DGROUP,SS:DGROUP

            PUBLIC _draw_sprite       
_draw_sprite    proc    far 
    push bp
    mov bp,sp
    push ds
    push si
    push di
    ;-----------------------------------
    lds     bx,[bp+6]
    lds     si,ds:[bx+4]        ; sprite->data to ds:si
    mov     ax,0A000h
    mov     es,ax                       
    mov     di,0                ; VGA[0] to es:di

    mov     ax,16               ; 16 scan lines
copy_line:  
    mov     cx,8
    rep     movsw               ; copy 16 bytes from ds:si to es:di
    add     di,320-16           ; go to next line of the screen
    dec     ax
    jnz     copy_line
    ;-----------------------------------
    pop di
    pop si
    pop ds
    mov sp,bp
    pop bp
    ret 
_draw_sprite    endp

Declare the function in c as:

    void draw_sprite(SPRITE *spr);

Data stored at spr->data, is an array of numbers (from 0 to 255, storing the color of a pixel).

That code finally draws the 16x16 bitmap at position x = 0, y = 0.

Thanks a lot!