assemblysyntaxx86

Difference between implementation and syntax


after a lot of researching I think I started to learn how the assemblers work

The assembler works like a compiler that compiles the code for the intended architecture. while assembly language is a general idea that is implemented differently. But I don't understand how the syntax works? Isn't it just the implementation?

I've searched but I can't find anything that explains how the syntax works and how it's different from the implementation.

I've seen the word syntax used in http://sun.hasenbraten.de/vasm/ ,How many assembly languages are there and many more articles.

But I still don't understand it.Mostly I find stuff life AT&T syntax vs intel syntax(can someone explain?).

Also this one is a bonus to ask: is it possible for an assembler to support multiple architectures?if yes,how?


Solution

  • I think you are trying to overcomplicate this.

    So while many of is can deal with this, and as needed write and program in this manner:

    0xe0821003
    0xe0021003
    0xe0421003
    

    Just writing down bits. That is tedious and increases the odds of mistakes. Not easily readable so not very maintainable.

    So for those bits for that instruction set (ISA), the IP or processor vendor creates a way to communicate what was intended in a way that is human readable/writeable/maintainable.

    And that would be

    add r1,r2,r3
    and r1,r2,r3
    sub r1,r2,r3
    

    But that is for that specific target and using the inventors supplied recommended language. All that really matters is the machine code. Any one of us could instead create an assembly language that takes this

    bob b,c,d
    ted b,c,d
    joe b,c,d
    

    That results in the same machine code per that assemblers language. I have seen, created and used tools that support this.

    r1 = r2 + r3
    r1 = r2 & r3
    r1 = r2 - r3
    

    As an assembly language (the machine code for the target I am thinking of makes this actually easier to write/use). And we could easily make an assembler that takes that syntax and creates the same machine code as above. Nothing whatsoever is stopping us from doing that. Or even adding syntax like that to an existing assembler that supports the add r1,r2,r3, could have both be supported by the same tool.

    All that matters is the machine code, that we cannot simply make up whatever bits we want and have the target processor that is already implemented with a set of rules change those rules (unless the processor is designed to do that like an fpga, but that is not what I am talking about).

    There is a gross misunderstanding about this, folks think that x86 is the only one that has different syntaxes and everyone is one syntax per target. The story there is the Intel vs AT&T, where Intel defined and created tools that supported this:

    mov ah,05h
    

    and to a lot of us destination on the left is very natural as every math class we have ever taken uses that convention:

    add r1,r2,r3
    r1 = r2 + r3
    

    But the folks that created a different assembler for a non-DOS platform which the x86 quickly moved into (some other operating systems but embedded in general). And perhaps because they liked having destination last they would rather see

    mov 05h,al
    

    And there is nothing wrong with that other than being goofy looking:

    add r2,r3,r1
    r2 + r3 = r1
    

    It is perfectly legal to make whatever syntax you want, so long as ... you know this ... you build the right machine code.

    There is no governing body like some high level programming languages for this. At best you have a toolchain issue where you have a linker, assembler and compiler, the output of the compiler is often assembly language that the assembler turns into objects that the linker turns into binaries, thus the term tool chain.

    The output of the compiler and the input of the assembler has to be agreed upon, usually one side dictates and the other side conforms to. So if for some reason you want to slide in a different backend then you need one that conforms to the output of the compiler. You have the exact same situation between the assembler and the linker, the file format is completely arbitrary whatever the authors choose to invent so long as it does the job, but for one tool to hand off to the other there has to be an agreed format and/or another tool to convert from on format to the other.

    So a separately developed compiler like gnu gcc wants to conform to a separately developed assembler like gnu as. That would be the closest to a governing body that dictates rules about the language. And being open source an individual can at will add a feature to one and implement the use of that feature in the other.

    Back to AT&T vs Intel; that is incorrectly perceived as the only case of assembly language differences.

    Try to assemble this perfectly legal arm code

    add r1,r2,r3 ; and r1,r2,r3
    add r1,r2,r3 @ and r1,r2,r3
    

    At least one tool is happy and another isn't could take the one line

    add r1,r2,r3 ; and r1,r2,r3
    

    and at least one tool gives

    0xe0821003
    0xe0021003
    

    and at least another gives this as the output

    0xe0821003
    

    (wrapped in some sort of an object file format, with these bits represented in that format)

    Point being every nuance of the language is relevant, some the label has to start in the first column and have a colon, others don't. Some have directives that must start with a dot .GLOBAL and others don't GLOBAL and right there that code is completely incompatible without getting into actual instructions. Then you have instruction differences. There is a VERY bad new fad of not using the register names, I can't stand it so I might not get them right

    add a0,v1,v2
    

    which of course makes for complete incompatibility, along with this insanity:

    mov %eax,0
    

    Decades of successful parsers and you get that lazy?

    Now we don't know what you mean by implementation. Ideally a well designed assembly language is one that you can take an assembly language "instruction" and that maps to a specific machine instruction. But unfortunately we have some assembly languages that are vague and/or instruction sets that are vague.

    For example in a made-up instruction set and assembly language you might have support for

    add r0,r1,#0
    mov r0,r1
    

    and for some reason actually implement different instructions for those. Often you will see that the latter is just pseudo code for the former, but we see both in the assembly language and the instruction set for x86 that there are a number of places where you can "implement" the programmers intention in more than one way.

    Is that what you are talking about implementation?

    Cleaner, leaner instruction sets will preserve instruction set space and not have those, some may not have a nop for example and instead a tool might just use

    and r0,r0
    

    Although if they do that means they could also have used

    and r1,r1
    

    instead. creating one assembly language instruction that can be implemented different ways. You will also see pseudo instructions.

    push {r1}
    

    which becomes

    stmia r13!,{r1}
    

    because the instruction set doesn't actually have a push instruction.

    Assembly has evolved to, used to be hex numbers were like this $12 for some languages, intel liked this 12h, but then C became popular and dominant and then the tools started to support 0x12, so you can find an otherwise compatible compiler family that one day they didn't support 0x12 and the next version they did.

    ARM did something interesting right out of the gate after being Acorn. They created a 16 bit instruction set that was backward compatible to the 32 bit, in their documentation they showed you the 32 bit instruction that was exactly compatible, was the same instruction as the shorter one (can obviously only go one way).

    One way to do this was most instructions only supported half the registers r0-r7 instead of all of them r0-r15, meaning you only needed three bits in the instruction rather than four. And arm had something that was not uncommon but also not common three register instructions add r1,r2,r3. A lot of older instruction sets you could do add r1,r2 with whatever the syntax was and that implied an operand was also the destination r1=r1+r2. And they did that for some instructions in thumb. And why that is relevant to this discussion is that for early thumb assemblers

    add r1,r1,r2 
    

    was illegal you would get an error, even though in arm that was legal. Then later the tools started to just support it as the intention was understood, and because arm was aiming for this unified syntax, which is just stupid, it makes things worse not better, but whatever...So there was a day/version that a particular assembler stopped complaining about that syntax when used as thumb.

    And more of an exception than a rule arm has two now three (well many) instruction sets, lets go with a specific thumb and specific arm and the same syntax so long as you stay within a subset of each can be used against different instruction sets (machine code) as described above

    add r1,r1,r3
    and r1,r1,r3
    sub r1,r1,r3
    
    .thumb
    add r1,r1,r3
    and r1,r1,r3
    sub r1,r1,r3
    

    gives

    0: e0811003 add r1, r1, r3 4: e0011003 and r1, r1, r3 8: e0411003 sub r1, r1, r3 c: 18c9 adds r1, r1, r3 e: 4019 ands r1, r3 10: 1ac9 subs r1, r1, r3

    Now there are arm gnu nuances in play here that keep going down this syntax rabbit hole differences between assembly languages for a specific target that differ between assemblers for that target (which is not x86).

    In general it makes no sense whatsoever to try to make an instruction set that has different targets, with the exception of something like the above where you have one that was derived from another and at one time or for a while implemented in the same core. Trying to make one syntax that makes machine code for x86 or arm just change the target but use the same source. That makes no sense, why bother? The point is to make machine code, specific instructions you want to have complete control over generating. So you need target specific information in order to do that.

    If you pull back and remove target specific details then it isn't assembly language any more it is a high level language like C or Python or Java or other. That is exactly why we have those high level languages that is where C came from back in the 60's to solve this exact problem, trying to implement the darpanet but having incompatible processors being used to be what we might call a modem and/or router today. As the world was in rapid processor development you had to keep re-writing the same programs in assembly and/or needed to create high level languages and then retarget them once you retargeted the compiler then ideally you could re-use some percentage of the "application" on the new target.

    Now there are some assembly languages that are popular in some circles that are a combination of stock assembly let me make the machine code I wanted, and some higher level features to save on typing.

    How syntax works is you create a language that can convey an idea or thought. If I draw a box with a triangle on top with another rectangle some wiggly line on top of that a quarter circle with rays coming out of the corner of the paper and a two vertical lines with some circular squiggly lines above it we all agree (no matter what our native language) is that's a house with the sun in the corner and a tree next to it.

    A SUCCESSFUL syntax is one that make sense and is useful, its not harder than the machine code itself.

    Implementation is simply parsing that syntax and making the machine code or data or using labels to compute portions of the instructions on a second pass or later during linking. Here again a successful syntax is one that allows us to correctly describe the machine code we wanted the tool to generate, functionally certainly and ideally specifically.