|
|||
|
I'm writing a piece of code for real-mode 16-bit x86 that needs to
process a sequence of bytes and do something with their values based on whether or not they are 00, FF, or neither. Jumping based on the zero flag and/or sign flag was my first idea, but MOV and LODS don't alter flags. Loading to CX and using JCXZ works for zero values, but not signed (FF) values. So my next thought was to just ensure that a target register is empty, so that a math op will alter the flags for me: XOR reg,reg ADD reg,mem ....but I'm optimizing for speed (this is in a loop) and it irks me to have to zero the register. Is the above the fastest way to do what I want, or am I missing something obvious? (For those who enjoy optimizing the core problem itself instead of assembly, the goal is to load bytes in sequence and add their values to a target register. The rules: 1. If a loaded value is 0, STOP reading. 2. If a loaded value is FF, add to target register and keep reading. 3. If a loaded value is not FF, add to target register and STOP reading. There is probably a way to combine rules #1 and #3 such that they use the same branch or something, but so far I'm not seeing it.) |
|
|
||||
|
||||
|
|
|
|||
|
Jim Leonard wrote:
> I'm writing a piece of code for real-mode 16-bit x86 that needs to > process a sequence of bytes and do something with their values based > on whether or not they are 00, FF, or neither. Jumping based on the > zero flag and/or sign flag was my first idea, but MOV and LODS don't > alter flags. Loading to CX and using JCXZ works for zero values, but > not signed (FF) values. So my next thought was to just ensure that a > target register is empty, so that a math op will alter the flags for > me: > > XOR reg,reg > ADD reg,mem > > ...but I'm optimizing for speed (this is in a loop) and it irks me to > have to zero the register. Is the above the fastest way to do what I > want, or am I missing something obvious? > > (For those who enjoy optimizing the core problem itself instead of > assembly, the goal is to load bytes in sequence and add their values > to a target register. The rules: > 1. If a loaded value is 0, STOP reading. > 2. If a loaded value is FF, add to target register and keep reading. > 3. If a loaded value is not FF, add to target register and STOP > reading. > There is probably a way to combine rules #1 and #3 such that they use > the same branch or something, but so far I'm not seeing it.) test reg, reg je 1f inc reg jne 0f add tgt, 0xFF jmp read 0:add tgt, -1(reg) 1:... Greetings from Augsburg Bernhard Schornak |
|
|||
|
On Jul 11, 2:54*pm, Bernhard Schornak <schor...@nospicedham.web.de>
wrote: > * *test reg, reg > * *je * 1f > * *inc *reg > * *jne *0f > * *add *tgt, 0xFF > * *jmp *read So the main answer to my question is that this: mov reg,mem test reg,reg ....is the most direct way to set flags. On my target platform it is 1 cycle faster than: xor reg,reg add reg,[mem] ....so I'll use that. Yes, I am not very bright. |
|
|||
|
On Jul 11, 8:21*pm, Jim Leonard <mobyga...@nospicedham.gmail.com>
wrote: > I'm writing a piece of code for real-mode 16-bit x86 that needs to > process a sequence of bytes and do something with their values based > on whether or not they are 00, FF, or neither. *Jumping based on the > zero flag and/or sign flag was my first idea, but MOV and LODS don't > alter flags. *Loading to CX and using JCXZ works for zero values, but > not signed (FF) values. *So my next thought was to just ensure that a > target register is empty, so that a math op will alter the flags for > me: > > XOR reg,reg > ADD reg,mem > > ...but I'm optimizing for speed (this is in a loop) and it irks me to > have to zero the register. *Is the above the fastest way to do what I > want, or am I missing something obvious? > > (For those who enjoy optimizing the core problem itself instead of > assembly, the goal is to load bytes in sequence and add their values > to a target register. *The rules: > 1. If a loaded value is 0, STOP reading. > 2. If a loaded value is FF, add to target register and keep reading. > 3. If a loaded value is not FF, add to target register and STOP > reading. > There is probably a way to combine rules #1 and #3 such that they use > the same branch or something, but so far I'm not seeing it.) Since adding 0 does nothing; 1 read, add to target 2 if FF goto 1 lea si, mem xor bx, bx @@1: lods byte add bx, ah cmp ah, ffh je @@1 |
|
|||
|
Alex McDonald wrote:
> On Jul 11, 8:21 pm, Jim Leonard <mobyga...@nospicedham.gmail.com> >> 1. If a loaded value is 0, STOP reading. >> 2. If a loaded value is FF, add to target register and keep reading. >> 3. If a loaded value is not FF, add to target register and STOP >> reading. >> There is probably a way to combine rules #1 and #3 such that they use >> the same branch or something, but so far I'm not seeing it.) > > Since adding 0 does nothing; > > 1 read, add to target > 2 if FF goto 1 > > lea si, mem > xor bx, bx > @@1: lods byte > add bx, ah That's an illegal register combination, won't assemble... > cmp ah, ffh > je @@1 > First, there is no need to actually count the number of FF bytes, that will drop out from the offset between the start and the end address, right? Second, any byte value except FF will cause a STOP, so that's the only time-critical test: mov dx,si ; Save starting point next2: lodsw inc ax jz next2 ; Loop while we get FFFF ; At this point we know that we didn't get two FF bytes, ; so adjust the SI pointer/counter dec si dec si ; Was the first byte 0? cmp al,1 je STOP ; Count the non-zero byte inc si ; Was it not FF (which would have turned into 0)? jnc STOP ; The first byte was FF, must check the second: ; It cannot have been FF as well, so 0 or not cmp ah,1 je STOP inc si STOP: sub si,dx You can also consider using REPE SCASW instead of the initial load loop! Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |
|
|||
|
Jim Leonard wrote:
> On Jul 11, 2:54 pm, Bernhard Schornak <schor...@nospicedham.web.de> > wrote: >> test reg, reg >> je 1f >> inc reg >> jne 0f >> add tgt, 0xFF >> jmp read > > So the main answer to my question is that this: > > mov reg,mem > test reg,reg > > ...is the most direct way to set flags. On my target platform it is 1 > cycle faster than: > > xor reg,reg > add reg,[mem] > > ...so I'll use that. Yes, I am not very bright. It depends on the processor. If the "read" label looks like read:mov reg, [mem] test reg, reg then this read:mov reg, [mem] 0:test reg, reg je 2f inc reg jne 1f mov reg, [mem] add tgt, 0xFF jmp 0b 1:add tgt, -1(reg) 2:... separates the read from the compare and might save a cycle, as well (the latency for the read is partially "hidden"). Greetings from Augsburg Bernhard Schornak |
|
|||
|
Jim Leonard wrote:
> Bernhard Schornak wrote: > > * *test reg, reg > > * *je * 1f > > * *inc *reg > > * *jne *0f > > * *add *tgt, 0xFF > > * *jmp *read > > So the main answer to my question is that this: > > mov reg,mem > test reg,reg > > ...is the most direct way to set flags. On my target platform it is 1 > cycle faster than: > > xor reg,reg > add reg,[mem] > > ...so I'll use that. Yes, mov/test would be the most direct way to test for 0. But as Alex pointed out, you don't need to test for 0, you only need to test for ff. The easiest way to test for ff is inc. So: 1: lodsb add inc al jz 1b It wasn't clear whether you wanted an 8- or 16-bit add, and if 16-bit, whether the byte should be sign extended or not. If you need a 16-bit add, you probably want to use movzx or movsx after (or instead of) lodsb. However, having said all that, I like Terje's suggestion better--just loop over all the ff bytes, take the difference between the beginning and ending address as the number of ff's to add, and then add the terminating byte (whether it is 0 or not, doesn't matter). |
|
|||
|
Philip Lantz wrote:
> However, having said all that, I like Terje's suggestion better--just > loop over all the ff bytes, take the difference between the beginning > and ending address as the number of ff's to add, and then add the > terminating byte (whether it is 0 or not, doesn't matter). Indeed. I suggested using REPE SCAS at the end of my post, I'm now fairly confident that this will be the fastest on a 16-bit x86 platform simply due to smaller code: ; Initial environment setup code cld ; Might not be needed if the state is known! push ds pop es ; ES might already be equal to DS? ; Get ready to scan past up to 64K 0FFh bytes mov ax,0ffh ; Setting AH = 0 at the same time... or cx,-1 ; Shorter than MOV CX,0FFFFh repe scasb je all_ff ; The count of FF bytes is 0FFFFh - CX ; Check the last byte, should it be counted? cmp ah,[di-1] ; Will generate a carry IFF the byte isn't 0 sbb cx,0 ; Decrement CX if the STOP byte wasn't 0 all_ff: or ax,-1 ; MOV AX,0FFFFh sub ax,cx Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |
|
|||
|
On Jul 11, 10:59*pm, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote: > Alex McDonald wrote: > > On Jul 11, 8:21 pm, Jim Leonard <mobyga...@nospicedham.gmail.com> > >> 1. If a loaded value is 0, STOP reading. > >> 2. If a loaded value is FF, add to target register and keep reading. > >> 3. If a loaded value is not FF, add to target register and STOP > >> reading. > >> There is probably a way to combine rules #1 and #3 such that they use > >> the same branch or something, but so far I'm not seeing it.) > > > Since adding 0 does nothing; > > > 1 read, add to target > > 2 if FF goto 1 > > > * * * lea *si, mem > > * * * xor *bx, bx > > @@1: lods byte > > * * * add *bx, ah > > That's an illegal register combination, won't assemble... I gave up on 16 bit code a long time ago... :-) Replace with [xor ax, ax] before the loop and [add bx, ax] > > > * * * cmp *ah, ffh > > * * * je * @@1 > > First, there is no need to actually count the number of FF bytes, that > will drop out from the offset between the start and the end address, right? Then I misinterpreted Jim's description; I thought he required adding values, as in FF FF FF FE -> 1023 dec, as opposed to FF FF FF FE -> 4. <snip> |
|
|||
|
Alex McDonald wrote:
> On Jul 11, 10:59 pm, Terje Mathisen <"terje.mathisen at >> First, there is no need to actually count the number of FF bytes, that >> will drop out from the offset between the start and the end address, right? > > Then I misinterpreted Jim's description; I thought he required adding > values, as in FF FF FF FE -> 1023 dec, as opposed to FF FF FF FE -> 4. That was possible, from the initial description, but it doesn't matter: The sum of N bytes all equal to FF is fairly easy to calculate. :-) sum = (N << 8) - N It also makes the final addition even simpler, since adding 0 is a NOP sum += last_byte Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |
|
|||
|
On Jul 11, 8:21*pm, Jim Leonard <mobyga...@nospicedham.gmail.com>
wrote: .... > (For those who enjoy optimizing the core problem itself instead of > assembly, the goal is to load bytes in sequence and add their values > to a target register. *The rules: > 1. If a loaded value is 0, STOP reading. > 2. If a loaded value is FF, add to target register and keep reading. > 3. If a loaded value is not FF, add to target register and STOP > reading. > There is probably a way to combine rules #1 and #3 such that they use > the same branch or something, but so far I'm not seeing it.) Yes, it's often better if you can reduce the number of branches. In this case you should be able to do all the tests and the looping with just one branch. Since you mention 16-bit mode how about this: xor ax, ax ;To hold each value 0 to 255 as 16-bit xor dx, dx ;Sum loop: mov al, [si] inc si add dx, ax cmp al, 0xff jne loop If you wanted an 8-bit sum instead: xor ah, ah ;AH = sum loop: mov al, [si] inc si add ah, al cmp al, 0xff jne loop You could replace the mov al, [si]; inc si with a lodsb instruction but it would be slower on certain generations of CPU. If you have any significant section of memory to scan or are running on a very early CPU the routine might be dominated by the memory read time so you won't get it to be any faster. If you have only a small amount of memory to read be careful you don't try to optimise too much. Instead, it may be better to keep it simple and easy to maintain. James |
|
|||
|
On Jul 12, 6:15*am, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote: > > However, having said all that, I like Terje's suggestion better--just > > loop over all the ff bytes, take the difference between the beginning > > and ending address as the number of ff's to add, and then add the > > terminating byte (whether it is 0 or not, doesn't matter). > I suggested using REPE SCAS at the end of my post, I'm now fairly > confident that this will be the fastest on a 16-bit x86 platform simply > due to smaller code: Thanks to everyone here for the great suggestions. I appreciate pointing out the INC suggestion to pick up FF, and I had not considered the REPE SCAS suggestion AT ALL which is fun to think about. For the curious, this is part of decompression code I'm writing (I cannot change the format). The code will populate CX which will be used to copy literal or match data. |
|
|||
|
On Jul 12, 6:24*am, Alex McDonald <b...@nospicedham.rivadpm.com>
wrote: > Then I misinterpreted Jim's description; I thought he required adding > values, as in FF FF FF FE -> 1023 dec, as opposed to FF FF FF FE -> 4. No, you were right: FF FF FF FE -> 1023 dec. |
|
|||
|
On Jul 12, 6:15*am, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote: > * *or cx,-1 * * ; Shorter than MOV CX,0FFFFh Wait, what? -a 0100 13A8:0100 mov cx,ffff 13A8:0103 or cx,-1 13A8:0107 -u 0100 13A8:0100 B9FFFF MOV CX,FFFF 13A8:0103 81C9FFFF OR CX,FFFF |
|
|||
|
On Jul 12, 2:42*pm, Jim Leonard <mobyga...@nospicedham.gmail.com>
wrote: > On Jul 12, 6:24*am, Alex McDonald <b...@nospicedham.rivadpm.com> > wrote: > > > Then I misinterpreted Jim's description; I thought he required adding > > values, as in FF FF FF FE -> 1023 dec, as opposed to FF FF FF FE -> 4. > > No, you were right: *FF FF FF FE -> 1023 dec. Wait, no, I'm wrong. A sequence of FF FF FF FE should result in 255+255+255+254=1019. And for extra trouble, this value needs to land in CX because it is going to be used to immediately copy stuff. Which is why I'm re- thinking Terje's REPE SCAS idea, because once the last FF byte is found, there's data now clogging up CX that I need to deal with. Actually... I guess it's not that big a deal, as both BP and DX are still free in this loop I'm designing and I can just use them as scratch. |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|