Aptarimas:Matematika/Sinuso Integralas
Pridėti temąpopular example of Fourier integral
[keisti]Given function
- f(x)=1, when |x|<1,
- f(x)=1/2, when |x|=1,
- f(x)=0, when |x|>1.
- Need to write this function through Fourier integral.
- Solution is for this example:
- In particular case if x=0 (|x|<1), then
- and we put 0 into x place and we get And so we have:
- As far as I understand about Fourier integral, this integral means:
- But problem is, that I check it through Free Pascal program "Version 1.0.12 2011/04/23; Compiler Version 2.4.4; Debugger GDB 7.2" with this code:
var a:longint; c:real; begin c:=0; a:=0; for a:=1 to 100000 do c:=c+sin(a)/a; writeln(c); readln(); end.
- so I get result 1.07080565212341. Even not close to — Preceding unsigned comment added by Versatranitsonlywaytofly (talk • contribs) 22:52, 22 December 2011 (UTC) BTW, you can use it like benchmark changing number in line "for a:=1 to 100000" to bigger than 100000. With number 1000000000 I got 1.07079632630307 and it take for CPU 52 seconds to compute result. You can use "a:integer" instead "a:longint", but then smaller number you will be able to choose. With number 100000000 it tooks only 5 seconds and result is 1.07079633477997.
- It apears just simple mistake, I thought impossible that it mean that it mean, because but it not, it equal to ~1. But interesting coincidence, that result is it something must to do with Fourier series and coefficient . So real code is:
var b:real; a:longint; begin b:=0; a:=0; for a:=1 to 1000000000 do b:=b+0.00001*sin(a/100000)/(a/100000); writeln(b); readln(); end.
- And result is 1.57088654523321 after 63 seconds.
- [Small fixing 2024] Above code looks strange. It will give result 1.57088654523321 after 63 seconds (on 2.6 GHz CPU). On 4.16 GHz CPU this above code gives result [1.5708865452332146E+000] first time (Free Pascal first time calculating longer than second and third and any later try) after about 2 minutes and about 25 seconds (about 145 seconds) with enough heavily loaded CPU with 3 internet Browsers. Second time launching this above code on at 4.16 GHz working CPU gives result [1.5708865452332146E+000] after 51 second with two heavely enough loaded internet browsers (one browser I close, but nothing was playing in Browsers anyway, just opened youtube and over pages...).
- Such Free Pascal code:
a:longint; begin b:=0; for a:=1 to 1000000000 do b:=b+0.000000001*sin(a*0.000000001)/(a*0.000000001); writeln(b); readln(); end.
- gives result [9.4608307028940319E-001 which means 9.4608307028940319/10] after 2 minutes and 10-15 seconds (130-135 seconds) first time on at 4.16 GHz speed working CPU (with two heavily loaded Internet Browsers). So I now understood that above code [which gives result ~pi/2 after 63 seconds on 2.6 GHz CPU] is correct and this code is wrong, because it have too big intervals between x values of function Si(x). Launching second time this code gives result [9.4608307028940319E-001] after 40 seconds on 4.16 GHz CPU.
- This Free Pascal code:
var b:real; a:longint; begin b:=0; for a:=1 to 1000000000 do b:=b+0.0000001*sin(a*0.01)/(a*0.01); writeln(b); readln(); end.
- gives result [1.5657964177342434E-005 which means 1.5657964177342434/100000] after 2 minutes and 19 seconds (139 seconds) first time on 4.16 GHz CPU (with heavily loaded one internet browser, but nothing was playing then). Second time this code gives result [1.5657964177342434E-005] after 46 seconds on 4.16 GHz CPU (with heavily loaded one internet browser, but nothing was playing then).
- This code:
var b:real; a:longint; begin b:=0; for a:=1 to 1000000000 do b:=b+0.000001*sin(a*0.000001)/(a*0.000001); writeln(b); readln(); end.
- gives result [1.5702326223791097E+000] first time after 2 minutes and 15 seconds (135 seconds) on 4.16 GHz CPU (with heavily loaded one internet browser, but nothing was playing then). Second time launching this code brings result [1.5702326223791097E+000] after 41 second on 4.16 GHz CPU (with heavily loaded one internet browser, but nothing was playing then).
- There is 3 multiplications, 1 division, 1 addition operation and one sine operation in each iteration. There is bilion iterations (10^9 iterations). For this calculations was waisted 41*4.16*10^9 = 170,560,000,000 = 170560000000 (cycles) = 1.7056 * 10^11 (cycles). If think that multiplication and addition takes about 4 cycles each. Then in each iteration waisted 4*3+4*1=16 (cycles). Here I get that for one division operation need about 25 cycles on 4.16 GHz CPU (but actualy it maybe doing division faster, because some cycles waisted on one addition operation and iteration operation itself; without addition operation there can be not 25 cycles, but about 25-4=21 cycle for one division operation). So for one sin(x) operation need (1.7056 * 10^11)/(10^9 [iterations]) = 170.56 cycles. If we subtract 16 cycles for all multiplications operations and addition operation and also subtract 21 cycle for division operation, then we will get:
- 170.56 - 16 - 21 = 133.56 =~133 cycles for one sin(x) operation. This was done on AMD FX(tm)-8350 Eight-Core Processor 4.00 GHz (it official goes to 4.2 GHz, but Windows 10 showing in Task Manager that it goes to 4.16 GHz when loaded with calculations). BTW, AMD was suied for saying that this CPU is 8 cores, because it can't run tasks designed for 8 cores, but only tasks designed for 4 cores. Anyway, I think Free Pascal always using only one core. This CPU is equiped with 8 GB DDR3-1600 (800 MHz) Dual Channel RAM.
- This Free Pascal code:
var b:real; a:longint; begin b:=0; for a:=1 to 1000000000 do b:=b+0.0001*sin(a*0.0001)/(a*0.0001); writeln(b); readln(); end.
- gives result [1.5707563204155381E+000] after 2 minutes and 16 seconds (136 seconds) in first time launch on 4.16 GHz CPU (with heavily loaded one internet browser, but nothing was playing then). Second and third time this code gives result [1.5707563204155381E+000] after 43-44 seconds on 4.16 GHz CPU (with heavily loaded one internet browser, but nothing was playing then). This is the best code from all experiments, which gives most accurate result. For example code with such line:
b:=b+0.001*sin(a*0.001)/(a*0.001);
- gives result 1.5702953898705860E+000, which is less accurate, because pi/2 =~ 1.5707963267948966.
- Only very first code (which calculated on 2.6 GHz CPU) with line
b:=b+0.00001*sin(a/100000)/(a/100000);
- which is equal to line:
b:=b+0.00001*sin(a*0.00001)/(a*0.00001);
- gives result 1.5708865452332150E+000 almost as good as code mentioned recently.
Sine Taylor series benchmarking
[keisti]- Sine function can be written as Taylor series. Here we have 14 numbers after point (15 total and one last for last number approximation, which isn't shows and is only in memory). If one decimal number is 4 bits, then it can be 16*4=64 bits precision.
- Using windows calculator I check, that
- particularly
- or
- or
Some stupid infinity furje integral problem (Furje Integral must be pseudoscience)
[keisti]var b:real; a:longint; begin b:=0; a:=0; for a:=1 to 1000000000 do b:=b+0.00001*cos(1.1*a/100000)*sin(a/100000)/(a/100000); writeln(b); readln();
- It just gives always different result do not matter how close to infinity you choose to be for a.
Simplest benchmark
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+a; writeln(c); readln(); end.
- Result is after 6 seconds on ~3 GHz processor. Notice, that in this case result is only twice faster than on 1.6 GHz Intel Atom processor, because it don't depends on number of cores, nor on instructions or amounts of cashes. Pentium III of 3GHz would calculate in same time.
- There is integer numbers of 16 decimal places (one 16th number is for approximation in the end and it's not shown). One decimal number is 4 bits. So 64 bits precision in total. There is exactly one billion additions in 6 seconds. This is 166 millions additions per second. But if calculate how many bits addition is per second then we get 166*64=10624 millions/s or 10 billions additions per second. This is 10624/8=1328 Megabytes per second or 1.3 GB/s. For now seems like nothing, what can not handle (800MHz*64bits)/8=6400 MB/s RAM memory.
- Interesting coincidence, that say at 3(GHz) done in 5(seconds), then at 1(GHz) done in 15(seconds). And we have 15 decimal places. So I suggest, that in one cycle (takt) CPU making one decimal number sum operation (like 4+7 or 8+6 or 4+5). I suggest, that in one cycle (3GHz CPU have 3 bilions cycles/s) can be done either one sum operation or one subtraction operation or multiplication or division (maybe subtraction operation takes 2 cycles and multiplication 3 cycles, but maybe no). From this can be conclusion, that Bill Gates don't using MMX(64bit), SSE(128bit) or AVX(256bit) instructions and they are just in some kinda BIOS or ROM memory, but have small influence in practice as all old programs like Visual C++ and Windows running on instructions [software codes] before introducing MMX, SSE, AVX. My drift is, that Intel various SSE instructions don't adding physical calculation units, but is just kinda smarter vector calculating codes, than kinda you can write and thus can be faster.
Simplest benchmark 2
[keisti]- Strange even this harder benchmark gives result in 3-4 seconds.
var a:longint; c,b,d,e,f,g:real; begin for a:=1 to 1000000000 do b:=a; d:=a*b; //a^2 e:=d*d; //a^4 f:=e*e; //a^8 g:=f*f; //a^16 // c:=c+sin(a)/a; c:=c+(a*g); writeln(c); readln(); end.
Something wrong
[keisti]var a:longint; c,b,d,e,f,g:real; begin for a:=1 to 1000000000 do b:=a; d:=a*b; //a^2 e:=d*d; //a^4 f:=e*e; //a^8 g:=f*f; //a^16 // c:=c+sin(a)/a; c:=c+(1-a*d/6+a*e/120-a*d*e/5040+a*f/362880-a*d*f/39916800+a*e*f/6227020800-(g/a)/1307674368000+a*g/355687428096000-a*d*g/121645100408832000)/a; writeln(c); readln(); end.
- Result should be , but for some reason is (this result is gotten after 3-4 seconds).
Something wrong 2
[keisti]- This also gives wrong result:
var a:longint; c,b,d,e,f,g,h:real; begin for a:=1 to 1000000000 do b:=a/100000; d:=a*b/100000; //a^2 e:=d*d; //a^4 f:=e*e; //a^8 g:=f*f; //a^16 h:=1-b*d/6+b*e/120-b*d*e/5040+b*f/362880-b*d*f/39916800+b*e*f/6227020800-(g/b)/1307674368000+b*g/355687428096000-b*d*g/121645100408832000; c:=c+0.00001*h/(a/100000); writeln(c); readln(); end.
- Result is after 10 seconds. But result should be
- update: silly error (must be b instead 1),
h:=b-b*d/6+b*e/120-b*d*e/5040+b*f/362880-b*d*f/39916800+b*e*f/6227020800-(g/b)/1307674368000+b*g/355687428096000-b*d*g/121645100408832000;
- but still need much longer serie (like up to ) to get something, which would give at least two first correct decimal places. It seems that there is some trick for last number to made series perhaps much much shorter.
Only sin(1) is quite precise with short series
[keisti]I check, that
- and
- While precise result is
- Update. With all small number (like from 0 to ) need only calculate to to get 10 decimal places precision. So smart choise is to divide big number by or by and remove fractional part and integer part multiply by or by and then gotten result subtract from initial big number. Then you have number from 0 to (or to ), which Taylor series calculation makes short and simple.
- Update 2. Need divide by then must do not be any errors. Then fractional part remove and integer part multiply by and gotten result subtract from big number. This gives about 5 correct decimal places if you calculate to , but it's still much shorter than very big numbers. For exampe,
- which is quite close to
simplest benchmark 2.1
[keisti]var a:longint; c,b,d,e,f,g,h:real; begin for a:=900000000 to 1000000000 do c:=c+a; writeln(c); readln();
- Result is in ~1 second. This is addition operations.
simplest benchmark 3
[keisti]var a:longint; c,b,d,e,f,g,h:real; begin for a:=500000000 to 1000000000 do c:=c+a; writeln(c); readln();
- Result is in 2 or 3 seconds on ~3 GHz CPU. This is addition operations. Notice, that on average done in precision where number 16 is 16 decimal digits (15, but one for rounding, I think). Because, I think, CPU don't using 15-16 numbers precision until numbers are small like or Putting this to binary form, I have in mind, that which is 16 decimal numbers from 52 binary numbers. So and it is 8 decimal places of integer. So if CPU using addition by 1 bit, then here is So need 5.2 GHz CPU to accomplish this task. My suggestion is that CPU 0's adding faster than 1's. Or cycle (all CPU do ~3*10^9 cycles/s) is determined by first CPU designers as one decimal number addition to over decimal number and if precision is 16 decimal numbers then there is 16 additions and thus it will be Now you know what mean cycle, because cycle is one addition of one number (from 0 to 9) to another number (from 0 to 9). So, for example, 64 bits (double precision) 2*10^12 floating operations per second (2 TFLOPS) is operations (additions, for example) per second of 16 decimal places digits numbers (of 64 bits numbers). For example, AMD claims, that Radeon HD 6970 have 683 GFLOPs Double Precision compute power (and 2.7 TFLOPs Single Precision compute power). So 683/16=42.6875 billions additions/s of 16 decimal places numbers. Also 2700/8=337.5 billions additions/s of 7-8 decimal places numbers (single precision (32 bits) have more like 7 decimal digits). Doubling computer power 2 times in two years, in 2016 there should be 4 times bigger number. But recently arived Radeon HD 7970 have 3.79 TFLOPs Single Precision compute power and 947 GFLOPs Double Precision compute power. This AMD "Radeon HD 7970" is at writing time most powerfull single chip graphics card. So in 2016 there should be 3.79*4=15.16 TFLOPS Single Precision and 0.947*4=3.788 TFLOPs double precision most powerful card (double precision amount of FLOPs is 4 times smaller than single precision on AMD Radeon HD cards). And in 2030 there should be TFLOPs ~ 1000 PFLOPS ~1 ExaFLOPS ( FLOPS) Single Precision most powerful GPU card.
- Another theory is, that CPU doing all job, and only CPU single core power increase, because of bigger frequency, which in 2016 should be 3.5 GHz not expensive as now 3 GHz. And in 2030 should be 10 GHz or 30 GHz (or the same 3 GHz in worst case). But games programming will be so professional, that you will not be able to say or it is done on 3GHz single core or on 10^18 FLOPS GPU. Anyway 10 GHz CPU will trick you with no problems with raytracing (reflection in reflection in small area and you are tricked) and approximated radiosity.
Natural logarithm benchmarking
[keisti]var a:longint; c,b,d,e,f,g,h:real; begin for a:=500000000 to 1000000000 do c:=c+ln(a); //real // b:=a/100000; // d:=a*b/100000; //a^2 // e:=d*d; //a^4 // f:=e*e; //a^8 // g:=f*f; //a^16 // c:=c+sin(a)/a; // h:=b-b*d/6+b*e/120-b*d*e/5040+b*f/362880-b*d*f/39916800+b*e*f/6227020800-(g/b)/1307674368000+b*g/355687428096000-b*d*g/121645100408832000; // c:=c+0.00001*h/b; writeln(c); readln(); end.
- This benchmark gives result 1.02082065291082*10^10 after 25 seconds on ~3GHz CPU.
Natural logarithm benchmarking 2
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+ln(a); writeln(c); readln(); end.
- This benchmark gives result after 50 seconds on ~3GHz CPU. Notice, that for sine function calculation ("c:=c+sin(a);") in exactly the same manner result was gotten also in ~50 seconds (47 seconds; result ). It makes me think, that sine or natural logarithm is gotten from some kind big database table, rather than calculated. But I check few times and there really is 3-4 seconds difference between sine function and natural logarithm (natural logarithm calculated 3 seconds longer).
Natural logarithm benchmarking 3
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+123456789012345*ln(a); writeln(c); readln(); end.
- This benchmark gives result after 52 seconds on ~3GHz CPU.
- If there is only one multiplier unit, then either there is calculations in smaller precision or there is more than one multiplier unit. Because to multiply each number with another number then there is 15^2=225 or 16^2=256 multiplications and 225 or 256 addition operations for multiplying two 15 (or 16) decimal places numbers. So in total to multiply, for example, 123456789012345 with 123456789012345 need 15^2+225=450 operations. And if one operation (like addition operation) done in one cycle, then for such benchmark (if not counting calculation of natural logarithm) to do in 52 seconds need not ~3GHz CPU, but CPU. So I pretty believe, that there is 15 or 16 decimal digits number multiplication with one decimal digit number in 1 or at most 3 cycles (but really not more than in 4-10 cycles).
- Update: "Free Pascal" can reform this code into this (and result will be the same, but little bit different ):
var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+ln(a); writeln(123456789012345*c); readln(); end.
- So multiplication must be hard, because you can't even use multiple times (like "for a:=1 to 1000000000 do") free pascal function "sqr()", which is power of 2. This function ("sqr()") you can use only once, so you need go around and use "exp()" and "ln()" functions combinations to rise power of 2 (like this: exp(2*ln(a))=a*a). So if there really all programing languages have problems with multiplication then there is hope, that GPU is not a fake. BTW, I read, that RISC processors programing also do not have multiplication or can't do multiplication, something like that, but can do division.
Natural logarithm benchmarking 4
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+exp(2*ln(a)); //a^2=exp(2*ln(a)) writeln(c); readln(); end.
- This benchmark gives result after 96 seconds on ~3GHz CPU.
Natural logarithm benchmarking 5
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+exp(ln(a)); writeln(c); readln(); end.
- This benchmark gives result after 92 seconds on ~3GHz CPU.
Power function benchmarking
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+a*exp(ln(a)); // a^2=a*exp(ln(a)) writeln(c); readln(); end.
- This benchmark gives result after 93 seconds on ~3GHz CPU.
Simple benchmark
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+1; writeln(c); readln(); end.
- This benchmark gives result after 4 seconds on ~3GHz CPU.
Simple benchmark 2
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+2.7; writeln(c); readln(); end.
- This benchmark gives result after 5 seconds on ~3GHz CPU.
Simple benchmark 3
[keisti]var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+2.789123456789012; writeln(c); readln(); end.
- This benchmark gives result after 5 seconds on ~3GHz CPU. Such code in "Free Pascal" can be reformed to this: "c:=c+1;=>result*2.789123456789012", so it does not prove anything. Like this:
var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+1; writeln(c*2.789123456789012); readln(); end.
- This time result is after 4-5 seconds on ~3GHz CPU.
- Or if you use this benchmark, then multiplication will be rounded:
var a:longint; b,c:real; begin for a:=1 to 1000000000 do c:=c+1; b:=c*2.789123456789012; writeln(b); readln(); end.
- And result is after 4-5 seconds on ~3GHz CPU. If you get result faster it means, that ~3GHz=2.6GHz.
Square root benchmarking
[keisti]var a:longint; b,c:real; begin for a:=1 to 1000000000 do c:=c+sqrt(a); // a^(1/2)=sqrt(a) writeln(c); readln(); end.
- This benchmark gives result after 12 seconds on ~3GHz CPU (if this line "c:=c+sqrt(a);" replace with this line "c:=c+a*sqrt(a);", it still gives result after 12 seconds on the same CPU). Calculations going on in 16 decimal digits precision. So two cycles waisted per each decimal digit in this [square root] calculation, because (12*2.6*10^9)/(16*10^9)=(31.2*10^9)/(16*10^9)=2. Or 2*16=32 cycles used for square root of one double precision (64 bits = 16 decimal digits) number.
- Here example how square root is calculated.
- To get
- Step 1: Guess G = 1;
- Step 2: New Guess = (G + x/G)/2;
- Repeat Step 2 arbitrary number of times, to get arbitrary precise result.
- For example,
- G = 1;
- G = (1+3/1)/2 = 4/2 = 2;
- G = (2 + 3/2)/2 = 7/4 = 1.75;
- G = (7/4 + 12/7)/2 = (49+48/28)/2 = (97/28)/2 = 97/56 = 1.732142857;
- G = (97/56 + 168/97)/2 = ((97^2 + 168*56)/(97*56))/2 = ((9409 + 9408)/5432)/2 = 18817/10864 = 1.732050810.
Too fast multiplication (multiplication benchmark)
[keisti]var a:longint; b,c:real; begin b:=0; c:=123456789012345; for a:=1 to 1000000000 do b:=b+c*a; writeln(b); readln(); end.
- This benchmark gives result after 5 seconds on ~3GHz CPU.
- But this benchmark
var a:longint; b,c:real; begin b:=0; c:=123456789012345; for a:=1 to 1000000000 do b:=b+a; writeln(c*b); readln(); end.
- gives result also after 5 seconds on ~3GHz CPU.
- And this benchmark
var a:longint; b,d,c:real; begin b:=0; c:=123456789012345; for a:=1 to 1000000000 do b:=b+a; d:=c*b; writeln(d); readln(); end.
- gives result also after 5 seconds on ~3GHz CPU.
- Even this benchmark
var a:longint; b,c:real; begin b:=0; for a:=1 to 1000000000 do b:=b+a; writeln(b); readln(); end.
- gives result after 5 seconds on ~3GHz CPU.
Free Pascal pagrindas norint sukūrti sinuso skaičiavimą
[keisti]Uses math; begin Writeln(Ceil(-3.7)); // should be -3 Writeln(floor(-3.7)); // should be -4 Writeln(frac(3.7)); // should be 0.7 Writeln(floor(3.7)); // should be 3 Writeln(ceil(3.7)); // should be 4 Writeln(Ceil(-4.0)); // should be -4 Readln; End.
- tai ir yra pilnas kodas Free Pascal'iui (Compiler version 2.6.0) su kuriuo jau veikia (paspaudžius "Run" arba klaviatūra "Ctrl+F9") ir parodomas monitoriuje skaičių stulpelis.
- Bet koks skaičius x paverčiamas į radianus nuo 0 iki pritaikius formulę:
- Pavyzdžiui, x=10, tada:
- sin(10)=-0.54402111088936981340474766185138;
- ir
- frac(1.5915494309189533576888376337251)=0.5915494309189533576888376337251,
- sin(3.716814692820413523074713233441)=-0.54402111088936981340474766185138.
Sinuso Tailoro eilutės Free Pascal kodas
[keisti]var a:longint; c:real; begin for a:=0 to 3 do c:=c+(a-0.16666666666666667*a*a*a+0.0083333333333333333*a*sqr(sqr(a*1.0))- 0.00019841269841269841*a*sqr(a*1.0)*sqr(sqr(a*1.0))+ 0.0000027557319223985891*a*sqr(sqr(sqr(a*1.0)))- 0.000000025052108385441718775*a*sqr(a*1.0)*sqr(sqr(sqr(a*1.0)))+ 0.000000000160590438368216146*a*sqr(sqr(a*1.0))*sqr(sqr(sqr(a*1.0)))- 0.00000000000076471637318198164759*a*sqr(a)*sqr(sqr(a))*sqr(sqr(sqr(a*1.0)))+ 0.000000000000002811457254345520763*a*sqr(sqr(sqr(sqr(a*1.0))))); writeln(c); writeln(sin(1)+sin(2)+sin(3)); Readln; End.
- duoda resultatus:
- 1.89188842905109;
- 1.8918884196934454.
Sinuso Free Pascal kodas bet kokiems skaičiams
[keisti]Uses math; var a:longint; c:real; begin for a:=0 to 100000000 do c:=c+6.283185307179586477*frac(a*0.159154943091895336)- 0.16666666666666667*sqr(6.283185307179586477*frac(a*0.159154943091895336))*6.283185307179586477*frac(a*0.159154943091895336)+ 0.0083333333333333333*6.283185307179586477*frac(a*0.159154943091895336)*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))- 0.00019841269841269841*6.283185307179586477*frac(a*0.159154943091895336)*sqr(6.283185307179586477*frac(a*0.159154943091895336))*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))+ 0.0000027557319223985891*6.283185307179586477*frac(a*0.159154943091895336)*sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))- 0.000000025052108385441718775*6.283185307179586477*frac(a*0.159154943091895336)*sqr(6.283185307179586477*frac(a*0.159154943091895336))*sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))+ 0.000000000160590438368216146*6.283185307179586477*frac(a*0.159154943091895336)*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))*sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))- 0.00000000000076471637318198164759*6.283185307179586477*frac(a*0.159154943091895336)* sqr(6.283185307179586477*frac(a*0.159154943091895336))*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))* sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))+ 0.000000000000002811457254345520763*6.283185307179586477*frac(a*0.159154943091895336)*sqr(sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))); // 1/(2*3.14)=0.159 writeln(c); Readln; End.
- kuris duoda atsakymą 55365,3072928836 po 71 sekundės su 2,6 GHz procesorium. Šitas kodas duoda sin(x) tą patį ką ir originali Free Pascal sinuso funkcija tik, kai 0<x<1.09, o su vis didesniais 1.09<x<6.283185307179586477 atsakymas gaunasi vis netikslesnis. Nes ko gero Free Pascal skaičiuoja sinusą ekonomiškai iki 45 laipsnių, o ne naudoja labai ilgą Teiloro eilutę.
- Truputi optimizuotas šio kodo variantas:
Uses math; var a:longint; c:real; begin for a:=1 to 100000000 do c:=c+frac(a*0.159154943091895336)- 0.16666666666666667*sqr(6.283185307179586477*frac(a*0.159154943091895336))*frac(a*0.159154943091895336)+ 0.0083333333333333333*frac(a*0.159154943091895336)*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))- 0.00019841269841269841*frac(a*0.159154943091895336)*sqr(6.283185307179586477*frac(a*0.159154943091895336))*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))+ 0.0000027557319223985891*frac(a*0.159154943091895336)*sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))- 0.000000025052108385441718775*frac(a*0.159154943091895336)*sqr(6.283185307179586477*frac(a*0.159154943091895336))*sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))+ 0.000000000160590438368216146*frac(a*0.159154943091895336)*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))*sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))- 0.00000000000076471637318198164759*frac(a*0.159154943091895336)* sqr(6.283185307179586477*frac(a*0.159154943091895336))*sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336)))* sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))+ 0.000000000000002811457254345520763*frac(a*0.159154943091895336)*sqr(sqr(sqr(sqr(6.283185307179586477*frac(a*0.159154943091895336))))); // 1/(2*3.14)=0.159 writeln(6.283185307179586477*c); Readln; End.
- duoda atsakymą 55365,307292856515 vis tiek po 71 sekundės su 2,6 GHz procesorium.
Sinuso benchmark'as
[keisti]- Šitas sinuso kodas:
Uses math; var a:longint; c:real; begin for a:=0 to 100000000 do c:=c+sin(a); writeln(c); Readln; End.
- yra 71/5=14 karų greitesnis už savadarbį, nes duoda atsakymą 1,71364934657128 po 5 sekundžių su 2,6 GHz procesorium.
Sinuso benchmark'as 2
[keisti]- Šitas sinuso kodas:
Uses math; var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+sin(a); writeln(c); Readln; End.
- duoda atsakymą 0,421294486750096 po 47 sekundžių su 2,6 GHz procesorium. Vadinasi, iš tiesų Free pascal funkciją frac() skaičiuoja tik vieną kartą ir greičiausiai iki kad Teiloro eilutė būtų kuo trumpesnė suderindamas minuso ženklus ir panašiai (gal dar kvadratu pakeltas reikšmes panaudoja vėl, o ne skaičiuoja iš naujo), nes 71/4,7=15,1 karto greičiau.
Free Pascal funkcijos frac() benchmark'as
[keisti]Uses math; var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+frac(a*0.15915494309189533576888); // 1/(2*3.14)=0.159 writeln(c); Readln; End.
- duoda rezultatą 499999986.434272 po 25 sekundžių su 2.6 GHz procesorium.
Teoretinis sinuso benchmark'as
[keisti]- Šis Free Pascal kodas skaičiuoja teisingai visus skaitmenis tik skaičiamas nuo 0 iki 1.09, o skaičiams didesniems nei 1.09 tikslumas mažėja, o labai dideliems tikslumas iš vis prarandamas. Kodas yra toks (tikrinamas teorinis sinuso greitis mažiems skaičiams):
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+(a-0.16666666666666667*a*a*a+0.0083333333333333333*a*sqr(sqr(a*1.0))- 0.00019841269841269841*a*sqr(a*1.0)*sqr(sqr(a*1.0))+ 0.0000027557319223985891*a*sqr(sqr(sqr(a*1.0)))- 0.000000025052108385441718775*a*sqr(a*1.0)*sqr(sqr(sqr(a*1.0)))+ 0.000000000160590438368216146*a*sqr(sqr(a*1.0))*sqr(sqr(sqr(a*1.0)))- 0.00000000000076471637318198164759*a*sqr(a)*sqr(sqr(a))*sqr(sqr(sqr(a*1.0)))+ 0.000000000000002811457254345520763*a*sqr(sqr(sqr(sqr(a*1.0))))); writeln(c); Readln; End.
- kuris duoda rezultatą po 43 sekundžių su 2.6 GHz procesorium.
- Va toks Free Pascal kodas:
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+(a-0.16666666666666667*a*a*a+0.0083333333333333333*a*sqr(sqr(a))- 0.00019841269841269841*a*sqr(a)*sqr(sqr(a))+ 0.0000027557319223985891*a*sqr(sqr(sqr(a)))- 0.000000025052108385441718775*a*sqr(a)*sqr(sqr(sqr(a)))+ 0.000000000160590438368216146*a*sqr(sqr(a))*sqr(sqr(sqr(a)))- 0.00000000000076471637318198164759*a*sqr(a)*sqr(sqr(a))*sqr(sqr(sqr(a)))+ 0.000000000000002811457254345520763*a*sqr(sqr(sqr(sqr(a*1.0))))); writeln(c); Readln; End.
- duoda rezultatą po 54 sekundžių su 2.6 GHz procesorium.
- Štai toks Free Pascal kodas:
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+a*(1-0.16666666666666667*sqr(a*1.0)+0.0083333333333333333*sqr(sqr(a*1.0))- 0.00019841269841269841*sqr(a*1.0)*sqr(sqr(a*1.0))+ 0.0000027557319223985891*sqr(sqr(sqr(a*1.0)))- 0.000000025052108385441718775*sqr(a*1.0)*sqr(sqr(sqr(a*1.0)))+ 0.000000000160590438368216146*sqr(sqr(a*1.0))*sqr(sqr(sqr(a*1.0)))- 0.00000000000076471637318198164759*sqr(a*1.0)*sqr(sqr(a*1.0))*sqr(sqr(sqr(a*1.0)))+ 0.000000000000002811457254345520763*sqr(sqr(sqr(sqr(a*1.0))))); writeln(c); Readln; End.
- duoda rezultatą po 41 sekundės su 2.6 GHz procesorium.
- Toks Free Pascal kodas:
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+a*(1-sqr(a*1.0)*(0.16666666666666667+0.0083333333333333333*sqr(a*1.0)- 0.00019841269841269841*sqr(sqr(a*1.0)))+ 0.0000027557319223985891*sqr(sqr(sqr(a*1.0)))- 0.000000025052108385441718775*sqr(a*1.0)*sqr(sqr(sqr(a*1.0)))+ 0.000000000160590438368216146*sqr(sqr(a*1.0))*sqr(sqr(sqr(a*1.0)))- 0.00000000000076471637318198164759*sqr(a*1.0)*sqr(sqr(a*1.0))*sqr(sqr(sqr(a*1.0)))+ 0.000000000000002811457254345520763*sqr(sqr(sqr(sqr(a*1.0))))); writeln(c); Readln; End.
- duoda rezultatą po 39 sekundžių su 2.6 GHz procesorium.
- Toks Free Pascal kodas:
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+a*(1- sqr(a*1.0)*(0.16666666666666667+ sqr(a*1.0)*(0.0083333333333333333- sqr(a*1.0)*(0.00019841269841269841+ sqr(a*1.0)*(0.0000027557319223985891- sqr(a*1.0)*(0.000000025052108385441718775+ sqr(a*1.0)*(0.000000000160590438368216146- sqr(a*1.0)*(0.00000000000076471637318198164759+ sqr(a*1.0)*0.000000000000002811457254345520763)))))))); writeln(c); Readln; End.
- duoda rezultatą po 27 sekundžių su 2.6 GHz procesorium.
- Toks teisingas Free Pascal kodas:
var a:longint; c:real; begin //for a:=0 to 1000000000 do a:=2; c:=c+a*(1+ sqr(a*1.0)*(-0.16666666666666667+ sqr(a*1.0)*(0.0083333333333333333+ sqr(a*1.0)*(-0.00019841269841269841+ sqr(a*1.0)*(0.0000027557319223985891+ sqr(a*1.0)*(-0.000000025052108385441718775+ sqr(a*1.0)*(0.000000000160590438368216146+ sqr(a*1.0)*(-0.00000000000076471637318198164759+ sqr(a*1.0)*(0.000000000000002811457254345520763- sqr(a*1.0)*0.000000000000000008220635246624329717))))))))); writeln(c); writeln(sin(2)); Readln; End.
- duoda rezultatą (neteisingai tik du paskutinius skaitmenis):
- 0.909297426825641 ir
- 0.90929742682568170.
- Toks Free Pascal kodas:
var a:longint; c:real; begin //for a:=0 to 1000000000 do a:=6; c:=c+a*(1+ sqr(a*1.0)*(-0.16666666666666667+ sqr(a*1.0)*(0.0083333333333333333+ sqr(a*1.0)*(-0.00019841269841269841+ sqr(a*1.0)*(0.0000027557319223985891+ sqr(a*1.0)*(-0.000000025052108385441718775+ sqr(a*1.0)*(0.000000000160590438368216146+ sqr(a*1.0)*(-0.00000000000076471637318198164759+ sqr(a*1.0)*(0.000000000000002811457254345520763+ sqr(a*1.0)*(-0.000000000000000008220635246624329717+ sqr(a*1.0)*(0.00000000000000000001957294106339126123- sqr(a*1.0)*0.000000000000000000000038681701706306840377))))))))))); writeln(c); writeln(sin(6)); Readln; End.
- duoda rezultatą:
- -0.279417241102534 ir
- -0.27941549819892587.
- Toks Free Pascal kodas (iki dalint iš 29 faktoriale):
var a:longint; c:real; begin //for a:=0 to 1000000000 do a:=6; c:=c+a*(1+ sqr(a*1.0)*(-0.16666666666666667+ sqr(a*1.0)*(0.0083333333333333333+ sqr(a*1.0)*(-0.00019841269841269841+ sqr(a*1.0)*(0.0000027557319223985891+ sqr(a*1.0)*(-0.000000025052108385441718775+ sqr(a*1.0)*(0.000000000160590438368216146+ sqr(a*1.0)*(-0.00000000000076471637318198164759+ sqr(a*1.0)*(0.000000000000002811457254345520763+ sqr(a*1.0)*(-0.000000000000000008220635246624329717+ sqr(a*1.0)*(0.00000000000000000001957294106339126123+ sqr(a*1.0)*(-0.000000000000000000000038681701706306840377+ sqr(a*1.0)*(0.00000000000000000000000006446950284384473396+ sqr(a*1.0)*(-0.000000000000000000000000000091836898637955461484+ sqr(a*1.0)*0.0000000000000000000000000000001130996288644771693)))))))))))))); writeln(c); writeln(sin(6));
- duoda rezultatą:
- -0.279415498042951 ir
- -0.27941549819892587.
- Toks Free Pascal kodas:
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+a*(1+ sqr(a*1.0)*(-0.16666666666666667+ sqr(a*1.0)*(0.0083333333333333333+ sqr(a*1.0)*(-0.00019841269841269841+ sqr(a*1.0)*(0.0000027557319223985891+ sqr(a*1.0)*(-0.000000025052108385441718775+ sqr(a*1.0)*(0.000000000160590438368216146+ sqr(a*1.0)*(-0.00000000000076471637318198164759+ sqr(a*1.0)*(0.000000000000002811457254345520763+ sqr(a*1.0)*(-0.000000000000000008220635246624329717+ sqr(a*1.0)*(0.00000000000000000001957294106339126123+ sqr(a*1.0)*(-0.000000000000000000000038681701706306840377+ sqr(a*1.0)*(0.00000000000000000000000006446950284384473396+ sqr(a*1.0)*(-0.000000000000000000000000000091836898637955461484+ sqr(a*1.0)*0.0000000000000000000000000000001130996288644771693)))))))))))))); writeln(c); Readln; End.
- duoda rezultatą po 47 sekundžių su 2.6 GHz procesorium. Tokiu atveju jei skaičiuoti, kad padaroma 14 daugybų per 1 iteraciją, tada padaroma 47*2.6/14=8.73 taktų vienai daugybai. O jei skaičiuoti, kad padaromos 28 daugybos, tada padaromi 47*2.6/28=4 taktai per vieną daugybą. Jei skaičiuoti, kad dar padaroma 14 sudeties operacijų, tada iš viso su daugybomis padaromi 47*2.6/42=2.9 ciklai vienai operacijai. Jei skaičiuoti, kad padaroma 14*3=42 daugybos ir 14 sudėties operacijų, tada iš viso padaromos 56 operacijos, o vienai operacijai tenka 47*2.6/56=2.18 procesoriaus ciklo.
- Skaičiuojant sinusą be iteracijų užtektų a pakelti kvadratu tik vieną kartą, todėl testuojamas toks Free Pascal kodas:
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+a*(1+ sqr(a*1.0)*(-0.16666666666666667+ a*(0.0083333333333333333+ a*(-0.00019841269841269841+ a*(0.0000027557319223985891+ a*(-0.000000025052108385441718775+ a*(0.000000000160590438368216146+ a*(-0.00000000000076471637318198164759+ a*(0.000000000000002811457254345520763+ a*(-0.000000000000000008220635246624329717+ a*(0.00000000000000000001957294106339126123+ a*(-0.000000000000000000000038681701706306840377+ a*(0.00000000000000000000000006446950284384473396+ a*(-0.000000000000000000000000000091836898637955461484+ a*0.0000000000000000000000000000001130996288644771693)))))))))))))); writeln(c); Readln; End.
- kuris duoda rezultatą po 41 sekundės su 2.6 GHz procesorium.
- Toks Free Pascal kodas (neturintis nieko bendro su sinusu):
var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+a*(1+ a*(0.16666666666666667+ a*(0.0083333333333333333+ a*(0.00019841269841269841+ a*(0.0000027557319223985891+ a*(0.000000025052108385441718775+ a*(0.000000000160590438368216146+ a*(0.00000000000076471637318198164759+ a*(0.000000000000002811457254345520763+ a*(0.000000000000000008220635246624329717+ a*(0.00000000000000000001957294106339126123+ a*(0.000000000000000000000038681701706306840377+ a*(0.00000000000000000000000006446950284384473396+ a*(0.000000000000000000000000000091836898637955461484+ a*0.0000000000000000000000000000001130996288644771693)))))))))))))); writeln(c); Readln; End.
- duoda rezultatą po lygiai 40 sekundžių su 2.6 GHz procesorium. Iš viso padaroma 15 daugybų ir 15 sudėčių per vieną iteraciją. Taigi, padaroma 30 operacijų per 1 iteraciją. Vienai operacijai reikia 40*2.6/30=3.4(6) ciklų. Apytiksliai reikia 3,5 ciklo vienai operacijai.
PASTEBĖJIMAS
[keisti]- Skyreliuose Something wrong ir Too fast multiplication (multiplication benchmark) iš dalies yra klaidos koduose.
- Pavyzdžiui, skyrelyje Too fast multiplication (multiplication benchmark) yra toks tekstas:
- "And this benchmark
var a:longint; b,d,c:real; begin b:=0; c:=123456789012345; for a:=1 to 1000000000 do b:=b+a; d:=c*b; writeln(d); readln(); end.
- gives result also after 5 seconds on ~3GHz CPU."
- Šitas kodas iš dalies yra su klaidom, nes programa Free Pascal nedaro antros eilutės po žodelio do (daro tik pirmą eilutę, o antrą eilutę padaro/apskaičiuoja [vieną kartą] tik po pirmos eilutės milijardo iteracijų skaičiavimo). Todėl ką tik pateiktas kodas yra ekvivalentus tokiam kodui:
var a:longint; b,d,c:real; begin b:=0; c:=123456789012345; for a:=1 to 1000000000 do b:=b+a; writeln(c*b); readln(); end.
- (kuriame nėra eilutės "d:=c*b;").
- Todėl tuos kodus taip greitai ir skaičiuoja kompiuteris.
How many CPU cycles need for sin(x) function operation?
[keisti]- In popular example of Fourier integral I showed that need about 133 cycles for one sin(x) operation on 4.16 GHz CPU (with heavely loaded one internet browser, but nothing was playing in browser or anywhere else).
- In Sinuso benchmark'as 2 is written:
- "Šitas sinuso kodas:
Uses math; var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+sin(a); writeln(c); Readln; End.
- duoda atsakymą 0,421294486750096 po 47 sekundžių su 2,6 GHz procesorium. ..."
- So on 2.6 GHz CPU with nothing loaded, just Free Pascal code, after 1 bilion iterations was gotten result "0.421294486750096" (not after first time, after second and consecunent times) after 47 seconds. So for one sin(x) operation need 47*2.6=122.2 cycles on 2.6 GHz Dual core AMD CPU. About 122 cycles.
- This Free Pascal code:
Uses math; var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+sin(a); writeln(c); Readln; End.
- gives result [4.2129448675010567E-001 which means 4.2129448675010567/10] after about 1 minute and 56 seconds (~116 seconds), when running first time on 4.16 GHz CPU (with heavely loaded one internet browser, but nothing was playing in browser). Second time and third time result [4.2129448675010567E-001] was gotten after 34-35 seconds on 4.16 GHz CPU (with heavely loaded one internet browser, but nothing was playing). So for one sin(x) operation need about 34*4.16=141.44 cycles on this 4.16 GHz CPU. So 141 cycles for sin(x) is more than 122 cycles on 2.6 GHz CPU. Of course you could blame heavily loaded internet Browser, but FPU by official teory almost not needed in Windows and software calculations, so in no loaded with intenet browser situation should be about the same 141 cycles for sin(x) for 4.16 GHz CPU. And I seems remember that loaded or not something in Windows, there Free Pascal calculations was almost with same speed. On the over hand here I got, that for calculations of parabola lenght need the same amount cycles with 2.6 GHz old CPU and 4.16 GHz newer CPU. But everywhere else 2.6 GHz CPU needed less cycles in Free Pascal calculations than 4.16 GHz CPU. As it known RAM memory latencies on some CAS (colum address strobe) or something [like there is Row selection of memory, which is almost the same speed (latency) for all types of memory, and there is Column selection of memory, which for some reasons have growing speed with newer and newer DDRn (DDR2, DDR3, DDR4, DDR5, ...) models] very rapidly improving (CAS latencies decreasing with newer DDR generation) with newer DDR generations and RAS (row address strobe, row selection or something) latencies almost don't decreasing [with newer DDR generation]. So if need to jump to farer address in RAM then need select another ROW or RAS and this is slow enough and, I think, if don't need to jump to some distant memory RAM address (in same selected ROW or RAS), then DDR RAM is faster on newer Generation (because CAS selection or Column selection in RAM is very fast). So depending on code there can be different behavior of RAM memory speed. Who knows, maybe some alocation in memory for FPU calculations takes some time and with no increasing RAS speed with DDR newer generation maybe this becomes bottleneck... As it is official, FPU is given it's own memory place in RAM and it don't comunicates with CPU directly, but both fetching/passing code to Some memory location... But cache should solve many things, so hard to say why need for FPU memory RAM [in calculations], unless cache is shared between CPU and FPU...
- I little bit make mistake about CAS latency. I know for sure that RAS or Row selection is slow and is for much longer time than CAS or Column selection. But actualy CAS latency also almost don't improving over time with newer generation of DDR RAM.
- From here:
- https://en.wikipedia.org/wiki/CAS_latency
"The CAS latency is the delay between the time at which the column address and the column address strobe signal are presented to the memory module and the time at which the corresponding data is made available by the memory module. The desired row must already be active; if it is not, additional time is required."
- CAS latency of SDRAM 100 MHz is 2 cycles or 20 ns (1/(20*10^{-9})=1/0.00000002=50,000,000= 50 MHz). Here https://en.wikipedia.org/wiki/CAS_latency#Memory_timing_examples
- is written that SDRAM 100 MHz need 20 ns (nanoseconds) for first word (word is 16 bit integer or peace of data) and after transfering fourth word (16 bits of data) will pass 50 ns. And after transfering Eigth word will pass 90 ns. So for 100 MHz SDRAM first time (aka when column is chosen) for word transfering need wait 20 ns. And for any later consequent word transfering need wait 10 ns (20+10+10+10=50 ns, then 50+10+10+10+10=90 ns).
- CAS latency of DDR2-800 (400 MHz like on 2.6 GHz CPU) is 6 cycles. And first word transfering goes in 15 ns (1/(15*10^{-9})=1/0.000000015=66,666,666= 66 MHz; little bit improved transfer speed of first 16 bits of data or first word over 100 MHz SDRAM). In that table written that DDR2-800 RAM transfer time of any, but not first word is 1.25 ns (for SDRAM 100 MHz this time was 10 ns). So after transfering 4 words will be spend 15+3*1.25=18.75 ns. And after transfering 8 words spend time is 15+3*1.25+4*1.25=23.75 ns. This 1.25 ns comes from 1/(1.25*10^{-9})=1/0.00000000125=800,000,000 =800 MHz. So according to this if CAS is selected and each newer RAM address is by one bigger then Data transfering goes at 800 MHz. But if need jump to some over address, which is +/- more than 1, then need wait for first data transfer 15 ns, like if transfering goes with 66 MHz speed.
- CAS latency of DDR3-1600 (800 MHz like on 4.16 GHz CPU) is 11 cycles. And first word transfering goes in 13.75 ns ( 1/(13.75*10^{-9})=1/0.00000001375=72,727,272= 72 MHz; BTW on my computer with 4.16 GHz CPU and DDR3-1600 RAM program CPUID CPU-Z also showing that memory CAS# Latency (CL) is 11 clocks [also RAS# to CAS# Delay (tRCD)=11 clocks; RAS# Precharge (tRP)=11 clocks; Cycle Time (tRAS)=30 clocks; Bank Cycle Time (tRC)=39 clocks; so for this my RAM DDR3-1600 timings are 11-11-11-30-39] ). Any not first word is transfered in 0.625 ns or at 1600 MHz speed ( 1/(1600*10^6)=0.000000000625 = 0.625 ns ). So after transfering 4 words, waisted time is 13.75+3*0.625=15.625 ns. And after transfering 8 words waisted time is 13.75+3*0.625+4*0.625= 18.125 ns.
- First word trasnfering time is gotten like this:
- 1/(800*10^6) *11=0.00000001375= 1.375*10^(-8)=13.75*10^(-9)= 13.75 ns.
- CAS latency of DDR4-4800 (2400 MHz) is 19 cycles. And first word transfering goes in 7.92 ns ( 1/(7.92*10^{-9})=1/0.00000000792=126,262,626= 126 MHz speed ). Number 7.92 ns comes from 1/(4800*10^6) *19=2.08333*10^{-10} *19 = 3.958333*10^{-9}. But because there is Double Data Rate, then 3.958333*2= 7.91666 ns. And 1/(4800*10^6)=2.08333*10^{-10}=0.2083 ns. After 4 words transfering taken time is 7.91666+3*0.2083=8.54156 ns (in wikipedia table in given link this result is 8.54 ns). After transfering 8 words taken time is 7.91666+3*0.2083+4*0.2083=9.37476 ns (in wikipedia table this time is 9.38 ns).
- CAS latency of DDR5-6600 (3300 MHz) is 34 cycles (there in table is another example: DDR5-6400 (3200 MHz) with CAS latency 32 cycles). And first word transfering goes in 10.30 ns ns ( 1/(10.3*10^{-9})=1/0.0000000103=97,087,378= 97 MHz speed ). Number 10.30 ns comes from 1/(3300*10^6) *34=3.030303*10^{-10} *34 = 1.0303*10^{-8} =10.303*10^{-9}=10.303 ns. Each not first word transfered after 1/(6600*10^6)=1.51515*10^{-10} s=0.1515*10^{-9} s=0.1515 ns. So 4 words transfered after 10.303 + 3*0.1515 = 10.7575 ns (in wikipedia 10.76 ns). And 8 words transfered after 10.303 + 3*0.1515 + 4*0.1515 = 11.3635 ns (in wikipedia table this is 11.36 ns).
- So we can see that DDR4-4800 with CAS 19 cycles transfering data faster than DDR5-6600 with CAS 34 cycles. (7.92 vs 10.30; 8.54 vs 10.76; 9.38 vs 11.36).
- There is possible that in wikipedia article about CAS latency, word "word" means one bit transfer. And since there is 8 banks in each memory (RAM) chip, and also there is 8 memory chips on RAM module (which you insert into motherboard), then 8*8=64 bits data can be transfered from one RAM module or 128 bits of data from two modules RAM in dual channel configuration.
- In this case also possible that after CAS was selected you can take very fast 64 bits of data from one and the same ROW, but from different Columns. But more realisticly is, that you can take 64 bits of data or from only subsequent column each time or even only from the SAME Column (in this case memory is always very slow for all DDR generations and only in small number of situations gives fast transfers from/to memory [by transfering from/to one and the same memory address]).
- There also possible two over explanations.
- First explanation is that word means 16 bits of data, but if needed you can transfer also 64 bits of data from one RAM module at same speed like 16 bits of data (like word).
- Second explanation is that now is popular Dual Channel RAM configuration. Dual channel have two 64 bits data buses, so total 128 bits data bus. And in CPU operations often enough 16 bits of data for some instructions/calculations... So if there is 8 banks in each memory chip and there is 2 modules with 8 chips each, then there is 2*8*8=128 bits to select/address in those two memory modules. So possible that RAM memory engineers/designers desided that when selected ROW (or RAS), then send signal to all the banks in all chips (of those two RAM modules) to activate Column (CAS) with of course slow speed. But in each bank there is diferent bits values (in two RAM modules is total 128 banks and each bank have different bit value, say first bank of this address have 1, second bank have 0, third bank have 0, 4th bank have 1, 5th bank have 0; each bank have some independent bit value). So if for CPU needed only 16 bits of data for some instructions, then it gets from first DDR RAM module this 16 bits of data from 16 banks. All the RAS and CAS signals are activated, so then need to activate only those 16 banks and take those precious 16 bits. Then when this 16 bits used in some instructions/operations then with RAS and CAS signals activated for given address, only need to activate another 16 banks and this time to get 16 bits of data much faster than in first time (because in first time needed to activate RAS and CAS signals for all 128 banks). Third time RAM address also selected with active RAS and CAS signals, so only need select another 16 bits (16 banks) from 128 bits (banks) possible. So 128/16=8 words. So after 8 words was transfered and used for some insturctions or operations, then need again select another address in RAM. And there is no big speed up gain in CPU speed. For FPU calculations there is no speed up gain at all [in single channel configuration], because FPU loading 64 bits of data each time if calculating in big precission. Dual channel can give some speed up gain for FPU but smaller than 2x...
- Some powerful server processors have more than 2 channels RAM. Often have 4, 8 or 12 channels. And also can have about 100 cores. Maybe for this servers you can get whooping speed up in Reading/writing from/to RAM if CAS latency functioning like described in Second explanation. But still not too significant speed up and not in all calculations/operations that big...
- Another most logical and realistic explanations is that RAM memory besides ROWs and Columns have something for fast accessing similar like explained about banks in Second explanation. Then CPU selects Row in RAM with big latency, then selects Column with smaller, but still big enough latency and then CPU have some say 1 kilobyte of some banks. And in this 1 KB of space CPU can very fast write/read anything to/from RAM and with no limitations of reading and writing number of times and not necessery in some subsequent order. Like CPU have 1 kilobyte of very fast memory for this given time until no need data from over memory locations.
- [Update after few days. Today when I try to wake up computer from sleep by pushing few times "Enter" button, my this 4.16 GHz CPU restarted (and he was lagging one or few days before this). This is not first time when after long using this (with heavy enough loaded Opera Browser) computer with this CPU, it not working (usualy it not responding (pakimba)) and need to restart it by pushing restart button on computer. So this 4.16 GHz CPU restarted itself and I wait until it fully load Windows 10... When Windows 10 was loaded I started Free Pascal sine benchmark:
Uses math; var a:longint; c:real; begin for a:=0 to 1000000000 do c:=c+sin(a); writeln(c); Readln; End.
- and it gave result [4.2129448675010567E-001] after 2 minutes and 6 seconds (126 seconds) on 4.16 GHz CPU (with nothing loaded in Windows 10, just Free Pascal prgram). Then I launch second time this code and it gave result after 35 seconds.
- Then I close Free Pascal program and launch again. Then I again launched this sine benchmark code and it gave result [4.2129448675010567E-001] after 1 minute and 57 seconds (117 seconds) on 4.16 GHz CPU (with nothing loaded in Windows 10). Then launch second time this code and got result [4.2129448675010567E-001] after 35 seconds (when launched third time result was gotten also after 35 s).
- Then I launched Opera internet browser and usualy after when Windows 10 not responding after long using Opera loaded (can be a few weeks or months), then after restart by launching Opera it loading pages, which was not closed. So Opera will be loading this pages forever. The trick is, that need to close Opera and launch again and it will fast enough load all pages before crash of Windows 10. After seeing that Opera working, I close Opera. So I again launched Free Pascal and this time, when launch first time this sine benchmark code, I got result [4.2129448675010567E-001] after 1 minute and 54 seconds (114 s) on 4.16 GHz CPU (with nothing loaded in Windows 10). When I launched second time this sine code I got result [4.2129448675010567E-001] after 34 seconds.
- And you know what? Results are identical with heavily loaded or no loaded Internet Browser(s). (116 s vs 114 s; 34-35 s vs 34-35 s).
- Also after this all sine benchmarks, I launched Free Pascal and open division benchmark:
var a:longint; c:real; begin for a:=1 to 1000000000 do c:=c+1/a; writeln(c); Readln; End.
- which gave result [2.1300481502506980E+001] first time after 1 minute and 40 seconds (100 seconds) on 4.16 GHz CPU (with nothing loaded in Windows 10). Then I lauchned second time this division benchmark and got result [2.1300481502506980E+001] after 6 seconds.
- Then I close Free Pascal and lauchned first time this division benchmark and got result [2.1300481502506980E+001] after 1 minute and 41 second (101 second) on 4.16 GHz CPU (with nothing loaded in Windows 10). Second time got result after 6 s.
- Then I was writting here and decided to be sure about this "1 minute and 40 s" and lauchned Free Pascal and after first time launching this code (by pressing "Run") I was surprised, because get result [2.1300481502506980E+001] after 1 minute and 5 seconds (65 seconds) on 4.16 GHz CPU (with heavily enough loaded internet browser Opera). By lauching second time this division code, I got result [2.1300481502506980E+001] after 6 seconds on 4.16 GHz CPU (with heavily enough loaded internet browser Opera).
- Here is result of division benchmark with loaded internet browser(s) on 4.16 GHz CPU. There first time result is gotten after about 103 seconds and second time after 6 seconds. So second time result is identical to results gotten now (6 seconds and "2.1300481502506980E+001"). And first launch time result is better now (100-101 s vs 103 s OR 65 s vs 103 s)...]
RAM Burst lenght
[keisti]- There is such thing as RAM Burst lenght. Burst lenght means that procesor can take 8 times more with burst lenght 8, than with burst lenght 1.
- For example here https://www.edaboard.com/threads/what-does-burst-length-in-ddr-sdram.84436/ telling that
- DDR1 has a Burst length of 2
- DDR2 has a Burst length of 4
- DDR3 has a Burst length of 8.
- This probably means that DDR4 and DDR5 have also Burst lenght of 8. That's why in Wikipedia
- https://en.wikipedia.org/wiki/CAS_latency
- there is calculation till 8.
- Here https://community.intel.com/t5/Programmable-Devices/DDR3-burst-lenght/td-p/143100 another link about Burst lenght speculations.
- Here https://www.quora.com/What-is-DRAM-burst-size many things written about RAM burst lenght. But guy explaining it very pesimistic about RAM speed in general.
- Here Burst lenght explained for wikipedians: https://en.wikipedia.org/wiki/Burst_mode_(computing) .
- As explained and calculated in topic How many CPU cycles need for sin(x) function operation? All DDR RAM roughly speaking working at 100 MHz effective speed. If there is RAM without burst lenght technology, then with 64 bits data bus such RAM memory can provide maximum 100*64 = 6400 Mbits/s = 800 MB/s. DDR RAM with 64 bits data bus and Burst lenght of 8 can provide maximum 8*800 MB =6.4 Gbytes/s.
- Burst lenght working principle is that there is something like 8 banks of bits instead 1 bank for each bit. And when CPU need data from RAM memory it asks from RAM to give 64 bits of data through 64 bits data bus from 8 adresses. For example, if need data from address 1234ABCDh (here letter h means that it is hexodecimals: from 0 to F; one hexodecimal is 4 bits so here is 32 bits address), when after waiting at maximum 100 MHz speed due CAS latency, DDR takes 64 bits From DDR RAM from this 8 banks and saves them in some registers, which are in this DDR RAM memory (from addresses: 1234ABCDh, 1234ABCE, 1234ABCF, 1234ABD0, 1234ABD1, 1234ABD2, 1234ABD3, 1234ABD4). So with maximum 100 MHz this CAS latency speed there is 64*8=512 bits taken almost instantly. Then CPU quickly takes this 512 bits by 64 bits at a time through arbitrary fast DDR RAM data bus like now DDR5 speed is about 8000 MHz effective (or 4000 MHz simple). So no matter how fast there is this DDR data bus, there never be taken more than mentioned 6.4 GB/s. CPU saves this taken 8 64 bits peaces in some like cache registers. Then if need execute instructions, processor can from this fetched 8 64 bits peaces execute instructions which along with data resides in this 8 peaces of [64 bits] data. Instruction normaly needs one or two such peaces of data. So there will be about 4-6 instructions (can be less). If one instruction execution takes about 5 CPU clock cycles, then 6 instructions execution will take 5*6 = 30 cycles. With 3 GHz CPU it would be as 100 MHz... So CPU probably even faster will execute instructions than get another 8 portions of 64 bits data from say DDR5 RAM through 64 bits data bus.
- Here some real datasheet of Micron 1997 RAM chip explanations: https://docs.rs-online.com/7195/0900766b80028afb.pdf
- Here some newer Micron SDR SDRAM datasheed: https://www.mouser.com/datasheet/2/671/64mb_x4x8x16_sdram-1282423.pdf
- I examined only this 1997 year Micron datasheet. It is called EDO DRAM. Some variant of SDR DRAM (dynamic random acess memory, maybe patent free...). So it called 4 MEG x 4 EDO DRAM. This means that one chip have 4 bits data bus. 16 such chips have 4 MEG * 4 (bits) * 16 (chips) = 256 Mbits = 32 MB RAM. And 16 such chips give 4*16= 64 bits data bus. If there will be used only 8 such 4 MEG x 4 EDO DRAM chips then RAM to CPU data bus will be only 4*8=32 bits.
- Here from Wikipedia:
- https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory
- From this Wikipedia article about older SDRAM (SIMM) ram:
- https://en.wikipedia.org/wiki/SIMM#72-pin_SIMMs
- is possible to learn more about how RAM working and what adressing and data bus using. In this SIMM RAM article written, that 72-pin RAM module have 32 bits data bus. Procesors 386 and 486 have 32 bits data bus, so it's enough one SIMM RAM module to operate on full speed. There says that Pentium (which have 64 bits data bus) using two such 72-pin SIMM RAM modules to obtain 64 bits data bus.
- As written in wikipedia one 72-pin SIMM RAM module have 12 bits address bus. And in normal functioning can address 2^24 = 16777216 addresses (about 16 milion addresses). If each address holds 64 bits then can be addressed 16777216*64/8 = 134,217,728 bytes or 128 Mbytes. Oficialy probably claimed that those 128 MB can be taken by 8 bits peaces if needed, but by my assumption for CPU it would be too complex and too time consuming to take 8 bits (or to put in RAM) at time from RAM if needed. So more realistic would be that CPU either have 32 bits data bus and each time takes 32 bits of data from RAM, even if need only 8 bits or 16 bits of data by ignoring upper bits (or most significant bits) of 32 bits chunk. Or CPU have 64 bits data bus, which should be useful for FPU calculations, but not for CPU... In 64 bits [data bus] waisted RAM space would be even bigger, because for example intel 8086 and 286 CPUs having 16 bits data and ~16 bits addres buses have instructions of 8 bits (most of of them if not all...). 32 bits procesors Addressing 2^32 = 4,294,967,296 addresses or 4,294,967,296*64/8 = 34,359,738,368 = 32 Gbytes by 64 bits peaces. Half or more of this RAM [cells] possible can be waisted if not used 64 bits FPU calculations on almost all this RAM.
- Going back to 72-pin SIMMs RAM, I told that there is 12 address lines or 12 bits address bus. So obviously that address is passed to SIMM RAM in two cycles. One time giving 12 bits and another time giving 12 bits, forming 24 bits address. This trick is not new. For example intel 8080 and 8085 and Zilog Z80 having 8 bits address and 8 bits data busses, passing 16 bits address to RAM also in two cycles: 8 lower bits and then 8 higher bits. SIMM RAM could have not 12 bits address bus, but 16 bits adddress bus, then could in two cycles address 32 bits, but at that time there wasn't made RAM with such storage, so enginers stick to 12 bits SIMM RAM. This would mean that most upper or like called most significan 8 (32-24=8) bits are ignored as they don't exist, because CPU's Program Counter wouldn't go to such big/distant address, because no programs at that time for RAM with such big storage. But you also can believe (if like conspiracy theories) that 286 CPU having 24 bits address bus nicely go along with two 12 bits addressing cycles and 386 and 486 don't went any further from here.
- The thing is that intel 286 CPU have 16 bits address bus TO MEMORY (by my thinking) and addition 8 bits address coming from segment register. There is 4 segement registers (CS, SS, DS, ES) in 8086 and 286 and 6 segment registers (CS, SS, DS, ES, FS, GS) in 386, 486, Pentium and all newer processors till today CPU's (segment registers manufacture like intel can made from 4, 8 bits to 16 bits like is in novadays CPUs). The problem is that 8086 (which have 20 bits address bus), the uppest most 4 bits gives from segment (CS or SS or DS or ES). 4+16=20 bits address. If there is RAM of 2^16 = 65536 addresses, then all computer will not work with 8086 or 286, because there is 4 segment registers and they very used for addressing memory address space. So there must be at least 18 bits of RAM addressing on 8086 or 286 CPU (2 [addition] bits gives 4 combination for 4 segment registers). And with this Segment Registers possibly need not 2, but 3 RAM memory addressing cycles. So possibly Segment registers are Intel and AMD marketing at early time to advertise more memory supportion (could wait till 32 bits CPUs). Intel 8080, 8085, Z80 don't have those segment registers and they functionality can be simulated by clever programing. One more interesting thing that intel 8080, 8085, 8086, 286 (maybe some later too, I don't read they manuals about this) CPUs using only 8 bits instructions and 8 or 16 bits operands. Zilog Z80 CPU have some instructions of two 8 bits peaces or, what is the same, have 16 bits instructios and can have (if remember correctly) up to 3 or 4 8 bits operands (this means one or two 16 bits operands). Big chance that Z80 CPU new instructions sinking in lies. Normaly CPU first fetching instruction (opcode) from some memory address which is in PC (program counter) and in another cycle or two, fetching operands from memory after this address or operands from register(s). Operands can be 8/16/32 bits and up to 64 bits for FPU. And operands can be 64 bits for 64 bit processors if such are not fake (but then they can't address more than mentioned 32 Gbytes of RAM).
- This https://docs.rs-online.com/7195/0900766b80028afb.pdf
- 1997 Micron RAM datasheed of 4Meg x 4 chip showing that this chip have 11 bits address. In two cycles can be addressed 2^22 = 4,194,304 addresses with 4 bits each. With 16 chips will be addressed the same 2^22 = 4,194,304 addresses, but with 64 bits each. This gives 4194304*64/8 = 33,554,432 bytes = 32 Mbytes RAM module. If there is used only 8 this Micron chips, then there would be 16 MB RAM Module (plate) and with 32 bits address bus. Seems EDO DRAM didn't go far from 12 bits (address) 72-pin SIMM modules... Even using less address bits... Maybe with more storage capable holding EDO DRAM changed 11 bits address bus to say 14 bits address bus (32 bits CPUs like 386 with 32 bits address bus treating 2*14=28 bits of address like 28 least significant bits and 4 most significant bits address ignoring, because programs Microsoft forbiding to go to such high address or program begining to write to HDD in this case for example).
- Correction: This 24 or 26 pin Mircron RAM chip have 12 address pins (didn't sow this time pin A11), from A0 to A11. In this case can be addressed 2^24 = 16,777,216 addresses (in two cycles with 12 bits address each) or 16M*8=128 MBytes (holding 64 bits in each address).
- In first cycle giving all 12 bits on Address bus (A0 - A11) and during second cycle giving 10 bits on address bus (A0 - A9). Total 12+10=22 bits. Can be addressed 2^22 = 4,194,304 addresses with 4 bits each. With 16 such chips can be addressed 4194304*4*16 = 268,435,456 bits = 256 Mbits = 33,554,432 MBytes = 32 MB plate/module with 16*4=64 bits data bus. In that PDF file looks like first 12 bits on address bus selecting ROW and over 10 bits (from second cycle) selecting Column. There is 4 kinda Banks in each chip, because there is 4 bits data bus (address to each bank is the same).
- So instead of 8 bits Burst, RAM simply could work at say some 2 GHz, this would be more than 100*8=800 MHz maximum effective RAM speed using burst lenght of 8.
- And seems that many new processors technologies can be fake and the thing which changed from intel 8080/8086 or Z80 is that it changed registers from 8/16 bits to 16/32 bits, and address and data buses from 16 bits (address formed from two 8 bits peaces) to 32 bits (which possibly also passed in two cycles with two 16 bits peaces). Many new flags in FLAGS register in new CPU can be fake or only few functional. Like many instructions, for example, in 286 possibly are fake. They can be simulated by combining few simpler instructions. Some 286 instructions looks ridiculous... like fake...
- Here article about 8086: https://www.righto.com/2020/06/a-look-at-die-of-8086-processor.html
- In this article told that 8086 instructions are putted into some microcode on 8086 chip die. With such microcode instructions actualy CPU could work faster than with combination of simpler instructions in RAM space. And this aproach probably should take less space of RAM. Thinking that CPU takes from RAM instruction, then takes from [this instruction address]+1 address operand (operand can be from register or from RAM and register at same time [two operands]). If there are many simpler instructions to achieve more complex instruction, then each instruction taking byte or 16 bits [possibly on new CPUs] from RAM and need this takes more RAM space (operands still need to take from RAM either way [if they are not from register]) than one instruction in microcode. Also microcode is like ROM and do faster small instructions in its microcode than if CPU taking many instructions addressed by Program Counter from RAM. Problem is that then microcode must have some small micro Program Counter of few bits... Normal Program counter is 16 bits in 8080 CPU, 8085 CPU, 8086/286 CPU, Z80 CPU and 32 bits in intel 386 CPU and newer CPUs.
- So with microcode, working principle should be that CPU takes FROM RAM complex instruction (which should be two bytes long) and this instruction treated by CPU's microcode with it's own few bits Program Counter, which doing the same thing like CPU would do IN RAM, but only difference is that this simple/small/short instructions executed with their operands from RAM (if operands needed from RAM) one by one, till gotten in this way complex instruction. This saves up to 50% of RAM and should be faster because ROM is faster than RAM... After finishing many small instructions micro-programcounter ending job and normal Program Counter increamented by one or two and new instruction execution begins.
- Also possible that microcode simply holding instructions for better order or something and there exist only "small" instructions and big/complex instructions are simply fake. Because own microcode Program Counter with it's own clocks looks like complex thing...
- There actualy for some instructions need some sub CPU clock and there's no way they can be executed in one cycle... intel 8080/8085 and Z80 doing such things in few cycles one simple basic instruction like Stack push or pop in Stack and RAM operations. And with wider data bus there is no over way, but to use some subcycles/suboperations in some [necessary] instructions. But not as many this subcycle like if would need with microcode own [short] Program Counter for complex instructions (which are simple instructions sequence).
More about CPUs
[keisti]- Here is manual of intel 286: https://bitsavers.org/components/intel/80286/210498-001_iAPX_286_Programmers_Reference_1983.pdf
- All 286 instructions listed and stricly described at very end of PDF file (begins at 353 page).
- Also 286 CPU mentioned instructions microcode can be like CPU booting ROM or something. To load basic code and do some things like interupts or to load something to RAM for begining of processor work.
- Here https://en.wikipedia.org/wiki/Intel_8231/8232 interesting article about Intel 8231/8232, which were licensed versions of AMD's Am9511 and Am9512 FPUs, from 1977 and 1979. No wonder intel give AMD rights to manufacture x86 CPUs.
- AMD AM9511 can operate only on 32 bit floating point values. And on 16 bit and 32 bit integers. It have 8 bit data bus. Standart 24 pin package. Looks like AM9511 FPU have only two 32 bit registers TOS (Top of Stack) and NOS (Next on Stack). Albeit paper indirectly claims having 8 16bit registers or 4 32 bit registers. 32 bit numbers to registers loaded in bytes sequences. First 4 bytes (32 bits) loaded to NOS register and then 4 bytes loaded to TOS register. Operands to TOS and NOS writen by activating CS (chip select) pin on AM9511 chip and pin WR (write, input). Then CPU AMD AM9080A (if such was made) or Intel 8080/8080A transfers total 8 bytes to NOS and TOS registers. For square root operation, calculations made on TOS register and result stored also in TOS register. But I doubt there was such instruction for real. Only addition, multiplication, maybe subtraction and maybe division instructions are available. For example, multiply instruction FMUL, multiplies NOS by TOS and store result to NOS. After TOS and NOS are loaded with 32 bits data, then on data bus is send 8 bits opcode to AM9511 FPU and depending on this opcode FPU performs either addition or subtraction or multiplication with operands TOS and NOS and stores result to NOS. When need to load NOS and TOS with 32 bits operands, CPU either itself sending 8 bytes one by one in 8 write operations (by WR pin activating) from CPU registers or CPU with it's selected address of RAM, move 8 times data from RAM addresses to FPU through CPU's data bus (which also connected to FPU and becomes active depending on chip selected (CS pin)). Like in hexodecimals could be addresses: 00B0, 00B1, 00B2, 00B3, 00B4, 00B5, 00B6, 00B7, from which CPU tell RAM to move this eight 8 bits peaces on data bus, like performing RAM read operation, but transfering it to over device (AM 9511 FPU), instead to it's (CPU's) own registers. When 8 write operations to FPU permormed, AM9511 FPU is ready for 8 bits opcode (which comes on 8 bits data bus) from CPU or RAM.
- Bit 7 of opcode looks to me like unfunctional bit... Bit 5 for fixed point (16 bit or 32 bit integers) operations may also don't work like many over instructions... Guaranty functional AM9511 FPU pins are CLK (clock, Input), CS (Chip Select, Input), RD (Read, Input), WR (Write, Input), DB0-DB7 (Bidirectional Data Bus, I/O).
- To read calculated result from AM9511 FPU, CPU activates CS pin and RD pin and probably 4 bytes (32 bits) transfered from FPU to RAM location pointed by CPU's Program Counter (Instruction Pointer) on 8 bit data bus. Maybe like NOP (no operation) operation, but then who activated read from RAM pin? So more realistic, that CPU reads 4 times each byte from FPU, like it performing input operation FROM some port/device. Writing to RAM or register in each 8 bits transfer is possible (now don't remember if input from port writen to RAM or CPU's register or can be written to both). Just seems AM9511 FPU have small program counter (pointer) of 8 bits, so 8 bits also go in correct sequence...
- Speaking about CPU seems most important part in CPU working principle is stack pointer (SP register) and interupts. Reading info about CPUs I find out, that interups are done based on priority. They from begining are made in such scheme with transistors logic, that working by principle that priority given to those interupts which have made with bigger priority with tranistors scheme. If higher priority device asking CPU for interupt then lower priority device interupt is stoped and served interupt of device with higher priority. Such priorities may be 2 or 3, maybe up to 5, because it's in most cases not very important... 8 bits pf information can serve 256 ports of input devices and 8 bits for 256 output devices. For example if there is two interupt priority levels, then say 128 ports are for lower priority level interupts and 128 ports - for higher priority interupts. They on transistor level wired, so priority interupt level will work automaticly. If priority interupts are of same level then first will be served interupt which occur first.
So Stack Pointer (SP) function is to load Program Counter 16 bits for 16 bits CPUs or 32 bits for 2^32 addresses capable addressing CPUs TO Stack pointer pointed address in RAM. During this operation Stack pointer is decremented by 1 (if there is interupt) or increamented by 1 if there is return from interupt. Then current Program Counter register of say in this example which is 32 bits loaded to Stack pointer address. If stack pointer had value 4567EB86h in hexodecimals which is 32 bits, then at begining interupt, Stack pointer decremented by 1 and have value 4567EB85h and sends Program Counter register 32 bits to RAM address 4567EB85h. Then program counter from interupt device loaded with new 32 bits value. I'm not sure, maybe new value loaded during interupt from some register or from some SRAM memory registers stack or there simply no so many pending interupts and say max 2 or 3 interupts occurs each on over with 2 or 3 different privilege (priority) levels.
- If input device loads new 32 bits value to Program Counter, then Program Counter starts do instructions from RAM at this new address or there can be some special addresses which are not RAM, but say ROM with some information and instructions.
- After interupt done, what it wanted to do like mouse click comand or keyboard click command, instruction RET is executed and from Stack Pointer address is geted old Program Counter value from RAM address 4567EB85h. Then Stack Pointer is increamented by 1 (and again have value 4567EB86h) and waits till new interupt occurs after some time or serves new interupt with lower priority. Theoreticly, more tasks and interupts at same time possibly slowing down all CPU work, than if to do all task and interupts one by one in some not interupting order...
- Here https://deramp.com/downloads/intel/8080%20Data%20Sheet.pdf on page 11 is written how doing interupts intel 8080 CPU:
- In this way, the pre-interrupt status of the program counter is preserved, so that data in the counter may be restored by the interrupted program after the interrupt request has been processed.
- The interrupt cycle is otherwise indistinguishable from an ordinary FETCH machine cycle. The processor itself takes no further special action. It is the responsibility of the peripheral logic to see that an eight-bit interrupt instruction is "jammed " onto the processor's data bus during state T3.
- In a typical system, this means that the data-in bus from memory must be temporarily disconnected from the processor's main data bus, so that the interrupting device can command the main bus without interference.
- The 8080's instruction set provides a special one-byte call which facilitates the processing of interrupts (the ordinary program Call takes three bytes). This is the RESTART instruction (RST). A variable three-bit field embedded in the eight-bit field of the RST enables the interrupting device to direct a Call to one of eight fixed memory locations. The decimal addresses of these dedicated locations are: 0, 8, 16, 24, 32, 40, 48, and 56. Any of these addresses may be used to store the first instruction(s) of a routine designed to service the requirements of an interrupting device. Since the (RST) is a call, completion of the instruction also stores the old program counter contents on the STACK.
- So Seems during interupt, interupting device activating CPUs interupt pin and this leting interupting device do "jamming" for all CPU processes. This way interupting device can load to CPU Program Counter address of interupting routine. Then with new Program counter after interupt pin deactivated, program counter goes to new address and do instructions there needed for serving interupt. After serving done, RET instruction at the end is executed and old Program Counter value is loaded from STACK. And CPU goes to old code and work...
- Another interupt serving explanation can be, that during interupt through 3 bits address or data bus (in manual not sayed about it), there is launch (due say some CPU pin activation) of one of 8 addresses (with step of 8 RAM locations: 0, 8, 16, 24, 32, 40, 48, and 56) interupt serving miniprograms, which can jump to big programs which initialy loaded in some secret or safe RAM place. This big programs can serve arbitrary complex interupt routine or program. After finishing interupt at the end of big program RET instruction execution brings from STACK program counter's old value and flow of usual code/instructions continues.
- Also possible that there is only one jump address during RESTART instruction (RST), which would be 0. Then CPU in this adress checking all ports (by executing some sequental code/instructions) for data in/out untill finds correct port, but this would be very slow.
- So possible newer procesors have more this RST addresses (more than 8 and with bigger spaces between them for more instructions to fit in there).