Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Computer Scientists’ Trivia

$
0
0

Hardware: Programmer’s Perspective

It is crucial for programmers to understand how long a certain operation takes in and out of a computer. For example, fetching a word from cache, memory, disk, and from other computers.
Inspired by Teach Yourself Programming in Ten Years, I would like to discuss this in a little more detail. (Most information is taken from here)

The analogy is:

  • L1 Cache - There is a sandwich in front of you.
  • L2 Cache - Walk to the kitchen and make a sandwich
  • RAM - Drive to the store, purchase sandwich fixings, drive home and make sandwich
  • Hard Drive - Drive to the store. purchase seeds. grow seeds, harvest lettuce, wheat, etc. Make sandwich.

To be more specific:

Latency ComparisonsNanosecMicrosecMillisec
L1 cache reference0.5
Branch mispredict5
L2 cache reference7
Mutex lock/unlock25
Main memory reference100
Compress 1K bytes with Zippy3,0003
Send 1K bytes over 1 Gbps network10,00010
Read 4K randomly from SSD150,000150
Read 1 MB sequentially from memory250,000250
Round trip within same datacenter500,000500
Read 1 MB sequentially from SSD1,000,0001,0001
Disk seek10,000,00010,00010
Read 1 MB sequentially from disk20,000,00020,00020
Send packet CA->Netherlands->CA150,000,000150,000150

Notes

  • 1 ns = 10^-9 seconds
  • 1 us = 10^-6 seconds = 1,000 ns
  • 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

It is important to have this information in mind because you have to understand what you should be optimizing for. Optimizing the CPU caching like hell will improve the web-api’s latency only a little, because most of the latency comes from the traveling time of the data.

If L1 access takes a second, then reading 1 MB sequentially from disk takes about 462 days. Sending packet form California to Netherlands and back to California would take 3472 days.

If visualized :

latency

Numbers and Bits

These numbers come naturally to programmers. First ten powers of 2 numbers:

1 2 4 8 16 32 64 128 256 512 1024

  • 2^10 = Kilo ~ 10^3
  • 2^20 = Mega ~ 10^6
  • 2^30 = Giga ~ 10^9
  • 2^40 = Tera ~ 10^12

It is good to know the answers for the questions below.

What’s the number 1111 1111?: 255
What’s 1111 1111 in Hex?: 0xff
What’s the number 1000 0000?: 2^7=128
What’s the largest 8-bit number?: 255
How to represent negative numbers?: 2’s Complement

Notes - Two’s Complement

Simply filp all bits and add 1.
00000011 -> 11111100 -> 11111101

Byte Ordering

Conceptually, memory is a big array, addressable at each byte.

Big Endian: most significant byte in smallest address (0x01020304)
Little Endian: least significant byte in smallest address (0x04030201)

Bit Twiddling

Bitwise operations and tricks can be found in here.

Floating Points

IEEE Floating Point (It’s a standard)

(-1)^(sign) Mantissa (aka significand) 2^E

TypeSign BitEncoded EEncoded MPrecisionMagnitude
Singlesign(1bit)exp (8-bits)frac(23-bits)2^-1492^128
Doublesign(1bit)exp (11-bits)frac(52-bits)2^(-1074)2^1025

Floating points are tricky because the precision diminishes as magnitude grows and it is prone to overflow and rounding error.

Many real world disasters due to floating point trickiness. Patriot Missile failed to intercept due to 0.1 rounding error (1991). Ariane 5 explosion due to overflow in converting from double to int (1996)

Extract Sign, Mantissa, and E from a Float

To display sign-bit, 8-bit exp field, and 23-bit frac field…

1

2

3

4

unsignedint *f = (unsignedint *)(&x);

unsignedint sign = (* f) >> 31;

unsignedintexp = (* f) << 1>> 24;

unsignedint frac = ((* f) << 9) >> 9;

Special Values

Infinite : exp = 111…1, frac = 000…0
(ex., 1.0/0.0 = −1.0/−0.0 = +∞, 1.0/−0.0 = −∞)
Not-a-Number(NaN) : exp = 111…1, frac ≠ 000…0
(ex., sqrt(–1), ∞ − ∞, ∞ × 0)

C data types

I personally do not think that every programmers should know C in order to succeed in programming in other langauges. However, I think it is very important to know how data is stored in C (especially pointers). This knowledge broadens the perspective of how memory and cpu works. And you’ll appreciate more on how modern languages made everything so easy.

Primitive Types in C

It depends on OS by OS but, most commonly…

TypeSize (32-bit)Size (64-bit)
char1 (8 bit)1 (8 bit)
short2 (16 bit)2 (16 bit)
int4 (32 bit)4 (32 bit)
long8 (32 bit)8 (64 bit)
float4 (32 bit)4 (32 bit)
double8 (64 bit)8 (64 bit)
pointer4 (32 bit)8 (64 bit)

2D Array vs. Array of Pointers

1

2

int m[2][3] = { {1, 2, 3}, {4, 5,6}};

int *n[2];

Diagram to show the difference
m: |_|_|_|_|_|_|_|
n: |_|_|

n[0] = &m[0][0]; //equivalent to n[0] = m[0]
n[1] = &m[1][0]; //equivalent to n[1] = m[1]

what is n[0][1]? //2
m[0][1]? // same as n[0][1]

What is n[1][1]? //2
What is m[1][1]? //same as n[1][1]

What about n[0][5]? //6

n[1] = m[0];
what is n[1][1] now? //2

n[1][1]++;
what are all elements of m? // 1 3 3 4 5 6

n[1][3]++;
what are all elements of m? // 1 3 3 5 5 6

Performance Realities

Programmers must optimize at multiple levels

  • Big-O: algorithm, data representations
  • Systems: optimize memory access, I/O, parallelize execution

Steps of manual optimization

  1. Identify bottlenecks - Bottleneck is CPU? Disk/SSD? Network? others?
  2. Measure program performance - If CPU is the bottleneck, profile a program’s execution to figure out which code path takes the most time

One sentence short, the biggest challenge for every programmers is: how to improve performance without destroying code modularity and readability. While contributing to mlpack, I saw maintainers giving up the use of GPU for performance improvements because it made the code too messy and unmaintainable.

You cannot rely on compiler optimization because compiler must generate the safe machine code that behaves same in any circumstances, which limits the scope of optimization. When in doubt, the compiler must be conservative.

Getting Higher Performance

Use compiler optimization flags and watch out for:

  • hidden algorithmic inefficiencies
  • Watch out for optimization blockers
  • procedure calls & memory aliasing

Always profile the program’s performance.

Code Motion

  • Reduce frequency with which computation performed
  • If it will always produce same result

1

2

3

4

5

6

voidset_row(double *a, double *b, long i, long n)

{

long j;

for (j = 0; j < n; j++)

a[n*i+j] = b[j];

}

In assembly the above function looks like below:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

set_row:

testq %rcx, %rcx # Test n

jle .L1 # If 0, goto done

imulq %rcx, %rdx # ni = n*i redundant!!

leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8

movl $0, %eax # j = 0

.L3: # loop:

movsd (%rsi,%rax,8), %xmm0 # t = b[j]

movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t

addq $1, %rax # j++

cmpq %rcx, %rax # j:n

jne .L3 # if !=, goto loop

.L1: # done:

rep ; ret

To improve the performance, take the redundant multiplication out of the loop like below:

1

2

3

4

long j;

int ni = n*i;

for (j = 0; j < n; j++)

a[ni+j] = b[j];

Replace Costly Operation with Simpler One

  • Shift, add instead of multiply or divide (ex: 16*x-->x << 4)
  • Recognize sequence of products

1

2

3

4

5

6

for (i = 0; i < n; i++) {

int ni = n*i;

for (j = 0; j < n; j++)

a[ni + j] = b[j];

}

}

In this case, replacing the multiplication with addition improves the performance.

1

2

3

4

5

6

7

int ni = 0;

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++)

a[ni + j] = b[j];

ni += n;

}

}

Share Common Subexpressions

Reuse portions of expressions (GCC will do this with –O1).

The below code has 3 multiplications: i*n, (i–1)*n, (i+1)*n.

1

2

3

4

5

6

up = val[(i-1)*n + j ];

down = val[(i+1)*n + j ];

left = val[i*n + j-1];

right = val[i*n + j+1];

sum = up + down + left + right;

In assembly the above function looks like below:

1

2

3

4

5

6

7

8

leaq 1(%rsi), %rax # i+1

leaq -1(%rsi), %r8 # i-1

imulq %rcx, %rsi # i*n

imulq %rcx, %rax # (i+1)*n

imulq %rcx, %r8 # (i-1)*n

addq %rdx, %rsi # i*n+j

addq %rdx, %rax # (i+1)*n+j

addq %rdx, %r8 # (i-1)*n+j

This code the the same work but there is only 1 multiplication in the code below: i*n.

1

2

3

4

5

6

long inj = i*n + j;

up = val[inj - n];

down = val[inj + n];

left = val[inj - 1];

right = val[inj + 1];

sum = up + down + left + right;

In assembly the above function looks like below:

1

2

3

4

5

imulq %rcx, %rsi # i*n

addq %rdx, %rsi # i*n+j

movq %rsi, %rax # i*n+j

subq %rcx, %rax # i*n+j-n

leaq (%rsi,%rcx), %rcx # i*n+j+n

Optimization Blocker 1. Procedure Calls

1

2

3

4

5

6

7

voidlower(char *s)

{

size_t i;

for (i = 0; i < strlen(s); i++)

if (s[i] >= 'A'&& s[i] <= 'Z')

s[i] -= ('A' - 'a');

}

Question: What’s the big-O runtime of lower, O(n)?

lower-case-graph
Answer: Quadratic performance!

Strlen takes O(n) to finish and strlen is called n times, which makes the whole function O(n^2).

1

2

3

4

5

6

7

8

voidlower2(char *s)

{

size_t i;

size_t len = strlen(s);

for (i = 0; i < len; i++)

if (s[i] >= 'A'&& s[i] <= 'Z')

s[i] -= ('A' - 'a');

}

In order to remedy this, we move call to strlen outside of loop. Since result does not change from one iteration to another. The improved performance is shown below.

lower-case-optimized

Optimization Blocker 2. Memory Aliasing

Aliasing means when two different memory references specify single location. It is easy to happen in C since it is allowed to do address arithmetic and direct access to storage structures is possible. In order to control this, get in habit of introducing local variables. See the variables accumulating within loops.

The aggressive reuse of memory is one of the ways through which programmers make code fast, and it is important for the correctness and speed of your program that you understand how a program might alias buffers.

1

2

3

4

5

6

7

8

9

voidsum_rows1(double *a, double *b, long n){

long i, j;

for (i = 0; i < n; i++) {

b[i] = 0;

for (j = 0; j < n; j++)

b[i] += a[i*n + j];

}

}

In assembly the above function looks like below:

1

2

3

4

5

6

7

8

# sum_rows1 inner loop

.L4:

movsd (%rsi,%rax,8), %xmm0 # FP load

addsd (%rdi), %xmm0 # FP add

movsd %xmm0, (%rsi,%rax,8) # FP store

addq $8, %rdi

cmpq %rcx, %rdi

jne .L4

Code updates b[i] on every iteration. Programmers must consider possibility that these updates will affect program behavior. Let’s remove the aliasing like below.

1

2

3

4

5

6

7

8

9

10

11

voidsum_rows2(double *a, double *b, long n){

long i, j;

for (i = 0; i < n; i++) {

int val = 0;

for (j = 0; j < n; j++){

val += a[i*n + j];

b[i] = val;

}

}

}

In assembly the above function looks like below:

1

2

3

4

5

6

# sum_rows2 inner loop

.L10:

addsd (%rdi), %xmm0 # FP load + add

addq $8, %rdi

cmpq %rax, %rdi

jne .L10

There is no need to store intermediate results anymore.

Memory Layout

  • Stack
    • Runtime stack (8MB limit)
    • E. g., local variables
  • Heap
    • Dynamically allocated as needed
    • When call malloc(), calloc(), new()
  • Data
    • Statically allocated data
    • E.g., global vars, static vars, string constants
  • Text / Shared Libraries
    • Executable machine instructions
    • Read-only

Each memory segment can be readable, executable, writable (or none at all). Segmentation fault occurs when program tries to access illegal memory. For example, when reading from segment with no permission or writing to read-only segments of the memory.


Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>