Quantcast
Channel: Hacker News
Viewing all 25817 articles
Browse latest View live

The European Parliament has approved budget for VLC bug bounty program

$
0
0
Browser not supported :( - HackerOne
Batman

You are visiting this page because we detected an unsupported browser. Your browser does not support security features that we require. We highly recommend that you update your browser. If you believe you have arrived here in error, please contact us. Be sure to include your browser version.


Debugging an evil Go runtime bug

$
0
0

Preface

I’m a big fan of Prometheus andGrafana. As a former SRE at Google I’ve learned to appreciate good monitoring, and this combination has been a winner for me over the past year. I’m using them for monitoring my personal servers (both black-box and white-box monitoring), for the Euskal Encounter external and internal event infra, for work I do professionally for clients, and more. Prometheus makes it very easy to write custom exporters to monitor your own data, and there’s a good chance you’ll find an exporter that already works for you outside of the box. For example, we use sql_exporter to make a pretty dashboard of attendee metrics for the Encounter events.

Event dashboard for Euskal Encounter (fake staging data)

Since it’s so easy to throw node_exporter onto any random machine and have a Prometheus instance scrape it for basic system-level metrics (CPU, memory, network, disk, filesystem usage, etc), I figured, why not also monitor my laptop? I have a Clevo “gaming” laptop that serves as my primary workstation, mostly pretending to be a desktop at home but also traveling with me to big events like the Chaos Communication Congress. Since I already have a VPN between it and one of my servers where I run Prometheus, I can just emerge prometheus-node_exporter, bring up the service, and point my Prometheus instance at it. This automatically configures alerts for it, which means my phone will make a loud noise whenever I open way too many Chrome tabs and run out of my 32GB of RAM. Perfect.

Trouble on the horizon

Barely an hour after setting this up, though, my phone did get a page: my newly-added target was inaccessible. Alas, I could SSH into the laptop fine, so it was definitely up, but node_exporter had crashed.

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc41ffc7fff pc=0x41439e]

goroutine 2395 [running]:
runtime.throw(0xae6fb8, 0x2a)
        /usr/lib64/go/src/runtime/panic.go:605 +0x95 fp=0xc4203e8be8 sp=0xc4203e8bc8 pc=0x42c815
runtime.sigpanic()
        /usr/lib64/go/src/runtime/signal_unix.go:351 +0x2b8 fp=0xc4203e8c38 sp=0xc4203e8be8 pc=0x443318
runtime.heapBitsSetType(0xc4204b6fc0, 0x30, 0x30, 0xc420304058)
        /usr/lib64/go/src/runtime/mbitmap.go:1224 +0x26e fp=0xc4203e8c90 sp=0xc4203e8c38 pc=0x41439e
runtime.mallocgc(0x30, 0xc420304058, 0x1, 0x1)
        /usr/lib64/go/src/runtime/malloc.go:741 +0x546 fp=0xc4203e8d38 sp=0xc4203e8c90 pc=0x411876
runtime.newobject(0xa717e0, 0xc42032f430)
        /usr/lib64/go/src/runtime/malloc.go:840 +0x38 fp=0xc4203e8d68 sp=0xc4203e8d38 pc=0x411d68
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus.NewConstMetric(0xc42018e460, 0x2, 0x3ff0000000000000, 0xc42032f430, 0x1, 0x1, 0x10, 0x9f9dc0, 0x8a0601, 0xc42032f430)
        /var/tmp/portage/net-analyzer/prometheus-node_exporter-0.15.0/work/prometheus-node_exporter-0.15.0/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/value.go:165 +0xd0 fp=0xc4203e8dd0 sp=0xc4203e8d68 pc=0x77a980

node_exporter, like many Prometheus components, is written in Go. Go is a relatively safe language: while it allows you to shoot yourself in the foot if you so wish, and it doesn’t have nearly as strong safety guarantees as, say, Rust does, it is still not too easy to accidentally cause a segfault in Go. More so, node_exporter is a relatively simple Go app with mostly pure-Go dependencies. Therefore, this was an interesting crash to get. Especially since the crash was inside mallocgc, which should never crash under normal circumstances.

Things got more interesting after I restarted it a few times:

2017/11/07 06:32:49 http: panic serving 172.20.0.1:38504: runtime error: growslice: cap out of range
goroutine 41 [running]:
net/http.(*conn).serve.func1(0xc4201cdd60)
        /usr/lib64/go/src/net/http/server.go:1697 +0xd0
panic(0xa24f20, 0xb41190)
        /usr/lib64/go/src/runtime/panic.go:491 +0x283
fmt.(*buffer).WriteString(...)
        /usr/lib64/go/src/fmt/print.go:82
fmt.(*fmt).padString(0xc42053a040, 0xc4204e6800, 0xc4204e6850)
        /usr/lib64/go/src/fmt/format.go:110 +0x110
fmt.(*fmt).fmt_s(0xc42053a040, 0xc4204e6800, 0xc4204e6850)
        /usr/lib64/go/src/fmt/format.go:328 +0x61
fmt.(*pp).fmtString(0xc42053a000, 0xc4204e6800, 0xc4204e6850, 0xc400000073)
        /usr/lib64/go/src/fmt/print.go:433 +0x197
fmt.(*pp).printArg(0xc42053a000, 0x9f4700, 0xc42041c290, 0x73)
        /usr/lib64/go/src/fmt/print.go:664 +0x7b5
fmt.(*pp).doPrintf(0xc42053a000, 0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2)
        /usr/lib64/go/src/fmt/print.go:996 +0x15a
fmt.Sprintf(0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2, 0x10, 0x9f4700)
        /usr/lib64/go/src/fmt/print.go:196 +0x66
fmt.Errorf(0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2, 0xc420410301, 0xc420410300)
        /usr/lib64/go/src/fmt/print.go:205 +0x5a

Well that’s interesting. A crash in Sprintf this time. What?

runtime: pointer 0xc4203e2fb0 to unallocated span idx=0x1f1 span.base()=0xc4203dc000 span.limit=0xc4203e6000 span.state=3
runtime: found in object at *(0xc420382a80+0x80)
object=0xc420382a80 k=0x62101c1 s.base()=0xc420382000 s.limit=0xc420383f80 s.spanclass=42 s.elemsize=384 s.state=_MSpanInUse<snip>
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw(0xaee4fe, 0x3e)
        /usr/lib64/go/src/runtime/panic.go:605 +0x95 fp=0x7f0f19ffab90 sp=0x7f0f19ffab70 pc=0x42c815
runtime.heapBitsForObject(0xc4203e2fb0, 0xc420382a80, 0x80, 0xc41ffd8a33, 0xc400000000, 0x7f0f400ac560, 0xc420031260, 0x11)
        /usr/lib64/go/src/runtime/mbitmap.go:425 +0x489 fp=0x7f0f19ffabe8 sp=0x7f0f19ffab90 pc=0x4137c9
runtime.scanobject(0xc420382a80, 0xc420031260)
        /usr/lib64/go/src/runtime/mgcmark.go:1187 +0x25d fp=0x7f0f19ffac90 sp=0x7f0f19ffabe8 pc=0x41ebed
runtime.gcDrain(0xc420031260, 0x5)
        /usr/lib64/go/src/runtime/mgcmark.go:943 +0x1ea fp=0x7f0f19fface0 sp=0x7f0f19ffac90 pc=0x41e42a
runtime.gcBgMarkWorker.func2()
        /usr/lib64/go/src/runtime/mgc.go:1773 +0x80 fp=0x7f0f19ffad20 sp=0x7f0f19fface0 pc=0x4580b0
runtime.systemstack(0xc420436ab8)
        /usr/lib64/go/src/runtime/asm_amd64.s:344 +0x79 fp=0x7f0f19ffad28 sp=0x7f0f19ffad20 pc=0x45a469
runtime.mstart()
        /usr/lib64/go/src/runtime/proc.go:1125 fp=0x7f0f19ffad30 sp=0x7f0f19ffad28 pc=0x430fe0

And now the garbage collector stumbled upon a problem. Yet a different crash.

At this point, there are two natural conclusions: either I have a severe hardware issue, or there is a wild memory corruption bug in the binary. I initially considered the former unlikely, as this machine has a very heavily mixed workload and no signs of instability that can be traced back to hardware (I have my fair share of crashing software, but it’s never random). Since Go binaries like node_exporter are statically linked and do not depend on any other libraries, I can download the official release binary and try that, which would eliminate most of the rest of my system as a variable. Yet, when I did so, I still got a crash.

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x76b998]

goroutine 13 [running]:
runtime.throw(0xabfb11, 0x5)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420060c40 sp=0xc420060c20 pc=0x42c725
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc420060c90 sp=0xc420060c40 pc=0x443197
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_model/go.(*LabelPair).GetName(...)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_model/go/metrics.pb.go:85
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*Desc).String(0xc4203ae010, 0xaea9d0, 0xc42045c000)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/desc.go:179 +0xc8 fp=0xc420060dc8 sp=0xc420060c90 pc=0x76b998

Yet another completely different crash. At this point there was a decent chance that there was truly an upstream problem with node_exporter or one of its dependencies, so I filed anissue on GitHub. Perhaps the developers had seen this before? It’s worth bringing this kind of issue to their attention and seeing if they have any ideas.

Unsurprisingly, upstream’s first guess was that it was a hardware issue. This isn’t unreasonable: after all, I’m only hitting the problem on one specific machine. All my other machines are happily running node_exporter. While I had no other evidence of hardware-linked instability on this host, I also had no other explanation as to what was so particular about this machine that would makenode_exporter crash. A Memtest86+ run never hurt anyone, so I gave it a go.

And then this happened:

This is what I get for using consumer hardware

Whoops! Bad RAM. Well, to be more specific, one bit of bad RAM. After letting the test run for a full pass, all I got was that single bad bit, plus a few false positives in test 7 (which moves blocks around and so can amplify a single error).

Further testing showed that Memtest86+ test #5 in SMP mode would quickly detect the error, but usually not on the first pass. The error was always the same bit at the same address. This suggests that the problem is a weak or leaky RAM cell. In particular, one which gets worse with temperature. This is quite logical: a higher temperature increases leakage in the RAM cells and thus makes it more likely that a somewhat marginal cell will actually cause a bit flip.

To put this into perspective, this is one bad bit out of 274,877,906,944. That’s actually a very good error rate! Hard disks and Flash memory have much higher error rates - it’s just that those devices have bad blocks marked at the factory that are transparently swapped out without the user knowing, and can transparently mark newly discovered weak blocks as bad and relocate them to a spare area. RAM has no such luxury, so a bad bit sticks forever.

Alas, this is vanishingly unlikely to be the cause of my node_exporter woes. That app uses very little RAM, and so the chances of it hitting the bad bit (repeatedly, at that) are extremely low. This kind of problem would be largely unnoticeable, perhaps causing a pixel error in some graphics, a single letter to flip in some text, an instruction to be corrupted that probably won’t ever be run, and perhaps the rare segfault when something actually important does land on the bad bit. Nonetheless, it does cause long-term reliability issues, and this is why servers and other devices intended to be reliable must use ECC RAM, which can correct this kind of error.

I don’t have the luxury of ECC RAM on this laptop. What I do have, though, is the ability to mark the bad block of RAM as bad and tell the OS not to use it. There is a little-known feature of GRUB 2 which allows you to do just that, by changing the memory map that is passed to the booted kernel. It’s not worth buying new RAM just for a single bad bit (especially since DDR3 is already obsolete, and there’s a good chance new RAM would have weak cells anyway), so this is a good option.

However, there’s one more thing I can do. Since the problem gets worse with temperature, what happens if I heat up the RAM?

🔥🔥🔥memtest86+🔥🔥🔥

A cozy 100°C

Using a heat gun set at a fairly low temperature (130°C) I warmed up two modules at a time (the other two modules are under the rear cover, as my laptop has four SODIMM slots total). Playing around with module order, I found three additional weak bits only detectable at elevated temperature, and they were spread around three of my RAM sticks.

I also found that the location of the errors stayed roughly consistent even as I swapped modules around: the top bits of the address remained the same. This is because the RAM is interleaved: data is spread over all four sticks, instead of each stick being assigned a contiguous quarter of the available address space. This is convenient, because I can just mask a region of RAM large enough to cover all possible addresses for each error bit, and not have to worry that I might swap sticks in the future and mess up the masking. I found that masking a contiguous 128KiB area should cover all possible permutations of addresses for each given bad bit, but, for good measure, I rounded up to 1MiB. This gave me three 1MiB aligned blocks to mask out (one of them covers two of the bad bits, for a total of four bad bits I wanted masked):

  • 0x36a7000000x36a7fffff
  • 0x460e000000x460efffff
  • 0x4ea0000000x4ea0fffff

This can be specified using the address/mask syntax required by GRUB as follows, in /etc/default/grub:

GRUB_BADRAM="0x36a700000,0xfffffffffff00000,0x460e00000,0xfffffffffff00000,0x4ea000000,0xfffffffffff00000"

One quick grub-mkconfig later, I am down 3MiB of RAM and four dodgy bits with it. It’s not ECC RAM, but this should increase the effective reliability of my consumer-grade RAM, since now I know the rest of the memory is fine up to at least 100°C.

Needless to say, node_exporter still crashed, but we knew this wasn’t the real problem, didn’t we.

Digging deeper

The annoying thing about this kind of bug is that it clearly is caused by some kind of memory corruption that breaks code that runs later. This makes it very hard to debug, because we can’t predict what will be corrupted (it varies), and we can’t catch the bad code in the act of doing so.

First I tried some basic bisecting of available node_exporter releases and enabling/disabling different collectors, but that went nowhere. I also tried running an instance under strace. This seemed to stop the crashes, which strongly points to a race-condition kind of problem. strace will usually wind up serializing execution of apps to some extent, by intercepting all system calls run by all threads. I would later find that the strace instance crashed too, but it took much longer to do so. Since this seemed to be related to concurrency, I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.

By now I had gathered quite a considerable number of crash logs, and I was starting to notice some patterns. While there was a lot of variation in the parts that were crashing and how, ultimately the error messages could be categorized into different types and the same kind of error showed up more than once. So I started Googling these errors, and this is how I stumbled uponGo issue #20427. This was an issue in seemingly an unrelated part of Go, but one that had caused similar segfaults and random issues. The issue was closed with no diagnosis after it couldn’t be reproduced with Go 1.9. Nobody knew what the root cause was, just that it had stopped happening.

So I grabbedthis sample code from the issue, which claimed to reproduce the problem, and ran it on my machine. Lo and behold, it crashed within seconds. Bingo. This is a lot better than waiting hours for node_exporter to crash.

That doesn’t get me any closer to debugging the issue from the Go side, but it gives me a much faster way to test for it. So let’s try another angle.

Bisecting machines

I know the problem happens on my laptop, but doesn’t happen on any other of my machines. I tried the reproducer on every other machine I have easy access to, and couldn’t get it to crash on any of them. This tells me there’s something special about my laptop. Since Go statically links binaries, the rest of userspace doesn’t matter. This leaves two relevant parts: the hardware, and the kernel.

I don’t have any easy way to test with various hardware other than the machines I own, but I can play with kernels. So let’s try that. First order of business: will it crash in a VM?

To test for this, I built a minimal initramfs that will allow me to very quickly launch the reproducer in a QEMU VM without having to actually install a distro or boot a full Linux system. My initramfs was built with Linux’s scripts/gen_initramfs_list.sh and contained the following files:

dir /dev 755 0 0
nod /dev/console 0600 0 0 c 5 1
nod /dev/null 0666 0 0 c 1 3
dir /bin 755 0 0
file /bin/busybox busybox 755 0 0
slink /bin/sh busybox 755 0 0
slink /bin/true busybox 755 0 0
file /init init.sh 755 0 0
file /reproducer reproducer 755 0 0

/init is the entry point of a Linux initramfs, and in my case was a simple shellscript to start the test and measure time:

#!/bin/sh
export PATH=/bin

start=$(busybox date +%s)

echo "Starting test now..."
/reproducer
ret=$?
end=$(busybox date +%s)
echo "Test exited with status $ret after $((end-start)) seconds"

/bin/busybox is a statically linked version of BusyBox, often used in minimal systems like this to provide all basic Linux shell utilities (including a shell itself).

The initramfs can be built like this (from a Linux kernel source tree), where list.txt is the file list above:

scripts/gen_initramfs_list.sh -o initramfs.gz list.txt

And QEMU can boot the kernel and initramfs directly:

qemu-system-x86_64 -kernel /boot/vmlinuz-4.13.9-gentoo -initrd initramfs.gz -append 'console=ttyS0' -smp 8 -nographic -serial mon:stdio -cpu host -enable-kvm

This resulted in no output at all to the console… and then I realized I hadn’t even compiled 8250 serial port support into my laptop’s kernel. D’oh. I mean, it doesn’t have a physical serial port, right? Anyway, a quick detour to rebuild the kernel with serial support (and crossing my fingers that didn’t change anything important), I tried again and it successfully booted and ran the reproducer.

Did it crash? Yup. Good, this means the problem is reproducible on a VM on the same machine. I tried the same QEMU command on my home server, with its own kernel, and… nothing. Then I copied the kernel from my laptop and booted that and… it crashed. The kernel is what matters. It’s not a hardware issue.

Juggling kernels

At this point, I knew I was going to be compiling lots of kernels to try to narrow this down. So I decided to move to the most powerful machine I had lying around: a somewhat old 12-core, 24-thread Xeon (now defunct, sadly). I copied the known-bad kernel source to that machine, built it, and tested it.

It didn’t crash.

What?

Some head-scratching later, I made sure the original bad kernel binary crashed (it did). Are we back to hardware? Does it matter which machine I build the kernel on? So I tried building the kernel on my home server, and that one promptly triggered the crash. Building the same kernel on two machines yields crashes, a third machine doesn’t. What’s the difference?

Well, these are all Gentoo boxes, and all Gentoo Hardened at that. But my laptop and my home server are both ~amd64 (unstable), while my Xeon server is amd64 (stable). That means GCC is different. My laptop and home server were both ongcc (Gentoo Hardened 6.4.0 p1.0) 6.4.0, while my Xeon was ongcc (Gentoo Hardened 5.4.0-r3 p1.3, pie-0.6.5) 5.4.0.

But my home server’s kernel, which was nearly the same version as my laptop (though not exactly), built with the same GCC, did not reproduce the crashes. So now we have to conclude that both the compiler used to build the kernel and the kernel itself (or its config?) matter.

To narrow things down further, I compiled the exact kernel tree from my laptop on my home server (linux-4.13.9-gentoo), and confirmed that it indeed crashed. Then I copied over the .config from my home server and compiled that, and found that it didn’t. This means we’re looking at a kernel config difference and a compiler difference:

  • linux-4.13.9-gentoo + gcc 5.4.0-r3 p1.3 + laptop .config - no crash
  • linux-4.13.9-gentoo + gcc 6.4.0 p1.0 + laptop .config - crash
  • linux-4.13.9-gentoo + gcc 6.4.0 p1.0 + server .config - no crash

Two .configs, one good, and one bad. Time to diff them. Of course, the two configs were vastly different (since I tend to tailor my kernel config to only include the drivers I need on any particular machine), so I had to repeatedly rebuild the kernel while narrowing down the differences.

I decided to start with the “known bad” .config and start removing things. Since the reproducer takes a variable amount of time to crash, it’s easier to test for “still crashes” (just wait for it to crash) than for “doesn’t crash” (how long do I have to wait to convince myself that it doesn’t?). Over the course of 22 kernel builds, I managed to simplify the config so much that the kernel had no networking support, no filesystems, no block device core, and didn’t even support PCI (still works fine on a VM though!). My kernel builds now took less than 60 seconds and the kernel was about 1/4th the size of my regular one.

Then I moved on to the “known good” .config and removed all the unnecessary junk while making sure it still didn’t crash the reproducer (which was trickier and slower than the previous test). I had a few false branches, where I changed something that made the reproducer start crashing (but I didn’t know what), yet I misidentified them as “no crash”, so when I got a crash I had to walk back up the previous kernels I’d built and make sure I knew exactly where the crash was introduced. I ended up doing 7 kernel builds.

Eventually, I narrowed it down to a small handful of different .config options. A few of them stood out, in particular CONFIG_OPTIMIZE_INLINING. After carefully testing them I concluded that, indeed, that option was the culprit. Turning it off produced kernels that crash the reproducer testcase, while turning it on produced kernels that didn’t. This option, when turned on, allows GCC to better determine which inline functions really must be inlined, instead of forcing it to inline them unconditionally. This also explains the GCC connection: inlining behavior is likely to change between GCC versions.

/*
 * Force always-inline if the user requests it so via the .config,
 * or if gcc is too old.
 * GCC does not warn about unused static inline functions for
 * -Wunused-function.  This turns out to avoid the need for complex #ifdef
 * directives.  Suppress the warning in clang as well by using "unused"
 * function attribute, which is redundant but not harmful for gcc.
 */
#if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) ||                \
    !defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4)
#define inline inline           __attribute__((always_inline,unused)) notrace
#define __inline__ __inline__   __attribute__((always_inline,unused)) notrace
#define __inline __inline       __attribute__((always_inline,unused)) notrace
#else
/* A lot of inline functions can cause havoc with function tracing */
#define inline inline           __attribute__((unused)) notrace
#define __inline__ __inline__   __attribute__((unused)) notrace
#define __inline __inline       __attribute__((unused)) notrace
#endif

So what next? We know that CONFIG_OPTIMIZE_INLINING makes the difference, but that potentially changes the behavior of every single inline function across the whole kernel. How to pinpoint the problem?

I had an idea.

Hash-based differential compilation

The basic premise is to compile part of the kernel with the option turned on, and part of the kernel with the option turned off. By testing the resulting kernel and checking whether the problem appears or not, we can deduce which subset of the kernel compilation units contains the problem code.

Instead of trying to enumerate all object files and doing some kind of binary search, I decided to go with a hash-based approach. I wrote this wrapper script for GCC:

#!/bin/bash
args=("$@")

doit=
while [ $# -gt 0 ]; do
        case "$1" in
                -c)
                        doit=1
                        ;;
                -o)
                        shift
                        objfile="$1"
                        ;;
        esac
        shift
done

extra=
if [ ! -z "$doit" ]; then
        sha="$(echo -n "$objfile" | sha1sum - | cut -d" " -f1)"
        echo "${sha:0:8} $objfile" >> objs.txt
        if [ $((0x${sha:0:8} & (0x80000000 >> $BIT))) = 0 ]; then
                echo "[n]" "$objfile" 1>&2
        else
                extra=-DCONFIG_OPTIMIZE_INLINING
                echo "[y]" "$objfile" 1>&2
        fi
fi

exec gcc $extra "${args[@]}"

This hashes the object file name with SHA-1, then checks a given bit of the hash out of the first 32 bits (identified by the $BIT environment variable). If the bit is 0, it builds without CONFIG_OPTIMIZE_INLINING. If it is 1, it builds with CONFIG_OPTIMIZE_INLINING. I found that the kernel had around 685 object files at this point (my minimization effort had paid off), which requires about 10 bits for a unique identification. This hash-based approach also has one neat property: I can choose to only worry about crashing outcomes (where the bit is 0), since it is much harder to prove that a given kernel build does not crash (as the crashes are probabilistic and can take quite a while sometimes).

I built 32 kernels, one for each bit of the SHA-1 prefix, which only took 29 minutes. Then I started testing them, and every time I got a crash, I narrowed down a regular expression of possible SHA-1 hashes to only those with zero bits at those specific positions. At 8 crashes (and thus zero bits), I was down to 4 object files, and a couple were looking promising. Once I hit the 10th crash, there was a single match.

$ grep '^[0246][012389ab][0189][014589cd][028a][012389ab][014589cd]' objs_0.txt
6b9cab4f arch/x86/entry/vdso/vclock_gettime.o

vDSO code. Of course.

vDSO shenanigans

The kernel’s vDSO is not actually kernel code. vDSO is a small shared library that the kernel places in the address space of every process, and which allows apps to perform certain special system calls without ever leaving user mode. This increases performance significantly, while still allowing the kernel to change the implementation details of those system calls as needed.

In other words, vDSO is GCC-compiled code, built with the kernel, that ends up being linked with every userspace app. It’s userspace code. This explains why the kernel and its compiler mattered: it wasn’t about the kernel itself, but about a shared library provided by the kernel! And Go uses the vDSO for performance. Go also happens to have a (rather insane, in my opinion) policy of reinventing its own standard library, so it does not use any of the standard Linux glibc code to call vDSO, but rather rolls its own calls (and syscalls too).

So what does flipping CONFIG_OPTIMIZE_INLINING do to the vDSO? Let’s look at the assembly.

With CONFIG_OPTIMIZE_INLINING=n:

arch/x86/entry/vdso/vclock_gettime.o.no_inline_opt:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <vread_tsc>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	0f 31                	rdtsc  
   9:	48 c1 e2 20          	shl    $0x20,%rdx
   d:	48 09 d0             	or     %rdx,%rax
  10:	48 8b 15 00 00 00 00 	mov    0x0(%rip),%rdx        # 17 <vread_tsc+0x17>
  17:	48 39 c2             	cmp    %rax,%rdx
  1a:	77 02                	ja     1e <vread_tsc+0x1e>
  1c:	5d                   	pop    %rbp
  1d:	c3                   	retq   
  1e:	48 89 d0             	mov    %rdx,%rax
  21:	5d                   	pop    %rbp
  22:	c3                   	retq   
  23:	0f 1f 00             	nopl   (%rax)
  26:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  2d:	00 00 00 

0000000000000030 <__vdso_clock_gettime>:
  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp
  47:	4c 8d 0d 00 00 00 00 	lea    0x0(%rip),%r9        # 4e <__vdso_clock_gettime+0x1e>
  4e:	83 ff 01             	cmp    $0x1,%edi
  51:	74 66                	je     b9 <__vdso_clock_gettime+0x89>
  53:	0f 8e dc 00 00 00    	jle    135 <__vdso_clock_gettime+0x105>
  59:	83 ff 05             	cmp    $0x5,%edi
  5c:	74 34                	je     92 <__vdso_clock_gettime+0x62>
  5e:	83 ff 06             	cmp    $0x6,%edi
  61:	0f 85 c2 00 00 00    	jne    129 <__vdso_clock_gettime+0xf9>
[...]

With CONFIG_OPTIMIZE_INLINING=y:

arch/x86/entry/vdso/vclock_gettime.o.inline_opt:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__vdso_clock_gettime>:
   0:	55                   	push   %rbp
   1:	4c 8d 0d 00 00 00 00 	lea    0x0(%rip),%r9        # 8 <__vdso_clock_gettime+0x8>
   8:	83 ff 01             	cmp    $0x1,%edi
   b:	48 89 e5             	mov    %rsp,%rbp
   e:	74 66                	je     76 <__vdso_clock_gettime+0x76>
  10:	0f 8e dc 00 00 00    	jle    f2 <__vdso_clock_gettime+0xf2>
  16:	83 ff 05             	cmp    $0x5,%edi
  19:	74 34                	je     4f <__vdso_clock_gettime+0x4f>
  1b:	83 ff 06             	cmp    $0x6,%edi
  1e:	0f 85 c2 00 00 00    	jne    e6 <__vdso_clock_gettime+0xe6>
[...]

Interestingly, CONFIG_OPTIMIZE_INLINING=y, which is supposed to allow GCC to inline less, actually resulted in it inlining more: vread_tsc is inlined in that version, while not in the CONFIG_OPTIMIZE_INLINING=n version. Butvread_tsc isn’t marked inline at all, so GCC is perfectly within its right to behave like this, as counterintuitive as it may be.

But who cares if a function is inlined? Where’s the actual problem? Well, looking closer at the non-inline version…

  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp

Why is GCC allocating over 4KiB of stack? That’s not a stack allocation, that’s a stack probe, or more specifically, the result of the -fstack-check GCCfeature.

Gentoo Linux enables -fstack-check by default on its hardened profile. This is a mitigation for theStack Clash vulnerability. While -fstack-check is an old GCC feature and not intended for this, it turns out it effectively mitigates the issue (I’m told proper Stack Clash protection will be in GCC 8). As a side-effect, it causes some fairly silly behavior, where every non-leaf function (that is, a function that makes function calls) ends up probing the stack 4 KiB ahead of the stack pointer. In other words, code compiled with -fstack-check potentially needs at least 4 KiB of stack space, unless it is a leaf function (or a function where every call was inlined).

Go loves small stacks.

TEXT runtime·walltime(SB),NOSPLIT,$16
	// Be careful. We're calling a function with gcc calling convention here.
	// We're guaranteed 128 bytes on entry, and we've taken 16, and the
	// call uses another 8.
	// That leaves 104 for the gettime code to use. Hope that's enough!

Turns out 104 bytes aren’t enough for everybody. Certainly not for my kernel.

It’s worth pointing out that the vDSO specification makes no mention of maximum stack usage guarantees, so this is squarely Go’s fault for making invalid assumptions.

Conclusion

This perfectly explains the symptoms. The stack probe is an orq, which is a logical OR with 0. This is a no-op, but effectively probes the target address (if it is unmapped, it will segfault). But we weren’t seeing segfaults in vDSO code, so how was this breaking Go? Well, OR with 0 isn’t really a no-op. Since orq is not an atomic instruction, what really happens is the CPU reads the memory address and then writes it back. This creates a race condition. If other threads are running in parallel on other CPUs, orq might effectively wind up undoing a memory write that occurs simultaneously. Since the write was out of the stack bounds, it was likely intruding on other threads’ stacks or random data, and, when the stars line up, undoing a memory write. This is also why GOMAXPROCS=1 works around the issue, since that prevents two threads from effectively running Go code at the same time.

What’s the fix? I left that up to the Go devs. Their solution ultimately was to pivot to a larger stack before calling vDSO functions. This introduces a small speed penalty (nanoseconds), but it’s acceptable. After building node_exporter with the fixed Go toolchain, the crashes went away.

2017-12-05 01:20

SEC Emergency Action Halts PlexCoin ICO

$
0
0

Washington D.C., Dec. 4, 2017 —

The Securities and Exchange Commission today announced it obtained an emergency asset freeze to halt a fast-moving Initial Coin Offering (ICO) fraud that raised up to $15 million from thousands of investors since August by falsely promising a 13-fold profit in less than a month.

The SEC filed charges against a recidivist Quebec securities law violator, Dominic Lacroix, and his company, PlexCorps. The Commission's complaint, filed in federal court in Brooklyn, New York, alleges that Lacroix and PlexCorps marketed and sold securities called PlexCoin on the internet to investors in the U.S. and elsewhere, claiming that investments in PlexCoin would yield a 1,354 percent profit in less than 29 days. The SEC also charged Lacroix's partner, Sabrina Paradis-Royer, in connection with the scheme.

Today's charges are the first filed by the SEC's new Cyber Unit. The unit was created in September to focus the Enforcement Division's cyber-related expertise on misconduct involving distributed ledger technology and initial coin offerings, the spread of false information through electronic and social media, hacking and threats to trading platforms.

"This first Cyber Unit case hits all of the characteristics of a full-fledged cyber scam and is exactly the kind of misconduct the unit will be pursuing," said Robert Cohen, Chief of the Cyber Unit. "We acted quickly to protect retail investors from this initial coin offering's false promises."

Based on its filing, the SEC obtained an emergency court order to freeze the assets of PlexCorps, Lacroix, and Paradis-Royer.

The SEC’s complaint charges Lacroix, Paradis-Royer and PlexCorps with violating the anti-fraud provisions, and Lacroix and PlexCorps with violating the registration provision, of the U.S. federal securities laws.  The complaint seeks permanent injunctions, disgorgement plus interest and penalties.  For Lacroix, the SEC also seeks an officer-and-director bar and a bar from offering digital securities against Lacroix and Paradis-Royer.

The Commission's investigation was conducted by Daphna A. Waxman, David H. Tutor, and Jorge G. Tenreiro of the New York Regional Office and the Cyber Unit, with assistance from the agency's Office of International Affairs. The case is being supervised by Valerie A. Szczepanik and Mr. Cohen. The Commission appreciates the assistance of Quebec's Autorité Des Marchés Financiers.

The SEC's Office of Investor Education and Advocacy issued an Investor Alert in August 2017 warning investors about scams of companies claiming to be engaging in initial coin offerings: https://www.investor.gov/additional-resources/news-alerts/alerts-bulletins/investor-alert-public-companies-making-ico-related.

Pride, Prejudice

$
0
0

Generated output

What it does

The problem isn't generating over 50,000 words. The problem is existing books are too long. Pride and Prejudice is 130,000 words, Moby Dick is 215,136 words (or 215,136 meows). And we all know 50,000 is the gold standard for a novel! So how can we reduce the word count?

These tactics reduce Pride and Prejudice by about 15% to 111,000 words.

Next we work out the ratio of words we have to 50k, count how many sentences we have, and work out how many sentences we want to approach 50k and use a text summariser to chop out the dead wood.

How to do it

Run:

pip install -r requirements.txt

python reducifier.py

Example:

python reducifier.py
open
word count: 130,000
word count: 126,936	diff: 97.643%	deboilerplatify
word count: 125,438	diff: 96.491%	remove_quote_things
word count: 121,549	diff: 93.499%	deveryify
word count: 121,018	diff: 93.091%	decontractify
word count: 111,633	diff: 85.872%	dehonorify
Ratio (words/50k):	 3
Number of sentences:	 4588
Number to keep:		 1529
word count: 54,273	diff: 41.748%	summarise

This produces output.txt before the summariser, and output2.txt after the summariser.

Works at least with macOS High Sierra with Python 3.6.3.

Example

Here's a diff of Pride and Prejudice and the first pass output.txt:

'tis a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

Source code

https://github.com/hugovk/NaNoGenMo-2017/tree/master/03-reducifier

Apple Agrees to Deal with Ireland Over $15B Unpaid Tax Issue

$
0
0

BRUSSELS—Ireland will begin collecting €13 billion ($15.46 billion) in back taxes from Apple Inc. as soon as early next year after both sides agreed to the terms of an escrow fund for the money, Ireland’s finance chief said Monday.

The European Union in 2016 ordered Dublin to retrieve the billions of euros from Apple in uncollected taxes, which the EU said Apple avoided paying with the help of sweetheart tax deals from Ireland.

A...

How Jet Built a GPU-Powered Fulfillment Engine with F# and CUDA

$
0
0

Jet.com Fulfillment

Have you ever looked at your shopping list and tried to optimize your trip based on things like distance to store, price, and number of items you can buy at each store? The quest for a smarter shopping cart is never-ending, and the complexity of finding even a sub-optimal solution to this problem can quickly get out of hand. This is especially true of online shopping, which expands the set of fulfillment possibilities from local to national scale. Ideally, you could shop online for all items from your list and the website would do all the work to find you the most savings.

That is exactly what Jet.com does for you! Jet.com is an e-commerce company (acquired by Walmart in 2016) known for its innovative pricing engine that finds an optimal cart and the most savings for the customer in real time.

In this post I discuss how Jet tackles the fulfillment optimization problem using GPUs with F#, Azure and microservices. We implemented our solutions in F# via AleaGPU, a natural choice for coding CUDA solutions in .NET. I will also cover relevant aspects of our microservice architecture.

The Merchant Selection Problem

Jet.com’s provides value by finding efficiencies in the system and passing them through to customers in the form of lower prices. At the center of this is Jet’s smart merchant selection algorithm. When a customer orders several items at once they usually can be fulfilled by multiple merchants with different warehouses. The goal is to find the combination of merchants and warehouses that minimizes the total order cost, including shipment costs and commissions. The bigger the shopping cart, the higher the potential savings, but also the more time consuming the search for the optimal combination.

Take, for example, a cart of four items (retail SKUs, or “Stock Keeping Units”) and a local market of four merchants.

Figure 1 shows the merchant’s price for each of the four skus and the shipping cost by merchant for some possible fulfillment combinations those merchants can provide. The table of shipping costs shows the cost for either individual SKUs or multiple SKUs packaged together (delimited by a comma or a plus sign respectively). For example, the cost to ship any individual SKU from Merchant 3 is $3.5 however SKU 3 and SKU 4 can be shipped together for only $4.5.

Figure 1: Initial cart of four items with a total of 3*3*2*4=72 combinations.
Figure 1: Initial cart of four items with a total of 3*3*2*4=72 combinations.

The naive approach of choosing the merchant with the cheapest offer for each SKU results in a total of $95 for the cart (Figure 2). This approach neglects shipping costs and fails to discover any savings from shipping multiple items from the same merchant.

Figure 2: Pricing the cart by taking the cheapest net price allocation.
Figure 2: Pricing the cart by taking the cheapest net price allocation.

For example, Merchant 2 can fulfill three of the 4 SKUs for only $8.5 in shipping costs, so fulfilling the order via Merchant 2 and Merchant 4 brings the total down to $94 (Figure 3). Is this the most optimal combination?

Figure 3: We can do better by packing items together.
Figure 3: We can do better by packing items together.

The only way to find the optimal combination (the most savings for the customer, as Figure 4 shows) is by pricing every one of the 72 combinations for this cart.

Figure 4: The optimal allocation.
Figure 4: The optimal allocation.

The Complexity of Full Search

Pricing every possible fulfillment combination to find the optimal solution is an exhaustive brute-force search of the entire solution space. The full search approach to the merchant selection problem is embarrassingly parallel: each enumerated fulfillment combination can be priced independently. This natural parallelism lead us to initially use the full search approach on the  GPU, but it turns out that full search is prohibitive (even with GPU acceleration) due to the exponential complexity of the merchant selection problem. Since a genetic algorithm (discussed in the next section) does not guarantee a fully optimal solution, it’s important to use full search when possible. We developed a metric called the “full search year” which we use to measure the computation time required for merchant selection.

A full search year is the number of combinations that can be priced within a year. We use GPU full search year or CPU full search year to indicate the time required when running either implementation. Full search seconds, minutes, hours, and days are derived from full search year.

Cart Complexity, or Merchant selection complexity, is the total number of fulfillment combinations that must be priced in order to find the most optimal fulfillment and lowest price.

Cart Complexity = number of combinations = # offers for item 1 * # offers for item 2 * … * # offers for item k.

To illustrate both the complexity of the merchant selection problem and the application of the full search year metric, we dug through our logs to find a cart which timed out when a customer tried to place an order. These timeouts tend to occur most often  when a large number of merchants that can fulfill each item, such as with electronics. Figure 5 shows an example in which the customer was ordering components to build a computer.

Figure 5: Example of a cart that required 70,442,237,952,000 (70 trillion!) combinations, leading to a timeout.
Figure 5: Example of a cart that required 70,442,237,952,000 (70 trillion!) combinations, leading to a timeout.

The chart in Figure 6 shows the number of offers for each of the 11 retail SKUs in the customer’s cart. The cart complexity is 321719162992510161724 = 70,442,237,952,000 = 1013.85 combinations. The time required to find the optimal fulfillment for this cart is 8.6 CPU full search years or 11 GPU full search days! This is well outside of our target response time of a few seconds. Given this constraint, the problem quickly grows larger than what can be handled by the GPU full search within the target time. We needed a better, more scalable solution.

A Genetic Algorithm

To address the scalability issues, we decided to apply genetic algorithms (GA) to solve the problem. Two important points to consider with the GA approach:

  1. GA can only find approximately optimal solutions.
  2. A standard GA will not work because our search space is astronomically large and we need a reliable approximate solution in near real-time.

Jet.com’s constraints on response time limit how long we can spend producing consecutive generations of the population. Moreover,  since generation iteration is a serial process, we can’t reduce computation time by parallelizing iterations.

We used four methods to address these issues:

  1. Dramatically increasing the population size to reduce the number of iterations required for convergence.
  2. Improving the initial population by including merchant combinations which are likely to be good. For example, rather than starting with an initial population of fully randomized combinations, we include combinations in which single merchants fulfill multiple retail SKUs since this tends to reduce shipping costs in most cases.
  1. Identifying parts of the population which can be used to guide the behavior of mutation and crossover operators.
  2. Leveraging AI/ML to choose the appropriate configuration for the GA.

The fourth method was non-trivial and introduced some of its own unique challenges which I hope to cover in more detail in a future blog post.

Implementation

I’m going to take you through some implementation details of the GPU full search algorithm (GPUFS) but first I want to briefly discuss algorithm selection. There are three merchant selection algorithms we can use : the CPU implementation, the GPU full search, and the GPU genetic algorithm (GPUGA). Each approach has its own strengths and weaknesses depending on the situation.

The implementation uses a decision function to choose between the three algorithms based on cart complexity and load. For example, consider the case of a single item checkout. Pricing a single-item cart on the GPU would take longer than pricing it on the CPU due to the cost of data transfer to the GPU. For more complex carts, if we can perform merchant selection within an acceptable timeframe using GPUFS we should choose it over GPUGA, since GA’s do not guarantee optimal results.

Choosing the best algorithm for the task is unfortunately not as simple as “if cart = small then do CPUFS elif cart = big then do GPUFS elif cart = huge then do GPUGA.” We needed a smart decision function to address these issues so we used machine learning to train a model based on cart features and used it to improve the function’s results. In order to train this model appropriately and employ multiple algorithms we had to develop a way to validate, appraise, and compare multiple algorithms on production data. I hope to share details of this work in future blog posts as they are out of scope of our current discussion. For now, let’s focus on the details of the GPUFS algorithm and the microservice that invokes it.

Figure 6: An example local market supply for a cart of three items.
Figure 6: An example local market supply for a cart of three items.

First I need to explain two core concepts: local market supply and allocation. The local market supply is an array of mappings of fulfillment nodes to offers for each retail SKU in a cart. Figure 6 shows an example LocalMarketSupply data structure; you can see (for example) that Node 0 can fulfill SKU 0 from two possible offers.

The goal of the search algorithms is to find the cheapest fulfillment combination for the items in the customer’s cart. Consider a scenario in which a customer initiates a checkout with three items: SKU 0, SKU 1, and SKU 2. Jet.com’s microservice queries a database to find all offers for these three retail SKUs and then uses this information to build the LocalMarketSupply structure for the cart.

The actual search space of the merchant selection problem is the set of all possible fulfillment combinations, which we refer to as allocations. A single allocation is an array of integers representing a combination of fulfillment options for the set of retail SKUs being priced.  The indices of this array represents the IDs of the retail SKUs, and the value at each index is the offer ID chosen for that retail SKU.

A = [o_{r_0} \ldots o_{r_i}]

Therefore the set of allocations for the local market supply defined in Figure 6 would resemble the table in Figure 7.

Figure 7: The full search space for the hypothetical local market supply shown in Figure 6.
Figure 7: The full search space for the hypothetical local market supply shown in Figure 6.

Full Search

Jet.com’s full search implementation is straightforward. One kernel function processes all possible allocations, pricing each one and performing a min reduction to find the cheapest.

To aid explanation I’ll divide the full search implementation into two conceptual parts: search and pricing. The search part essentially refers to the kernel and the code responsible for finding the cheapest allocation. The pricing part then refers to code that actually prices an individual allocation.  Let’s look at the search part first.

The kernel has a main while loop that follows a familiar strided access pattern.

// Because we need to repeatedly refer to our local market supply to
// determine which retail skus a fulfillment node can fulfill we store the
// compressed supply in shared memory.
let shared = __shared__.ExternArray(8) |> __address_of_array
let supply = CompressedSupply.LoadToShared(supply, shared.Reinterpret(), numRetailSkus)

let mutable minAllocation = -1L
let mutable minPrice = RealMax
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable localLinear = int64 start

while localLinear < numElements do
    let allocation = localLinear + linearStart
    let price = price allocation

    if price < minPrice then
        minPrice <- price
        minAllocation <- allocation

    localLinear <- localLinear + (int64 stride)

We perform a warp reduce to find the best price/allocation for each warp.

let mutable pricedAllocation = PricedAllocation(minPrice, minAllocation)

for i = 0 to Util.WarpSizeLog - 1 do
    let offset = 1 <<< i   let peer = DeviceFunction.ShuffleDown(pricedAllocation, offset, Util.WarpSize)
    pricedAllocation 

// Synchronize threads because we reuse the shared memory
__syncthreads()

Then we prepare for a block reduce by adding the warp reduce result to shared memory.

let mutable shared = shared.Reinterpret()

if threadIdx.x &&& Util.WarpSizeMask = 0 then   let warpId = threadIdx.x >>> Util.WarpSizeLog   shared.[warpId] 
__syncthreads()

Next we perform a block reduce to find the best price and allocation among the warps within each block.

if threadIdx.x = 0 then
    for warpId = 1 to (blockDim.x / Util.WarpSize) - 1 do
        pricedAllocation <- PricedAllocation.Min pricedAllocation shared.[warpId]
        output.Prices.[blockIdx.x] <- pricedAllocation.Price
        output.Allocations.[blockIdx.x] <- pricedAllocation.Allocation

Now we need to retrieve the results from the GPU, find the price/allocation pair with the minimum price, and then decode the respective allocation from int64 back to an array of int.

let decodeAllocation =   // A jagged array of offers by retail sku id, the first dimension 
    // runs over all **sorted** retail sku ids and the second dimension
    // are the offers for the given retail sku id
    let offerIdsByRetailSkuId = 
        LocalMarket.localMarketSupplyToOfferIdsByRetailSkuId pr.LocalMarketSupply
    let dims = offerIdsByRetailSkuId |> Array.map (fun fids -> fids.Length)   let indexer = RowMajor(dims)

    fun allocation ->       let indices = allocation |> indexer.ToIndex       indices |> Array.mapi (fun rsid idx -> offerIdsByRetailSkuId.[rsid].[idx])

Finally the getResults function copies the array of grid minima back to the host and performs a min by price to find the global minimum along with its accompanying allocation.

let getResults() =
    let prices = Gpu.CopyToHost(prices)
    let allocations = Gpu.CopyToHost(allocations)
    let price, allocation = Array.zip prices allocations |> Array.minBy fst
    let allocation = decodeAllocation allocation
    price, allocation

Now that we’ve seen the general flow of the search kernel, let’s look into the pricing code. The allocation price function is defined within the kernel function.

let price (allocation:int64) =
    generalPrice 
        numRetailSkus numFulfillmentNodes
        (fun i -> fulfillmentNodes.[i])
        (fun i -> shippingRules.[i])
        (fun i -> commissionRules.[i])
        (loopOffers allocation)

The generalPrice function prices an allocation by calling priceFulfillmentNode on an incrementing fulfillmentNodeId within a while loop. The priceFulfillmentNode function returns the number of SKUs computed in the allocation so the loop can exit early if possible. Note that generalPrice and loopOffers are both higher order functions. price uses partial function application and passes the partially applied loopOffers function along to generalPrice.

The loopOffers function loops over the offers of the allocation which are being fulfilled by fulfillment node fnid. Every time loopOffers is called, it loops over each retail SKU but only invokes the function f if the fulfillment node id of the offer for that retail SKU is equal to the current node being priced.Implementing loopOffers as a higher order function provides an abstraction over the various updateXYZ functions used within priceFulfillmentNode.

let loopOffers (allocation:int64) (fnid:int) (f:Offer -> Sku -> unit) =
    let offset = ref allocation
    let f (iter:AllocationIterator) (rsid:int) =
        let idx = AllocationIterator.Decode(iter, offset)
        let oid = supply.OfferIds.[iter.SupplyOffset + idx]
        let offer = offers.[oid]
        let sku = retailSkus.[rsid]
        if offer.Id = oid && offer.fulfillmentNodeId = fnid && offer.retailSkuId = rsid then
            f offer sku
    // The AllocationIterator type is used only by the GPU full 
    // search implementation where we encode allocations of int[]
    // into int64 to improve performance and use less memory.
    AllocationIterator.Iterate(supply, numRetailSkus, f)

The reduction in memory use provided by doing this int[] to int64 encoding is important for full search since it must enumerate all possible allocations. Using encoded allocations significantly increases the capabilities of the full search algorithm with regard to the level of cart complexity the algorithm can handle.

priceFulfillmentNode prices the set of offers being fulfilled by fulfillment node fnid, and updates various aspects of the order such as the total order price, total weight, and shipping cost.

let priceFulfillmentNode fnid
                  (fulfillmentNode:FulfillmentNode)
                (shippingRules:int -> ShippingRule)
                  (commissionRules:int -> CommissionRule)
                  (loopOffers:int -> (Offer -> Sku -> unit) -> unit) =
    let mutable orderPrice = 0.0
    let mutable numSkusPriced = 0

    let order = ref (Order())
    let orderSums = ref (OrderSums())
    let updateOrderSums = updateOrderSums orderSums
    let updateShippingCost = updateShippingCost shippingRules orderSums order
    let updateOfferPrice = updateOfferPrice commissionRules orderSums order
    loopOffers fnid updateOrderSums
    if orderSums.contents.linesCount > 0 then
        numSkusPriced <- orderSums.contents.linesCount
        updateShippingCost fulfillmentNode
        loopOffers fnid (updateOfferPrice fulfillmentNode)
        let order = !order
        orderPrice <- order.totalNetPrice + order.totalShipping - order.commission
    orderPrice, numSkusPriced

Microservice Layer

At a high level, Jet.com’s microservices are just simple executables that listen to a route for an HTTP request and translate the body JSON to an F# record containing all the data necessary to perform the merchant selection operation. The microservices reference the pricing engine library and use it accordingly.

In addition to testing the core performance of the GPU algorithm, we also needed to address how the GPU microservice would handle high load. That is, what do we do when a request is received before the GPU is done with the previous calculation? Our solution uses a BlockingCollection of size 1 from System.Collections.Concurrent. The tryAdd member attempts to add an element to the collection within the specified wait time.

// 4 second wait time
let [] private MAX_WAIT_TIME = 4000
// With a collection size of 1, once an element has been added to it 
// further adds are blocked until the element is removed.
let requestQueue = new BlockingCollection(1)
let cheapestAllocation () =
    match pricingAlgorithm with
    | GpuFullSearch | PricingAlgorithm.GpuGenetic ->
        // We try for 4 seconds to add a new request
        if requestQueue.TryAdd(merchantSelectionRequest, MAX_WAIT_TIME) then
            try
                let result = pricingAlgorithm.Invoke pr
                // Once we have the result, remove the request from the
                // collection so we can process the next one
                requestQueue.Take() |> ignore
                Success(result)
            with exn ->
                requestQueue.Take() |> ignore
                Failure exn
        Else
            // If we are unable to add a new request to the collection within 
            // 4 seconds, fail with GPU busy message
            Failure(new Exception("GPU busy"))
    | _ ->
        // CPU algorithms aren’t blocked
        pricingAlgorithm.Invoke pr
        |> Success

This simple solution works well. Due to the improved performance on medium to large-size carts, one Azure N-Series GPU machine is able to handle the request load of multiple Azure D3 (CPU-only) instances.

Conclusion: GPU-accelerated Fulfillment with F# in the Cloud

This post introduced Jet.com’s exponentially complex merchant selection problem and our approach to solving it within an environment of microservices written in F# and running in the Azure cloud. I covered  implementation details of a brute force GPU search approach to merchant selection and how this algorithm is used from a RESTful microservice. In future posts I hope to expand on our genetic algorithm approach, how we validated and compared multiple algorithms on production data, and different ways we have used AI and  machine learning to improve the performance of our pricing engine.

If you’d like to learn more about some of the GPU-related work going on at Jet.com be sure to check out these two GTC presentations given by Daniel Egloff:

  • “Prices Drop as You Shop: How Walmart is Using Jet’s GPU-based Smart Merchant Selection to Gain a Competitive Advantage” (MP4, PDF)
  • “Welcome to the Jet Age – How AI and Deep Learning Make Online Shopping Smarter at Walmart” (MP4, PDF)

Check out the Jet Technology Blog to learn more about Jet.com and the problems we are working on, and feel free to reach out to me directly or in the comments below.

Acknowledgements

I would like to thank Daniel Egloff, Neerav Kothari, Xiang Zhang, and Andrew Shaeffer. This work would not have been possible without them.

Evolution of : Gif without the GIF

$
0
0
  • GIFs are awesome but terrible for quality and performance
  • Replacing GIFs with <video> is better but has perf. drawbacks: not preloaded, uses range requests
  • Now you can <img src=".mp4">s in Safari Technology Preview
  • Early results show mp4s in <img> tags display 20x faster and decode 7x faster than the GIF equivalent – in addition to being 1/14th the file size!
  • Background CSS video & Responsive Video can now be a “thing”.
  • Finally cinemagraphs without the downsides of GIFs!
  • Now we wait for the other browsers to catch-up: This post is 46MB on Chrome but 2MB in Safari TP

Special thanks to: Eric Portis, Jer Noble, Jon Davis, Doron Sherman, and Yoav Weiss.

I both love and hate animated GIFs. Ode to GeocitiesThanks Tim

Safari Tech Preview has changed all of this. Now I love and love animated “GIFs”.

Everybody loves animated Gifs!

Animated GIFs are a hack. To quote from the original GIF89a specification:

The Graphics Interchange Format is not intended as a platform for animation, even though it can be done in a limited way.

But they have become an awesome tool for cinemagraphs, memes, and creative expression. All of this awesomeness, however, comes at a cost. Animated GIFs are terrible for web performance. They are HUGE in size, impact cellular data bills, require more CPU and memory, cause repaints, and are battery killers. Typically GIFs are 12x larger files than H.264 videos, and take 2x the energy to load and display in a browser. And we’re spending all of those resources on something that doesn’t even look very good – the GIF 256 color limitation often makes GIF files look terrible (although there are some cool workarounds).

My daughter loves them – but she doesn’t understand why her battery is always dead.

GIFs have many advantages: they are requested immediately by the browser preloader, they play and loop automatically, and they are silent! Implicitly they are also shorter. Marketresearch has shown that users have higher engagement with, and generally prefer both micro-form video (< 1minute) and cinemagraphs (stills with subtle movement), over longer-form videos and still images. Animated GIFs are great for user experience.

videos that are <30s have highest conversion

So how did I go from love/hating GIFs to love/loving “Gifs”? (capitalization change intentional)

In the latest Safari Tech Preview, thanks to some hard work by Jer Noble, we can now use MP4 files in <img> tags. The intended use case is not long-form video, but micro-form, muted, looping video – just like GIFs. Take a look for yourself:

<img src="rocky.mp4">
Rocky!

Cool! This is going to be awesome on so many fronts – for business, for usability, and particularly for web performance!

As many have already pointedout, using the <video> tag is much better for performance than using animated GIFs. That’s why in 2014 Twitter famously added animated GIF support by not adding GIF support. Twitter instead transcodes GIFs to MP4s on-the-fly, and delivers them inside <video> tags. Since all browsers now support H.264, this was a very easy transition.

<video autoplay loop muted inline>
<source src="eye-of-the-tiger-video.webm" type="video/webm">
<source src="eye-of-the-tiger-video.mp4" type="video/mp4">
<img src="eye-of-the-tiger-fallback.gif"/>
</video>

Transcoding animated GIFs to MP4 is fairly straightforward. You just need to run ffmpeg -i source.gif output.mp4

However, not everyone can overhaul their CMS and convert <img> to <video>. Even if you can, there are three problems with this method of delivering GIF-like (Gif), micro-form video:

1. Browser performance is slow with <video>

As Doug Sillars recently pointed out in a HTTP Archive post, there is huge performance penalty in the visual presentation when using the <video> tag.

Sites without video, load about 28 percent faster than sites with video

Unlike <img> tags, browsers do not preload <video> content. Generally preloaders only preload JavaScript, CSS, and image resources because they are critical for the page layout. Since <video> content can be any length – from micro-form to long-form – <video> tags are skipped until the main thread is ready to parse its content. This delays the loading of <video> content by many hundreds of milliseconds.


For example, the hero video at the top of the Velocity conference page is only requested 5 full seconds into the page load. It’s the 27th requested resource and it isn’t even requested until after Start Render, after webfonts are loaded.

Worse yet, many browsers assume that <video> tags contain long-form content. Instead of downloading the whole video file at once, which would waste your cell data plan in cases where you do not end up watching the whole video, the browser will first perform a 1-byte request to test if the server supports HTTP Range Requests. Then it will follow with multiple range requests in various chunk sizes to ensure that the video is adequately (but not over-) buffered. The consequence is multiple TCP round trips before the browser can even start to decode the content and significant delays before the user sees anything. On high-latency cellular connections, these round trips can set video loads back by hundreds or thousands of milliseconds.


And what performs even worse than the native <video> element? The typical JavaScript video player. Often, the easiest way to embed a video on a site is to use a hosted service like YouTube or Vimeo and avoid the complexities of video encoding, hosting, and UX. This is normally a great idea, but for micro-form video, or critical content like hero videos, it just adds to the delay because of the javascript players and supporting resources these hosting services inject (css/js/jpg/woff). In addition to the <video> markup you are forcing the browser to downloaded, evaluate, and execute the javascript player — and only then can the video start to load.


As many people know, I love my Loki jacket because of its built in mitts, balaclava, and a hood that is sized for helmets. But take a look at the Loki USA homepage – which uses a great hero-video, hosted on Vimeo:

lokiusa.com filmstrip
lokiusa.com video

If you look closely, you can see that the JavaScript for the player is actually requested soon after DOM Complete. But it isn’t fully loaded and ready to start the video stream until much later.

lokiusa.com waterfall

WPT Results

2. You can’t right click and save video

Most long-form video content – vlogs, TV, movies – is delivered via JavaScript-based players. Usually these players provide users with a convenient “share now” link or bookmark tool, so they can come back to YouTube (or wherever) and find the video again. In contrast, micro-form content – like memes and cinemagraphs – usually doesn’t come via a player, and users expect to be able to download GIFs and send them to friends, like they can with any image on the web. That meme of the dancing cat was sooo funny – I have to share it with all my friends!

If you use <video> tags to deliver micro-form video, users can’t right-click, click-and-drag, or force touch, and save. And their dancing-cat joy becomes a frustrating UX surprise.

3. Autoplay abuse

Finally, using <video> tags and MP4s instead of <img> tags and GIFs is brings you into the middle of an ongoing cat and mouse game between browsers and unconscionable ad vendors, who abuse the <video autoplay> attribute in order to get the users’ attention. Historically, mobile browsers have ignored the autoplay attribute and/or refused to play videos inline, requiring them to go full screen. Over the last couple of years, Apple and Google have both relaxed their restrictions on inline, autoplay videos, allowing for Gif-like experiences with the <video> tag. But again, ad networks have abused this, causing further restrictions: if you want to autoplay <video> tags you need to mark the content with muted or remove the audio track all together.

The GIF format isn’t the only animation-capable, still-image format. WebP and PNG have animation support, too. But, like GIF, they were not designed for animation and result in much larger files, compared to dedicated video codecs like H.264, H.265, VP9, and AV1.

Animated PNG is now widely supported across all browsers, and while it addresses the color pallete limitation of GIF, it is still an inefficient file format for compressing video.

Animated WebP is better, but compared to true video formats, it’s still problematic. Aside from not having a formal standard, animated WebP lacks chroma subsampling and wide-gamut support. Further, the ecosystem of support is fragmented. Not even all versions of Android, Chrome, and Opera support animated WebP – even though those browsers advertise support with the Accept: image/webp. You need Chrome 42, Opera 15+ or Android 5+.

So while animated WebPs compress much better than animated GIFs or aPNGs, we can do better. (See file size comparisons below)

By enabling true video formats (like MP4) to be included in <img> tags, Safari Technology Preview has fixed these performance and UX problems. Now, our micro-form videos can be small and efficient (like MP4s delivered via the <video> tag) and they can can be easily preloaded, autoplayed, and shared (like our old friend, the GIF).

<img src="ottawa-river.mp4">

So how much faster is this going to be? Pull up the developer tools and see the difference in Safari Technology Preview and other browsers:

Take a look at this!

Unfortunately Safari doesn’t play nice with WebPageTest, and creating reliable benchmark tests is complicated. Likewise, Tech Preview’s usage is fairly low, so comparing performance with RUM tools is not yet practical.

We can, however, do two things. First, compare raw byte sizes, and second, use the Image.decode() promise to measure the device impact of different resources.

Byte Savings

First, the byte size savings. To compare this I transcoded the trending top 100 animated Gifs from giphy.com and then converted into vp8/vp9/webp/h264/h265.

NB: These results should be taken as directional only! Each codec could be tuned much more as you can see the vp9 fairs worse than the default vp8 outputs. A more comprehensive study should be done that considers SSIM.

Below are the median (p50) results of the conversion:

FormatBytes p50% change p50
GIF1,713KB
WebP310KB-81%
WebM/VP857KB-97%
WebM/VP966KB-96%
WebM/AV1TBD
MP4/H.264102KB-93%
MP4/H.26543KB-97%

Yes animated WebP is smaller but any video format is much smaller. This shouldn’t surprise anyone since these modern video codecs are highly optimized for online video streaming. H.265 fairs very well as I expect AV1 will too.

The benefits here will not only be faster transit but also substantial $$ savings for end users.

Net-Net, using video in <img> tags is going to be much faster on a cellular connection.

Decode and Visual Performance Improvements

Next, let’s consider the impact of the decode and display effects on the browsing experience. H.264 (and H.265) has the notable advantage of being hardware decoded instead of using the primare core for decode.

How can we measure this? Since browsers haven’t yet implemented the proposed hero image API, we can use Steve Souder’s User Timing and Custom Metric strategy as a good aproximation of when the image starts to display to the user. It doesn’t measure frame rate, but it tells us roughly when the first frame is displayed. Better yet, we can also use the newly adopted Image.decode() event promise to measure decode performance. In the test page below, I inject a unique GIF and MP4 in an <img> tag 100 times and compare the decode and paint performance.

let image = new Image;
t_startReq = new Date().getTime();
document.getElementById("testimg").appendChild(image);
image.onload = timeOnLoad;
image.src = src;
return image.decode().then(() => { resolve(image); });

The results are quite impressive! Even on my powerful 2017 MacBook Pro, running the test locally, with no network throttling, we can see GIFs taking 20x longer than MP4s to draw the first frame (signaled by the onload event), and 7x longer to decode!

Localhost test on 2017 i7 MacBook Pro

Curious? Clone the repo and test for yourself. I will note that adding network conditions on the transit of the GIF v. MP4 will disproportionately skew the test results. Specifically since decode can start happening before the last byte finishes, the delta between transfer, display and decode becomes much smaller. What this really tells us is that just the byte savings alone will improve substantially the user experience. However, factoring out the network as I’ve done on a localhost run, you can see that using video has substantial performance benefits for the energy consumption as well.

So now that Safari Technology Preview supports this design pattern, how can you actually take advantage of it, without serving broken images to non-supporting browsers? Good news! It’s relatively easy.

Option 1: Use Responsive Images

Ideally the simplest way is to use the <source type> attribute of the HTML5 <picture> tag.

<picture>
<source type="video/mp4" srcset="cats.mp4">
<source type="image/webp" srcset="cats.webp">
<img src="cats.gif">
</picture>

I’d like to say we can stop there. However, there is this nasty WebKit bug in Safari that causes the preloader to download the first<source> regardless of the mimetype declaration. The main DOM loader realizes the error and selects the correct one. However, the damage will be done. The preloader squanders its opportunity to download the image eary and on top of that, downloads the wrong version wasting bytes. The good news is that I’ve patched this bug and it should land in Safari TP 45.

In short, using the <picture> and <source type> for mime-type selection is not advisable until the next version of Safari reaches the 90%+ of the user base.

Option 2: Use MP4, animated WebP and Fallback to GIF

If you don’t want to change your HTML markup, you can use HTTP to send MP4s to Safari with content negotiation. In order to do so, you must generate multiple copies of your cinemagraphs (just like before) and Varyresponses based on both the Accept and User-Agent headers.

This will get a bit cleaner once WebKit BUG 179178 is resolved and you can add a test for the Accept: video/* header, (like the way you can test for Accept: image/webp). But the end result is that each browser gets the best format for <img>-based micro-form videos that it supports:

BrowserAccept HeaderResponse
Safari TP41+H.264 MP4
Accept: video/mp4H.264 MP4
Chrome 42+Accept: image/webpaWebP
Opera 15Accept: image/webpaWebP
Accept: image/apngaPNG
DefaultaGif

In nginx this would look something like:


map $http_user_agent $mp4_suffix {
    default   "";
    "~*Safari/605"  ".mp4";
}

location ~* .(gif)$ {
      add_header Vary Accept;
      try_files $uri$mp4_suffix $uri =404;
}

Of course, don’t forget the Vary: Accept, User-Agent to tell coffee-shop proxies and your CDN to cache each response differently. In fact, you should probably mark the Cache-Control as private and use TLS to ensure that the less sophisticated ISP Performance-Enhancing-Proxies don’t cache the content.

GET /example.gif HTTP/1.1
Accept: image/png; video/*; */*
User-Agent: User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/605.1.13 (KHTML, like Gecko) Version/11.1 Safari/605.1.13

…

HTTP/1.1 200 OK
Content-Type: video/mp4
Content-Length: 22378567
Vary: Accept, User-Agent

Option 3: Use RESS and fall Back to <video> tag

If you can manipulate your HTML, you can adopt the Responsive-Server-Side (RESS) technique. This option moves the browser detection logic into your HTML output.

For example, you could do it like this with PHP:

<?php if(strlen(strstr($_SERVER['HTTP_USER_AGENT'],"Safari/605")) <= 0 ){ // if not firefox ?>
<img src="example.mp4">
<?php } else {?>
<img src="example.gif">
<?php }?>

As above, be sure to emit a Vary: User-Agent response to inform your CDN that there are different versions of your HTML to cache. Some CDNs automatically honour the Vary headers while others can support this with a simple update to the CDN configuration.

Bonus: Don’t forget to remove the audio track

Now, since you aren’t converting GIF to MP4s but rather you are converting MP4s to GIFs, we should also remember to strip the audio track for extra byte savings. (Please tell me you aren’t using GIFs as your original. Right?!) Audio tracks add extra bytes to the file size that we can quickly strip off since we know that it will be played on mute anyway. The simplest way with ffmpeg is:

ffmpeg -i cats.mp4 -vcodec copy -an cats.mp4

As I’m writing this, Safari will blindly download whatever video you specify in the <img> tag, no matter how long it is. On the one hand, this is expected because it helps improve the performance of the browser. Yet, this can be deadly if you push down a 120-minute video to the user. I’ve tested multiple sizes and all were downloaded as long as the user hung around. So, be courteous to your users. If you want to push longer form video content, use the <video> tag for better performance.

Now that we can deliver MP4s via <img> tags, doors are opening to many new use cases. Two that come to mind: responsive video, and background videos. Now that we can put MP4s in srcsets, vary our responses for them using Client Hints and Content-DPR, art direct them with <picture media>, well – think of the possibilities!

<img src="cat.mp4" alt="cat"
  srcset="cat-160.mp4 160w, cat-320.mp4 320w, cat-640.mp4 640w, cat-1280.mp4 1280w"
  sizes="(max-width: 480px) 100vw, (max-width: 900px) 33vw, 254px">

Video in CSS background-image: url(.mp4) works, too!

<div style="width:800px, height: 200px, background-image:url(colin.mp4)"/>

By enabling video content in <img> tags, Safari Technology Preview is paving the way for awesome Gif-like experiences, without the terrible performance and quality costs associated with GIF files. This functionality will be fantastic for users, developers, designers, and the web. Besides the enormous performance wins that this change enables, it opens up many new use cases that media and ecommerce businesses have been yearning to implement for years. Here’s hoping the other browsers will soon follow. Google? Microsoft? Mozilla? Samsung? Your move!

Books I read this year

$
0
0

Turtles and jazz chickens

| December 4, 2017

Reading is my favorite way to indulge my curiosity. Although I’m lucky that I get to meet with a lot of interesting people and visit fascinating places through my work, I still think books are the best way to explore new topics that interest you.

This year I picked up books on a bunch of diverse subjects. I really enjoyed Black Flags: The Rise of ISIS by Joby Warrick. I recommend it to anyone who wants a compelling history lesson on how ISIS managed to seize power in Iraq.

On the other end of the spectrum, I loved John Green’s new novel, Turtles All the Way Down, which tells the story of a young woman who tracks down a missing billionaire. It deals with serious themes like mental illness, but John’s stories are always entertaining and full of great literary references.

Another good book I read recently is The Color of Law by Richard Rothstein. I’ve been trying to learn more about the forces preventing economic mobility in the U.S., and it helped me understand the role federal policies have played in creating racial segregation in American cities.

I’ve written longer reviews about some of the best books I read this year. They include a memoir by one of my favorite comedians, a heartbreaking tale of poverty in America, a deep dive into the history of energy, and not one but two stories about the Vietnam War. If you’re looking to curl up by the fireplace with a great read this holiday season, you can’t go wrong with one of these.

The Best We Could Do, by Thi Bui. This gorgeous graphic novel is a deeply personal memoir that explores what it means to be a parent and a refugee. The author’s family fled Vietnam in 1978. After giving birth to her own child, she decides to learn more about her parents’ experiences growing up in a country torn apart by foreign occupiers.

Evicted: Poverty and Profit in the American City, by Matthew Desmond. If you want a good understanding of how the issues that cause poverty are intertwined, you should read this book about the eviction crisis in Milwaukee. Desmond has written a brilliant portrait of Americans living in poverty. He gave me a better sense of what it is like to be poor in this country than anything else I have read.

Believe Me: A Memoir of Love, Death, and Jazz Chickens, by Eddie Izzard. Izzard’s personal story is fascinating: he survived a difficult childhood and worked relentlessly to overcome his lack of natural talent and become an international star. If you’re a huge fan of him like I am, you’ll love this book. His written voice is very similar to his stage voice, and I found myself laughing out loud several times while reading it.

The Sympathizer, by Viet Thanh Nguyen. Most of the books I’ve read and movies I’ve seen about the Vietnam War focused on the American perspective. Nguyen’s award-winning novel offers much-needed insight into what it was like to be Vietnamese and caught between both sides. Despite how dark it is, The Sympathizer is a gripping story about a double agent and the trouble he gets himself into.

Energy and Civilization: A History, by Vaclav Smil. Smil is one of my favorite authors, and this is his masterpiece. He lays out how our need for energy has shaped human history—from the era of donkey-powered mills to today’s quest for renewable energy. It’s not the easiest book to read, but at the end you’ll feel smarter and better informed about how energy innovation alters the course of civilizations.

Become a Gates Notes Insider for access to exclusive content and personalized reading suggestions

Read previous versions of the Annual Letter

"+(mainIIG+1)+"  of  "+listOfObjects.length+"

"+capt01+" "+capt02+""); }else{ $('.InlineImageGalleryText').html("

"+(mainIIG+1)+"  of  "+listOfObjects.length+"

"+capt01+" "+capt02+""); } $('.iigNextArrow').click(function() { iignext(); }) $('.iigPrevArrow').click(function() { iigprev(); }) adjustIIG(); } function iigprev(){ var ww = $(window).width(); mainIIG-=1; if (mainIIGlistOfObjects.length-1){nIIG=0;} $('.InlineImageGalleryNextImage').html(listOfObjects[nIIG].outerHTML); pIIG = mainIIG-1; if (pIIG

"+(mainIIG+1)+"  of  "+listOfObjects.length+"

"+capt01+" "+capt02+""); }else{ $('.InlineImageGalleryText').html("

"+(mainIIG+1)+"  of  "+listOfObjects.length+"

"+capt01+" "+capt02+""); } $('.iigNextArrow').click(function() { iignext(); }) $('.iigPrevArrow').click(function() { iigprev(); }) adjustIIG(); } function adjustPQ(){ var pqW = $('.suspendedR').width(); var pqH = $('.suspendedR').height(); $('.pqrq').css("margin-left",pqW-60); $('.pqrq').css("margin-top",pqH-110); } function adjustIIG(){ var ww = $(window).width(); var iigW = $('.InlineImageGalleryBase').width(); //console.log(ww); //if (iigWlistOfObjects.length-1){nIIG=0;} $('.InlineImageGalleryNextImage').html(listOfObjects[nIIG].outerHTML); pIIG = mainIIG-1; if (pIIG

"+(mainIIG+1)+"  of  "+listOfObjects.length+"

"+capt01+" "+capt02+""); $('.IIGheadline').css("width",iigMW-200); var headH = $('.IIGheadline').height(); if (headH==15){ $('.IIGheadline').css("top",17); }else if(headH==30){ $('.IIGheadline').css("top",9); }else if(headH==45){ $('.IIGheadline').css("top",2); }else{ $('.IIGheadline').css("top",17); } $('.InlineImageGalleryText').css("height", 50); //console.log("h: "+$('.IIGheadline').height()+" w: "+$('.IIGheadline').width()); } } // ADD SUPPORT FOR INLINE INTERACTIVE FILES function formatInlineElems(){ // related content //small-12 medium-12 large-12 columns var listOfRelated = []; $('.promounit').each(function (i, obj) { listOfRelated.push(obj); }); //console.log(listOfRelated); // CTA $('.bgcta').each(function (i, obj) { //console.log($(this).children().eq(1).prop('src')); $(this).after( "" ); var imgOn = $(this).children().eq(1).prop('src'); var imgOff = $(this).children().eq(0).prop('src'); $("#bgctab"+i).mouseenter(function() { $(this).find('img').attr("src", ""+imgOn); }).mouseleave(function() { $(this).find('img').attr("src", ""+imgOff); }); $(this).remove(); $("#bgctab"+i).children(0).attr("width", "100%" ); }); // marginalia right $('.bgsbnoter').each(function (i, obj) { $(this).after( "" ); $(this).remove(); $("#bgsbnoter"+i).children(0).attr("width", "100%" ); $("#bgsbnoter"+i).css("position","relative"); $("#bgsbnoter"+i).css("top",$(this).attr("offsetY")); $("#bgsbnoter"+i).css("left",$(this).attr("offsetX")); }); // marginalia left $('.bgsbnotel').each(function (i, obj) { $(this).after( "" ); $(this).remove(); $("#bgsbnotel"+i).children(0).attr("width", "100%" ); $("#bgsbnotel"+i).css("position","relative"); $("#bgsbnotel"+i).css("top",$(this).attr("offsetY")); $("#bgsbnotel"+i).css("left",$(this).attr("offsetX")); }); // left image $('.bgli').each(function (i, obj) { //$(this).after( "

"+listOfRelated[0].outerHTML+"

" ); // related content $(this).after( "" ); $(this).remove(); $("#gblib"+i).children(0).attr("width", "100%" ); }); // block quote $('.bgbq').each(function (i, obj) { var PQtext = $(this).html(); var s = PQtext.replace(/\u201C/g, " "); var s2 = s.replace(/\u201D/g, ""); $(this).after( "" ); $(this).remove(); $(".articleContent p").css("z-index", "1"); $(".articleContent p").css("position", "relative"); }); // underline $('.TGNil_underline').each(function (i, obj) { $(this).css("border-bottom","0px solid "+"#f5d840");//+$(this).attr("mycolor")); $(this).css("box-shadow","inset 0 -5px 0 "+"#f5d840");//+$(this).attr("mycolor")); $(this).css("color","inherit"); $(this).css("-webkit-transition","background .15s cubic-bezier(.33,.66,.66,1)"); $(this).css("transition","background .15s cubic-bezier(.33,.66,.66,1)"); }); // highlight $('.TGNil_highlight').each(function (i, obj) { $(this).css("border-bottom","2px solid "+$(this).attr("mycolor")); $(this).css("box-shadow","inset 0 -23px 0 "+$(this).attr("mycolor")); $(this).css("color","inherit"); $(this).css("-webkit-transition","background .15s cubic-bezier(.33,.66,.66,1)"); $(this).css("transition","background .15s cubic-bezier(.33,.66,.66,1)"); }); // inline image gallery $('.bgiig').css("display", "block");//remove after testing $('.bgiig').each(function (i, obj) { $(this).children().each(function () { $(this).attr('style', 'height:100%!important;width:auto!important;'); listOfObjects.push(this); }); var capt01 = listOfObjects[0].getAttribute('data-caption1');//listOfObjects[0].dataset.caption1 var capt02 = listOfObjects[0].getAttribute('data-caption2'); if (capt01==null){capt01="";} if (capt02==null){capt02="";} //thumb.getAttribute('data-bigwidth') $(this).after( "

"+listOfObjects[0].outerHTML+"

1  of  "+listOfObjects.length+"

"+capt01+" "+capt02+"
"); //$(this).after( "

"+listOfObjects[0].outerHTML+"

1  of  "+listOfObjects.length+"

"+listOfObjects[0].dataset.caption1+" "+listOfObjects[mainIIG].dataset.caption2+"
" ); $(this).remove(); //console.log(listOfObjects); //$('.articleContent').css("color","#F00"); }); var IIGlength = listOfObjects.length; //console.log(IIGlength); $('.InlineImageGalleryNext').click(function() { iignext(); }) $('.InlineImageGalleryPrev').click(function() { iigprev(); }) $('.iigNextArrow').click(function() { iignext(); }) $('.iigPrevArrow').click(function() { iigprev(); }) // if( $('.InlineImageGalleryBase').length ) { adjustIIG(); } // if( $('.suspendedR').length ) { adjustPQ(); } //$('.suspended').each(function (i, obj) { // var thisW = $(this).width(); // var thisH = $(this).height(); // if(thisW>thisH){ // $(this).css("width",450); // } //}); } function getCommentCount() { var see = $(".disqusLink").text(); if (see == "") { setTimeout(getCommentCount, 500); } else { var i = parseInt(see); $(".commentCount").each(function (ii, v) { $(v).text(i.toString()); }); } } var scrollwidthDU = function () { var inner = document.createElement('p'); inner.style.width = "100%"; inner.style.height = "200px"; var outer = document.createElement('div'); outer.style.position = "absolute"; outer.style.top = "0px"; outer.style.left = "0px"; outer.style.visibility = "hidden"; outer.style.width = "200px"; outer.style.height = "150px"; outer.style.overflow = "hidden"; outer.appendChild(inner); document.body.appendChild(outer); var w1 = inner.offsetWidth; outer.style.overflow = 'scroll'; var w2 = inner.offsetWidth; if (w1 == w2) w2 = outer.clientWidth; document.body.removeChild(outer); return (w1 - w2); }(); function adjustPrevNext() { var hn = $(".nextText").height(); var hp = $(".prevText").height(); var h = Math.max(hn+24, hp+24); //console.log("prev " + hp + " next " + hn); $("#content_0_nextLink").css("height", h + "px"); $("#content_0_prevLink").css("height", h + "px"); //$(".nextText").css("padding-top", (((h-hn) / 2) + 8) + "px"); //$(".prevText").css("padding-top", (((h-hp) / 2) + 8) + "px"); $(".nextArrow").css("padding-top", (((h-33) / 2)) + "px"); $(".prevArrow").css("padding-top", (((h-33) / 2)) + "px"); } function doOmnitureEvent(event) { try { s.linkTrackVars = "events,eVar9"; s.linkTrackEvents = "event11,event12,event13,event15"; s.events = event; s.tl(true, 'o', 'article'); } catch (e) { //console.log(e.message); //console.log(e.name); } } $(document).ready(function () { $(".disqusLink").click(function (e) { e.preventDefault(); var discussionTop = $(".newComments .horizontalRule1").offset().top - 80; $("html, body").animate({ scrollTop: discussionTop }, 800); }); // Omniture Stuff var tagsLocation = $(".bottom").offset().top; // was tagsLocation var scrolledToTags = false; var scrolledToComments = false; var tgnbody = $("#tgnbody"); $(".fbShareThis").click(function () { s.eVar9 = "facebook share"; doOmnitureEvent("event11"); }); $(".twitterShareThis").click(function () { s.eVar9 = "twitter share"; doOmnitureEvent("event12"); }); $("#tgnbody").scroll(function () { if (scrolledToTags === false) { if (tgnbody.height() + tgnbody.scrollTop() > tagsLocation) { scrolledToTags = true; doOmnitureEvent("event13"); } } if (scrolledToComments === false) { if (tgnbody.scrollTop() > tagsLocation) { scrolledToComments = true; doOmnitureEvent("event15"); } } }); }); $(document).ready(function () { var clicked = false; var title = document.location.pathname; if ($(".videoEmbed iframe").length > 0) { $(".videoEmbed iframe").iframeTracker({ blurCallback: function () { if (clicked === false) { try { s.events = "event8"; s.eVar6 = title; s.prop9 = title; s.linkTrackVars = 'events,eVar6,prop9'; s.linkTrackEvents = "event8"; s.tl(true, 'o', title); } catch (e) { //console.log(title); //console.log(event); //console.log(e.message); //console.log(e.name); } clicked = true; } return true; } }); } });

Apple is sharing your facial wireframe with apps

$
0
0

Poop that mimics your facial expressions was just the beginning.

It’s going to hit the fan when the face-mapping tech that powers the iPhone X’s cutesy “Animoji” starts being used for creepier purposes. And Apple just started sharing your face with lots of apps.

Beyond a photo, the iPhone X’s front sensors scan 30,000 points to make a 3D model of your face. That’s how the iPhone X unlocks and makes animations that might have once required a Hollywood studio.

Now that a phone can scan your mug, what else might apps want to do with it? They could track your expressions to judge if you’re depressed. They could guess your gender, race and even sexuality. They might combine your face with other data to observe you in stores—or walking down the street.

Apps aren’t doing most of these things, yet. But is Apple doing enough to stop it? After I pressed executives this week, Apple made at least one change—retroactively requiring an app tapping into face data to publish a privacy policy.

“We take privacy and security very seriously,” Apple spokesman Tom Neumayr said. “This commitment is reflected in the strong protections we have built around Face ID data—protecting it with the Secure Enclave in iPhone X—as well as many other technical safeguards we have built into iOS.”

Indeed, Apple—which makes most of its money from selling us hardware, not selling our data—may be our best defense against a coming explosion in facial recognition. But I also think Apple rushed into sharing face maps with app makers that may not share its commitment, and it isn’t being paranoid enough about the minefield it just entered.

“I think we should be quite worried,” said Jay Stanley, a senior policy analyst at the American Civil Liberties Union. “The chances we are going to see mischief around facial data is pretty high—if not today, then soon—if not on Apple then on Android.”

Your face is open for business

Apple’s face tech sets some good precedents—and some bad ones. It won praise for storing the face data it uses to unlock the iPhone X securely on the phone, instead of sending it to its servers over the Internet.

Less noticed was how the iPhone lets other apps now tap into two eerie views from the so-called TrueDepth camera. There’s a wireframe representation of your face and a live read-out of 52 unique micro-movements in your eyelids, mouth and other features. Apps can store that data on their own computers.

To see for yourself, use an iPhone X to download an app called MeasureKit. It exposes the face data Apple makes available. The app’s maker, Rinat Khanov, tells me he’s already planning to add a feature that lets you export a model of your face so you can 3D print a mini-me.

The Post's Geoffrey A. Fowler shows MotionKit, an app that shows users what facial data is being sent to other apps. (The Washington Post)

“Holy cow, why is this data available to any developer that just agrees to a bunch of contracts?” said Fatemeh Khatibloo, an analyst at Forrester Research.

Being careful is in Apple’s DNA—it has been slow in opening home and health data with outsiders. But it also views the face camera as a differentiator, helping position Apple as a leader in artificial intelligence and augmented reality.

Apple put some important limits on apps. It requires “that developers ask a user’s permission before accessing the camera, and that apps must explain how and where this data will be used,” Apple's Neumayr said.

And Apple’s rules say developers can’t sell face data, use it to identify anonymous people or use it for advertising. They’re also required to have privacy policies.

“These are all very positive steps,” said Clare Garvey, an associate at Georgetown University’s Center on Privacy & Technology.

Privacy holes

Still, it wasn’t hard for me to find holes in Apple’s protections.

The MeasureKit app’s maker told me he wasn’t sensing much extra scrutiny from Apple for accessing face data.

“There were no additional terms or contracts. The app review process is quite regular as well—or at least it appears to be, on our end,” Khanov said. When I noticed his app didn’t have a privacy policy, Khanov said Apple didn’t require it because he wasn’t taking face data off the phone.

After I asked Apple about this, it called Khanov and told him to post a privacy policy.

“They said they noticed a mistake and this should be fixed immediately,” Khanov said. “I wish Apple were more specific in their App Review Guidelines."

The bigger concern: “How realistic is it to expect Apple to adequately police this data?” Georgetown’s Garvey told me. Apple might spot violations from big apps like Facebook, but what about gazillions of smaller ones?

Apple hasn’t said how many apps it has kicked out of its store for privacy issues.

Then there’s a permission problem. Apps are supposed to make clear why they’re accessing your face and seek “conspicuous consent,” according to Apple’s policies. But when it comes time for you to tap OK, you get a pop-up that asks to “access the camera.” It doesn’t say, “HEY, I’M NOW GOING TO MAP YOUR EVERY TWITCH.”

The iPhone’s settings don’t differentiate between the back camera and all those front face-mapping sensors. Once you give it permission, an active app keeps on having access to your face until you delete it or dig into advanced settings. There’s no option that says, “Just for the next five minutes.”

Overwhelming people with notifications and choices is a concern, but the face seems like a sufficiently new and sensitive data source that it warrants special permission. Unlike a laptop webcam, it’s hard to put a privacy sticker over the front of the iPhone X—without a fingerprint reader, it’s the main mechanism to unlock the thing.

Android phones have had face-unlock features for years, but most haven’t offered 3D face mapping like the iPhone. Like iOS, Android doesn’t make a distinction between front and back cameras. Google’s Play Store doesn’t prohibit apps from using the face camera for marketing or building databases, so long as they ask permission.

The value of your face

Facial detection can, of course, be used for good and for bad. Warby Parker, the online glasses purveyor, uses it to fit frames to faces, and a Snapchat demo uses it to virtually paint on your face. Companies have touted face tech as a solution to distracted driving, or a way to detect pain in children who have trouble expressing how they’re feeling.

It’s not clear how Apple’s TrueDepth data might change the kinds of conclusions software can draw about people. But from years of covering tech, I’ve learned this much: Given the opportunity to be creepy, someone will take it.

Using artificial intelligence, face data “may tell an app developer an awful lot more than the human eye can see,” said Forrester’s Khatibloo. For example, she notes researchers recently used AI to more-accurately determine people’s sexuality just from regular photographs. That study had limitations, but still “the tech is going to leapfrog way faster than consumers and regulators are going to realize,” said Khatibloo.

Our faces are already valuable. Half of all American adults have their images stored in at least one database that police can search, typically with few restrictions.

Facebook and Google use AI to identify faces in pictures we upload to their photo services. (They’re being sued in Illinois, one of the few states with laws that protect biometric data.) Facebook has a patent for delivering content based on emotion, and in 2016, Apple bought a startup called Emotient that specializes in detecting emotions.

Using regular cameras, companies such as Kairos make software to identify gender, ethnicity and age as well as the sentiment of people. In the last 12 months, Kairos said it has read 250 million faces for clients looking to improve commercials and products.

Apple’s iPhone X launch was “the primal scream of this new industry, because it democratized the idea that facial recognition exists and works,” said Kairos CEO Brian Brackeen. His company gets consent from volunteers whose faces it reads, and sometimes even pays them—but he said the field is wide open. “What rights do people have? Are they being somehow compensated for the valuable data they are sharing?” he said.

What keeps privacy advocates up at night is that the iPhone X will make face scanning seem normal. Will makers of other phones, security cameras or drones be as careful as Apple? We don’t want to build a future where we become numb to a form of surveillance that goes far beyond anything we’ve known before.

You’ve only got one face, so we’d better not screw this up.

Read more about the iPhone X: 

The iPhone X-factor: Don’t buy a phone you don’t need

What happens if a cop forces you to unlock your iPhone X with your face?

If you want an iPhone X for the holidays, start planning now

Using Rust in Mercurial

$
0
0

This page describes the plan and status for leveraging the Rust programming language in Mercurial.

Why use Rust?

Today, Mercurial is a Python application. It uses Python C extensions in various places to achieve better performance.

There are many advantages to being a Python application. But, there are significant disadvantages.

Performance is a significant pain point with Python. There are multiple facets to the performance problem:

  • Startup overhead
  • General performance overhead compared to *native* code
  • GIL interfering with parallel execution

It takes several dozen milliseconds to start a Python interpreter and load the Mercurial Python modules. If you have many extensions loaded, it could take well over 100ms just to effectively get to a Mercurial command's main function. Reports of over 250ms are known. While the command itself may complete in mere milliseconds, Python overhead has already made hg seem non-instantaneous to end-users.

A few years ago, we measured that CPython interpreter startup overhead amounted to 10-18% of the run time of Mercurial's test harness. 100ms may not sound like a lot. But it is enough to give the perception that Mercurial is slower than tools like Git (which can run commands in under 10ms).

There are also situations like querying hg for shell prompts that require near-instantaneous execution.

Mercurial is also heavily scripted by tools like IDEs. We want these tools to provide results near instantaneously. If people are waiting over 100ms for results from hg, it makes these other tools feel sluggish.

There are workarounds for startup overhead problems: the CommandServer (start a persistent process and issue multiple commands to it) and CHg(a C binary that speaks with a Mercurial command server and enables chg commands to execute without Python startup overhead). chg's very existence is because we need hg to be a native binary in order to avoid Python startup overhead. If hg weren't a Python script, we wouldn't need chg to be a separate program.

Python is also substantially slower than native code. PyPy can deliver substantially better performance than CPython. And some workloads with PyPy might even be faster than native code due to JIT. But overall, Python is slower than native code.

But even with PyPy's magical performance, we still have the GIL. Python doesn't allow you to execute CPU-bound Python code on multiple threads. If you are CPU bound, you need to offload that work to an extension (which releases the GIL when it executes hot code) or you spawn multiple processes. Since Mercurial needs to run on Windows (where new process overhead is ~10x worse than POSIX and is a platform optimized for spawning threads - not processes), many of the potential speedups we can realize via concurrency are offset on Windows by new process overhead and Python startup overhead. We need thread-level concurrency on Windows to help with shorter-lived CPU-bound workloads. This includes things like revlog reading (which happens on nearly every Mercurial operation).

In addition to performance concerns, Python is also hindering us because it is a dynamic programming language. Mercurial is a large project by Python standards. Large projects are harder to maintain. Using a statically typed programming language that finds bugs at compile time will enable us to make wide-sweeping changes more fearlessly. This will improve Mercurial's development velocity.

Today, when performance is an issue, Mercurial developers currently turn to C. But we treat C as a measure of last resort because it is just too brittle. It is too easy to introduce security vulnerabilities, memory leaks, etc. On top of vanilla C, the Python C API is somewhat complicated. It takes significantly longer to develop C components because the barrier to writing bug-free C is much higher.

Furthermore, Mercurial needs to run on multiple platforms, including Windows. The nice things we want to do in native code are complicated to implement in C because cross-platform C is hard. The standard library is inadequate compared to modern languages. While modern versions of C++ are nice, we still support Python 2.7 and thus need to build with MSVC 2008 on Windows. It doesn't have any of the nice features that modern versions of C++ have. Things like introducing a thread pool in our current C code would be hard. But with Rust, that support is in the standard library and "just works." Having Rust's standard library is a pretty compelling advantage over C/C++ for any project, not just Mercurial.

For Mercurial, Rust is all around a better C. It is much safer, about the same speed, and has a usable standard library and modules system for easily pulling in 3rd party code.

Desired End State

  • hg is a Rust binary that embeds and uses a Python interpreter when appropriate (hg is a Python script today)

  • Python code seemlessly calls out to functionality implemented in Rust
  • Fully self-contained Mercurial distributions are available (Python is an implementation detail / Mercurial sufficiently independent from other Python presence on system)
  • The customizability and extensibility of Mercurial through extensions is not significantly weakened.
  • chg functionality is rolled into hg

Current Status

(last updated December 4 2017)

Priorities for Oxidation

All existing C code is a priority for oxidation because we don't like maintaining C code for safety and compatibility reasons. Existing C code includes:

In addition, the following would be good candidates for oxidation:

  • All revlog I/O (reading is more important than writing)
  • Working directory I/O (extracting content from revlogs/store and writing to filesystem)
  • bundle2 reading and writing
  • changelog reading
  • revsets
  • All filesystem I/O (allows us to use Windows APIs and properly handle filenames on Windows)

Problems

CRT Mismatch on Windows

Mercurial still uses Python 2.7. Python 2.7 is officially compiled with MSVC 2008 and links against vcruntime90.dll. Rust and its standard library don't support MSVC 2008. They are likely linked with something newer, like MSVC 2015 or 2017.

If we want compatibility with other binary Python extensions, we need to use a Python built with MSVC 2008 and linked against msvcr90.dll.

So, our options are:

  1. Build a custom Python 2.7 distribution with modern MSVC and drop support for 3rd party binary Python 2.7 extensions.
  2. Switch Mercurial to Python 3 and build Rust code with same toolchain as Python we target.
  3. Mix the CRTs.

#1 significantly undermines Mercurial's extensibility. Plus, Python 2.7 built for !MSVC 2008 isn't officially supported.

#2 is in progress. However, the timeline for officially supporting Python 3 to the point where we can transition the official distribution for it is likely too far out (2H 2018) and would hinder Rust adoption efforts.

That leaves mixing the CRTs. This would work by having the Rust components statically link a modern CRT while having Python dynamically load msvcr90.dll.

Mixing CRTs is dangerous because if you attempt to perform a multipart operation with multiple CRTs, things could blow up. e.g. if you malloc() in CRT A and free() in CRT B. Or attempt to operate on FILE instances across CRTs. More info at https://docs.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries. See also https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/crt-alphabetical-function-reference for a full list of CRT functions.

Fortunately, our exposure to the multiple CRT problem is significantly reduced because:

  • Rust and its standard library doesn't make heavy use of CRT primitives.
  • Memory managed by Rust and Python is already being kept separate by the Python API. In Rust speak, we won't be transferring ownership of raw pointers between Rust and Python. Python's refcounting mechanism ensures all PyObject are destroyed by Python. The only time ownership of memory crosses the bridge is when we create something in Rust and pass it to Python. But that object will be a PyObject and backing memory would have been managed with the Python APIs.

  • We shouldn't be using FILE anywhere. And I/O on an open file descriptor would likely be limited to its created context. e.g. if we open a file from Rust, we're likely not reading it from Python.

We would have to keep a close eye out for CRT objects spanning multiple CRTs. We can mitigate exposure for bad patterns by establishing static analysis rules on source code. We can also examine the produced Rust binaries for symbol references and raise warnings when unwanted CRT functions are used by Rust code.

Rust Support

Mercurial relies on other entities (like Linux distros) to package and distribute Mercurial. This means we have to consider their support for packaging programs that use Rust or else we risk losing packagers. This means we need to consider:

  • The minimum version of Rust to require
  • Whether we can use beta or nightly Rust features

For official Mercurial distributions, these considerations don't exist, as we'll be giving a binary to end-users. So this topic is all about our relationship with downstream packagers.

Packaging Overhaul Needed

If hg becomes a Rust binary and we want Mercurial to be a self-contained application, we'll need to overhaul our packaging mechanisms on all operating systems.

Distributing Python

Mercurial would need to distribute a copy of Python.

Python insists that embedded Python load a pythonXX shared library. e.g. python27.dll or libpython27.so.

We would also need to distribute a copy of the Python standard library (.py, .pyc, etc files). These could be distributed in flat form (hundreds of .py files) or in a zip file. (Python supports importing modules from zip files.) If we wanted to get creative, we could invent our own archive format / module loading mechanism (but this feels like unnecessary work).

We can't prune the Python standard library of unused modules because Mercurial extensions may make use of any feature in the standard library. So we'll be distributing the entire Python standard library.

But the distribution of Python is not required: various packagers (like operating systems) would want Mercurial to use a Python provided to it. So our Rust hg needs to support loading a bundled Python and a Python provided to it. This can likely be controlled with build-time flags.

Windows

Mercurial could conceptually be distributed as a .zip file. That archive would contain pre-built hg.exe, pythonXX.dll, any other shared library dependencies, a copy of the Python standard library, Mercurial Python files, and any support files.

Because zip files aren't user friendly, we'd likely provide a standalone .exe or .msi installer (like we do today).

Linux

We could provide a self-contained archive file containing hg binary, libpython27.so, and any other dependencies. We could also provide rpm, deb, etc packages for popular distributions. These would be self-contained and not dependent any many (any?) other packages. Our biggest concern here is libc compatibility. That can be solved by static linking, compiling against a sufficiently old (and compatible) libc, or providing distro-specific packages.

Of course, many distros will want to provide their own Mercurial package. And they will likely want Mercurial to make use of the system Python. We can and must support this.

An issue with a self-contained distribution is loading of shared libraries. Not all operating systems and loaders may support loading of binary-relative shared libraries. We may need to hack something together that uses dlopen() to explicitly specify which libpython27.so, etc to load.

MacOS

This is very similar to Linux. We may support the native application / installer mechanism to make things more user friendly. We don't have good support for this today. So it is likely most users will rely on Homebrew or MacPorts for installation.

BSDs / Solaris / Etc

Basically the same strategy as Linux.

PyPI / pip

We support installing Mercurial via pip today. We upload a source distribution to PyPI and anyone can pip install Mercurial to install Mercurial in their Python environment. On Windows (where users can't easily compile binary Python extensions), we provide Python wheels with pre-built Mercurial binaries.

The future of pip install Mercurial with an oxidized Mercurial is less clear.

pip is tailored towards Python applications. If Mercurial is a Rust application and Python is an implementation detail, does it make sense to use pip and PyPI as a distribution channel?

pip install Mercurial is very convenient (at least for the people that have pip installed and can run it). It is certainly easier than downloading and running an installer. So unless we bake an upgrade facility into Mercurial itself, pip install Mercurial is the next best thing for upgrading after the system package manager (apt, yum, brew, port, etc).

pip install Mercurial goes through a well-defined mechanism to take the artifact it downloaded from PyPI to install it. This mechanism could be abused to facilitate the use of PyPI/pip for distributing a self-contained Mercurial distribution. e.g. the user would end up with a Rust binary in PYTHONHOME/bin/hg that loads a custom version of Python and is fully self-contained and isolated from the Python it was pip installed into. This would be super hacky. It may not even be allowed by PyPI's hosting terms of service? But we could certainly abuse pip install if we needed to.

Support for PyPy / non-CPython Pythons

There exist Python distributions beyond the official CPython distribution. PyPy likely being the one of the most interest to us because of its performance advantages.

The cost to supporting non-CPython Pythons when hg is a Rust binary could be very high. That would likely significantly curtail the use of the CPython API. Instead, we'd have to do interop via ctypes or cffi or provide N ways to do interop.

It's worth noting that if Mercurial is a self-contained application, we could potentially swap out CPython for PyPy. We could go as far as to unsupport CPython completely.

Rust <=> Python Interop

Rust and Python code will need to call into each other. (Although it is anticipated that the bulk of the calling will be from Python into Rust code - at least initially.)

There are many options for us here.

python27-sys and python3-sys are low-level Rust bindings to the CPython API. Lots of unsafe {} code here.

rust-cpython and PyO3 are higher-level bindings to python27-sys and python3-sys. They are what you want to use for day-to-day Rust programming.

PyO3 is a fork of rust-cpython. It seems to be a bit nicer. But it requires Nightly Rust features.

Milksnake uses Rust's cbindgen crate to automatically generate Python cffi bindings to Rust libraries. Essentially, you write a Rust library that exports symbols and milksnake can generate a Python binding to it. There's a lot going on. But it is definitely an interesting approach. And some of the components are useful without the rest of milksnake. e.g. the idea of using cbindgen + cffi to generate low-level Python bindings. Because Milksnake uses cffi, the approach should work with both CPython and PyPy.

A major reason for adopting Rust (and C before that) is performance. We know from Mercurial's C extensions that native code is often vasly undermined by a) crossing the Python<->native boundary b) excessive use of Python API from native code. For example, obsolescence marker parsing is ~100x faster in C. However, once you construct PyObject for all the parsed markers, it is only 2-4x faster.

We know that using ctypes to call from Python into native code is significantly slower than binary Python extensions. Although if the number of function calls and data being transferred across the boundary is small, this difference isn't as pronounced. Rust will enable us to write more functionality in native code (we try to avoid writing C today for maintainability and security reasons). So the performance of the Python<->native bridge will be more important over time. Therefore, it seems prudent to rule out ctypes. That leaves us with extensions or CFFI.

Reconciling `hg` with Rust extensions

Initially, hg will be a minimal Rust binary that embeds a Python interpreter. It simply tells the interpreter to invoke Mercurial's main() function. In this world, other Rust functionality is likely loaded via shared libraries or Python extensions. In other words, we have multiple Rust contexts running from different binaries (an executable and a shared library). The executable handles very early process activity. The shared library handles business logic.

Over time, we'll likely want to expand the role of Rust for early process activity. For example, we'll need to implement some command line processing in Rust for chg functionality. We may also want to implement config file loading (we need to rewrite the config parser anyway to facilitate writing back config changes). And, if we could load a repo from disk and maybe even implement performance critical commands (like hg status) from pure Rust, this would likely be a massive performance win. (Although we have to consider how this will interact with extensibility.)

What this means is that we'll have multiple Rust binaries holding Mercurial state. This feels brittle. Ideally we'd have a single Rust binary. If Python needed to call into native/Rust code, it would get those symbols from the parent hg binary instead of from a shared library. It is unclear how this would work. It is obviously possible to resolve the address of a symbol in the current binary. But existing "call native code" mechanisms in Python seem to assume that symbols are coming from loaded libraries, not the current executable. This may require modifications to cffi or some custom code to generate the Python bindings to executable-local symbols.

Preserving Support for Extensions

Mercurial implemented in Python is good for extensibility because it means extensions can customize nearly every part of Mercurial - often via monkeypatching.

As we use more Rust, we no longer have the dynamic nature of Python and extensions will lose some of their power.

As more of hg is implemented in Rust before any Python is called, we could lose the ability for Python extensions to influence low-level and core operations. e.g. if we want to implement hg status such that it doesn't invoke Python and incur Python startup overhead, how do we enable extensions to still influence the behavior of hg status?

Presumably, hg will eventually implement config file loading and command line processing. So, Rust will be able to see which extensions are being loaded. Assuming hg can resolve the paths to loaded extensions, we could add a syntax to the main extensions file to declare their influence on various behavior. For example, if an extension influences behavior of hg status, its source code could contain something like: # hgext-influences: cmd-status. Rust would see this special syntax and know it needs to instantiate a Python interpreter in order to load the extension for the current hg status command. We could also imagine doing something similar for other functionality implemented in Rust, such as the core store interface.


CategoryNewFeaturesCategoryNewFeatures

Integrating “safe” languages into OpenBSD?

$
0
0
'Re: Integrating "safe" languages into OpenBSD?' - MARC
[prev in list] [next in list] [prev in thread] [next in thread] 
List:       openbsd-misc
Subject:    Re: Integrating "safe" languages into OpenBSD?
From:       "Theo de Raadt" <deraadt () openbsd ! org>
Date:       2017-12-03 20:37:07
Message-ID: 81638.1512333427 () cvs ! openbsd ! org
[Download message RAW]> As a response to this, Theo asked rhetorically "Where's ls, where's cat,> where's grep, and where's sort?", implying that noone so far bothered to> write implementations of even the basic unix utilities in such a> language.

I wasn't implying.  I was stating a fact.  There has been no attempt
to move the smallest parts of the ecosystem, to provide replacements
for base POSIX utilities.

As a general trend the only things being written in these new
languages are new web-facing applications, quite often proprietory or
customized to narrow roles.  Not Unix parts.

Right now, there are zero usage cases in the source tree to require
those compiler tools.  We won't put a horse into the source tree when
society lacks cart builders.

> This brings me to the question, what if someone actually bothered?

So rather than bothering to begin, you wrote an email.

Awesome.

Yes, now I am implying something: you won't bother to rewrite the
utilities.

And I understand, why would anyone bother?  It took about 10 years for
gnu grep to be replaced sufficiently well in our tree.  This stuff
doesn't happen overnight.

However there is a rampant fiction that if you supply a new safer
method everyone will use it.  For gods sake, the simplest of concepts
like the stack protector took nearly 10 years for adoption, let people
should switch languages?  DELUSION.

> Under what conditions would you consider replacing one of the> current C implementations with an implementation written in another,> "safer" language?

In OpenBSD there is a strict requirement that base builds base.

So we cannot replace any base utility, unless the toolchain to build
it is in the base.  Adding such a toolchain would take make build time
from 40 minutes to hours.  I don't see how that would happen.

> Note that with Cgrep and haskell-ls, there do in fact exist> implementations/analogues of two of the mentioned utilities in a> memory safe language (Haskell).

Are they POSIX compliant?  No.  They are completely different programs
that have borrowed the names.

By the way, this is how long it takes to compile our grep:

    0m00.62s real     0m00.63s user     0m00.53s system

Does Cgrep compile in less than 10 minutes?

Such ecosystems come with incredible costs.  For instance, rust cannot
even compile itself on i386 at present time because it exhausts the
address space.

Consider me a skeptic -- I think these compiler ecosystems face a grim
bloaty future.

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About |News |Add a list | Sponsored by KoreLogic

A Mother’s Ninth-Century Manual on How to Be a Man

$
0
0

Albert Edelfelt, Queen Blanche of Norway and Sweden with Prince (later King) Hacon, 1877.

Being a red-blooded, blue-blooded male in the Carolingian Empire was a risky business. Those who grew up in Western Europe during the eighth and ninth centuries were frequently exposed to extreme violence. One adolescent royal from the period was struck so hard in a play fight that, in the words of a contemporary account, his playmate’s sword “penetrated almost as far as the brain, reaching from his left temple to his right cheekbone.”

The only thing the Carolingians valued as much as ruthlessness on the battlefield was proficiency with Biblical text. William of Septimania appears to have had a thorough education in both. He was barely in his twenties when he seized control of Barcelona in 848, but he had already spent four years warring against the crown. The city had been the old stomping ground of his father, Bernard. Bernard was an important figure in the reign of Louis the Pious, the Carolingian emperor who ruled a great swathe of territory from what is now northern Spain to the Czech Republic. But in recent times Bernard had endured a spectacular fall, toppled by intrigue and machination that ended in his death and devastated his family. When still in his teens, William became determined to win the battles his father couldn’t. He joined a rebellion against the ruling dynasty that had once been as close as kin. 

It was an audacious act, and in the long run it was destined to fail, but it was consistent with the moral education he’d received since childhood. His mother, Dhuoda, had drilled into him that there was only one true measure of nobility: “in every matter, be obedient to the interests of your father.”

She wrote those words of maternal wisdom in Liber Manualis, a handbook on how to be a nobleman that she composed for William when he was a teenager, and which she hoped would guide him through his adult life. Across eleven chapters, Dhuoda’s book outlines the subjects that should most concern a man of high birth, such as how to pray and read the Bible; how to distinguish vice from virtue; how best to honor his parents; how to serve God and the Crown; how to handle illness, affliction, and hardship. The work belongs to the tradition of “mirrors for princes,” an ancient literary genre that also proliferated during the Middle Ages. But, Dhuoda’s mirror, the only extant written work by a European woman from the ninth century, is one of a kind: not, as most were, a cleric’s tutorial but a mother’s gift of loving guidance through an uncertain future, with the thoughts, feelings, and personality of its author running through it. Like the Alfred Jewel, the Cross of Lothair, and so many of the most beautiful creations of early medieval Europe, the Liber Manualis beguiles with its intimacy and exquisite intricacy, a glittering portal to a culture that can seem entirely alien from our own.

*

Almost everything we know about Dhuoda comes from the Liber Manualis. According to her own account, she married Bernard on June 29, 824, at the Palace of Aachen, which was the center of power in the Carolingian Empire. She might’ve been as young as fifteen, she was certainly no older than twenty, and in all likelihood she saw in front of her a life of wealth and prominence as the wife of a noble servant of Europe’s most powerful ruling dynasty in centuries. Five years after the wedding, Bernard was rewarded for his loyal service with the positions of chamberlain at the imperial court and mentor to Charles, the emperor’s six year-old son by his new young wife, Judith. Judging by the lavish paeans written about her, Judith was adored by just about everyone save for her three adult stepsons who saw her and her little boy as a grave threat to their inheritances. When they heard that Charles was to be awarded lands that had formerly been given to them, the brothers struck out against their father and their stepmother, and plunged the empire into a decade of destructive factionalism and civil war.

Dhuoda’s young family was caught in the cross fire. Bernard was accused of sorcery and sexual impropriety with Judith, and fled to Barcelona. Throughout the 830s his relatives were preyed upon. One of his brothers was blinded; another was beheaded. His sister, a nun, was captured and drowned, ostensibly for being a witch. In 840, the emperor died, but it only exacerbated the conflict between his sons. Now aged sixteen, Charles asked Bernard to join forces with him. Bernard demurred, perhaps concerned that backing such a young pretender was too risky a gamble. He would soon regret his lack of conviction; the boy proved astoundingly resolute.

In 841, Charles recorded a surprise victory over his eldest sibling, which elevated him to a position of great strength. Scrambling to get back in Charles’s good books, Bernard made an offering: the fealty of his only child, William, who was sent to live and serve at Charles’s court. Such arrangements weren’t uncommon among the Frankish nobility, where adolescents were often sent to reside in a patron’s household as a sort of apprenticeship in noble living. But, this was not a usual situation, and Dhuoda was clearly concerned to see her beloved first born drawn into the “worsening turmoil of this wretched world,” as she described the events of her time. Thinking of his well-being in this life and the next, she set out to offer comfort and guidance in the only way she could.

Dhuoda began work on the handbook at her home in Uzès, on November 30, 841, the day after William’s fifteenth birthday. At the time, the world must’ve seemed a precarious and unmalleable place. While Bernard battled his enemies, the empire continued to be riven by a civil war, made all the more alarming by a succession of Viking raids from the north and Moorish incursions from the south. Just a few months before William was sent to Charles, Dhuoda’s newborn second child had been taken to live with his father a couple of hundred miles away in Aquitaine. At the moment of separation, she didn’t even know the baby’s name; it was Bernard’s privilege to decide that. Family life had slipped from her grasp. Composing this book of instruction for William was a means of exerting some control, setting down on the page her most valued truths and projecting a sense of order on a world that seemed bereft of it. “I am somewhat ill at ease,” she writes in her introduction, “and eager to be useful to you … Even though I am absent in body, this little book will be present.”

Responding to crises in this literary way was a thoroughly Carolingian trait. At the end of the previous century, Emperor Charlemagne fostered the so-called Carolingian Renaissance, a period of intense intellectual and cultural activity, and a rediscovery of classical learning. All manner of disciplines, from architecture to jurisprudence to metalworking, were patronized by the imperial court with the aim of sparking spiritual and moral reform across the continent and revivifying the vanished civilization of Rome. Central to these efforts was the preservation of knowledge through the written word. The Carolingians transcribed thousands of ancient papyrus texts to more robust parchment, safeguarding vital works that would otherwise have been lost. According to the scholars Costambeys, Innes, and MacLean, only eighteen hundred manuscripts survive from pre-800 continental Western Europe, yet, as a result of the Carolingians’ commitment, we have nine thousand manuscripts from the ninth century. The irrepressible power of books is a dominant theme of the Liber Manualis, and it is crammed with literary references, especially those from works of theology and philosophy, and, of course, Scripture. Like many of her contemporaries, and unlike many of ours, Dhuoda didn’t regard reading as escape from the “real” world but as a purposeful, pious deed, and the first step on the road to righteous action. In the opening section, she beseeches William to “willingly grasp [the book] in your own hand, and enfolding it, turning it over, and reading it, and studying it, you’ll strive to fulfill its teachings,” as though the mere sensation of its cover on his skin would help him act bravely and wisely.

The Liber Manualis does everything mirrors for princes are meant to do. It counsels and consoles its reader, and abases its author, stressing her unworthiness before God and her march toward death. But it is also something more than a dry, serious work of moral instruction: Dhuoda refuses to remove herself from the text. Barely a page goes by without a glimpse of the author somewhere—in her love of puns, acrostics, and numerology; her predilection for tossing in biographical details; the curious extended metaphors; her aside about how difficult she finds writing. When those moments hit, Liber Manualis doesn’t read like a book about how to be a nobleman but a book about what it’s like to be Dhuoda. Working within the conventions of the mirrors, she finds a way to create a self-portrait—not to indulge her ego, but to provide William with something that will keep her alive in his mind, a keepsake as intimate and evocative as a lock of hair. She all but admits as much herself:

“Dhuoda is always here to exhort you, my son, but in anticipation of the day when I shall no longer be with you, you have here as a memento of me this little book … You will have learned doctors to teach you many more examples, more eminent and of greater usefulness, but they are not of equal status with me, nor do they have a heart more ardent than I, your mother, have for you, my firstborn son!”

Some scholars go further and suggest that Dhuoda made herself so visible in the text as a rebuke to Bernard, whose political mistakes and alleged infidelities put the family at risk. According to M. A. Claussen, the message of the Liber Manualis is that “Bernard is a loser;” if William wants to survive the chaos, he’d better steer clear of his biological father’s example and follow that of the person who’s been doing Bernard’s job for him for the last fifteen years: Dhuoda.

*

On February 2, 843, the Liber Manualis was complete. Dhuoda may have feared, when she sent it to William, that this would be her last meaningful communication with him. “Despite the many cares that consume me,” she writes, “this anxiety is foremost in God’s established design—that I see you one day with my own eyes.” She urges him to share the book with his little brother when he is old enough to read. This may have been empty rhetoric, or else a sign of how bleakly she saw events in the outside world. There’s no record of when William first cast his eyes over the book. It’s possible he had it by the time Charles and his brothers finally came to a truce in August 843. They sealed the deal with the Treaty of Verdun, often referred to as “Europe’s birth certificate,” which formally divided the empire in three, establishing the boundaries of modern France and Germany.

The end of the dynastic strife would’ve done nothing to ease Dhuoda’s anxieties. In the spring of the following year, Bernard was captured by Charles and executed for treason. William might have witnessed the grisly event. With Dhuoda’s command to honor his father in his ears, that summer William joined a revolt against Charles’s rule. He now moved to claim what he considered his birthright, the title his father had once held: Count of Barcelona. A chronicler recorded that when William took control of the city in 848, he did so “by guile rather than by force.” But it would require more than cunning to keep his station. Within two years, Charles had him on the run. By 850, William was dead, killed by Charles’s supporters.

Exactly eleven hundred years later the historian André Vernet discovered a copy of the Liber Manualis in Barcelona. It wasn’t the original, but it may well have derived from it, and therefore be evidence that William kept the handbook close to him. It’s unknown when Dhuoda passed away, but it’s unlikely that Dhuoda saw William before his death. There’s a small chance that she lived long enough to see her younger son, a man best known to historians as Bernard “Hairy Paws,” who also fought against Charles, become the Count of Auvergne and the Margrave of Aquitaine.

The turbulent tale of Dhuoda’s family caught the medieval imagination. The rumored affair between her husband and Empress Judith became the basis of The Erl of Toulouse, a famed fourteenth-century English chivalric romance. To the modern mind, however, always in search of a chink of insight into the interior life of the individual, it’s Dhuoda’s book that captivates. She was an avid reader, and a born writer, thinking and scribbling for her audience of one.

Edward White is the author of The Tastemaker: Carl Van Vechten and the Birth of Modern America.

Bat cave solves mystery of SARS virus

$
0
0

Researchers analysed strains of SARS virus circulating in horseshoe bats, such as this one (Rhinolophus sinicus), in a cave in Yunnan province, China.Credit: Libiao Zhang/Guangdong Institute of Applied Biological Resource

After a detective hunt across China, researchers chasing the origin of the deadly SARS virus have finally found their smoking gun. In a remote cave in Yunnan province, virologists have identified a single population of horseshoe bats that harbours virus strains with all the genetic building blocks of the one that jumped to humans in 2002, killing almost 800 people around the world. 

The killer strain could easily have arisen from such a bat population, the researchers report in PLoS Pathogens1 on 30 November. They warn that the ingredients are in place for a similar disease to emerge again. 

In late 2002, cases of a mystery pneumonia-like illness began occurring in Guangdong province, southeastern China. The disease, dubbed severe acute respiratory syndrome (SARS), triggered a global emergency as it spread around the world in 2003, infecting thousands of people. 

Scientists identified the culprit as a strain of coronavirus and found genetically similar viruses in masked palm civets (Paguma larvata) sold in Guangdong’s animal markets. Later surveys revealed large numbers of SARS-related coronaviruses circulating in China’s horseshoe bats (Rhinolophus)2— suggesting that the deadly strain probably originated in the bats, and later passed through civets before reaching humans. But crucial genes — for a protein that allows the virus to latch onto and infect cells — were different in the human and known bat versions of the virus, leaving room for doubt about this hypothesis. 

Bat hunt

To clinch the case, a team led by Shi Zheng-Li and Cui Jie of the Wuhan Institute of Virology in China sampled thousands of horseshoe bats in locations across the country3. “The most challenging work is to locate the caves, which usually are in remote areas,” says Cui. After finding a particular cave in Yunnan, southwestern China, in which the strains of coronavirus looked similar to human versions4,5, the researchers spent five years monitoring the bats that lived there, collecting fresh guano and taking anal swabs1.  

They sequenced the genomes of 15 viral strains from the bats and found that, taken together, the strains contain all the genetic pieces that make up the human version. Although no single bat had the exact strain of SARS coronavirus that is found in humans, the analysis showed that the strains mix often. The human strain could have emerged from such mixing, says Kwok-Yung Yuen, a virologist at the University of Hong Kong who co-discovered the SARS virus: “The authors should be congratulated for confirming what has been suspected.” 

But Changchun Tu, a virologist who directs the OIE Reference Laboratory for Rabies in Changchun, China, says the results are only “99%” persuasive. He would like to see scientists demonstrate in the lab that the human SARS strain can jump from bats to another animal, such as a civet. "If this could have been done, the evidence would be perfect,” he says. 

Travel trouble

Another outstanding question is how a virus from bats in Yunnan could travel to animals and humans around 1,000 kilometres away in Guangdong, without causing any suspected cases in Yunnan itself. That “has puzzled me a long time”, says Tu.

Cui and Shi are searching for other bat populations that could have produced strains capable of infecting humans. The researchers have now isolated some 300 bat coronavirus sequences, most not yet published, with which they will continue to monitor the virus’s evolution. 

And they warn that a deadly outbreak could emerge again: the cave where the elements of SARS were found is just 1 kilometre from the nearest village, and genetic mixing among the viral strains is fast. “The risk of spillover into people and emergence of a disease similar to SARS is possible,” the authors write in their paper.

Although many markets selling animals in China have already been closed or restricted following outbreaks of SARS and other infectious diseases, Yuen agrees that the latest results suggest the risk is still present. “It reinforces the notion that we should not disturb wildlife habitats and never put wild animals into markets,” says Yuen. Respecting nature, he argues, “is the way to stay away from the harm of emerging infections”.   

Sign up for the daily Nature Briefing email newsletter

Stay up to date with what matters in science and why, handpicked from Nature and other publications worldwide.

Sign Up

How We’re Designing Channels

$
0
0

This post is part of a series on how we’re making Channels, the thinking behind the product, and insight into the process. We’ve got smart people working on smart solutions, and continue looking to the community and alpha/beta testers as we iterate toward launch. Read “Why Channels”, the first post in the series for more background info.

Working on Channels has been like any remodeling project: You start out excited, then you pull down the wood paneling and suddenly realize you’ve got rewiring to do — solving one issue leads to even more. 

Discovering the problems

The team started by interviewing developers — observation to more deeply understand their needs, the benefits and shortcomings of tools they’re using for knowledge management, and their hopes for what Channels on Stack Overflow could be.

This research helped us center our approach around three main principles:

  • Channels provides a private & secure space for your team to store and share institutional knowledge,
  • Channels is a feature that exists right on Stack Overflow, a familiar place where developers know the systems and already go to ask and answer programming questions,
  • And deep integrations & notifications are essential to make sure that the right question gets in front of the right person, especially on a smaller community or team.

Next, we went broad by analyzing other products and prototyping many design approaches before converging on any solution. We pulled in designers from different product teams to quickly explore a wide range of solutions.

This is where we began to discover that the new tile might not match the countertops.

Here’s what we’ve learned so far

Needs in small, private spaces are very different than large, public ones

There are nuances to how people will use their private, work channel that differ from the public setting. Connections and relationships already exist in the physical world prior to the addition of a private Channel. Work problems are specific to coworkers who are domain experts. In public, anyone across the globe with the right knowledge can answer general programming questions. 

Not all features and rules created for public Stack Overflow are necessary in a smaller, private environment, and new features are needed that don’t yet exist. For instance on public Stack Overflow, you can’t mention anyone unless they’ve contributed in some way to the question (otherwise Jon Skeet would get pinged any time anyone had a C# question.) But in a private channel the need is different — “I want to make sure Jenny sees this question because I think she’s probably the only person in the company who can answer it well.”

comment

Also, users need to be able to quickly identify what what is private. There are updates to to the UI that are necessary in order to ensure there’s no possibility of posting private info into the public space.

ask-question-prompt

Adding channels requires big changes to SO’s Information Architecture

A house with more rooms needs more doors. In order to help teams easily find, share, and store institutional knowledge, we need to make our search interface and navigation (finding things by clicking around) be more intuitive.

Putting navigation in the right place

Adding channels into the current navigation presents challenges that didn’t previously exist, and can only be solved by rethinking the underlying site structure. 

nav-sketches

The team created several prototype navigations, narrowed to two, and tested with a group of users. This surfaced a few key issues:

  • Scalability: Horizontal navigation lacks the space for additional elements — especially multiple channels per user.
  • Gestalt: It’s confusing to place Channel navigation at the page top, where information is already dense. The hierarchy of information must be balanced and easily to comprehend.
  • Persistence / Consistence: Because navigation creates a mental map, any design iteration that didn’t expose the existence of the channel at a top nav level created confusion.

We need a system that solved these problems well regardless of whether you’re trying to find info to help onboard a new developer to your team, or just trying to figure out how to vertically align text in CSS.

While we haven’t solved all the edge cases, the design discovery gave us a clear direction to test during the alpha phase.

One search for all Channels

A key path to finding content on Stack Overflow (searching for a question on Google, then clicking on a Stack Overflow Question) doesn’t work for private Q&A because Google can’t index it. That means Channels users will often need to use SO’s search when they’re looking for content that might be private… and our search has been showing its age for some time now. If our goal is to quickly help users find the most relevant results, then we need a better search experience that considers the searcher’s intent and improves the quality of the results (but I’ll let an engineer speak to those changes in a future post.)

New integrations and notifications

Because Stack Overflow gets 10 million visits per day, getting questions answered hasn’t been a problem we’ve faced for a long time. The mechanics in smaller communities are different. With Channels, we had to think through how smaller teams interact and the workflows they’re already using in order to make sure the right people see the right questions.

Our  primary learnings showed that we need to integrate with the tools teams already use (such as Slack) and enable users to create custom email notifications. We’re planning  and building several integrations.

Balancing getting users in and iterating toward a vision

Much of design is iteratively gaining a deeper understanding of the problems we need to be solving. Achieving a higher fidelity of understanding means repeating the process of creating and testing over and over again. Throughout the design process there’s a healthy tension between what we can build today and ship fast so that we can actually put something in front of users to make sure we’re meeting their greatest needs and a longer-term vision that guides our trajectory as the product matures. 

We split the work into phases, and even though we’re embarrassed by some of the things we are putting in front of people, there’s reassurance that we’re building toward a future state that we all believe in. This also means that we aren’t launching into long build phases on things that we think are important, but may not be essential to our users.

Next Steps

We’re launching Channels alpha in December and hope to be moving into public Beta in Early Spring. Look for future posts sharing our design, research, and product development process in the coming weeks.

In the meantime, we invite you to provide feedback or sign up to be part of our beta group as we test and grow our Channels product. If you have any questions for the team or me, feel free to drop them in the comments or ask a question about Channels on Meta. Let us know what you think. 

The Old-School Fire Effect and Bare-Metal Programming

$
0
0

Many years ago, probably around 1995 or so, my family was having dinner at some friends, and their son (who I think may have been a high-school senior then) showed me some cool DOS programs on the computer. One of the programs was a demo that drew animated flames on the screen. I was amazed! Asking what language it was written in, I was told it was Pascal.

Anders Haraldsson, 'Programming i Pascal' (2:a uppl.), Studentlitteratur, Lund 1979 Until then I had only programmed in QBasic, but if one could make fire with Pascal, I knew I just had to learn it.

My uncle supplied me with his university text-book on the language (pictured on the right), and I stepped to it. Unfortunately, the book turned out to be extremely thin on the subject of making fire. Also it was mainly concerned with programming the PDP-10 as opposed to the IBM PC that I was using. And so, I never learned the skill.

(There's a lesson here about the social aspects of programming: I could have asked more and much better questions.)

I've wanted to revisit this for years. Having acquired better programming, English, and web search skills, it's time to fill this gap in my education. This post contains a walk-through of the classic MS-DOS firedemo, a port of it to SDL, and an implementation of the fire effect that runs on bare-metal.

Table of Contents

Firedemo

According to the internet, Javier "Jare" Arévalo's firedemo from 1993 was the first implementation of this effect. He wrote a blog post about it for the 20th anniversary, which includes a version in Javascript (source on GitHub).

When I asked about the firedemo, here's what he told me:

It began when we bought a 80387 math coprocessor and, to enjoy it, played a lot with a famous fractal generator called Fractint. Then I wanted to make a kind of plasma style fractal, but animated in a more complex way than the color rotation typical of the time. I just started writing some code without thinking much. A few bugs later, I had something that looked like small blue explosions that quickly faded to black. Tweaking the code rather than fixing the bugs got me the fire effect. We did realize how and why it looked like a fire, and JCAB's implementation in Inconexia was fully intentional and correct, but I never sat down to truly understand all the subtle bits in the original (where did the initial white explosion come from? Why was there no apparent "real" random number generator, yet it looked random?) until I recreated it in Javascript. As far as I can tell, the Javascript version is pixel perfect, it shows the exact same animation as the original did.

(FractInt still exists and has been ported to Linux. Jare wrote a plasma effect demo, iris, a few days after the fire demo.Inconexia (YouTube) uses the fire effect in the final scene.)

I'm not sure whether this is the program I saw that night many years ago, or if I saw one of the many other implementations that followed. Kirk A. Baum has collected some of them in firecode.zip, including a version that is indeed written in Pascal called Flames by Mark D. Mackey.

Let's dissect the source code of the firedemo:

Data

; ------------------------------ FIRE.ASM ------------------------------
; Bye Jare of VangeliSTeam. Want more comments? Write'em. O:-)


        .MODEL SMALL
        .STACK 400
        DOSSEG
        LOCALS

This syntax suggests the code is written for Borland's Turbo Assembler. (I suppose this write-up serves as an answer to the call for more comments.)

        .DATA

FirePal LABEL BYTE
;  Fire palette, colors 0-63 ------------

        DB        0,   0,   0,   0,   1,   1,   0,   4,   5,   0,   7,   9
	DB	  0,   8,  11,   0,   9,  12,  15,   6,   8,  25,   4,   4
	DB	 33,   3,   3,  40,   2,   2,  48,   2,   2,  55,   1,   1
	DB	 63,   0,   0,  63,   0,   0,  63,   3,   0,  63,   7,   0
	DB	 63,  10,   0,  63,  13,   0,  63,  16,   0,  63,  20,   0
	DB	 63,  23,   0,  63,  26,   0,  63,  29,   0,  63,  33,   0
	DB	 63,  36,   0,  63,  39,   0,  63,  39,   0,  63,  40,   0
	DB	 63,  40,   0,  63,  41,   0,  63,  42,   0,  63,  42,   0
	DB	 63,  43,   0,  63,  44,   0,  63,  44,   0,  63,  45,   0
	DB	 63,  45,   0,  63,  46,   0,  63,  47,   0,  63,  47,   0
	DB	 63,  48,   0,  63,  49,   0,  63,  49,   0,  63,  50,   0
	DB	 63,  51,   0,  63,  51,   0,  63,  52,   0,  63,  53,   0
	DB	 63,  53,   0,  63,  54,   0,  63,  55,   0,  63,  55,   0
	DB	 63,  56,   0,  63,  57,   0,  63,  57,   0,  63,  58,   0
	DB	 63,  58,   0,  63,  59,   0,  63,  60,   0,  63,  60,   0
	DB	 63,  61,   0,  63,  62,   0,  63,  62,   0,  63,  63,   0

FirePal contains the first 64 colours of the palette that will be used, stored as (Red,Green,Blue) byte triplets where each value is between 0 and 63. The remaining 192 colours of the palette are all white and will get set separately when programming the VGA palette.

ByeMsg  DB 'FIRE was coded bye Jare of VangeliSTeam, 9-10/5/93', 13, 10
        DB 'Sayonara', 13, 10, 10
        DB 'ELYSIUM music composed by Jester of Sanity (an Amiga demo group, I think)', 13, 10
        DB 'The music system you''ve just been listening is the VangeliSTracker 1.2b', 13, 10
        DB 'VangeliSTracker is Freeware (no money required), and distributed in source code', 13, 10
        DB 'If you haven''t got your copy of the VangeliSTracker, please go to your', 13, 10
        DB 'nearest BBS and get it NOW', 13, 10
        DB 'Also, don''t forget that YOU can join the VangeliSTeam. Contact the', 13, 10
        DB 'VangeliSTeam in the following addresses: ', 13, 10, 10
        DB '  Mail:     VangeliSTeam                          ³ This demo is dedicated to', 13, 10
        DB '            Juan Carlos Arévalo Baeza             ³        Mark J. Cox', 13, 10
        DB '            Apdo. de Correos 156.405              ³            and', 13, 10
        DB '            28080 - Madrid (Spain)                ³      Michael Abrash', 13, 10
        DB '  Internet: jarevalo@moises.ls.fi.upm.es          ³ At last, the PC showed good', 13, 10
        DB '  Fidonet:  2:341/27.16, 2:341/15.16, 2:341/9.21  ³ for something.', 13, 10, 10
        DB 'Greetings to all demo groups and MOD dudes around.', 13, 10
        DB '$'

ByeMsg contains the text that's printed before exiting the program. 13 and 10 are the ASCII character codes for carriage return and newline, respectively. The dollar sign signals the end of the string.

        UDATASEG

Imagen  DB 80*50 DUP (?)
Imagen2 DB 80*50 DUP (?)

These two 400-byte uninitialized arrays will be used for storing the intensity of the fire in each pixel. One array is for the current frame, and the other for the previous one.

Setting up the VGA

        .CODE
        .STARTUP

        CLD
        MOV     AX,13h
        INT     10h
        CLI

CLD clears the direction flag (the DF bit in FLAGS), which means the index registers SI and DI get incremented (as opposed to decremented) after each string operation, such as LODS and STOS below.

INT 10h raises an interrupt that's handled by the VGA BIOS, like a system call. The contents of the AH register (the high 8 bits of AX) specify the function, in our case 0 which means "Set video mode", and AL specifies which mode to set, in our case Mode 13h. This mode has a resolution of 320x200 pixels and 256 colors, specified with one byte per pixel in a linear address space starting at A000:0000. The BIOS configures this mode by writing specific values to the many VGA registers that control exactly how the contents of the VGA memory, the frame-buffer, is to be accessed and drawn on the screen.

The CLI instruction disables interrupts. This is in order to protect the code that writes directly to VGA registers below. The code will perform OUT operations in a certain order, and could break if an interrupt handler performed other I/O operations in the middle of it all.

        MOV     DX,3c4h
        MOV     AX,604h                 ; "Unchain my heart". And my VGA...
        OUT     DX,AX
        MOV     AX,0F02h                ; All planes
        OUT     DX,AX

Now the code starts to tweak the VGA registers, moving away from the standard Mode 13h. 3c4h, loaded into DX, is the I/O port number of the VGA Sequence Controller Index register (See The Graphics Programming Black Book Chapter 23 and FreeVGA).

By doing OUT DX,AX, the code writes the 16-bit value in AX to the port, which is effectively the same as writing the 8-bit value in AL to ec4h (Sequence Controller Index Register) and the 8-bit value in AH to ec5h (Sequence Controller Data Register). The Index Register selects an internal Sequence Controller register, and the Data Register provides the value to write into it.

In our case, the code is writing 06h to register index 04h, which is the Sequencer Memory Mode Register. This disables the Chain 4 bit which is otherwise set in mode 13h. This is what the "Unchain" comment refers to: turning off Chain-4 addressing mode and entering normal mode.

The VGA RAM is split into four different "planes", which were often implemented by four different memory chips on the circuit board. One reason was to solve the frame-buffer memory-access problem: to output 70 high-resolution frames per second, the VGA's CRT controller would need to read bytes at a higher rate than was feasible for a byte-addressed DRAM chip at the time. But with the frame-buffer split into four planes, stored in four chips, the CRT controller could read four bytes in parallel at a time, enough to keep up with the CRT refresh rate.

Chain 4 is a mode for addressing the four memory planes. When enabled, it uses the two least significant bits of the address to select which plane to read or write to (and leaves those two bits clear when addressing inside the plane, if I understand correctly), allowing linear addressing of the four planes "chained together". For example, writes to A000:0004, A000:0005, and A000:0006 in Chain 4 mode would end up at address 4 in plane 0, 1, and 2 respectively.

With Chain 4 disabled, the programmer has to explicitly select which plane(s) to access by setting the VGA Sequence Controller's Map Mask Register (index 02h). The write of 0Fh to that register enables writes to all four planes at once, hence the "All planes" comment. This means that each byte written to the framebuffer will get written to all four planes at that address, effectively appearing as four consecutive identical pixels.

        MOV     DX,3D4h
        MOV     AX,14h                  ; Disable dword mode
        OUT     DX,AX
        MOV     AX,0E317h               ; Enable byte mode.
        OUT     DX,AX

The VGA Sequence Controller controls how the frame-buffer is accessed from the CPU, but it's the CRT Controller that decides how to access the frame-buffer when scanning it to produce the video signal. 3D4h addresses the CRT Controller's Index Register (immediately followed by the Data Register). Writing 0014h to that port sets the Underline Location Register to zero, clearing the DW and DIV4 bits which enabled the double-word addressing mode that is normally used for scanning when Chain-4 is enabled. The write of E3h (the leading 0 in 0E317h is required for the assembler to recognize it as a number) to index 17h sets the Byte Mode bit in the CRTC Mode Control Register.

If I understand correctly, the reason for scanning to be done with double-word addressing in mode 13h is that Chain-4 clears the lower two address bits when writing into a plane. This means that after the scanner has read a value from each plane, it needs to increment the address by four (the size in bytes of a 32-bit "double word") to get to the next set of values.

        MOV     AL,9
        OUT     DX,AL
        INC     DX
        IN      AL,DX
        AND     AL,0E0h                 ; Duplicate each scan 8 times.
        ADD     AL,7
        OUT     DX,AL

The first two instructions above write 09h to the CRT Controller Index Register, which is the index of the Maximum Scan Line Register. Then DX is incremented to address the port of the CRT Controller Data Register, after which a byte is read, masked, added with 7, and written back, resulting in the Maximum Scan Line field of the register being set to 7, which means each scan line will be repeated eight (7+1) times.

Regular mode 13h produces 400 scan lines, with each scan line repeated twice for a vertical resolution of 200 pixels. With the operation above, the vertical resolution becomes 50 pixels instead. Mode 13h has a horizontal resolution of 320 pixels, but with our "unchaining" and writing to all four planes at once above, we now have a horizontal resolution of 80 pixels instead. In summary, these operations have changed from the 256-color 320-by-200 pixel mode 13h to a custom 256-color 80-by-50 mode.

Why is this lower resolution desirable? Aren't the pixels chunky enough in 320x200 mode? The reason was probably to make the program run faster. Computing the values for 80x50 pixels is much less work than for 320x200, so the lower resolution allows for producing more frames per second on a slow machine.

        MOV     DX,3c8h                 ; Setup palette.
        XOR     AL,AL
        OUT     DX,AL
        INC     DX
        MOV     CX,64*3
        MOV     SI,OFFSET FirePal       ; Prestored...
@@pl1:
         LODSB
         OUT    DX,AL
         LOOP   @@pl1

The DAC (Digital-to-Analog Converter) is the part of the video adapter responsible for converting the bits coming out of memory to an analog video signal that can be fed to a monitor. It contains 256 registers, mapping each possible byte value to an 18-bit color representation: 6 bits for red, green, and blue intensity, respectively. (The VGA also has something called the Palette RAM, which is different and used for EGA compatibility.)

To program the DAC, our program first writes a zero to 3c8h, the DAC Address Write Mode Register, signalling that it wishes to set the value of DAC register zero. It then writes repeatedly to port 3c9h, the DAC Data Register, three byte-sized writes for each of the 64 colours in FirePal (LODSB reads a byte from DS:SI and then increments SI, LOOP jumps to a label and decrements CX until it's zero).

        MOV     AL,63
        MOV     CX,192*3                ; And white heat.
@@pl2:
         OUT    DX,AL
         LOOP   @@pl2

The code above fills the remaining 192 DAC registers with "white heat": all-white (red, green and blue all 63) color values.

STI turns interrupts back on, now that the code for setting up the VGA is done.

Main Loop

        MOV     AX,DS
        MOV     ES,AX
        MOV     DI,OFFSET Imagen        ; Cleanup both Images.
        MOV     CX,80*50
        XOR     AX,AX
        REP STOSW

Before we enter the main loop, the code above clears the Imagen and Imagen2 arrays using REP STOSW which performs a word-sized write (of AX which is zero) to ES:DI, increments DI, and repeats 400 (CX) times. Using word-sized writes means the code writes 800 bytes in total, clearing both arrays.

MainLoop:
        MOV     DX,3DAh                 ; Retrace sync.
@@vs1:
        IN      AL,DX
        TEST    AL,8
        JZ      @@vs1
@@vs2:
        IN      AL,DX
        TEST    AL,8
        JNZ     @@vs2

The main loop starts by reading from the 3DAh I/O port, which is the VGA's Input Status #1 Register, and checking the VRetrace bit. It loops first while the bit is zero and then while it's one, effectively waiting for it to go from one to zero, thus synchronizing the loop with the VGA refresh cycle.

        PUSH    DS
        POP     ES
        MOV     SI,81+OFFSET Imagen     ; Funny things start here. 8-P
        MOV     DI,81+OFFSET Imagen2
        MOV     CX,48*80-2
        XOR     BH,BH

Treating Imagen and Imagen2 as 80-by-50 two-dimensional arrays (matching the screen resolution), SI and DI are set up to point to the second element on the second row (counting from the top-left corner) of Imagen and Imagen2, respectively. CX will be used for the loop count, and BH is cleared to be used as a zero below.

@@lp:
        XOR     AX,AX
        ADD     AL,-1[SI]
        ADC     AH,BH
        ADD     AL,-80[SI]
        ADC     AH,BH
        ADD     AL,-79[SI]
        ADC     AH,BH
        ADD     AL,-81[SI]
        ADC     AH,BH
        ADD     AL,1[SI]
        ADC     AH,BH
        ADD     AL,80[SI]
        ADC     AH,BH
        ADD     AL,79[SI]
        ADC     AH,BH
        ADD     AL,81[SI]
        ADC     AH,BH

The code above sums together the values of all eight pixels neighbouring SI in Imagen into AX (-1[SI] is the pixel to the left of SI, -80[SI] is the pixel just above, etc.). First the low bits are added to AL, then any carry bit is added to AH using ADC.

It is because the code accesses neighbours of SI that it was set up to start at the second element of the first second row in Imagen, and why the loop count in CX was chosen so the process will stop after the second-last element of the second-last row.

        ROR     AX,1
        ROR     AX,1
        ROR     AX,1

Rotating the bits in AX three steps to the right leaves AL containing the previous sum divided by eight, in other words it contains the average of the eight values surrounding SI. This is the core idea in the fire effect: computing the "heat" of each pixel as an average of its neighbours.

        TEST    AH,60h                  ; Wanna know why 60h? Me too.
        JNZ     @@nx                    ; This is pure experience.

After the ROR instructions, the three least significant bits of the sum of neighbours have ended up as the three highest bits of AH. This means that the TEST instruction effectively checks whether the two low bits of the sum were set. If they were not, we fall through to the code below. As the comment suggests, this was probably chosen somewhat randomly.

         CMP    DI,46*80+OFFSET Imagen2 ; And this was a bug.
         JNC    @@dec                   ; This one's by my cat.
          OR    AL,AL                   ; My dog coded here too.
          JZ    @@nx                    ; I helped my sister with this one.
@@dec:
           DEC  AL                      ; Yeah! Cool a bit, please.

The code above checks whether DI is past the first 46 rows of Imagen2, and if so jumps straight to @@dec. Otherwise, the code checks whether AL is greater than zero, and only proceeds to @@dec if so.

All this is effectively to decide whether to decrement AL, thereby "cooling" that pixel. If no cooling occurred, the screen would eventually fill with a single colour. Instead, the code cools pixels given the semi-random condition that the two low bits of the neighbour sum are zero (so roughly 25% of the time).

If AL is already zero however, decrementing doesn't "cool" it, but rather "re-ignites" it since the value wraps around to 255. The code only allows this for pixels in the lower four rows, which is how it "feeds the fire" from below. Note that when the program starts, all pixels are initially zero, so the low bits of the sum will be zero, and all pixels on the lower rows will ignite, causing the initial burst of flame.

@@nx:
        INC     SI
        STOSB
        LOOP    @@lp                    ; New image stored in Imagen2.

With the final value of AL computed, STOSB writes it to the address pointed to by DI which it also increments. SI is also incremented, and the loop repeats with the next pixel.

        MOV     SI,80+OFFSET Imagen2    ; Scrolling copy. :-)
        MOV     DI,OFFSET Imagen
        MOV     CX,40*48
        REP     MOVSW

With all the new pixel values in Imagen2, the program now copies them back to Imagen for next time. By starting the source pointer (SI) 80 bytes into the array, the copy effectively scrolls the contents up one line. The actual copying is done with REP MOVSW which performs 40*48 (CX) word-sized moves from DS:SI to ES:DI, incrementing SI and DI after each one. Only 40 moves are needed per line because they are word-sized, and only 48 lines are copied because the top line is discarded (by starting at offset 80) and the bottom line is all zeros.

        MOV     SI,80*43+OFFSET Imagen2 ; Get rid of some ashes.
        MOV     CX,6*80
        MOV     AH,22
@@rcl:
         MOV    AL,[SI]
         CMP    AL,15
         JNC    @@rcn
          SUB   AL,AH
          NEG   AL
          MOV   [SI],AL
@@rcn:
         INC    SI
         LOOP   @@rcl

By "ashes", the code means pixels with low heat values. Such pixels look a bit unsightly in the bottom lines, so to smooth things over, the code above loops over the pixels in the bottom six lines, looking for pixels with values lower than 15. For such pixels, the code subtracts 22 (AH), and negates the result (effectively computing 22 minus the pixel value), which brightens them up a bit.

        MOV     SI,80+OFFSET Imagen2    ; And show it.
        MOV     DI,0
        MOV     AX,0A000h
        MOV     ES,AX
        MOV     CX,40*48
        REP     MOVSW

With all the pixel values ready in Imagen2, the program copies them over to the 80x50 linearly addressed framebuffer at A000:0000 using the same "scrolling copy" technique as before. The frame will be displayed the next time the monitor refreshes.

        MOV     AH,1
        INT     16h
        JNZ     Bye
        JMP     MainLoop

After the frame has been copied to the graphics memory, the code invokes Int 16/AH=01h to check whether there's a keystroke in the keyboard buffer. If there's not, the MainLoop continues, otherwise it jumps to the code below.

Epilogue

Bye:
        XOR     AH,AH
        INT     16h
        MOV     AX,3
        INT     10h
        MOV     DX,OFFSET ByeMsg
        MOV     AH,9
        INT     21h

First, Int 16/AH=00h is invoked to retrieve the keystroke from the keyboard buffer (the result, in AX, is ignored). Then Int 10/AH=00h is used to reset the video mode back to 03h, which is the regular 80x25 16-color text mode. Finally, Int 21/AH=09h is used to write the goodbye message to the screen.

        MOV     AX,4C00h
        INT     21h

        END
; ------------------------------ End of FIRE.ASM ---------------------------

At the very end, Int 21/AH=4Ch terminates the program.

That's it: 200 lines of assembly and the rest is history.

Firedemo in SDL

After reading through the original firedemo code above, I wanted to re-implement it to run on modern operating systems. In the Othello project, we did some graphical programming by using the native libraries (Xlib, Win32 GDI, Cocoa, etc.), but in this case we're not trying to build a graphical user interface, we just want to paint pixels on the screen. One popular cross-platform library for doing that, often used in games programming, is SDL (Simple Directmedia Layer).

The code below (available in fire.c) is a pixel-perfect port of the firedemo to SDL2. (It mainly follows this guidance from the SDL2 migration guide.) Hopefully it's a little easier to read than the assembly version.

#include <SDL.h>
#include <stdbool.h>
#include <stdio.h>

#define WIDTH 80
#define HEIGHT 50
#define WIN_WIDTH 640
#define WIN_HEIGHT 400
#define FPS 30static constuint32_tpalette[256]={/* Jare's original FirePal. */#define C(r,g,b) ((((r) * 4) << 16) | ((g) * 4 << 8) | ((b) * 4))C(0,0,0),C(0,1,1),C(0,4,5),C(0,7,9),C(0,8,11),C(0,9,12),C(15,6,8),C(25,4,4),C(33,3,3),C(40,2,2),C(48,2,2),C(55,1,1),C(63,0,0),C(63,0,0),C(63,3,0),C(63,7,0),C(63,10,0),C(63,13,0),C(63,16,0),C(63,20,0),C(63,23,0),C(63,26,0),C(63,29,0),C(63,33,0),C(63,36,0),C(63,39,0),C(63,39,0),C(63,40,0),C(63,40,0),C(63,41,0),C(63,42,0),C(63,42,0),C(63,43,0),C(63,44,0),C(63,44,0),C(63,45,0),C(63,45,0),C(63,46,0),C(63,47,0),C(63,47,0),C(63,48,0),C(63,49,0),C(63,49,0),C(63,50,0),C(63,51,0),C(63,51,0),C(63,52,0),C(63,53,0),C(63,53,0),C(63,54,0),C(63,55,0),C(63,55,0),C(63,56,0),C(63,57,0),C(63,57,0),C(63,58,0),C(63,58,0),C(63,59,0),C(63,60,0),C(63,60,0),C(63,61,0),C(63,62,0),C(63,62,0),C(63,63,0),/* Followed by "white heat". */#define W C(63,63,63)W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W,W#undef W
#undef C};staticuint8_tfire[WIDTH*HEIGHT];staticuint8_tprev_fire[WIDTH*HEIGHT];staticuint32_tframebuf[WIDTH*HEIGHT];intmain()
{SDL_Window*window;SDL_Renderer*renderer;SDL_Texture*texture;SDL_Event event;inti;uint32_tsum;uint8_tavg;boolfull_screen=false;boolkeep_running=true;if(SDL_Init(SDL_INIT_VIDEO)<0) {fprintf(stderr,"Failed SDL_Init: %s\n",SDL_GetError());return1;
        }window=SDL_CreateWindow("SDL2 firedemo (www.hanshq.net/fire.html)",SDL_WINDOWPOS_UNDEFINED,SDL_WINDOWPOS_UNDEFINED,WIN_WIDTH,WIN_HEIGHT,SDL_WINDOW_SHOWN|SDL_WINDOW_RESIZABLE);if(window==NULL) {fprintf(stderr,"Failed CreateWindow: %s\n",SDL_GetError());return1;
        }renderer=SDL_CreateRenderer(window,-1,0);if(renderer==NULL) {fprintf(stderr,"Failed CreateRenderer: %s\n",SDL_GetError());return1;
        }texture=SDL_CreateTexture(renderer,SDL_PIXELFORMAT_ARGB8888,SDL_TEXTUREACCESS_STREAMING,WIDTH,HEIGHT);if(texture==NULL) {fprintf(stderr,"Failed CreateTexture: %s\n",SDL_GetError());return1;
        }while(keep_running) {while(SDL_PollEvent(&event)) {if(event.type==SDL_QUIT) {keep_running=false;
                        }else if(event.type==SDL_KEYDOWN) {if(event.key.keysym.sym==SDLK_f) {full_screen= !full_screen;SDL_SetWindowFullscreen(window,full_screen?SDL_WINDOW_FULLSCREEN_DESKTOP:0);
                                }else if(event.key.keysym.sym==SDLK_q) {keep_running=false;
                                }
                        }
                }for(i=WIDTH+1;i<(HEIGHT-1)*WIDTH-1;i++) {/* Average the eight neighbours. */sum=prev_fire[i-WIDTH-1]+prev_fire[i-WIDTH]+prev_fire[i-WIDTH+1]+prev_fire[i-1]+prev_fire[i+1]+prev_fire[i+WIDTH-1]+prev_fire[i+WIDTH]+prev_fire[i+WIDTH+1];avg=(uint8_t)(sum/8);/* "Cool" the pixel if the two bottom bits of the
                           sum are clear (somewhat random). For the bottom
                           rows, cooling can overflow, causing "sparks". */if(!(sum&3)&&(avg>0||i>=(HEIGHT-4)*WIDTH)) {avg--;
                        }fire[i]=avg;
                }/* Copy back and scroll up one row.
                   The bottom row is all zeros, so it can be skipped. */for(i=0;i<(HEIGHT-2)*WIDTH;i++) {prev_fire[i]=fire[i+WIDTH];
                }/* Remove dark pixels from the bottom rows (except again the
                   bottom row which is all zeros). */for(i=(HEIGHT-7)*WIDTH;i<(HEIGHT-1)*WIDTH;i++) {if(fire[i]<15) {fire[i]=22-fire[i];
                        }
                }/* Copy to framebuffer and map to RGBA, scrolling up one row. */for(i=0;i<(HEIGHT-2)*WIDTH;i++) {framebuf[i]=palette[fire[i+WIDTH]];
                }/* Update the texture and render it. */SDL_UpdateTexture(texture,NULL,framebuf,WIDTH*sizeof(framebuf[0]));SDL_RenderClear(renderer);SDL_RenderCopy(renderer,texture,NULL,NULL);SDL_RenderPresent(renderer);SDL_Delay(1000/FPS);
        }SDL_DestroyTexture(texture);SDL_DestroyRenderer(renderer);SDL_DestroyWindow(window);SDL_Quit();return0;
}

To build and run the program on Debian GNU/Linux (or Ubuntu):

$ sudo apt-get install libsdl2-dev$ gcc -O3 -o fire `sdl2-config --cflags --libs` fire.c$ ./fire

To install SDL2 from MacPorts and build on Mac:

$ sudo port install libsdl2$ clang -O3 -o fire `sdl2-config --cflags --libs` fire.c$ ./fire

To build on Windows, download the latest "Visual C++ 32/64-bit" development library from the SDL 2.0 download page (currently the latest version is SDL2-devel-2.0.7-VC.zip). Extract that somewhere (I used C:\), and build in a Visual Studio Developer Command Prompt:

cl /Ox /DSDL_MAIN_HANDLED /Ic:\SDL2-2.0.7\include c:\SDL2-2.0.7\lib\x86\SDL2.lib fire.c
copy c:\SDL2-2.0.7\lib\x86\SDL2.dll .
fire.exe

The /DSDL_MAIN_HANDLED flag is to prevent SDL from replacing the main function. The copy is to make sure the SDL2.dll can be found when running the program.

(The program may not work in VirtualBox if video acceleration is not set up correctly. In that case, pass SDL_RENDERER_SOFTWARE instead of the 0 argument in the call to SDL_CreateRenderer.)

A New Fire Demo for DOS

I don't think the firedemo above was actually the program I saw that evening in the nineties. The way I remember it, the flames were just along the bottom of the screen. What I remember resembles much more what's described in Lode Vandevorde's Fire Effect tutorial.

One important difference in how that tutorial creates the fire is that it only averages pixel values on rows below the current one. This means the computation can be performed on a single buffer, in other words, there is no need to have separate buffers for the current and previous frame.

That makes things easier, and since the fire is located mostly along the bottom of the screen, it should be no problem running this in 320x200 resolution, even on a slow machine.

I've used this technique to make a little fire demo of my own (fire.asm):

        org 0x100       ; For .com file.

section .text
start:
        ; Enter mode 13h: 320x200, 1 byte (256 colors) per pixel.
        mov ax, 0x13
        int 0x10

        ; Make sure es and ds point to our segment (cs).
        push cs
        push cs
        pop ds
        pop es

        ; Write string.
        mov ax, 0x1300          ; ah=13h, al=write mode
        mov bx, 0xf             ; bh=page number (0), bl=attribute (white)
        mov cx, (msg_end - msg) ; cx=length
        mov dx, ((10 << 8) + (40 / 2 - (msg_end - msg) / 2)) ; dh=row, cl=column
        mov bp, msg             ; es:bp=string address
        int 0x10

        ; Set up the palette.
        ; Jare's original FirePal:
        cli             ; No interrupts while we do this, please.
        mov dx, 0x3c8   ; DAC Address Write Mode Register
        xor al, al
        out dx, al      ; Start setting DAC register 0
        inc dx          ; DAC Data Register
        mov cx, (firepal_end - firepal)
        mov si, firepal
setpal1:
        lodsb
        out dx, al      ; Set DAC register (3 byte writes per register)
        loop setpal1
        mov al, 63
        mov cx, (256 * 3 - (firepal_end - firepal))
setpal2:
        out dx, al      ; Set remaining registers to "white heat".
        loop setpal2
        sti             ; Re-enable interrupts.

        ; A buffer at offset 0x1000 from our segment will be used for preparing
        ; the frames. Copy the current framebuffer (the text) there.
        push 0xa000
        pop ds
        push cs
        pop ax
        add ax, 0x1000
        mov es, ax
        xor si, si
        xor di, di
        mov cx, (320 * 200 / 2)
        cld
        rep movsw       ; Copy two bytes at a time.

        push es
        pop ds
mainloop:
        ; On entry to the loop, es and ds should point to the scratch buffer.

        ; Since we'll be working "backwards" through the framebuffer, set the
        ; direction flag, meaning stosb etc. will decrement the index registers.
        std

        ; Let di point to the pixel to be written.
        mov di, (320 * 200 - 1)

        ; Write random values to the bottom row.
        ; For random numbers, use "x = 181 * x + 359" from
        ; Tom Dickens "Random Number Generator for Microcontrollers"
        ; http://home.earthlink.net/~tdickens/68hc11/random/68hc11random.html
        mov cx, 320
        xchg bp, ax     ; Fetch the seed from bp.
bottomrow:
        imul ax, 181
        add ax, 359
        xchg al, ah     ; It's the high 8 bits that are random.
        stosb
        xchg ah, al
        loop bottomrow
        xchg ax, bp     ; Store the seed in bp for next time.

        ; For the next 50 rows, propagate the fire upwards.
        mov cx, (320 * 50)
        mov si, di
        add si, 320     ; si points at the pixel below di.
propagate:
        ; Add the pixel below, below-left, below-right and two steps below.
        xor ax, ax
        mov al, [si]
        add al, [si - 1]
        adc ah, 0
        add al, [si + 1]
        adc ah, 0
        add al, [si + 320]
        adc ah, 0
        imul ax, 15
        shr ax, 6       ; Compute floor(sum * 15 / 64), averaging and cooling.
        stosb
        dec si
        loop propagate

        ; Mirror some of the fire onto the text.
        mov dx, 15              ; Loop count, decrementing.
        mov di, (90 * 320)      ; Destination pixel.
        mov si, (178 * 320)     ; Source pixel.
mirrorouter:
        mov cx, 320     ; Loop over each pixel in the row.
mirrorinner:
        mov al, [di]    ; Load destination pixel.
        test al, al     ; Check if its zero.
        lodsb           ; Load the source pixel into al.
        jnz mirrorwrite ; For non-zero destination pixel, don't zero al.
        xor al, al
mirrorwrite:
        stosb           ; Write al to the destination pixel.
        loop mirrorinner
        add si, 640     ; Bump si to the row below the one just processed.
        dec dx
        jnz mirrorouter

        ; Sleep for one system clock tick (about 1/18.2 s).
        xor ax, ax
        int 0x1a        ; Returns nbr of clock ticks in cx:dx.
        mov bx, dx
sleeploop:
        xor ax, ax
        int 0x1a
        cmp dx, bx
        je sleeploop

        ; Copy from the scratch buffer to the framebuffer.
        cld
        push 0xa000
        pop es
        mov cx, (320 * (200 - 3) / 2)
        xor si, si
        mov di, (320 * 3)       ; Scroll down three rows to avoid noisy pixels.
        rep movsw

        ; Restore es to point to the scratch buffer.
        push ds
        pop es

        ; Check for key press.
        mov ah, 1
        int 0x16
        jz mainloop

done:
        ; Fetch key from buffer.
        xor ah, ah
        int 0x16

        ; Return to mode 3.
        mov ax, 0x3
        int 0x10

        ; Exit with code 0.
        mov ax, 0x4c00
        int 0x21

; Data.
msg: db 'www.hanshq.net/fire.html'
msg_end:

firepal:
        db     0,   0,   0,   0,   1,   1,   0,   4,   5,   0,   7,   9
        db     0,   8,  11,   0,   9,  12,  15,   6,   8,  25,   4,   4
        db    33,   3,   3,  40,   2,   2,  48,   2,   2,  55,   1,   1
        db    63,   0,   0,  63,   0,   0,  63,   3,   0,  63,   7,   0
        db    63,  10,   0,  63,  13,   0,  63,  16,   0,  63,  20,   0
        db    63,  23,   0,  63,  26,   0,  63,  29,   0,  63,  33,   0
        db    63,  36,   0,  63,  39,   0,  63,  39,   0,  63,  40,   0
        db    63,  40,   0,  63,  41,   0,  63,  42,   0,  63,  42,   0
        db    63,  43,   0,  63,  44,   0,  63,  44,   0,  63,  45,   0
        db    63,  45,   0,  63,  46,   0,  63,  47,   0,  63,  47,   0
        db    63,  48,   0,  63,  49,   0,  63,  49,   0,  63,  50,   0
        db    63,  51,   0,  63,  51,   0,  63,  52,   0,  63,  53,   0
        db    63,  53,   0,  63,  54,   0,  63,  55,   0,  63,  55,   0
        db    63,  56,   0,  63,  57,   0,  63,  57,   0,  63,  58,   0
        db    63,  58,   0,  63,  59,   0,  63,  60,   0,  63,  60,   0
        db    63,  61,   0,  63,  62,   0,  63,  62,   0,  63,  63,   0
firepal_end:

To assemble the program and run it with Dosbox on Linux:

$ sudo apt-get install nasm dosbox$ nasm fire.asm -fbin -o fire.com$ dosbox fire.com

(fire.com can also be downloaded here.)

On Mac:

$ sudo port install nasm dosbox$ nasm fire.asm -fbin -o fire.com$ dosbox fire.com

For Windows, the idea is the same, but you have to download the programs from the nasm and Dosbox web sites manually.

Running on Bare Metal

While the fire.com demo above runs under MS-DOS, the program doesn't actually use DOS for anything. In fact, it's not so much a DOS program as an IBM PC-compatible program: it's just 16-bit x86 code, some BIOS calls and fiddling with the VGA.

The exciting thing is that while PCs have gotten much faster and more capable in the last 20 years, the old stuff is still there. It should be possible to run my program on a modern PC, without the help of any operating system.

Running a program without an operating system is sometimes referred to as running on bare metal. This is most common in embedded systems, but it's possible on PCs as well.

When a PC starts, it first performs power-on self tests (POST), and then proceeds to load the operating system. Typically it loads it from the hard drive, but it can also boot from other devices such as a CD-ROM, USB stick or floppy disk.

The way a PC traditionally decides if it can boot from some medium is by reading the first sector (512 bytes) of it and checking whether that ends with the two magic bytes 0x550xAA, the Master Boot Record boot signature. If so, it loads that sector into memory at address 0000:7c00 and runs it.

Luckily, our program fits in well under 512 bytes, so to make it run as a Master Boot Record, we just have to make it expect to be loaded at 0000:7c00:

and insert padding and the magic bytes at the end:

        times (510 - ($ - $$)) db 0      ; Pad to 510 bytes
        db 0x55                          ; MBR boot signature.
        db 0xaa

We assemble it as before:

$ nasm fire.asm -fbin -o fire.img

and end up with fire.img which contains our program and functions as a Master Boot Record.

An easy way to test this is with VirtualBox. Configure a new virtual machine, load the .img file as a virtual floppy disk, start the machine and watch it boot into the fire demo.

To create a bootable USB stick with our demo from a Linux machine, insert a USB stick and check dmesg to see what device ID it gets assigned:

$ dmesg..
[23722.398774] usb-storage 3-1.2:1.0: USB Mass Storage device detected
[23722.400366] scsi7 : usb-storage 3-1.2:1.0
[23723.402196] scsi 7:0:0:0: Direct-Access              USB DISK 2.0
[23723.402883] sd 7:0:0:0: Attached scsi generic sg4 type 0
[23726.611204] sd 7:0:0:0: [sdc] 15138816 512-byte logical blocks: (7.75 GB/7.21 GiB)
[23726.613778] sd 7:0:0:0: [sdc] Write Protect is off
[23726.613783] sd 7:0:0:0: [sdc] Mode Sense: 23 00 00 00
[23726.615824] sd 7:0:0:0: [sdc] No Caching mode page found
[23726.615829] sd 7:0:0:0: [sdc] Assuming drive cache: write through
[23726.629461]  sdc: sdc1
[23726.638104] sd 7:0:0:0: [sdc] Attached SCSI removable disk

Note: don't try this at home if you don't know what you're doing. Also don't try it at work.

To write the image to the USB stick (it will effectively delete all existing data on the USB device; make sure you got the right device ID and don't have anything important on it):

$ sudo dd if=fire.img of=/dev/sdc1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000129039 s, 4.0 MB/s

Restart the computer, boot from the USB stick (you might have to enter a BIOS menu to select boot device) and watch it run on a modern computer just like it would twenty years ago!

Further Reading

  • Michael Abrash's Graphics Programming Black Book is full of information about the VGA, including techniques like "unchaining". The full text (web friendly version) is available online.
  • Fabien Sanglard's Game Engine Black Book: Wolfenstein 3D has excellent explanations of the PC hardware of the early nineties and provided significant inspiration for this post.

Fewer toys at once may help toddlers to focus better and play more creatively

$
0
0

Highlights

An abundance of toys present reduced quality of toddlers’ play.

Fewer toys at once may help toddlers to focus better and play more creatively.

This can done in many settings to support development and promote healthy play.

Abstract

We tested the hypothesis that an environment with fewer toys will lead to higher quality of play for toddlers. Each participant (n = 36) engaged in supervised, individual free play sessions under two conditions: Four Toy and Sixteen Toy. With fewer toys, participants had fewer incidences of toy play, longer durations of toy play, and played with toys in a greater variety of ways (Z = −4.448, p < 0.001, r = −0.524; Z = 2.828, = 0.005, = 0.333; and Z = 4.676, p < 0.001, = 0.55, respectively). This suggests that when provided with fewer toys in the environment, toddlers engage in longer periods of play with a single toy, allowing better focus to explore and play more creatively. This can be offered as a recommendation in many natural environments to support children’s development and promote healthy play.

Andrew Wiles on the struggle and beauty of mathematics

$
0
0

Roger Highfield explores the beauty of mathematics at a recent event at the Science Museum

One of the world’s greatest mathematicians, Sir Andrew Wiles, made a rare public appearance in the Science Museum this week to discuss his latest research, his belief in the value of struggle, and how to inspire the next generation.

Sir Andrew made global headlines in 1994 when he reported that he had cracked Fermat’s Last Theorem, so named because it was first formulated by the French mathematician Pierre de Fermat in 1637.

His triumph while working in Princeton marked the end of a long gruelling struggle for Sir Andrew, who first became entranced by the theorem in the early sixties, when he was 10 years old.

Why did Fermat exert such a tight grip on him? The romance of this mathematical story, ‘captivated me’, he said. ‘Fermat wrote down this problem in a copy of a book of Greek mathematics. It was only found after his death by his son.’

Last year, in recognition of his towering achievement, Sir Andrew was awarded the Abel Prize, the mathematics’ equivalent of the Nobel Prize, and today the Royal Society Research Professor of Mathematics at Oxford’s Mathematical Institute continues to explore new horizons in mathematics that have been opened up by his work on Fermat.

Sir Andrew Wiles, Royal Society Research Professor of Mathematics at Oxford’s Mathematical Institute in conversation with mathematician and broadcaster Dr Hannah Fry of University College London

After a brief lecture, Sir Andrew was joined by mathematician and broadcaster Dr Hannah Fry of University College London, a familiar face in the Science Museum, having this year presented Britain’s Greatest Invention from the museum’s storage facility near Swindon, and been one of the faces associated with the museum’s Tomorrow’s World partnership with the BBC, Royal Society, Open University and Wellcome.

Despite his years of struggle, and the scepticism of his peers, did he always believe it was possible to prove Fermat’s Last Theorem? ‘Oh yes,’ said Sir Andrew. “I am always quite encouraged when people say something like: “You can’t do it that way.”’

Did Fermat himself really have a proof? Based on the mathematics of his day, ‘the probability is almost zero,’ though he added: ‘it is just conceivable.’

Sir Andrew realised early on in his attempt on the problem that conventional mathematical approaches had been exhausted but became intrigued once again in 1986 when he realised a new route to crack the problem had opened up in mainstream mathematics, through the study of what are called elliptic curves.

During years of intense study, he placed his faith in the ‘three Bs’: Bus, bath and bed. In other words, the power of the subconscious, when his mind could relax and was given rein to wander.
Sir Andrew is a specialist in number theory, a branch of mathematics dedicated to the study of integers. So, asked Dr Fry, are there other areas of pure mathematics that he wished he had more time to study? ‘I confess that I was addicted to number theory from the time I was ten years old,’ he said.’ I have never found anything else in mathematics that appealed quite as much.’

Broadcaster Dr Hannah Fry of University College London

Dr Fry pointed out that the implication was, of course, that there were other fields of undergraduate mathematics where the Abel prize winner felt he was weaker. ‘Definitely true’, he said.

Even so, he found a way to sate his addiction as a student. ‘There was not much number theory in undergraduate mathematics’ so he would ‘sneak off to the library to try and read Fermat. ‘But Fermat had this really irritating habit of writing in Latin,’ said Sir Andrew. Even today, his grasp of Latin remains ‘minimal’.

Terms such as ‘elegance’ and ‘beauty’ are bandied around by many mathematicians. What do they mean? They are hard to explain but Sir Andrew likened the mathematical equivalent of experiencing the rapture of beauty to walking down a path to explore a garden by the great landscape architect Capability Brown, when a breathtaking vista suddenly beckons. In other words, elegance in mathematics ‘is this surprise element of suddenly see everything clarified and beautiful.’

But you should ‘not stare at it non-stop’, he warned, else the majesty will fade, as is also the case with great paintings and music.

Today he is still walking through the great garden of mathematics, ‘the language of science,’ he said. Another way Sir Andrew described his lifelong passion to the rapt audience was as a ‘beautiful edifice…the most permanent thing there is.’

Industry and government realise that mathematicians are the lifeblood of a modern economy but are concerned by the lack of uptake of maths. Most young people ‘do have a real appetite for mathematics’, said Sir Andrew, but they are put off because, he believes, their teachers are not viscerally interested in the subject.

It is in primary schools that teachers need to kindle sparks of interest in the subject but many of them aren’t actually mathematicians because so many maths graduates end up in better paid careers.

Young people need to learn from someone who truly enjoys the subject, and shows their enjoyment, he said. When the teachers don’t truly care about mathematics, ‘that gets passed on.’ One solution to attracting better teachers, he added, is to ‘pay them more’. His comment was greeted with warm applause.

Is skill at mathematics more a matter of nature than nurture? Sir Andrew disagrees with the depiction of mathematicians in the movie Good Will Hunting, which suggests success means being born with an aptitude for mathematics so that ‘it is easy.’

Sir Andrew Wiles, Royal Society Research Professor of Mathematics at Oxford’s Mathematical Institute

He told the audience that there are some things you are born with that might make it easier but, he stressed, ‘it’s never easy.’

‘Mathematicians struggle with mathematics even more than the general public does,’ said Sir Andrew. ‘We really struggle. It’s hard.’

But, he added, ‘we learn how to adapt to that struggle.’ Intriguingly, he said that some young, bright PhD mathematicians might find it hard to adapt to a life with less instant gratification from solving problems, and ‘can’t cope with being stuck for more than 24 hours’.

To be a great research mathematician, it takes character more than just technical skill. ‘You need a particular kind of personality that will struggle with things, will focus, won’t give up.’

Paradoxically, he suggested that those who are not so good at mathematics are able to cope better with research and the frustration of being stuck.

What is the next great challenge in mathematics? Sir Andrew referred to the Millennium Prize Problems, seven problems in mathematics that were highlighted by the Clay Mathematics Institute in 2000, each with a $1 million prize.

One, the Poincaré conjecture, was solved in 2003. The most famous of the remaining six, and the one that he would bet on to be cracked next, is the Riemann hypothesis, a great unsolved problem identified in 1900 by the highly influential German mathematician David Hilbert. ‘It says something about the way prime numbers are distributed,’ he said.

He encouraged young mathematicians to attempt these ‘impossible problems’ while they are teens or undergraduates, to give them a taste for research, but to set them to one side when starting a career ‘to be responsible.’

The special event was introduced by Martin Bridson, Whitehead Professor of Pure Mathematics, Head of the Mathematical Institute, and by Dame Mary Archer, Chair of the Science Museum Group, who listed various mathematics initiatives in the museum.

Dame Mary pointed out that the museum’s Wonderlab interactive gallery has launched a new mathematics show for young visitors, called Primetime, to celebrate the remarkable impact of mathematics on everyday life.

Thanks to the help of Prof Marcus du Sautoy of Oxford, also one of the museum’s advisors, the Bodleian commissioned a carbon dating project of the ‘Bakhshali manuscript’, part of which is on display in the museum, which revealed the first written record of zero, a highly-influential number, dates back four centuries further than most scholars had thought.

L-R: Dame Mary Archer, Roger Highfield, Sir Andrew Wiles, Dr Hannah Fry and Martin Bridson

Since it opened last December, Mathematics: the Winton Gallery, has welcomed 1.2 million visitors and won two awards. In the audience was David Harding, who with his wife Claudia donated £5m to fund the gallery, designed by the late Dame Zaha Hadid, to inspire future generations of mathematicians.

Also in the packed IMAX theatre was the TV presenter Dara O’Briain, William Shawcross, Chairman of the Charity Commission for England and Wales, Ilyas Khan of Cambridge Quantum Computing, school teachers and many young mathematicians – perhaps even a future Abel prize-winner or Fields medalist.

High-Speed Trading: Lines, Radios, and Cables

$
0
0

TABB FORUM: where capital markets speak

Matt Hurd, -
10 May 2017

Even savvy traders, such as Getco, make mistakes and invest millions of dollars inappropriately in the wrong communication technologies in pursuit of speed. Low latency may be worth millions of dollars to your trade, but capital and recurrent expenditures may give you pause as you toss around modern HFT technology and potential ROIs. Tech can be expensive. You’d better understand it well before choosing your preferred cost and profile.

Spread Networks blew a lazy few hundred million dollars on a white elephant straighter optical fibre between Chicago and New York. Not all traders were wise enough to dodge the Spread Networks bullet, with Getco the most famous customer to spend an unjustifiable, inordinate amount. Microwave had been on the same link for more than 50 years, was faster, and already was used for trading.

Don’t make the same mistake as Spread. Be careful with your link choices and your cable choices.

Look at these cables. Decide the order of speed of propagation of signal in them. Many traders, but not so many engineers, may be surprised:

Basic cabling test: Put these in order from slowest to fastest.

The correct order from slowest to fastest, by velocity of propagation, is d, a, c, b, then e. There is faster, though. If you’re geek enough, like me, to get a kick out of this kind of thing, you may find this interesting. Most people would prefer to meander elsewhere, I suspect. I’m not the guy you want to invite to your dinner party ;-)

Latency misconceptions

Even savvy traders, such as Getco, do make mistakes and invest millions of dollars inappropriately in the wrong communication technologies. Don’t do that.

Latency may be worth millions of dollars to your trade, but capital and recurrent expenditures may give you pause as you toss around modern HFT technology and potential ROIs. Tech can be expensive. You’d better understand it well before choosing your preferred cost and profile. Let’s have a look at some of the poorly understood and interesting, to me, misunderstandings and developments that may be important to both your latency-critical and latency-sensitive trading. Let’s meander through some of the points.

Is fibre transmission faster than transmission using electrical wires? 

The answer is: It depends.

Is radio frequency transmission always faster than fibre?

The answer is: It depends.

The new low earth orbit (LEO) satellite service in pre-sales from LeoSat Enterprise LLC has reportedly snared a high-speed trading customer. Could LeoSat really be faster than terrestrial
communication?

The answer is: It depends (but unlikely).

Back in the day, when Getco released some S1s & S4s, there was a bit of trading community comment regarding notes in the accounts where it was disclosed that millions of dollars had been spent on Spread Networks fibre capacity between Chicago and New York:

“Colocation and data line expenses increased $18.9 million (52.0%) to $55.2 million in 2010 from $36.3 million in 2009 primarily due to the introduction of Spread Networks, which is a fiber optic line that transmits exchange and market data between Chicago and New York, and the build out of GETCO’s Asia-Pacific colocations and data lines.” [Knight Holdco, Inc., SEC S-4, 12 Feb 2013, page 227].

Investing in Spread Networks was wasted money. Microwave links are faster and were already being used on that route. In fact, the first microwave link was built in 1949 for that route.

September 1949 Long Lines publication regarding New York to Chicago microwave link.

Later, poor old Getco had its traders’ frustrations aired in public with the disclosure of an internal complaint regarding their internal microwave network being higher latency than a third-party network available for use. I expect that was either the McKay Bros./Quincy Data or Tradeworx people. They do good work:

McKay Bros round trip microwave latency. Optical fibre is ~12ms on same path.

I use Getco as an example here not because it is incompetent, but rather because it is very good at what it does. Even Getco, now KCG, now Virtu, as good as it is, had missteps in low-latency path development.

Wired 2012. Not so secret, hey, Michael Lewis?

If you haven’t read Michael Lewis’s “Flash Boys,” and you really shouldn’t, you may have missed the low-latency narrative centered around Spread Networks’ fibre roll-out that stitched together the book. The literary device used to end the book was the hook of a tower hosting a microwave network. This ending was left as evil hanging in the air like a brick doesn’t. To me, such narrative abuse represents some very poor journalism. Such RF links, and vendors offering them, had been widely discussed, such as in Wired (2012) (see charts at right) and the Chicago Tribune (2012). They had weighed in on the microwave discussion, publishing vendors’ names and even prices. This is a snippet from the Chicago Tribune in 2012, years earlier than “Flash Boys”:

“He [Benti] said the microwave network starts at 350 E. Cermak, ends at another telecom hotel at 165 Halsey St. in Newark, N.J., and went live in the fourth quarter of 2009.”

It was hardly a big secret, and I found the presentation in “Flash Boys” somewhat scandalous. Barksdale and Clark, who Lewis had written a book about, “The New New Thing,” are investors in Spread Networks and IEX. They remain friends of Lewis. That looks material to the objectivity, or lack thereof, of “Flash Boys.”

Latency matters. Latency can be expensive. Latency technology has risks. Let’s expose some latency matters that matter.

Trading at the speed of light

Let’s meander through a little physics and then some of the history of some communication links.

The speed of light is 299,792.458 km/s, near enough to 300,000 km/s, which is how I usually round it off. This is the speed of light in a vacuum and the newish way that we define time. Light’s speed is commonly referred to as ‘c.’ When you force light, which is just a form of RF, into a medium, such as a fibre, for light, or a wire, for RF, it goes a bit slower. You’ll have to remember that electrical transmission is different to photonic transmission but related.

The speed of light in a standard optical fibre, either single mode fibre or multi-mode fibre, is around two-thirds the speed of light in a vacuum. This is normally written as 0.66c.

The atmosphere isn’t a vacuum, thankfully for life on earth, but it doesn’t slow RF, including light, much at all. It’s close enough to c that we don’t bother with a discount and just say it’s 1c.

This is why microwaves make a big difference. If you could use point-to-point transmission for the roughly 1,200 kilometres from Chicago to New York, you’d get roughly 6 milliseconds for light in standard fibre and 4 milliseconds for direct RF transmission. Two million nanoseconds of difference is quite a big difference to a trader.

Spread Networks dug very straight lines and managed to beat other fibre networks to achieve the lowest fibre latency known on that link. Spread’s latency was indeed around 6ms one way, as expected. It’s a shame they didn’t fully appreciate the benefits of microwave comms on that link before they started digging. Let’s stick to just the tech for now. Here is the start of a table:

Medium Speed of Transmission

  • Vacuum 1c
  • Atmosphere ~1c
  • Twisted pair ~0.67c
  • Standard fibre ~0.66c

That’s pretty rough, and I’ve taken a few liberties which I’ll explain later. A very important and interesting thing about electrical transmission in wire is that the construction of the wire, and, perhaps even more important, the insulation, matters. Not just a bit, but a lot.

LMR-1700 coaxial cable specs. Note: The Velocity of Propagation is 0.89c.

If the “wire” was a coaxial cable, then the RF would enjoy travelling along the outside of those wires’ surfaces and burn rubber to achieve up to 0.89c [LMR-1700 low loss coax – Foam PE and 0.87c with Commscope 875 coax also with Foam PE].

Remarkably, older coaxial undersea cables used from around the 1930s might have been faster than some modern fibre cables. Not many people understand that. Then again, if you chose Neoprene as the dielectric in your wire cable, as earlier cables did, you’d only chug along at around 0.45c. The dielectric performance of the wire limits the speed. A high-dielectric constant in your wire is bad news for latency. In the nineteenth century, most cables used Gutta-percha compounds which had dielectrics in the range of 2.4 to 3.4, with more than 4 when wet, likely resulting in speeds significantly less than 0.66c. In the 1930s the coaxial submarine cables around the world started using polyethylene, which has a typical dielectric constant of 2.26, giving a speed of around 0.66c. However, the construction matters a lot. A foam polyethylene has a dielectric constant of around 1.55, resulting in typical speeds of around 0.8c.

Old tech can be fun. You can get more than 0.95c out of a simple open wire ladder line – 0.95c to 0.99c is the typical range for open ladder lines. You might remember an open wire ladder line if you cast your mind back to that really old-school two parallel wire TV antenna cable with rectangular cut outs in the polyethylene webbing every inch or so. Who’d have thunk it? Ancient technology faster than fibre! Details matter.

Printed circuit board (PCB) design is both a science and an art. Standard PCB layers use a medium called FR4, basically fibreglass, which is probably the most common PCB filler. PCB transmission with FR4 is positively glacial with 0.5c typical. Various other layers, such as Rogers, are used for high-speed channel and RF design, which has different properties again and is typically faster for latency, too.

Let’s look at a revised table:

Medium Speed of Transmission

  • Vacuum 1c
  • Atmosphere ~1c
  • Open wire ladder ~0.95c
  • Coaxial cable ~0.8c
  • Twisted pair ~0.67c
  • Standard fibre ~0.66c
  • PCB FR4 ~0.5c

Now you can probably imagine building a twisted pair cable that is a bit rounder, more like coax, and not so flat, less of a PCB, and that cable may be a little faster. So the revised cable might be faster than light over fibre. Again, details matter.

There are standards for CAT twisted pair cables. Those standards also specify minimum propagation speeds and variations within the cable. For example, here is the standard specification for Cat-6 cable:

The velocities are minimums, so don’t panic yet about the 0.585c to 0.621c. If I look at the specifications for a couple of real-world cables, I see Draka SuperCat 5/5E at 0.64c, and Prysman M@XLAN cat 5E/6/6A cables claim 0.67c. These are specification claims, not guarantees. Seimon reports in its cable guide:

“NVP varies according to the dielectric materials used in the cable and is expressed as a percentage of the speed of light. For example, most category 5 polyethylene (FRPE) constructions have NVP ranges from 0.65c to 0.70c. ... Teflon (FEP) cable constructions range from 0.69c to 0.73c, whereas cables made of PVC are in the 0.60c to 0.64c range.”

Those electrons must find Teflon just as slippery as we do.

The other notable thing in the cable specifications is the maximum delay skew. That refers to the fact that the different wires in the cables may propagate faster or slower. In an old-school coaxial submarine cable, the core wires could be 5% shorter than the outer wires. That is a big deal. In a twisted pair CAT 6, the 45ns skew per 100m could be around 8% variation within the wire. This can matter a good deal, as you may only be as fast as your slowest producing bit.

Can you quite believe you are still reading about cables and propagation? This is the kind of detail a good trader may have to worry about.

My friends at Metamako measured some common fibre and wire cables using their latency measuring MetaApp. They found the copper cables they tested were indeed faster than the fibre ones by just a bit. I’ve reproduced Metamako’s chart below with permission:

Metamako cable comparison: ‘Copper is faster than fibre!’

This chart comes from the following data:

Here, copper is faster than fibre. The direct-attach copper cables come in at 4.60ns per metre and single mode fibre at 4.95ns per metre, with multi-mode fibre at 4.98 ns per metre. You always have to be careful looking at this, as it is not just about transmission but also about the latency cost of amplifying, cleaning, and propagating the signal. Notably, the fibre has a little more endpoint overhead as you can see a larger constant in the fibre equations’ fits.

As you now know, cables can vary a great deal, so caveat emptor.

Another note about cables is that often longer copper interfaced cables in the data center aren’t really copper but active optical cables (AOC). Such cables transmute the electrical signal into optical and back to electrical (EOE) as part of the cable to improve range. The constants, especially from media changes, can matter in these equations. For example, with 10G Ethernet over CAT-6, you might use a nice Teflon cable and expect some fast propagation. You will be disappointed to learn the 10G twisted pair codec is really twisted and likely to cost you microseconds – yes, thousands of nanoseconds – before you even get onto the cable. Whilst the rise time of a 10G laser in an SFP+ may be less than 30 picoseconds, organizing the rise time from the electrical signal takes some gymnastics, even if quite a bit less than a 10G twisted pair. A fast cable, or plane, will not always help if your boarding procedures are slow.

There are some more obscure and exciting cables, such as fewer mode fibre, we’ll save for later.

A little comms history

Let’s segue and meander through a bit of history.

Did you know Great Britain’s Pound is commonly called “cable” in trading and financial circles?

This is because when that first cross-Atlantic telegraph cable briefly sprang into life in 1858, information sped up. An obvious and important use was for trading and financial information, hence the name for the US Dollar to Great British Pound cross rate became colloquially named after its primary Atlantic transmission medium. Cable underscores the importance of cable.

When did these trade latency wars start? Perhaps thousands of years ago, but certainly hundreds of years ago. There are records relating to coffee merchants in Africa and the Middle East suggesting a trader knowing quickly about production in Africa could make significant profits in the Middle East. Kipp Rogers pointed me to a letter from a silk merchant from around 1066 worrying about time being wasted waiting for such tradeable information:

“The price in Ramle of the Cyprus silk, which I carry with me, is 2 dinars per little pound. Please inform me of its price and advise me whether I should sell it here or carry it with me to you in Misr (Fustat), in case it is fetching a good price there. By God, answer me quickly, I have no other business here in Ramle except awaiting answers to my letters. … I need not stress the urgency of a reply concerning the price of silk.”

Latency trades are no “New New Thing.”

You’ve probably heard the stories of Rothschild’s consol trade, where, learning about Waterloo, the information was transmitted, most likely by fast boats rather than by the pigeons he is famous for using, to London, earning a considerable profit. Reuter’s empire was started by being at the crossroads of information flow to participate and speed up information flows. Alexandre Laumonier pointed out to me the old semaphore and light signaling used by the French as an early optical network. It’s also fun to know there were various frauds, delays, and embeddings in early semaphore and telegraph networks with profit motives, even making it into the tale of “The Count of Monte Christo.” Chappe’s optical telegraph in France covered 556 stations with 4,800 kilometres of 1c transmission media, air between the stations, from 1792.

Speaking of Chappe’s optical telegraph, you may find it intriguing that even in the 1830s stock market speculators were abusing communications for profit:

“On another topic, and like Internet outstripping the lawmakers, optical telegraph asked for new laws and regulations: a fraud related to the introduction, into regular messages, of information about the stock market, was discovered in 1836. The telegraph operators involved were using steganography (a specific pattern of errors added to the message) to transmit the stock market data to Bordeaux. In 1837, they were tried but acquitted, having not violated any explicit law. The criminals were one step ahead.”

Many people argue that High-Fequency Trading (HFT) is a new phenomenon, perhaps as little as a decade in age. Some argue it goes back to the 1980s. The wise Kipp Rogers also passed on a nice book reference to me which noted HFT, in the modern sense, from 1964’s Bankers Monthly, Volume 81, page 49:

“This is an important aspect of bank stocks and leads us to a clearer view of the market. To begin, let’s define a broad line of bank stock house as one range. There are few professionals who will insist that high-frequency trading occurs in more than 20 bank stock names.”

The use in the text quite clearly talks of it in a style that suggests common usage, so perhaps the term is decades older? HFT is not a “New New Thing.”

Indeed, there is not much new under the sun, even in clichés. The power of compound interest was argued in cuneiform some 5,000 years ago. The code of Hammurabi dealt with trade and liability, amongst many other things, more than 3,800 years ago. Your friendly Ancient Greek philosopher, Thales, was challenged to show how philosophy could be practical in a financial sense, so he made money in times BC by using options on presses to leverage olive forecasts and cornered the market. Ancient Rome used corporate structures.

Just as HFT is probably older than you and I think, history shows the importance of latency is also not a “New New Thing.”

Retransmission

One of the problems with the old semaphores and telegraphs is that humans were used as repeaters. The early telegraph couldn’t cross the US continent with its electrics and thus people rekeyed the messages. Semaphore networks are optical and transmit at the speed of light, but the on-boarding, off-boarding, and retransmission of messages relies on people, flags, and the like. Such retransmissions were not measured in picoseconds.

This is also an issue for modern microwave networks. Lower cost microwave or millimetre wave devices often have tens or hundreds of microseconds in their on-boarding and retransmission latencies. For much of the world, wasting a few microseconds is not so much of a big deal. The telecom carriers are usually more concerned with bandwidth as their optimization point.

The very best Chicago-to-New York links have single-digit microsecond differences between them through aggressive path and device optimizations. So the issue of retransmission latency and onboarding is a large issue. One way of cutting down latency, or to make devices simpler, is to talk to the device with a signal it understands to cut out any unnecessary conversions. This led to radio over fibre (RoF), where the RF signal is represented directly in the fibre to feed microwaves. A significant development more relevant to the Chicago – New York and London – Frankfurt links was the development of clever repeaters that analyze the signal and minimally process the signal if it is of sufficient quality rather than requiring a full digital cleansing, or clock data recovery (CDR). Such repeating takes nanoseconds instead of microseconds. Most microwave traders now use such microwave repeaters.

The first trans-Atlantic telegraph cable in 1858 was a stupendously expensive and brave undertaking that only briefly worked. Cables had been used across water before, with the English Channel being crossed in 1851, but nothing quite so ambitious as a whole ocean. The Atlantic cable briefly worked thanks to sensitive receivers rather than by an understanding of amplification or repeating. That came later. Brave souls, newer cables, amplification, and repeating drove improvements and commerce to an ever-increasing frenzy. Messages were very expensive to send, but the financial world became a virtually instant world to onlooking humans in the 1850s. The path for the rise of the machines was laid.

Satellites

Geostationary communication satellites came to enable anywhere-to-anywhere communication covering the entire planet. Telephone systems and faxes were hooked up. If you are old like me and have talked on an old telephone link, you’ll remember the satellites’ biggest problem – the nasty delays inherent to the lines. Hearing your own voice delayed, or just an awkwardly pausing conversation, would drive you nuts. Latency was the issue.

Geostationary orbit is a high orbit. A really, really high orbit. The circumference of the world is about 40,000 kilometres. Using 2πr you can work out the distance from the center of the earth to the surface is about 6,400 kilometres. A geostationary orbit is 42,164 kilometres from the center of the earth. More than six times the Earth’s radius. Pause and visualize that for a second. A tiny, little satellite dot far from Earth. Several Earths away. That’s a long way. So sending a signal to a satellite and receiving it somewhere else is nearly the same as going around the Earth’s equator twice! That geostationary 72,000 km journey – there and back again, in Hobbit speak – at the speed of light is around 240 milliseconds. Add some processing overhead and you get a very annoying delay. Geostationery satellites suck for latency. Don’t use satellites for trading. That is why we have a bunch of spaghetti surrounding the earth in the guise of under-sea cables. Latency begone.

Current submarine cabling.

Microwave

Microwave has been around longer than many people think. A microwave link was put over the English Channel in 1931. Today’s HFTs are fighting over space at Richborough to make straighter lines with taller towers for less repetition over similar grounds. The first Chicago-to-New York link was created with 34 jumps in September 1949, just beating Spread Network’s straighter fibre by some 60 years and two milliseconds.

President Truman made a USA coast-to-coast microwave TV transmission in September 1951 after it was opened for telephone use in August.

Microwave is not a new thing, and don’t let Michael Lewis lead you astray into thinking otherwise.

In my homeland of Tasmania, some amateurs set a record using standard astronomy telescopes on a couple of mountains with more than 100 miles between them to modulate a voice call.

This example shows that whilst light transmission, including lasers, typically has less of a range than microwave, it doesn’t always have to be that way if you want to get creative. Microwave bandwidth and distances are continually improving in leaps and bounds.

RF, such as microwave, does not have to be hideously expensive. My Toronto-to-NY regular old telco cable link was about $15k rent per month from memory of the interlisted arb. With microwave, you could buy a couple of end points for your link and then you don’t have to pay the recurring costs for cables. However, the towers and real estate becomes an expensive proposition, which grows as the links get longer and up goes the repeater count. Lots of HFTs are fighting over similar paths and towers and there is a certain element of land and license grabbing that takes place. Alexandre Laumonier, via his blog, SniperInMahwah, has been documenting such links in Europe and the recent battle to get a couple of large towers approved in Richborough in the UK for the channel crossing. It’s an expensive game when you want to build large towers.

The actual microwave RF bit is expensive, but it’s not outrageous. The real-estate access and towers can be very expensive. All the HFTs have been knocking each other around with one-upmanship to gain an edge. Public spectrum and council records, such as those used by Sniper, make it difficult to use shell games to hide your capabilities. In that spirit, a recently announced consortium of sorts is joining forces to create a “Go West” project to share the burden, as they know they’ll just compete each other out to create very similar links. That makes a good deal of sense, as the technical cost is nearly out of control with such networks due to the land and towers. IMC has invested in McKay Bros. to facilitate improvements to their networks for the benefit of all traders. Tower Research has joined them. KCG and Jump Trading also work together as New Line Networks. These joint venture approaches are sensible cry outs to the gods of cost control. HFTs are realizing that being fastest is not so good if you can’t afford a trade’s transaction cost, including your depreciation.

Radio is light is radio

Marconi was obsessed with crossing the Atlantic with radio. He succeeded in a bit of a scary way. He basically built a huge amplifier that generated enough of a current, or spark, to bludgeon his way across the Atlantic with brute force.

Marconi using a kite to lift his antenna to 150m for the first Atlantic transmission in 1901.

Click. Kaboom.

What frequency was it?

All of them!

Well, pretty much. Perhaps around 850kHz. That spark gap transmitter was quite quickly replaced with more nuanced hardware so that different people could use different frequencies and thus the planet was not just restricted to one giant broadcaster. We then found that certain frequencies, 2-70MHz, the high frequency or HF band, would bounce around the world thanks to the Ionosphere acting as a bit of a trampoline, sometimes, to those frequencies. Shortwave radio is not so popular anymore but still active, even for number stations.

Microwave and millimetre radio is a bit of a misnomer. Microwave, in the normal literature, actually covers wavelengths from 300MHz to 300GHz, or wavelengths from 1 millimetre to 1 metre. Millimetre bands are part of the microwave spectrum and do indeed have wavelengths of millimetres. Not having micrometer wavelengths with microwaves, I find a little interesting. Micrometre wavelengths are part of the infrared. Typical microwave networks are in the 2GHz to 7GHz range. 60GHz is a popular, usually free, millimetre wave network. It is free as the atmosphere kicks it around and limits its usefulness. Many countries have light weight regulations for 80GHz links so you can use them more easily. There is much for the trader to choose from.

Light is radio. Radio is light. The wavelength for your standard data center fibre light over MMF is 850nm or ~350 THz. In the data center, 1310 nm is typically used. 1550 nm is often used for longer-distance links thanks to its kind transmission properties in long strands of fibre. Note that visible light is usually considered to range from violet at 380nm/789THz to red’s 620nm/400THz. The common data center light borders those visible frequencies. When we put different colors of light, or wavelengths, onto a single fibre, we call it Wave Division Multiplexing (WDM), which is a complicated way of saying a pretty rainbow.

We have standards for the colors, sometimes called channels, so we can talk to each other thanks to the International Telecommunication Union. We mix those colors up and separate them out after to make better use of the holes we dig in the ground or sea. If the colors are close together we call it Dense WDM (DWDM), and when the colors have a bit more space and there are fewer of them, it is called Coarse WDM (CWDM). Fancy names for pretty simple stuff. International trading is powered by rainbows, literally.

Often the photo sensor receivers are wide ranging enough to allow just about any frequency to trigger them, which can make your network design a little easier. We can often interchange short run MMF and SMF cables without noticing too much, as they mainly make a difference in the large runs or over n-way splits. SMF cables used to be expensive cables but they aren’t too different in cost to MMF today in the volume a trader may buy them.

Erbium doped cables take a little light injection and reinvigorate the existing light signal as it travels which is pretty clever. There is no real latency cost here if you consider the erbium-doped length as part of the cable. You need a bit of distance in the cable doping so this is a slow way of doing amplification if you only have short cables, such as in a trading co-lo facility. For short distances, you’ll be better doing OEO, which is not so different to the EOE we talked about with AOC cables.

Faster fibre

There has been some interesting work on making much faster fibre cables. The idea had its seed in thinking about point-to-point laser systems, often called free space optics, that have been used for links, including for HFT in New Jersey. Imagine you do your lasering underground. Carefully add some mirrors to bounce the lasers around. That is not too far from the concept of a Hollow Core Fibre (HCF) or Few-Mode Fibre (FMF). HCF speeds are around 0.997c. Pretty good, no? So why aren’t they everywhere?

Cost has been an issue. I looked recently and cables were about $500 a metre. Yikes! Perhaps HCF cables are cheaper now or in bulk. HCF repeating is an issue as the signal dissipates pretty quickly. That is, the mirror bouncy wouncy timey wimey thing, to paraphrase The Doctor, is not so super-efficient. Attenuation kills. We are used to having big distances between our fibre repeaters in modern times. Strangely enough, though, the HCF repeater requirements are not too different to the old coaxial-cable requirements in terms of spacing. Hmmm, perhaps expensive long distance HCF is really possible for a trader if we go back to the future? A demonstration a couple of years ago changed this thinking when a greater than 1000km repeating FMF cable was demonstrated in Europe. Perhaps we’ll see Spread Networks replace its Chicago-to-New York link’s SMF with HCF?

Medium Speed of Transmission

  • Vacuum 1c
  • Atmosphere ~1c
  • rf inc laser ~1c
  • Hollow-core fibre ~0.997c
  • Open wire ladder ~0.95-0.99c
  • Coaxial cable ~0.45-0.89c
  • Twisted pair ~0.58-0.75c
  • Standard fibre ~0.66c
  • PCB FR4 ~0.5c

HFT radio tuning

A few years ago now, the small HFT I founded bought a couple of the original Ettus Research FPGA GNU Radio boxes to play with. We got a little RF signal to go a short distance in the room. Well, they were sitting on the same workbench. Digital FPGA pin in to digital FPGA pin out was 880ns on the oscilloscope. That’s pretty fast. The experiment was to see what kind of overhead the RF stack, including the IF, encoding, MAC, etc., was causing. This experiment showed that with such modern software-defined radio (SDR), this kind of RF comms hackery has become wide open to all types and sizes of trading firms.

Why doesn’t an HFT just use a HAM radio to send a signal across the Atlantic to compete? Well, maybe they are. If HFTs are, it definitely requires some custom thinking, as commodity appliances for this do not exist with the right characteristics. The MIL-spec stuff, say for non-satellite warship communication, may use HF radio but the packets can take seconds to get through. Ouch. That is slow. Why is it so slow?

Email on warships with HF is slow because the MIL-spec packets are heavily encoded with error correction and are spread out over time to handle disturbances. Now Marconi didn’t do this. His brute force grunt was sent at 1c over the horizon with little processing overhead. Click. Kaboom. It may be possible to spatially encode a signal instead of doing the redundancy over time to instantly deliver small messages from continent to continent that are leading-edge triggered. HF Muliple-In-Mulitiple-Out (MIMO) may also be a thing for those purposes. Just as you have little groups of MIMO antennae with centimetre, or so, spacing on the back of your wi-fi router, HF MIMO can have groups of antennae doing their thing. However, even though the research I’ve seen looks promising, their encoding was still slower than a terrestrial equivalent. One experiment was going from Europe to the Canary Islands, but even though the net result for the experiment is encouraging, it was still slower than cable speed due to those pesky encoding and hardware overheads. Such speed was not the point in that particular case, though. Just getting HF MIMO working is quite a feat. There is much potential to explore this area even if HF MIMO has somewhat huge spacing for the antennae. Awkwardly, spacing for HF MIMO antennas is not measured in centimetres but hundreds or thousands of metres. The antennas don’t fit in my workshop but they may fit in your trading farm down in Cornwall.

Another RF alternative is to use line of sight with balloons (Google’s loon) or planes (Facebook and Google). This is not new. The height of balloons was not just used in the American Civil War, but a US company used cheapish balloons for making small responders that could help in tracking trucks and other freight things in the US, mainly in the South. To keep costs under control, you got a reward if you found a fallen balloon, read the plaque, and sent it back to the company for your reward. That way they kept recycling RF stations. That same company was also awarded a contract for enlarging the RF footprint in Iraq for the DoD via balloons. Google’s RF balloon trials in New Zealand have been working well. RF balloon comms are no longer a “New, New Thing.”

Before I knew all of this, in the dark annals of history, I was interested in looking at the Toronto, Chicago, NJ triangle to see what height might be practical for direct line of sight. The Toronto to New York distance is about 800km. For that distance, a platform would have to be at around 12,500 metres at each end to see each other. Not so different if you just put a balloon in the middle. YouTube tells me this is clearly possible and not so high if you consider all of the high school kids sending weather balloons 100,000 feet high to get pretty pictures and videos of the curvature of the Earth. If The Register can send their Lego man up in a paper airplane to such heights, surely an HFT can do something cute too?

The Register’s Paper Aircraft Released Into Space (PARIS).

It is also worth considering HF radio bouncing around the ionosphere. A relatively small transmitter can cover the entire planet. The ionosphere is wide ranging in height of bounce. For simplicity, let’s assume it is at 60km and plug it into a bouncy equation for long link and the total distance variation is surprisingly small. That is, the HF bouncing around from NY to Tokyo doesn’t add that much onto the total distance due to the shallow angles involved. If you can find a way to encode the signal sufficiently well, your trading latency could be on a winner.

For some years,  there has been talk of using balloons across the Atlantic for trading. Hobbyists with model airplanes have flown the distance. Maybe you could use a continuous line of UAVs to act as relays? Now that would be a fun project. An HCF undersea cable seems more practical.

LEO Satellites

There was an article in the Wall Street Journal about LeoSat recruiting HFTs for low-latency links. The WSJ reported one HFT taker. We saw previously that height is an issue for geostationery satellites, as the latency is a killer. So how low is low for LEO? Is the height a latency killer? The company is planning on laser-based comms between the satellites but you still have to get up there. Low cannot be too low for satellites, as otherwise there is a bit of pull and drag that sucks them into the atmosphere to a fiery death. LEO orbits start at 160km, might be 300km, but really need to be 600km, or higher, to last a while. That is, it is usually better to have a bit more height and live longer. O3b medium orbit satellites sit at 8,000km. Iridium LEO satellites are at 780 km.

The WSJ article reported LeoSat could do Tokyo to NY in less than 130ms, which LeoSat claimed was twice as fast as existing 260ms links. This claim is a little hollow, as publicly known Chicago-Tokyo links are already similar in speed to that quoted by LeoSat. Hibernia is offering JPX in Tokyo to CME at Aurora, Chicago, as a link at 121.190ms, and we know Chicago to NY is just under 4ms with current offerings such as McKay Bros. 130ms from LeoSat is already not competitive. The article quoted the company as saying satellite-to-ground latency was 20ms. It’s not clear if that is one way or a round trip for 600km of light speed equivalent. It’s not fast. Low orbit, not so low latency in this case, yet.

Neutrinos and long waves

I hope this has provided some of the color to the thoughts a trader may have about trading links.

I’ll leave with one further thought. Neutrinos. Hold up your hand. You have hundreds of billions of neutrinos travelling through it each second. There are around 65 billion solar neutrinos passing through matter per square centimetre on Earth. Trillions are passing through your entire body. Near the South Pole there is a cubic kilometre of clearish ice with special sensors hot drilled in. They lie in wait for Neutrinos travelling through the Earth from the North Pole. Those neutrinos occasionally – very, very rarely – bump into something and provide a little blue flash.

IceCube: South Pole Neutrino Observatory

A trader might think: Why go around the Earth or its crust when you can go through it? Nice.

Remember the fuss that started in September 2011 regarding neutrinos travelling faster than light as part of the European OPERA experiment? It was thrown out to the community to solve the puzzle. Eventually, it was figured to be a measurement error. To me the interesting part was that someone was firing Neutrinos from Geneva, Switzerland, to Gran Sasso in Italy, through the planet, and detecting them! Neutrino communications is a thing already. You need to send an awful lot to get a lucky hit and thus a message length would be short and the time would be probabilistically long; but you gotta start somewhere. Don’t let the detector’s required 300,000 bricks weighing 8.3kg per brick daunt you. What’s 300~400GeV between friends? Who wants to build and improve on a few tonnes of Neutrino detection for HFT?

Submarines can use long waves for water penetrating RF comms. Slow packets with big waves. There are patentpapers for turning a whole submarine into a Neutrino detector for comms or navigation as an alternative to long waves. Would it work? Seems very unlikely but, tantalizingly, not completely crazy. A few hundred tonnes would not be a problem for a submariner. Such answers are beyond my pay grade, but an HFT has gotta ask.

What about ground penetrating long waves? Long waves are slow, as they are very long in metres. You have to wait a long time for your bits. Though I do remember when sub-wavelength imaging was thought to be “proven” to be impossible. Super-resolution imaging came along despite rigorous math suggesting ye olde wavelength limiting thingamabobs prevented us diving deeper. We can now see molecules and atoms inside cells by thinking a little outside that wavelength limiting box. That is, in a short while, the impossible became possible. The neural community has weathered two large winters of more than a decade each to survive to be a bright deep learning star doing the seemingly impossible despite what Minksy and Papert had you believe in the 1960s. Scientific winters sometimes pass. So you never know. Perhaps long waves that hug the earth’s curves and penetrate water and soil can go do something sub-wavelength for signaling the British Pound is a buy with one or two bits of secret signal and a new “cable” for cable may be born? Maybe some kind of neutrino or neutrino-like particle can be practically enabled. There is a good movie in there somewhere for Matt Damon to follow up on.

Final word

I’m not holding my breath for the “Go Through” consortium, or cartel, to replace today’s “Go West” venture. That said, I’d be surprised but not shocked. Once Musk gets his Mars trading outpost functioning, I hope we don’t repeat the mistakes of the past and build too much duplicate infrastructure for trading Martian Renimbi against the Earth’s Rupiah.

Back in our real world: UAV-based comms, hollow core fibres, and HF-based HFT low-latency signaling may be happening whether you like it or not. Learn from Getco and don’t buy into the “New New Thing” that is just another Spread Networks. Be aware and beware.

This article originally was published on Matt Hurd’s blog, Meanderful.


Motel Living and Slowly Dying

$
0
0

BY TRADE AND BY SELF-IDENTITY, I am a novelist. But to keep the groceries coming, I am also an oil pipeline worker. They call me a “pig tracker,” which means I monitor the location of cleaning and diagnostic tools traveling through pipelines, and when I’m not in the field, I’m in a hotel somewhere along the line, sleeping my way toward my next shift.

The particular rhythms of what I do — track the pig in its journey beneath the prairies, hand off the job to my counterpart on the other shift, find a hotel near where I’ll rejoin the line, sleep, lather, rinse, repeat — have made me something of an unintentional expert on hotel living and on the America nobody dreams about seeing on vacation.

I travel by secondary and tertiary roads, skulking around the pipeline on 12-hour shifts, either midnight to noon or noon to midnight. I work alone, mostly. And when the shift is done, I catch my rest in places like Harrisonville, Missouri, and Iola, Kansas. Lapeer, Michigan, and Amherst, New York. Toledo, Ohio, and Thief River Falls, Minnesota. I’ve learned that Super 8s are not always super, and Comfort Inns sometimes afflict the comfortable. I rack up IHG points and Wyndham Rewards and Choice Privileges. I may never have to pay for a personal car rental again, so fulsome are my Enterprise points.

Sometimes I lose track of what day it is, or when, exactly, I’m going home again. But the places where I set my head stand ready to reorient me with a comfortable sameness. There’s the antiseptic smell of a well-cleaned lobby, the paper coffee cups in my room wrapped in plastic, the rattle and hum of the air-conditioning unit. When I arrive in the wee hours, the lonely night auditor is often all too happy to talk. When I leave at 11 p.m. for my next night shift, I often have to talk the clerk through my reasons for arriving and departing in the same 12-hour period. Twenty-four hours a day, there’s coffee of varying age and quality.

I’ve come to value the simple things: a clean room, reliable hot water, and a staff that respects a do-not-disturb sign. And learned the sublime wisdom of a song called “My Favourite Chords” by a Canadian band called the Weakerthans. 

I want to fall asleep / to the beat of you breathing / in a room near a truck stop / on a highway somewhere …

The words are a concise demonstration of language’s power to inspire cinema in our heads. They also form a picture of my life. Because the truth is, I’ve been living in motels since I was a child.

¤

My father was an exploratory well digger, and I traveled with him every summer through the American West, far off the interstates, in a nomadic way that was worlds different from my life at home with my mother in Fort Worth, Texas.

In the summer of 1981, when I was 11 years old, I lived with him and my then-stepmother in a bottom-floor room at the Park Plaza Motel in Sidney, Montana. Dad was working near Watford City, North Dakota, about 50 miles east, but he used Montana and its lack of a sales tax as a home base. I’d ride out to the fields with him during the day, peeling around on my motorcycle, then return with him and his crew to Sidney in the evening, reuniting with my stepmother. The three of us would have dinner, watch some TV — it was the summer of Fernando Valenzuela’s miracle stint with the Log Angeles Dodgers — and then start the cycle again.

I catalog my boyhood summers by where dad was working at the time — Elko, Nevada (1976); Baggs, Wyoming (1977); Montpelier, Idaho (1978); and so on. But that particular summer in Sidney stands out from the others in all sorts of ways. For one thing, the festering anger between dad and his wife, making their second failed attempt at marriage, was too near and threatening. I tried to escape by hanging out with dad’s helpers when I could. Once those guys helped me score a bag of Beechnut, which I chomped happily until I got desperately sick. They teased me, almost to the point of cruelty but not so badly that they’d provoke dad. They introduced me to Aerosmith. So, yeah, it had its upsides.

I returned to Montana at age 36 for the last leg of my career as a journalist. My first wife was a woman who grew up just north of Sidney, and thus the next several years were spent driving into and out of that town, passing the Park Plaza coming and going, watching it age right along with me. And now in this era of the Bakken Shale, my frequent trips to pipeline jobs in North Dakota — Williston, Minot, points beyond and between — also carry me through Sidney. Each one sends the summer of 1981 a little deeper into the memories and brings the place I know now a little more to the fore.

It’s still oil-soaked. Still a little rough-edged. Still pleasant in its own way. Probably still a place where a little guy can swallow too much tobacco juice if he’s not careful.

Some people have Paris. I have Sidney. I’m okay with that.

¤

My life in motels doesn’t bear much resemblance to what I’ve read or seen; it’s too ordinary, too predictable. I’m not Humbert Humbert, dragging his Lolita through the West, a step ahead of Clare Quilty. I’ve never met someone like Juan Chicoy, the impromptu innkeeper from John Steinbeck’s The Wayward Bus, or the precocious little girl Moonee and her crazy mom from the recent movie about permanent motel existence, The Florida Project. My stays are straight credit-card transactions, reimbursed by my employer, and generally last just a few hours before I move along again. My intimacy with these places runs no deeper than: “Welcome back, Mr. Lancaster.”

There is, of course, a darker, seemingly hopeless side to these homes away from home. In left-behind precincts of cities and towns — indeed, on the main drag that connects my comfortable suburban neighborhood in Billings, Montana, with downtown — you can find bedraggled motels where single-room occupancy often means a family of five sharing a sink, a shower, and maybe a kitchenette. These are the working destitute, or the pensioned-off. These folks are able to scrape together several hundred dollars for rent, but not the first and last months and a water deposit and a credit score required for a less expensive apartment, let alone the three-percent down on an FHA mortgage.

We tend to think of homelessness in terms of cardboard boxes on street grates and cars that double as living spaces, but that’s a small aperture of the overall problem. These past-their-glory motels house people who work hard — and who face crushing odds of ever getting ahead of their circumstances. And as we run the average rents in places like Seattle and San Francisco to the stratosphere, without an attendant increase in affordable housing, we’re falling deeper into crisis.

In Billings, where I live when I’m not in a motel, we have 110,000 people, and 621 of them are homeless kids in the public school system. That’s the total from the most recent full school year. Of those 600-plus, 104 live the peculiar form of it at motels with names like the Lazy K-T. By any measure, it’s a shameful number. Teachers and administrators at the schools write grant proposals for supplemental breakfast programs. My friends who oversee classrooms have mastered the subtle art of pulling a kid aside and, without shaming him, learning whether he has a winter coat. When the answer is no, they find a way to get him one.

Elizabeth Lloyd Fladung, who has photographed American families on the margins for the past two decades, told The Nation in 2015: “The sight of these iconic structures now serving as home to scores of destitute people who don’t seem to have any chance at the American Dream really shows just how little infrastructure there is to help poor people in need, and how much damage decades of wage stagnation has done.”

It’s the what-might-have-been scenario for my father, the formerly nouveau riche exploratory driller whose fortunes crashed in 1983 along with that wave of the oil economy. He’s now on Social Security and a small VA disability, mostly blind, although stubbornly semi-independent. A decade ago, I persuaded him to move closer to me, found him a one-bedroom condo he could afford, and have helped him when I’m able and when he’s needed it. I drive him to doctor’s appointments and weekly grocery store visits. When I’m at home, I spend an hour or so with him a day, watching TV, playing backgammon, listening in the rare instances when he wants to talk.

A good deal of my mental energy is expended on making sure I see him off this mortal coil with love, and on hoping nothing happens to me before he’s gone.

¤

I’m writing this from a Microtel in Williston, North Dakota, on a pleasant mid-October day when the air carries a hint of what’s coming to the northern corridor. I’m here for a pipeline run, of course, and I’m feeling the uncertainty of what’s ahead. I might be here three days and then drive back home, five hours away, before returning next week. But there’s talk of stringing together the three jobs ahead of us into one unchecked block of work. If that’s the case, I’ll be here 10 days, maybe 12. Fourteen hours ago, I kissed my wife back in Billings and said, “I’ll be back when I’m back.” The gaping unknown is something we’ve learned to manage.

However long I’m here, the Microtel is a humane enough facsimile of home. The room rate includes a hot breakfast, fresh cookies await on the front desk in the afternoon, the towels are plentiful and clean, and the bed is comfortable. I have a generous per diem. If I take care of my responsibilities out in the field and drive prudently three or 12 days from now, I’ll make it back safely to my wife, my cat and dogs, and my father. And that’s the whole point. The Microtel is where I live when I’m here. But I’m going to leave it behind, shed it like a skin, and I don’t ever want the impermanence to be permanent.

¤

Craig Lancaster is the author of seven novels, including the recently published Julep Street.


IPvlan overlay-free Kubernetes Networking in AWS

$
0
0

Lyft is pleased to announce the initial open source release of our IPvlan-based CNI networking stack for running Kubernetes at scale in AWS.

cni-ipvlan-vpc-k8s provides a set of CNI and IPAM plugins implementing a simple, fast, and low latency networking stack for running Kubernetes within Virtual Private Clouds (VPCs) on AWS.

Background

Today Lyft runs in AWS with Envoy as our service mesh but without using containers in production. We use a home-grown, somewhat bespoke stack to deploy our microservice architecture onto service-assigned EC2 instances with auto-scaling groups that dynamically scale instances based on load.

While this architecture has served us well for a number of years, there are significant benefits to moving toward a reliable and scalable open source container orchestration system. Given our previous work with Google and IBM to bring Envoy to Kubernetes, it should be no surprise that we’re rapidly moving Lyft’s base infrastructure substrate to Kubernetes.

We’re handling this change as a two phase migration — initially deploying Kubernetes clusters for native Kubernetes applications such as TensorFlow and Apache Flink, followed by a migration of Lyft-native microservices where Envoy is used to unify a mesh that spans both the legacy infrastructure as well as Lyft services running on Kubernetes. It’s critical that both Kubernetes-native services as well as Lyft-native services be able to communicate and share data as first class citizens. Networking these environments together must be low latency, high throughput, and easy to debug if issues arise.

Kubernetes networking in AWS: historically a study in tradeoffs

Deploying Kubernetes at scale on AWS is not a simple or straightforward task. While much work in the community has been done to easily and quickly spin up small clusters in AWS, until recently, there hasn’t been an immediate and obvious path to mapping Kubernetes networking requirements onto AWS VPC network primitives.

The simplest path meeting Kubernetes’ network requirement is to assign a /24 subnet to every node, providing an excess of the 110 Pod IPs needed to reach the default maximum of schedulable Pods per node. As nodes join and leave the cluster, a central VPC route table is updated. Unfortunately, AWS’s VPC product has a default maximum of 50 non-propagated routes per route table, which can be increased up to a hard limit of 100 routes at the cost of potentially reducing network performance. This means you’re effectively limited to 50 Kubernetes nodes per VPC using this method.

While considering clusters larger than 50 nodes in AWS, you’ll quickly find recommendations to use more exotic networking techniques such as overlay networks (IP in IP) and BGP for dynamic routing. All of these approaches add massive complexity to your Kubernetes deployment, effectively requiring you to administer and debug a custom software defined network stack running on top of Amazon’s native VPC software defined network stack. Why would you run an SDN on top of an SDN?

Simpler solutions

After staring at the AWS VPC documentation, the CNI spec, Kubernetes networking requirement documents, kube-proxy iptables magic, along with all the various Linux network driver and namespace options, it’s possible to create simple and straightforward CNI plugins which drive native AWS network constructs to provide a compliant Kubernetes networking stack.

Lincoln Stoll’s k8s-vpcnet, and more recently, Amazon’s amazon-vpc-cni-k8s CNI stacks use Elastic Network Interfaces (ENIs) and secondary private IPs to achieve an overlay-free AWS VPC-native solutions for Kubernetes networking. While both of these solutions achieve the same base goal of drastically simplifying the network complexity of deploying Kubernetes at scale on AWS, they do not focus on minimizing network latency and kernel overhead as part of implementing a compliant networking stack.

A simple and low-latency solution

We developed our solution using IPvlan, bypassing the cost of forwarding packets through the default namespace to connect host ENI adapters to their Pod virtual adapters. We directly tie host ENI adapters to Pods.

Network flow to/from VPC over IPvlan

In IPVLAN — The Beginning, Mahesh Bandewar and Eric Dumazet discuss needing an alternative to forwarding as a motivation for writing IPvlan:

Though this solution [forwarding packets from and to the default namespace] works on a functional basis, the performance / packet rate expected from this setup is is much lesser since every packet that is going in or out is processed 2+ times on the network stack (2x Ingress + Egress or 2x Egress + Ingress). This is a huge cost to pay for.

We also wanted the system to be host-local with minimal moving components and state; our network stack contains no network services or daemons. As AWS instances boot, CNI plugins communicate with AWS networking APIs to provision network resources for Pods.

Lyft’s network architecture for Kubernetes, a low level overview

The primary EC2 boot ENI with its primary private IP is used as the IP address for the node. Our CNI plugins manage additional ENIs and private IPs on those ENIs to assign IP addresses to Pods.

ENI assignment

Each Pod contains two network interfaces, a primary IPvlan interface and an unnumbered point-to-point virtual ethernet interface. These interfaces are created via a chained CNI execution.

CNI chained execution
  • IPvlan interface: The IPvlan interface with the Pod’s IP is used for all VPC traffic and provides minimal overhead for network packet processing within the Linux kernel. The master device is the ENI of the associated Pod IP. IPvlan is used in L2 mode with isolation provided from all other ENIs, including the boot ENI handling traffic for the Kubernetes control plane.
  • Unnumbered point-to-point interface: A pair of virtual ethernet interfaces (veth) without IP addresses is used to interconnect the Pod’s network namespace to the default network namespace. The interface is used as the default route (non-VPC traffic) from the Pod, and additional routes are created on each side to direct traffic between the node IP and the Pod IP over the link. For traffic sent over the interface, the Linux kernel borrows the IP address from the IPvlan interface for the Pod side and the boot ENI interface for the Kubelet side. Kubernetes Pods and nodes communicate using the same well-known addresses regardless of which interface (IPvlan or veth) is used for communication. This particular trick of “IP unnumbered configuration” is documented in RFC5309.

Internet egress

For applications where Pods need to directly communicate with the Internet, our stack can source NAT traffic from the Pod over the primary private IP of the boot ENI by setting the default route to the unnumbered point-to-point interface; this, in turn, enables making use of Amazon’s Public IPv4 addressing attribute feature. When enabled, Pods can egress to the Internet without needing to manage Elastic IPs or NAT Gateways.

Internet egress w/ SNAT over boot ENI

Host namespace interconnect

Kubelets and Daemon Sets have high bandwidth, host-local access to all Pods running on the instance — traffic doesn’t transit ENI devices. Source and destination IPs are the well-known Kubernetes addresses on either side of the connect.

  • kube-proxy: We use kube-proxy in iptables mode and it functions as expected, but with the caveat that Kubernetes Services see connections from a node’s source IP instead of the Pod’s source IP as the netfilter rules are processed in the default namespace. This side effect is similar to Kubernetes behavior under the userspace proxy. Since we’re optimizing for services connecting via the Envoy mesh, this particular tradeoff hasn’t been a significant issue.
  • kube2iam: Traffic from Pods to the AWS Metadata service transits over the unnumbered point-to-point interface to reach the default namespace before being redirected via destination NAT. The Pod’s source IP is maintained as kube2iam runs as a normal Daemon Set.

VPC optimizations

Our design is heavily optimized for intra-VPC traffic where IPvlan is the only overhead between the instance’s ethernet interface and the Pod network namespace. We bias toward traffic remaining within the VPC and not transiting the IPv4 Internet where veth and NAT overhead is incurred. Unfortunately, many AWS services require transiting the Internet; however, both DynamoDB and S3 offer VPC gateway endpoints.

While we have not yet implemented IPv6 support in our CNI stack, we have plans to do so in the near future. IPv6 can make use of the IPvlan interface for both VPC traffic as well as Internet traffic, due to AWS’s use of public IPv6 addressing within VPCs and support for egress-only Internet Gateways. NAT and veth overhead will not be required for this traffic.

We’re planning to migrate to a VPC endpoint for DynamoDB and use native IPv6 support for communication to S3. Biasing toward extremely low overhead IPv6 traffic with higher overhead for IPv4 Internet traffic seems like the right future direction.

Ongoing work and next steps

Our stack is composed of a slightly modified upstream IPvlan CNI plugin, an unnumbered point-to-point CNI plugin, and an IPAM plugin that does the bulk of the heavy lifting. We’ve opened a pull request against the CNI plugins repo with the hope that we can unify the upstream IPvlan plugin functionality with our additional change that permits the IPAM plugin to communicate back to the IPvlan driver the interface (ENI device) containing the allocated Pod IP address.

Short of adding IPv6 support, we’re close to being feature complete with our initial design. We’re very interested in hearing feedback on our CNI stack, and we’re hopeful the community will find it a useful addition that encourages Kubernetes adoption on AWS. Please reach out to us via GitHub, email, or Gitter.

Thanks

cni-ipvlan-vpc-k8s is a team effort combining engineering resources from Lyft’s Infrastructure and Security teams. Special thanks to Yann Ramin who coauthored much of the code and Mike Cutalo who helped get the testing infrastructure into shape.

Interested in working on Kubernetes? Lyft is hiring! Drop me a note on Twitter or at pfisher@lyft.com.
Viewing all 25817 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>