The European Parliament has approved budget for VLC bug bounty program

December 4, 2017, 9:01 am

≫ Next: Debugging an evil Go runtime bug

≪ Previous: Malware Detection in Executables Using Neural Networks

Browser not supported :( - HackerOne

You are visiting this page because we detected an unsupported browser. Your browser does not support security features that we require. We highly recommend that you update your browser. If you believe you have arrived here in error, please contact us. Be sure to include your browser version.

↧

Debugging an evil Go runtime bug

December 4, 2017, 8:34 am

≫ Next: SEC Emergency Action Halts PlexCoin ICO

≪ Previous: The European Parliament has approved budget for VLC bug bounty program

Preface

I’m a big fan of Prometheus andGrafana. As a former SRE at Google I’ve learned to appreciate good monitoring, and this combination has been a winner for me over the past year. I’m using them for monitoring my personal servers (both black-box and white-box monitoring), for the Euskal Encounter external and internal event infra, for work I do professionally for clients, and more. Prometheus makes it very easy to write custom exporters to monitor your own data, and there’s a good chance you’ll find an exporter that already works for you outside of the box. For example, we use sql_exporter to make a pretty dashboard of attendee metrics for the Encounter events.

Event dashboard for Euskal Encounter (fake staging data)

Since it’s so easy to throw node_exporter onto any random machine and have a Prometheus instance scrape it for basic system-level metrics (CPU, memory, network, disk, filesystem usage, etc), I figured, why not also monitor my laptop? I have a Clevo “gaming” laptop that serves as my primary workstation, mostly pretending to be a desktop at home but also traveling with me to big events like the Chaos Communication Congress. Since I already have a VPN between it and one of my servers where I run Prometheus, I can just emerge prometheus-node_exporter, bring up the service, and point my Prometheus instance at it. This automatically configures alerts for it, which means my phone will make a loud noise whenever I open way too many Chrome tabs and run out of my 32GB of RAM. Perfect.

Trouble on the horizon

Barely an hour after setting this up, though, my phone did get a page: my newly-added target was inaccessible. Alas, I could SSH into the laptop fine, so it was definitely up, but node_exporter had crashed.

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc41ffc7fff pc=0x41439e]

goroutine 2395 [running]:
runtime.throw(0xae6fb8, 0x2a)
        /usr/lib64/go/src/runtime/panic.go:605 +0x95 fp=0xc4203e8be8 sp=0xc4203e8bc8 pc=0x42c815
runtime.sigpanic()
        /usr/lib64/go/src/runtime/signal_unix.go:351 +0x2b8 fp=0xc4203e8c38 sp=0xc4203e8be8 pc=0x443318
runtime.heapBitsSetType(0xc4204b6fc0, 0x30, 0x30, 0xc420304058)
        /usr/lib64/go/src/runtime/mbitmap.go:1224 +0x26e fp=0xc4203e8c90 sp=0xc4203e8c38 pc=0x41439e
runtime.mallocgc(0x30, 0xc420304058, 0x1, 0x1)
        /usr/lib64/go/src/runtime/malloc.go:741 +0x546 fp=0xc4203e8d38 sp=0xc4203e8c90 pc=0x411876
runtime.newobject(0xa717e0, 0xc42032f430)
        /usr/lib64/go/src/runtime/malloc.go:840 +0x38 fp=0xc4203e8d68 sp=0xc4203e8d38 pc=0x411d68
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus.NewConstMetric(0xc42018e460, 0x2, 0x3ff0000000000000, 0xc42032f430, 0x1, 0x1, 0x10, 0x9f9dc0, 0x8a0601, 0xc42032f430)
        /var/tmp/portage/net-analyzer/prometheus-node_exporter-0.15.0/work/prometheus-node_exporter-0.15.0/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/value.go:165 +0xd0 fp=0xc4203e8dd0 sp=0xc4203e8d68 pc=0x77a980

node_exporter, like many Prometheus components, is written in Go. Go is a relatively safe language: while it allows you to shoot yourself in the foot if you so wish, and it doesn’t have nearly as strong safety guarantees as, say, Rust does, it is still not too easy to accidentally cause a segfault in Go. More so, node_exporter is a relatively simple Go app with mostly pure-Go dependencies. Therefore, this was an interesting crash to get. Especially since the crash was inside mallocgc, which should never crash under normal circumstances.

Things got more interesting after I restarted it a few times:

2017/11/07 06:32:49 http: panic serving 172.20.0.1:38504: runtime error: growslice: cap out of range
goroutine 41 [running]:
net/http.(*conn).serve.func1(0xc4201cdd60)
        /usr/lib64/go/src/net/http/server.go:1697 +0xd0
panic(0xa24f20, 0xb41190)
        /usr/lib64/go/src/runtime/panic.go:491 +0x283
fmt.(*buffer).WriteString(...)
        /usr/lib64/go/src/fmt/print.go:82
fmt.(*fmt).padString(0xc42053a040, 0xc4204e6800, 0xc4204e6850)
        /usr/lib64/go/src/fmt/format.go:110 +0x110
fmt.(*fmt).fmt_s(0xc42053a040, 0xc4204e6800, 0xc4204e6850)
        /usr/lib64/go/src/fmt/format.go:328 +0x61
fmt.(*pp).fmtString(0xc42053a000, 0xc4204e6800, 0xc4204e6850, 0xc400000073)
        /usr/lib64/go/src/fmt/print.go:433 +0x197
fmt.(*pp).printArg(0xc42053a000, 0x9f4700, 0xc42041c290, 0x73)
        /usr/lib64/go/src/fmt/print.go:664 +0x7b5
fmt.(*pp).doPrintf(0xc42053a000, 0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2)
        /usr/lib64/go/src/fmt/print.go:996 +0x15a
fmt.Sprintf(0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2, 0x10, 0x9f4700)
        /usr/lib64/go/src/fmt/print.go:196 +0x66
fmt.Errorf(0xae7c2d, 0x2c, 0xc420475670, 0x2, 0x2, 0xc420410301, 0xc420410300)
        /usr/lib64/go/src/fmt/print.go:205 +0x5a

Well that’s interesting. A crash in Sprintf this time. What?

runtime: pointer 0xc4203e2fb0 to unallocated span idx=0x1f1 span.base()=0xc4203dc000 span.limit=0xc4203e6000 span.state=3
runtime: found in object at *(0xc420382a80+0x80)
object=0xc420382a80 k=0x62101c1 s.base()=0xc420382000 s.limit=0xc420383f80 s.spanclass=42 s.elemsize=384 s.state=_MSpanInUse<snip>
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw(0xaee4fe, 0x3e)
        /usr/lib64/go/src/runtime/panic.go:605 +0x95 fp=0x7f0f19ffab90 sp=0x7f0f19ffab70 pc=0x42c815
runtime.heapBitsForObject(0xc4203e2fb0, 0xc420382a80, 0x80, 0xc41ffd8a33, 0xc400000000, 0x7f0f400ac560, 0xc420031260, 0x11)
        /usr/lib64/go/src/runtime/mbitmap.go:425 +0x489 fp=0x7f0f19ffabe8 sp=0x7f0f19ffab90 pc=0x4137c9
runtime.scanobject(0xc420382a80, 0xc420031260)
        /usr/lib64/go/src/runtime/mgcmark.go:1187 +0x25d fp=0x7f0f19ffac90 sp=0x7f0f19ffabe8 pc=0x41ebed
runtime.gcDrain(0xc420031260, 0x5)
        /usr/lib64/go/src/runtime/mgcmark.go:943 +0x1ea fp=0x7f0f19fface0 sp=0x7f0f19ffac90 pc=0x41e42a
runtime.gcBgMarkWorker.func2()
        /usr/lib64/go/src/runtime/mgc.go:1773 +0x80 fp=0x7f0f19ffad20 sp=0x7f0f19fface0 pc=0x4580b0
runtime.systemstack(0xc420436ab8)
        /usr/lib64/go/src/runtime/asm_amd64.s:344 +0x79 fp=0x7f0f19ffad28 sp=0x7f0f19ffad20 pc=0x45a469
runtime.mstart()
        /usr/lib64/go/src/runtime/proc.go:1125 fp=0x7f0f19ffad30 sp=0x7f0f19ffad28 pc=0x430fe0

And now the garbage collector stumbled upon a problem. Yet a different crash.

At this point, there are two natural conclusions: either I have a severe hardware issue, or there is a wild memory corruption bug in the binary. I initially considered the former unlikely, as this machine has a very heavily mixed workload and no signs of instability that can be traced back to hardware (I have my fair share of crashing software, but it’s never random). Since Go binaries like node_exporter are statically linked and do not depend on any other libraries, I can download the official release binary and try that, which would eliminate most of the rest of my system as a variable. Yet, when I did so, I still got a crash.

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x76b998]

goroutine 13 [running]:
runtime.throw(0xabfb11, 0x5)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420060c40 sp=0xc420060c20 pc=0x42c725
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc420060c90 sp=0xc420060c40 pc=0x443197
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_model/go.(*LabelPair).GetName(...)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_model/go/metrics.pb.go:85
github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*Desc).String(0xc4203ae010, 0xaea9d0, 0xc42045c000)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/client_golang/prometheus/desc.go:179 +0xc8 fp=0xc420060dc8 sp=0xc420060c90 pc=0x76b998

Yet another completely different crash. At this point there was a decent chance that there was truly an upstream problem with node_exporter or one of its dependencies, so I filed anissue on GitHub. Perhaps the developers had seen this before? It’s worth bringing this kind of issue to their attention and seeing if they have any ideas.

Unsurprisingly, upstream’s first guess was that it was a hardware issue. This isn’t unreasonable: after all, I’m only hitting the problem on one specific machine. All my other machines are happily running node_exporter. While I had no other evidence of hardware-linked instability on this host, I also had no other explanation as to what was so particular about this machine that would makenode_exporter crash. A Memtest86+ run never hurt anyone, so I gave it a go.

And then this happened:

This is what I get for using consumer hardware

Whoops! Bad RAM. Well, to be more specific, one bit of bad RAM. After letting the test run for a full pass, all I got was that single bad bit, plus a few false positives in test 7 (which moves blocks around and so can amplify a single error).

Further testing showed that Memtest86+ test #5 in SMP mode would quickly detect the error, but usually not on the first pass. The error was always the same bit at the same address. This suggests that the problem is a weak or leaky RAM cell. In particular, one which gets worse with temperature. This is quite logical: a higher temperature increases leakage in the RAM cells and thus makes it more likely that a somewhat marginal cell will actually cause a bit flip.

To put this into perspective, this is one bad bit out of 274,877,906,944. That’s actually a very good error rate! Hard disks and Flash memory have much higher error rates - it’s just that those devices have bad blocks marked at the factory that are transparently swapped out without the user knowing, and can transparently mark newly discovered weak blocks as bad and relocate them to a spare area. RAM has no such luxury, so a bad bit sticks forever.

Alas, this is vanishingly unlikely to be the cause of my node_exporter woes. That app uses very little RAM, and so the chances of it hitting the bad bit (repeatedly, at that) are extremely low. This kind of problem would be largely unnoticeable, perhaps causing a pixel error in some graphics, a single letter to flip in some text, an instruction to be corrupted that probably won’t ever be run, and perhaps the rare segfault when something actually important does land on the bad bit. Nonetheless, it does cause long-term reliability issues, and this is why servers and other devices intended to be reliable must use ECC RAM, which can correct this kind of error.

I don’t have the luxury of ECC RAM on this laptop. What I do have, though, is the ability to mark the bad block of RAM as bad and tell the OS not to use it. There is a little-known feature of GRUB 2 which allows you to do just that, by changing the memory map that is passed to the booted kernel. It’s not worth buying new RAM just for a single bad bit (especially since DDR3 is already obsolete, and there’s a good chance new RAM would have weak cells anyway), so this is a good option.

However, there’s one more thing I can do. Since the problem gets worse with temperature, what happens if I heat up the RAM?

Using a heat gun set at a fairly low temperature (130°C) I warmed up two modules at a time (the other two modules are under the rear cover, as my laptop has four SODIMM slots total). Playing around with module order, I found three additional weak bits only detectable at elevated temperature, and they were spread around three of my RAM sticks.

I also found that the location of the errors stayed roughly consistent even as I swapped modules around: the top bits of the address remained the same. This is because the RAM is interleaved: data is spread over all four sticks, instead of each stick being assigned a contiguous quarter of the available address space. This is convenient, because I can just mask a region of RAM large enough to cover all possible addresses for each error bit, and not have to worry that I might swap sticks in the future and mess up the masking. I found that masking a contiguous 128KiB area should cover all possible permutations of addresses for each given bad bit, but, for good measure, I rounded up to 1MiB. This gave me three 1MiB aligned blocks to mask out (one of them covers two of the bad bits, for a total of four bad bits I wanted masked):

0x36a700000– 0x36a7fffff
0x460e00000– 0x460efffff
0x4ea000000– 0x4ea0fffff

This can be specified using the address/mask syntax required by GRUB as follows, in /etc/default/grub:

GRUB_BADRAM="0x36a700000,0xfffffffffff00000,0x460e00000,0xfffffffffff00000,0x4ea000000,0xfffffffffff00000"

One quick grub-mkconfig later, I am down 3MiB of RAM and four dodgy bits with it. It’s not ECC RAM, but this should increase the effective reliability of my consumer-grade RAM, since now I know the rest of the memory is fine up to at least 100°C.

Needless to say, node_exporter still crashed, but we knew this wasn’t the real problem, didn’t we.

Digging deeper

The annoying thing about this kind of bug is that it clearly is caused by some kind of memory corruption that breaks code that runs later. This makes it very hard to debug, because we can’t predict what will be corrupted (it varies), and we can’t catch the bad code in the act of doing so.

First I tried some basic bisecting of available node_exporter releases and enabling/disabling different collectors, but that went nowhere. I also tried running an instance under strace. This seemed to stop the crashes, which strongly points to a race-condition kind of problem. strace will usually wind up serializing execution of apps to some extent, by intercepting all system calls run by all threads. I would later find that the strace instance crashed too, but it took much longer to do so. Since this seemed to be related to concurrency, I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.

By now I had gathered quite a considerable number of crash logs, and I was starting to notice some patterns. While there was a lot of variation in the parts that were crashing and how, ultimately the error messages could be categorized into different types and the same kind of error showed up more than once. So I started Googling these errors, and this is how I stumbled uponGo issue #20427. This was an issue in seemingly an unrelated part of Go, but one that had caused similar segfaults and random issues. The issue was closed with no diagnosis after it couldn’t be reproduced with Go 1.9. Nobody knew what the root cause was, just that it had stopped happening.

So I grabbedthis sample code from the issue, which claimed to reproduce the problem, and ran it on my machine. Lo and behold, it crashed within seconds. Bingo. This is a lot better than waiting hours for node_exporter to crash.

That doesn’t get me any closer to debugging the issue from the Go side, but it gives me a much faster way to test for it. So let’s try another angle.

Bisecting machines

I know the problem happens on my laptop, but doesn’t happen on any other of my machines. I tried the reproducer on every other machine I have easy access to, and couldn’t get it to crash on any of them. This tells me there’s something special about my laptop. Since Go statically links binaries, the rest of userspace doesn’t matter. This leaves two relevant parts: the hardware, and the kernel.

I don’t have any easy way to test with various hardware other than the machines I own, but I can play with kernels. So let’s try that. First order of business: will it crash in a VM?

To test for this, I built a minimal initramfs that will allow me to very quickly launch the reproducer in a QEMU VM without having to actually install a distro or boot a full Linux system. My initramfs was built with Linux’s scripts/gen_initramfs_list.sh and contained the following files:

dir /dev 755 0 0
nod /dev/console 0600 0 0 c 5 1
nod /dev/null 0666 0 0 c 1 3
dir /bin 755 0 0
file /bin/busybox busybox 755 0 0
slink /bin/sh busybox 755 0 0
slink /bin/true busybox 755 0 0
file /init init.sh 755 0 0
file /reproducer reproducer 755 0 0

/init is the entry point of a Linux initramfs, and in my case was a simple shellscript to start the test and measure time:

#!/bin/sh
export PATH=/bin

start=$(busybox date +%s)

echo "Starting test now..."
/reproducer
ret=$?
end=$(busybox date +%s)
echo "Test exited with status $ret after $((end-start)) seconds"

/bin/busybox is a statically linked version of BusyBox, often used in minimal systems like this to provide all basic Linux shell utilities (including a shell itself).

The initramfs can be built like this (from a Linux kernel source tree), where list.txt is the file list above:

scripts/gen_initramfs_list.sh -o initramfs.gz list.txt

And QEMU can boot the kernel and initramfs directly:

qemu-system-x86_64 -kernel /boot/vmlinuz-4.13.9-gentoo -initrd initramfs.gz -append 'console=ttyS0' -smp 8 -nographic -serial mon:stdio -cpu host -enable-kvm

This resulted in no output at all to the console… and then I realized I hadn’t even compiled 8250 serial port support into my laptop’s kernel. D’oh. I mean, it doesn’t have a physical serial port, right? Anyway, a quick detour to rebuild the kernel with serial support (and crossing my fingers that didn’t change anything important), I tried again and it successfully booted and ran the reproducer.

Did it crash? Yup. Good, this means the problem is reproducible on a VM on the same machine. I tried the same QEMU command on my home server, with its own kernel, and… nothing. Then I copied the kernel from my laptop and booted that and… it crashed. The kernel is what matters. It’s not a hardware issue.

Juggling kernels

At this point, I knew I was going to be compiling lots of kernels to try to narrow this down. So I decided to move to the most powerful machine I had lying around: a somewhat old 12-core, 24-thread Xeon (now defunct, sadly). I copied the known-bad kernel source to that machine, built it, and tested it.

It didn’t crash.

What?

Some head-scratching later, I made sure the original bad kernel binary crashed (it did). Are we back to hardware? Does it matter which machine I build the kernel on? So I tried building the kernel on my home server, and that one promptly triggered the crash. Building the same kernel on two machines yields crashes, a third machine doesn’t. What’s the difference?

Well, these are all Gentoo boxes, and all Gentoo Hardened at that. But my laptop and my home server are both ~amd64 (unstable), while my Xeon server is amd64 (stable). That means GCC is different. My laptop and home server were both ongcc (Gentoo Hardened 6.4.0 p1.0) 6.4.0, while my Xeon was ongcc (Gentoo Hardened 5.4.0-r3 p1.3, pie-0.6.5) 5.4.0.

But my home server’s kernel, which was nearly the same version as my laptop (though not exactly), built with the same GCC, did not reproduce the crashes. So now we have to conclude that both the compiler used to build the kernel and the kernel itself (or its config?) matter.

To narrow things down further, I compiled the exact kernel tree from my laptop on my home server (linux-4.13.9-gentoo), and confirmed that it indeed crashed. Then I copied over the .config from my home server and compiled that, and found that it didn’t. This means we’re looking at a kernel config difference and a compiler difference:

linux-4.13.9-gentoo + gcc 5.4.0-r3 p1.3 + laptop .config - no crash
linux-4.13.9-gentoo + gcc 6.4.0 p1.0 + laptop .config - crash
linux-4.13.9-gentoo + gcc 6.4.0 p1.0 + server .config - no crash

Two .configs, one good, and one bad. Time to diff them. Of course, the two configs were vastly different (since I tend to tailor my kernel config to only include the drivers I need on any particular machine), so I had to repeatedly rebuild the kernel while narrowing down the differences.

I decided to start with the “known bad” .config and start removing things. Since the reproducer takes a variable amount of time to crash, it’s easier to test for “still crashes” (just wait for it to crash) than for “doesn’t crash” (how long do I have to wait to convince myself that it doesn’t?). Over the course of 22 kernel builds, I managed to simplify the config so much that the kernel had no networking support, no filesystems, no block device core, and didn’t even support PCI (still works fine on a VM though!). My kernel builds now took less than 60 seconds and the kernel was about 1/4th the size of my regular one.

Then I moved on to the “known good” .config and removed all the unnecessary junk while making sure it still didn’t crash the reproducer (which was trickier and slower than the previous test). I had a few false branches, where I changed something that made the reproducer start crashing (but I didn’t know what), yet I misidentified them as “no crash”, so when I got a crash I had to walk back up the previous kernels I’d built and make sure I knew exactly where the crash was introduced. I ended up doing 7 kernel builds.

Eventually, I narrowed it down to a small handful of different .config options. A few of them stood out, in particular CONFIG_OPTIMIZE_INLINING. After carefully testing them I concluded that, indeed, that option was the culprit. Turning it off produced kernels that crash the reproducer testcase, while turning it on produced kernels that didn’t. This option, when turned on, allows GCC to better determine which inline functions really must be inlined, instead of forcing it to inline them unconditionally. This also explains the GCC connection: inlining behavior is likely to change between GCC versions.

/*
 * Force always-inline if the user requests it so via the .config,
 * or if gcc is too old.
 * GCC does not warn about unused static inline functions for
 * -Wunused-function.  This turns out to avoid the need for complex #ifdef
 * directives.  Suppress the warning in clang as well by using "unused"
 * function attribute, which is redundant but not harmful for gcc.
 */
#if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) ||                \
    !defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4)
#define inline inline           __attribute__((always_inline,unused)) notrace
#define __inline__ __inline__   __attribute__((always_inline,unused)) notrace
#define __inline __inline       __attribute__((always_inline,unused)) notrace
#else
/* A lot of inline functions can cause havoc with function tracing */
#define inline inline           __attribute__((unused)) notrace
#define __inline__ __inline__   __attribute__((unused)) notrace
#define __inline __inline       __attribute__((unused)) notrace
#endif

So what next? We know that CONFIG_OPTIMIZE_INLINING makes the difference, but that potentially changes the behavior of every single inline function across the whole kernel. How to pinpoint the problem?

I had an idea.

Hash-based differential compilation

The basic premise is to compile part of the kernel with the option turned on, and part of the kernel with the option turned off. By testing the resulting kernel and checking whether the problem appears or not, we can deduce which subset of the kernel compilation units contains the problem code.

Instead of trying to enumerate all object files and doing some kind of binary search, I decided to go with a hash-based approach. I wrote this wrapper script for GCC:

#!/bin/bash
args=("$@")

doit=
while [ $# -gt 0 ]; do
        case "$1" in
                -c)
                        doit=1
                        ;;
                -o)
                        shift
                        objfile="$1"
                        ;;
        esac
        shift
done

extra=
if [ ! -z "$doit" ]; then
        sha="$(echo -n "$objfile" | sha1sum - | cut -d" " -f1)"
        echo "${sha:0:8} $objfile" >> objs.txt
        if [ $((0x${sha:0:8} & (0x80000000 >> $BIT))) = 0 ]; then
                echo "[n]" "$objfile" 1>&2
        else
                extra=-DCONFIG_OPTIMIZE_INLINING
                echo "[y]" "$objfile" 1>&2
        fi
fi

exec gcc $extra "${args[@]}"

This hashes the object file name with SHA-1, then checks a given bit of the hash out of the first 32 bits (identified by the $BIT environment variable). If the bit is 0, it builds without CONFIG_OPTIMIZE_INLINING. If it is 1, it builds with CONFIG_OPTIMIZE_INLINING. I found that the kernel had around 685 object files at this point (my minimization effort had paid off), which requires about 10 bits for a unique identification. This hash-based approach also has one neat property: I can choose to only worry about crashing outcomes (where the bit is 0), since it is much harder to prove that a given kernel build does not crash (as the crashes are probabilistic and can take quite a while sometimes).

I built 32 kernels, one for each bit of the SHA-1 prefix, which only took 29 minutes. Then I started testing them, and every time I got a crash, I narrowed down a regular expression of possible SHA-1 hashes to only those with zero bits at those specific positions. At 8 crashes (and thus zero bits), I was down to 4 object files, and a couple were looking promising. Once I hit the 10th crash, there was a single match.

$ grep '^[0246][012389ab][0189][014589cd][028a][012389ab][014589cd]' objs_0.txt
6b9cab4f arch/x86/entry/vdso/vclock_gettime.o

vDSO code. Of course.

vDSO shenanigans

The kernel’s vDSO is not actually kernel code. vDSO is a small shared library that the kernel places in the address space of every process, and which allows apps to perform certain special system calls without ever leaving user mode. This increases performance significantly, while still allowing the kernel to change the implementation details of those system calls as needed.

In other words, vDSO is GCC-compiled code, built with the kernel, that ends up being linked with every userspace app. It’s userspace code. This explains why the kernel and its compiler mattered: it wasn’t about the kernel itself, but about a shared library provided by the kernel! And Go uses the vDSO for performance. Go also happens to have a (rather insane, in my opinion) policy of reinventing its own standard library, so it does not use any of the standard Linux glibc code to call vDSO, but rather rolls its own calls (and syscalls too).

So what does flipping CONFIG_OPTIMIZE_INLINING do to the vDSO? Let’s look at the assembly.

With CONFIG_OPTIMIZE_INLINING=n:

arch/x86/entry/vdso/vclock_gettime.o.no_inline_opt:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <vread_tsc>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	0f 31                	rdtsc  
   9:	48 c1 e2 20          	shl    $0x20,%rdx
   d:	48 09 d0             	or     %rdx,%rax
  10:	48 8b 15 00 00 00 00 	mov    0x0(%rip),%rdx        # 17 <vread_tsc+0x17>
  17:	48 39 c2             	cmp    %rax,%rdx
  1a:	77 02                	ja     1e <vread_tsc+0x1e>
  1c:	5d                   	pop    %rbp
  1d:	c3                   	retq   
  1e:	48 89 d0             	mov    %rdx,%rax
  21:	5d                   	pop    %rbp
  22:	c3                   	retq   
  23:	0f 1f 00             	nopl   (%rax)
  26:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  2d:	00 00 00 

0000000000000030 <__vdso_clock_gettime>:
  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp
  47:	4c 8d 0d 00 00 00 00 	lea    0x0(%rip),%r9        # 4e <__vdso_clock_gettime+0x1e>
  4e:	83 ff 01             	cmp    $0x1,%edi
  51:	74 66                	je     b9 <__vdso_clock_gettime+0x89>
  53:	0f 8e dc 00 00 00    	jle    135 <__vdso_clock_gettime+0x105>
  59:	83 ff 05             	cmp    $0x5,%edi
  5c:	74 34                	je     92 <__vdso_clock_gettime+0x62>
  5e:	83 ff 06             	cmp    $0x6,%edi
  61:	0f 85 c2 00 00 00    	jne    129 <__vdso_clock_gettime+0xf9>
[...]

With CONFIG_OPTIMIZE_INLINING=y:

arch/x86/entry/vdso/vclock_gettime.o.inline_opt:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__vdso_clock_gettime>:
   0:	55                   	push   %rbp
   1:	4c 8d 0d 00 00 00 00 	lea    0x0(%rip),%r9        # 8 <__vdso_clock_gettime+0x8>
   8:	83 ff 01             	cmp    $0x1,%edi
   b:	48 89 e5             	mov    %rsp,%rbp
   e:	74 66                	je     76 <__vdso_clock_gettime+0x76>
  10:	0f 8e dc 00 00 00    	jle    f2 <__vdso_clock_gettime+0xf2>
  16:	83 ff 05             	cmp    $0x5,%edi
  19:	74 34                	je     4f <__vdso_clock_gettime+0x4f>
  1b:	83 ff 06             	cmp    $0x6,%edi
  1e:	0f 85 c2 00 00 00    	jne    e6 <__vdso_clock_gettime+0xe6>
[...]

Interestingly, CONFIG_OPTIMIZE_INLINING=y, which is supposed to allow GCC to inline less, actually resulted in it inlining more: vread_tsc is inlined in that version, while not in the CONFIG_OPTIMIZE_INLINING=n version. Butvread_tsc isn’t marked inline at all, so GCC is perfectly within its right to behave like this, as counterintuitive as it may be.

But who cares if a function is inlined? Where’s the actual problem? Well, looking closer at the non-inline version…

  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	48 81 ec 20 10 00 00 	sub    $0x1020,%rsp
  3b:	48 83 0c 24 00       	orq    $0x0,(%rsp)
  40:	48 81 c4 20 10 00 00 	add    $0x1020,%rsp

Why is GCC allocating over 4KiB of stack? That’s not a stack allocation, that’s a stack probe, or more specifically, the result of the -fstack-check GCCfeature.

Gentoo Linux enables -fstack-check by default on its hardened profile. This is a mitigation for theStack Clash vulnerability. While -fstack-check is an old GCC feature and not intended for this, it turns out it effectively mitigates the issue (I’m told proper Stack Clash protection will be in GCC 8). As a side-effect, it causes some fairly silly behavior, where every non-leaf function (that is, a function that makes function calls) ends up probing the stack 4 KiB ahead of the stack pointer. In other words, code compiled with -fstack-check potentially needs at least 4 KiB of stack space, unless it is a leaf function (or a function where every call was inlined).

Go loves small stacks.

TEXT runtime·walltime(SB),NOSPLIT,$16
	// Be careful. We're calling a function with gcc calling convention here.
	// We're guaranteed 128 bytes on entry, and we've taken 16, and the
	// call uses another 8.
	// That leaves 104 for the gettime code to use. Hope that's enough!

Turns out 104 bytes aren’t enough for everybody. Certainly not for my kernel.

It’s worth pointing out that the vDSO specification makes no mention of maximum stack usage guarantees, so this is squarely Go’s fault for making invalid assumptions.

Conclusion

This perfectly explains the symptoms. The stack probe is an orq, which is a logical OR with 0. This is a no-op, but effectively probes the target address (if it is unmapped, it will segfault). But we weren’t seeing segfaults in vDSO code, so how was this breaking Go? Well, OR with 0 isn’t really a no-op. Since orq is not an atomic instruction, what really happens is the CPU reads the memory address and then writes it back. This creates a race condition. If other threads are running in parallel on other CPUs, orq might effectively wind up undoing a memory write that occurs simultaneously. Since the write was out of the stack bounds, it was likely intruding on other threads’ stacks or random data, and, when the stars line up, undoing a memory write. This is also why GOMAXPROCS=1 works around the issue, since that prevents two threads from effectively running Go code at the same time.

What’s the fix? I left that up to the Go devs. Their solution ultimately was to pivot to a larger stack before calling vDSO functions. This introduces a small speed penalty (nanoseconds), but it’s acceptable. After building node_exporter with the fixed Go toolchain, the crashes went away.

2017-12-05 01:20

↧

SEC Emergency Action Halts PlexCoin ICO

December 4, 2017, 6:12 am

≫ Next: Pride, Prejudice

≪ Previous: Debugging an evil Go runtime bug

Washington D.C., Dec. 4, 2017 —

The Securities and Exchange Commission today announced it obtained an emergency asset freeze to halt a fast-moving Initial Coin Offering (ICO) fraud that raised up to $15 million from thousands of investors since August by falsely promising a 13-fold profit in less than a month.

The SEC filed charges against a recidivist Quebec securities law violator, Dominic Lacroix, and his company, PlexCorps. The Commission's complaint, filed in federal court in Brooklyn, New York, alleges that Lacroix and PlexCorps marketed and sold securities called PlexCoin on the internet to investors in the U.S. and elsewhere, claiming that investments in PlexCoin would yield a 1,354 percent profit in less than 29 days. The SEC also charged Lacroix's partner, Sabrina Paradis-Royer, in connection with the scheme.

Today's charges are the first filed by the SEC's new Cyber Unit. The unit was created in September to focus the Enforcement Division's cyber-related expertise on misconduct involving distributed ledger technology and initial coin offerings, the spread of false information through electronic and social media, hacking and threats to trading platforms.

"This first Cyber Unit case hits all of the characteristics of a full-fledged cyber scam and is exactly the kind of misconduct the unit will be pursuing," said Robert Cohen, Chief of the Cyber Unit. "We acted quickly to protect retail investors from this initial coin offering's false promises."

Based on its filing, the SEC obtained an emergency court order to freeze the assets of PlexCorps, Lacroix, and Paradis-Royer.

The SEC’s complaint charges Lacroix, Paradis-Royer and PlexCorps with violating the anti-fraud provisions, and Lacroix and PlexCorps with violating the registration provision, of the U.S. federal securities laws. The complaint seeks permanent injunctions, disgorgement plus interest and penalties. For Lacroix, the SEC also seeks an officer-and-director bar and a bar from offering digital securities against Lacroix and Paradis-Royer.

The Commission's investigation was conducted by Daphna A. Waxman, David H. Tutor, and Jorge G. Tenreiro of the New York Regional Office and the Cyber Unit, with assistance from the agency's Office of International Affairs. The case is being supervised by Valerie A. Szczepanik and Mr. Cohen. The Commission appreciates the assistance of Quebec's Autorité Des Marchés Financiers.

The SEC's Office of Investor Education and Advocacy issued an Investor Alert in August 2017 warning investors about scams of companies claiming to be engaging in initial coin offerings: https://www.investor.gov/additional-resources/news-alerts/alerts-bulletins/investor-alert-public-companies-making-ico-related.

↧

Pride, Prejudice

December 1, 2017, 4:55 am

≫ Next: Apple Agrees to Deal with Ireland Over $15B Unpaid Tax Issue

≪ Previous: SEC Emergency Action Halts PlexCoin ICO

Generated output

What it does

The problem isn't generating over 50,000 words. The problem is existing books are too long. Pride and Prejudice is 130,000 words, Moby Dick is 215,136 words (or 215,136 meows). And we all know 50,000 is the gold standard for a novel! So how can we reduce the word count?

These tactics reduce Pride and Prejudice by about 15% to 111,000 words.

Next we work out the ratio of words we have to 50k, count how many sentences we have, and work out how many sentences we want to approach 50k and use a text summariser to chop out the dead wood.

How to do it

Run:

pip install -r requirements.txt

python reducifier.py

Example:

python reducifier.py
open
word count: 130,000
word count: 126,936	diff: 97.643%	deboilerplatify
word count: 125,438	diff: 96.491%	remove_quote_things
word count: 121,549	diff: 93.499%	deveryify
word count: 121,018	diff: 93.091%	decontractify
word count: 111,633	diff: 85.872%	dehonorify
Ratio (words/50k):	 3
Number of sentences:	 4588
Number to keep:		 1529
word count: 54,273	diff: 41.748%	summarise

This produces output.txt before the summariser, and output2.txt after the summariser.

Works at least with macOS High Sierra with Python 3.6.3.

Example

Here's a diff of Pride and Prejudice and the first pass output.txt:

Source code

https://github.com/hugovk/NaNoGenMo-2017/tree/master/03-reducifier

↧

Apple Agrees to Deal with Ireland Over $15B Unpaid Tax Issue

December 4, 2017, 7:51 am

≫ Next: How Jet Built a GPU-Powered Fulfillment Engine with F# and CUDA

≪ Previous: Pride, Prejudice

BRUSSELS—Ireland will begin collecting €13 billion ($15.46 billion) in back taxes from Apple Inc. as soon as early next year after both sides agreed to the terms of an escrow fund for the money, Ireland’s finance chief said Monday.

The European Union in 2016 ordered Dublin to retrieve the billions of euros from Apple in uncollected taxes, which the EU said Apple avoided paying with the help of sweetheart tax deals from Ireland.

A...

↧

How Jet Built a GPU-Powered Fulfillment Engine with F# and CUDA

December 4, 2017, 7:32 am

≫ Next: Evolution of : Gif without the GIF

≪ Previous: Apple Agrees to Deal with Ireland Over $15B Unpaid Tax Issue

Jet.com Fulfillment

Have you ever looked at your shopping list and tried to optimize your trip based on things like distance to store, price, and number of items you can buy at each store? The quest for a smarter shopping cart is never-ending, and the complexity of finding even a sub-optimal solution to this problem can quickly get out of hand. This is especially true of online shopping, which expands the set of fulfillment possibilities from local to national scale. Ideally, you could shop online for all items from your list and the website would do all the work to find you the most savings.

That is exactly what Jet.com does for you! Jet.com is an e-commerce company (acquired by Walmart in 2016) known for its innovative pricing engine that finds an optimal cart and the most savings for the customer in real time.

In this post I discuss how Jet tackles the fulfillment optimization problem using GPUs with F#, Azure and microservices. We implemented our solutions in F# via AleaGPU, a natural choice for coding CUDA solutions in .NET. I will also cover relevant aspects of our microservice architecture.

The Merchant Selection Problem

Jet.com’s provides value by finding efficiencies in the system and passing them through to customers in the form of lower prices. At the center of this is Jet’s smart merchant selection algorithm. When a customer orders several items at once they usually can be fulfilled by multiple merchants with different warehouses. The goal is to find the combination of merchants and warehouses that minimizes the total order cost, including shipment costs and commissions. The bigger the shopping cart, the higher the potential savings, but also the more time consuming the search for the optimal combination.

Take, for example, a cart of four items (retail SKUs, or “Stock Keeping Units”) and a local market of four merchants.

Figure 1 shows the merchant’s price for each of the four skus and the shipping cost by merchant for some possible fulfillment combinations those merchants can provide. The table of shipping costs shows the cost for either individual SKUs or multiple SKUs packaged together (delimited by a comma or a plus sign respectively). For example, the cost to ship any individual SKU from Merchant 3 is $3.5 however SKU 3 and SKU 4 can be shipped together for only $4.5.

Figure 1: Initial cart of four items with a total of 3*3*2*4=72 combinations.

The naive approach of choosing the merchant with the cheapest offer for each SKU results in a total of $95 for the cart (Figure 2). This approach neglects shipping costs and fails to discover any savings from shipping multiple items from the same merchant.

Figure 2: Pricing the cart by taking the cheapest net price allocation.

For example, Merchant 2 can fulfill three of the 4 SKUs for only $8.5 in shipping costs, so fulfilling the order via Merchant 2 and Merchant 4 brings the total down to $94 (Figure 3). Is this the most optimal combination?

Figure 3: We can do better by packing items together.

The only way to find the optimal combination (the most savings for the customer, as Figure 4 shows) is by pricing every one of the 72 combinations for this cart.

The Complexity of Full Search

Pricing every possible fulfillment combination to find the optimal solution is an exhaustive brute-force search of the entire solution space. The full search approach to the merchant selection problem is embarrassingly parallel: each enumerated fulfillment combination can be priced independently. This natural parallelism lead us to initially use the full search approach on the GPU, but it turns out that full search is prohibitive (even with GPU acceleration) due to the exponential complexity of the merchant selection problem. Since a genetic algorithm (discussed in the next section) does not guarantee a fully optimal solution, it’s important to use full search when possible. We developed a metric called the “full search year” which we use to measure the computation time required for merchant selection.

A full search year is the number of combinations that can be priced within a year. We use GPU full search year or CPU full search year to indicate the time required when running either implementation. Full search seconds, minutes, hours, and days are derived from full search year.

Cart Complexity, or Merchant selection complexity, is the total number of fulfillment combinations that must be priced in order to find the most optimal fulfillment and lowest price.

Cart Complexity = number of combinations = # offers for item 1 * # offers for item 2 * … * # offers for item k.

To illustrate both the complexity of the merchant selection problem and the application of the full search year metric, we dug through our logs to find a cart which timed out when a customer tried to place an order. These timeouts tend to occur most often when a large number of merchants that can fulfill each item, such as with electronics. Figure 5 shows an example in which the customer was ordering components to build a computer.

Figure 5: Example of a cart that required 70,442,237,952,000 (70 trillion!) combinations, leading to a timeout.

The chart in Figure 6 shows the number of offers for each of the 11 retail SKUs in the customer’s cart. The cart complexity is 321719162992510161724 = 70,442,237,952,000 = 10^13.85 combinations. The time required to find the optimal fulfillment for this cart is 8.6 CPU full search years or 11 GPU full search days! This is well outside of our target response time of a few seconds. Given this constraint, the problem quickly grows larger than what can be handled by the GPU full search within the target time. We needed a better, more scalable solution.

A Genetic Algorithm

To address the scalability issues, we decided to apply genetic algorithms (GA) to solve the problem. Two important points to consider with the GA approach:

GA can only find approximately optimal solutions.
A standard GA will not work because our search space is astronomically large and we need a reliable approximate solution in near real-time.

Jet.com’s constraints on response time limit how long we can spend producing consecutive generations of the population. Moreover, since generation iteration is a serial process, we can’t reduce computation time by parallelizing iterations.

We used four methods to address these issues:

Dramatically increasing the population size to reduce the number of iterations required for convergence.
Improving the initial population by including merchant combinations which are likely to be good. For example, rather than starting with an initial population of fully randomized combinations, we include combinations in which single merchants fulfill multiple retail SKUs since this tends to reduce shipping costs in most cases.

Identifying parts of the population which can be used to guide the behavior of mutation and crossover operators.
Leveraging AI/ML to choose the appropriate configuration for the GA.

The fourth method was non-trivial and introduced some of its own unique challenges which I hope to cover in more detail in a future blog post.

Implementation

I’m going to take you through some implementation details of the GPU full search algorithm (GPUFS) but first I want to briefly discuss algorithm selection. There are three merchant selection algorithms we can use : the CPU implementation, the GPU full search, and the GPU genetic algorithm (GPUGA). Each approach has its own strengths and weaknesses depending on the situation.

The implementation uses a decision function to choose between the three algorithms based on cart complexity and load. For example, consider the case of a single item checkout. Pricing a single-item cart on the GPU would take longer than pricing it on the CPU due to the cost of data transfer to the GPU. For more complex carts, if we can perform merchant selection within an acceptable timeframe using GPUFS we should choose it over GPUGA, since GA’s do not guarantee optimal results.

Choosing the best algorithm for the task is unfortunately not as simple as “if cart = small then do CPUFS elif cart = big then do GPUFS elif cart = huge then do GPUGA.” We needed a smart decision function to address these issues so we used machine learning to train a model based on cart features and used it to improve the function’s results. In order to train this model appropriately and employ multiple algorithms we had to develop a way to validate, appraise, and compare multiple algorithms on production data. I hope to share details of this work in future blog posts as they are out of scope of our current discussion. For now, let’s focus on the details of the GPUFS algorithm and the microservice that invokes it.

Figure 6: An example local market supply for a cart of three items.

First I need to explain two core concepts: local market supply and allocation. The local market supply is an array of mappings of fulfillment nodes to offers for each retail SKU in a cart. Figure 6 shows an example LocalMarketSupply data structure; you can see (for example) that Node 0 can fulfill SKU 0 from two possible offers.

The goal of the search algorithms is to find the cheapest fulfillment combination for the items in the customer’s cart. Consider a scenario in which a customer initiates a checkout with three items: SKU 0, SKU 1, and SKU 2. Jet.com’s microservice queries a database to find all offers for these three retail SKUs and then uses this information to build the LocalMarketSupply structure for the cart.

The actual search space of the merchant selection problem is the set of all possible fulfillment combinations, which we refer to as allocations. A single allocation is an array of integers representing a combination of fulfillment options for the set of retail SKUs being priced. The indices of this array represents the IDs of the retail SKUs, and the value at each index is the offer ID chosen for that retail SKU.

$A = [o_{r_0} \ldots o_{r_i}]$

Therefore the set of allocations for the local market supply defined in Figure 6 would resemble the table in Figure 7.

Full Search

Jet.com’s full search implementation is straightforward. One kernel function processes all possible allocations, pricing each one and performing a min reduction to find the cheapest.

To aid explanation I’ll divide the full search implementation into two conceptual parts: search and pricing. The search part essentially refers to the kernel and the code responsible for finding the cheapest allocation. The pricing part then refers to code that actually prices an individual allocation. Let’s look at the search part first.

The kernel has a main while loop that follows a familiar strided access pattern.

// Because we need to repeatedly refer to our local market supply to
// determine which retail skus a fulfillment node can fulfill we store the
// compressed supply in shared memory.
let shared = __shared__.ExternArray(8) |> __address_of_array
let supply = CompressedSupply.LoadToShared(supply, shared.Reinterpret(), numRetailSkus)

let mutable minAllocation = -1L
let mutable minPrice = RealMax
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable localLinear = int64 start

while localLinear < numElements do
    let allocation = localLinear + linearStart
    let price = price allocation

    if price < minPrice then
        minPrice <- price
        minAllocation <- allocation

    localLinear <- localLinear + (int64 stride)

We perform a warp reduce to find the best price/allocation for each warp.

let mutable pricedAllocation = PricedAllocation(minPrice, minAllocation)

for i = 0 to Util.WarpSizeLog - 1 do
    let offset = 1 <<< i   let peer = DeviceFunction.ShuffleDown(pricedAllocation, offset, Util.WarpSize)
    pricedAllocation 

// Synchronize threads because we reuse the shared memory
__syncthreads()

Then we prepare for a block reduce by adding the warp reduce result to shared memory.

let mutable shared = shared.Reinterpret()

if threadIdx.x &&& Util.WarpSizeMask = 0 then   let warpId = threadIdx.x >>> Util.WarpSizeLog   shared.[warpId] 
__syncthreads()

Next we perform a block reduce to find the best price and allocation among the warps within each block.

if threadIdx.x = 0 then
    for warpId = 1 to (blockDim.x / Util.WarpSize) - 1 do
        pricedAllocation <- PricedAllocation.Min pricedAllocation shared.[warpId]
        output.Prices.[blockIdx.x] <- pricedAllocation.Price
        output.Allocations.[blockIdx.x] <- pricedAllocation.Allocation

Now we need to retrieve the results from the GPU, find the price/allocation pair with the minimum price, and then decode the respective allocation from int64 back to an array of int.

let decodeAllocation =   // A jagged array of offers by retail sku id, the first dimension 
    // runs over all **sorted** retail sku ids and the second dimension
    // are the offers for the given retail sku id
    let offerIdsByRetailSkuId = 
        LocalMarket.localMarketSupplyToOfferIdsByRetailSkuId pr.LocalMarketSupply
    let dims = offerIdsByRetailSkuId |> Array.map (fun fids -> fids.Length)   let indexer = RowMajor(dims)

    fun allocation ->       let indices = allocation |> indexer.ToIndex       indices |> Array.mapi (fun rsid idx -> offerIdsByRetailSkuId.[rsid].[idx])

Finally the getResults function copies the array of grid minima back to the host and performs a min by price to find the global minimum along with its accompanying allocation.

let getResults() =
    let prices = Gpu.CopyToHost(prices)
    let allocations = Gpu.CopyToHost(allocations)
    let price, allocation = Array.zip prices allocations |> Array.minBy fst
    let allocation = decodeAllocation allocation
    price, allocation

Now that we’ve seen the general flow of the search kernel, let’s look into the pricing code. The allocation price function is defined within the kernel function.

let price (allocation:int64) =
    generalPrice 
        numRetailSkus numFulfillmentNodes
        (fun i -> fulfillmentNodes.[i])
        (fun i -> shippingRules.[i])
        (fun i -> commissionRules.[i])
        (loopOffers allocation)

The generalPrice function prices an allocation by calling priceFulfillmentNode on an incrementing fulfillmentNodeId within a while loop. The priceFulfillmentNode function returns the number of SKUs computed in the allocation so the loop can exit early if possible. Note that generalPrice and loopOffers are both higher order functions. price uses partial function application and passes the partially applied loopOffers function along to generalPrice.

The loopOffers function loops over the offers of the allocation which are being fulfilled by fulfillment node fnid. Every time loopOffers is called, it loops over each retail SKU but only invokes the function f if the fulfillment node id of the offer for that retail SKU is equal to the current node being priced.Implementing loopOffers as a higher order function provides an abstraction over the various updateXYZ functions used within priceFulfillmentNode.

let loopOffers (allocation:int64) (fnid:int) (f:Offer -> Sku -> unit) =
    let offset = ref allocation
    let f (iter:AllocationIterator) (rsid:int) =
        let idx = AllocationIterator.Decode(iter, offset)
        let oid = supply.OfferIds.[iter.SupplyOffset + idx]
        let offer = offers.[oid]
        let sku = retailSkus.[rsid]
        if offer.Id = oid && offer.fulfillmentNodeId = fnid && offer.retailSkuId = rsid then
            f offer sku
    // The AllocationIterator type is used only by the GPU full 
    // search implementation where we encode allocations of int[]
    // into int64 to improve performance and use less memory.
    AllocationIterator.Iterate(supply, numRetailSkus, f)

The reduction in memory use provided by doing this int[] to int64 encoding is important for full search since it must enumerate all possible allocations. Using encoded allocations significantly increases the capabilities of the full search algorithm with regard to the level of cart complexity the algorithm can handle.

priceFulfillmentNode prices the set of offers being fulfilled by fulfillment node fnid, and updates various aspects of the order such as the total order price, total weight, and shipping cost.

let priceFulfillmentNode fnid
                  (fulfillmentNode:FulfillmentNode)
                (shippingRules:int -> ShippingRule)
                  (commissionRules:int -> CommissionRule)
                  (loopOffers:int -> (Offer -> Sku -> unit) -> unit) =
    let mutable orderPrice = 0.0
    let mutable numSkusPriced = 0

    let order = ref (Order())
    let orderSums = ref (OrderSums())
    let updateOrderSums = updateOrderSums orderSums
    let updateShippingCost = updateShippingCost shippingRules orderSums order
    let updateOfferPrice = updateOfferPrice commissionRules orderSums order
    loopOffers fnid updateOrderSums
    if orderSums.contents.linesCount > 0 then
        numSkusPriced <- orderSums.contents.linesCount
        updateShippingCost fulfillmentNode
        loopOffers fnid (updateOfferPrice fulfillmentNode)
        let order = !order
        orderPrice <- order.totalNetPrice + order.totalShipping - order.commission
    orderPrice, numSkusPriced

Microservice Layer

At a high level, Jet.com’s microservices are just simple executables that listen to a route for an HTTP request and translate the body JSON to an F# record containing all the data necessary to perform the merchant selection operation. The microservices reference the pricing engine library and use it accordingly.

In addition to testing the core performance of the GPU algorithm, we also needed to address how the GPU microservice would handle high load. That is, what do we do when a request is received before the GPU is done with the previous calculation? Our solution uses a BlockingCollection of size 1 from System.Collections.Concurrent. The tryAdd member attempts to add an element to the collection within the specified wait time.

// 4 second wait time
let [] private MAX_WAIT_TIME = 4000
// With a collection size of 1, once an element has been added to it 
// further adds are blocked until the element is removed.
let requestQueue = new BlockingCollection(1)
let cheapestAllocation () =
    match pricingAlgorithm with
    | GpuFullSearch | PricingAlgorithm.GpuGenetic ->
        // We try for 4 seconds to add a new request
        if requestQueue.TryAdd(merchantSelectionRequest, MAX_WAIT_TIME) then
            try
                let result = pricingAlgorithm.Invoke pr
                // Once we have the result, remove the request from the
                // collection so we can process the next one
                requestQueue.Take() |> ignore
                Success(result)
            with exn ->
                requestQueue.Take() |> ignore
                Failure exn
        Else
            // If we are unable to add a new request to the collection within 
            // 4 seconds, fail with GPU busy message
            Failure(new Exception("GPU busy"))
    | _ ->
        // CPU algorithms aren’t blocked
        pricingAlgorithm.Invoke pr
        |> Success

This simple solution works well. Due to the improved performance on medium to large-size carts, one Azure N-Series GPU machine is able to handle the request load of multiple Azure D3 (CPU-only) instances.

Conclusion: GPU-accelerated Fulfillment with F# in the Cloud

This post introduced Jet.com’s exponentially complex merchant selection problem and our approach to solving it within an environment of microservices written in F# and running in the Azure cloud. I covered implementation details of a brute force GPU search approach to merchant selection and how this algorithm is used from a RESTful microservice. In future posts I hope to expand on our genetic algorithm approach, how we validated and compared multiple algorithms on production data, and different ways we have used AI and machine learning to improve the performance of our pricing engine.

If you’d like to learn more about some of the GPU-related work going on at Jet.com be sure to check out these two GTC presentations given by Daniel Egloff:

“Prices Drop as You Shop: How Walmart is Using Jet’s GPU-based Smart Merchant Selection to Gain a Competitive Advantage” (MP4, PDF)
“Welcome to the Jet Age – How AI and Deep Learning Make Online Shopping Smarter at Walmart” (MP4, PDF)

Check out the Jet Technology Blog to learn more about Jet.com and the problems we are working on, and feel free to reach out to me directly or in the comments below.

Acknowledgements

I would like to thank Daniel Egloff, Neerav Kothari, Xiang Zhang, and Andrew Shaeffer. This work would not have been possible without them.

↧

Evolution of : Gif without the GIF

December 4, 2017, 8:34 am

≫ Next: Books I read this year

≪ Previous: How Jet Built a GPU-Powered Fulfillment Engine with F# and CUDA

GIFs are awesome but terrible for quality and performance
Replacing GIFs with <video> is better but has perf. drawbacks: not preloaded, uses range requests
Now you can <img src=".mp4">s in Safari Technology Preview
Early results show mp4s in <img> tags display 20x faster and decode 7x faster than the GIF equivalent – in addition to being 1/14th the file size!
Background CSS video & Responsive Video can now be a “thing”.
Finally cinemagraphs without the downsides of GIFs!
Now we wait for the other browsers to catch-up: This post is 46MB on Chrome but 2MB in Safari TP

Special thanks to: Eric Portis, Jer Noble, Jon Davis, Doron Sherman, and Yoav Weiss.

I both love and hate animated GIFs. Ode to Geocities Thanks Tim

Safari Tech Preview has changed all of this. Now I love and love animated “GIFs”.

Everybody loves animated Gifs!

Animated GIFs are a hack. To quote from the original GIF89a specification:

The Graphics Interchange Format is not intended as a platform for animation, even though it can be done in a limited way.

But they have become an awesome tool for cinemagraphs, memes, and creative expression. All of this awesomeness, however, comes at a cost. Animated GIFs are terrible for web performance. They are HUGE in size, impact cellular data bills, require more CPU and memory, cause repaints, and are battery killers. Typically GIFs are 12x larger files than H.264 videos, and take 2x the energy to load and display in a browser. And we’re spending all of those resources on something that doesn’t even look very good – the GIF 256 color limitation often makes GIF files look terrible (although there are some cool workarounds).

My daughter loves them – but she doesn’t understand why her battery is always dead.

GIFs have many advantages: they are requested immediately by the browser preloader, they play and loop automatically, and they are silent! Implicitly they are also shorter. Market research has shown that users have higher engagement with, and generally prefer both micro-form video (< 1minute) and cinemagraphs (stills with subtle movement), over longer-form videos and still images. Animated GIFs are great for user experience.

videos that are <30s have highest conversion

So how did I go from love/hating GIFs to love/loving “Gifs”? (capitalization change intentional)

In the latest Safari Tech Preview, thanks to some hard work by Jer Noble, we can now use MP4 files in <img> tags. The intended use case is not long-form video, but micro-form, muted, looping video – just like GIFs. Take a look for yourself:

<img src="rocky.mp4">

Cool! This is going to be awesome on so many fronts – for business, for usability, and particularly for web performance!

As many have already pointed out, using the <video> tag is much better for performance than using animated GIFs. That’s why in 2014 Twitter famously added animated GIF support by not adding GIF support. Twitter instead transcodes GIFs to MP4s on-the-fly, and delivers them inside <video> tags. Since all browsers now support H.264, this was a very easy transition.

<video autoplay loop muted inline>
<source src="eye-of-the-tiger-video.webm" type="video/webm">
<source src="eye-of-the-tiger-video.mp4" type="video/mp4">
<img src="eye-of-the-tiger-fallback.gif"/>
</video>

Transcoding animated GIFs to MP4 is fairly straightforward. You just need to run ffmpeg -i source.gif output.mp4

However, not everyone can overhaul their CMS and convert <img> to <video>. Even if you can, there are three problems with this method of delivering GIF-like (Gif), micro-form video:

1. Browser performance is slow with `<video>`

As Doug Sillars recently pointed out in a HTTP Archive post, there is huge performance penalty in the visual presentation when using the <video> tag.

Sites without video, load about 28 percent faster than sites with video

Unlike <img> tags, browsers do not preload <video> content. Generally preloaders only preload JavaScript, CSS, and image resources because they are critical for the page layout. Since <video> content can be any length – from micro-form to long-form – <video> tags are skipped until the main thread is ready to parse its content. This delays the loading of <video> content by many hundreds of milliseconds.

For example, the hero video at the top of the Velocity conference page is only requested 5 full seconds into the page load. It’s the 27th requested resource and it isn’t even requested until after Start Render, after webfonts are loaded.

Worse yet, many browsers assume that <video> tags contain long-form content. Instead of downloading the whole video file at once, which would waste your cell data plan in cases where you do not end up watching the whole video, the browser will first perform a 1-byte request to test if the server supports HTTP Range Requests. Then it will follow with multiple range requests in various chunk sizes to ensure that the video is adequately (but not over-) buffered. The consequence is multiple TCP round trips before the browser can even start to decode the content and significant delays before the user sees anything. On high-latency cellular connections, these round trips can set video loads back by hundreds or thousands of milliseconds.

And what performs even worse than the native <video> element? The typical JavaScript video player. Often, the easiest way to embed a video on a site is to use a hosted service like YouTube or Vimeo and avoid the complexities of video encoding, hosting, and UX. This is normally a great idea, but for micro-form video, or critical content like hero videos, it just adds to the delay because of the javascript players and supporting resources these hosting services inject (css/js/jpg/woff). In addition to the <video> markup you are forcing the browser to downloaded, evaluate, and execute the javascript player — and only then can the video start to load.

As many people know, I love my Loki jacket because of its built in mitts, balaclava, and a hood that is sized for helmets. But take a look at the Loki USA homepage – which uses a great hero-video, hosted on Vimeo:

If you look closely, you can see that the JavaScript for the player is actually requested soon after DOM Complete. But it isn’t fully loaded and ready to start the video stream until much later.

WPT Results

2. You can’t right click and save video

Most long-form video content – vlogs, TV, movies – is delivered via JavaScript-based players. Usually these players provide users with a convenient “share now” link or bookmark tool, so they can come back to YouTube (or wherever) and find the video again. In contrast, micro-form content – like memes and cinemagraphs – usually doesn’t come via a player, and users expect to be able to download GIFs and send them to friends, like they can with any image on the web. That meme of the dancing cat was sooo funny – I have to share it with all my friends!

If you use <video> tags to deliver micro-form video, users can’t right-click, click-and-drag, or force touch, and save. And their dancing-cat joy becomes a frustrating UX surprise.

3. Autoplay abuse

Finally, using <video> tags and MP4s instead of <img> tags and GIFs is brings you into the middle of an ongoing cat and mouse game between browsers and unconscionable ad vendors, who abuse the <video autoplay> attribute in order to get the users’ attention. Historically, mobile browsers have ignored the autoplay attribute and/or refused to play videos inline, requiring them to go full screen. Over the last couple of years, Apple and Google have both relaxed their restrictions on inline, autoplay videos, allowing for Gif-like experiences with the <video> tag. But again, ad networks have abused this, causing further restrictions: if you want to autoplay <video> tags you need to mark the content with muted or remove the audio track all together.

The GIF format isn’t the only animation-capable, still-image format. WebP and PNG have animation support, too. But, like GIF, they were not designed for animation and result in much larger files, compared to dedicated video codecs like H.264, H.265, VP9, and AV1.

Animated PNG is now widely supported across all browsers, and while it addresses the color pallete limitation of GIF, it is still an inefficient file format for compressing video.

Animated WebP is better, but compared to true video formats, it’s still problematic. Aside from not having a formal standard, animated WebP lacks chroma subsampling and wide-gamut support. Further, the ecosystem of support is fragmented. Not even all versions of Android, Chrome, and Opera support animated WebP – even though those browsers advertise support with the Accept: image/webp. You need Chrome 42, Opera 15+ or Android 5+.

So while animated WebPs compress much better than animated GIFs or aPNGs, we can do better. (See file size comparisons below)

By enabling true video formats (like MP4) to be included in <img> tags, Safari Technology Preview has fixed these performance and UX problems. Now, our micro-form videos can be small and efficient (like MP4s delivered via the <video> tag) and they can can be easily preloaded, autoplayed, and shared (like our old friend, the GIF).

<img src="ottawa-river.mp4">

So how much faster is this going to be? Pull up the developer tools and see the difference in Safari Technology Preview and other browsers:

Unfortunately Safari doesn’t play nice with WebPageTest, and creating reliable benchmark tests is complicated. Likewise, Tech Preview’s usage is fairly low, so comparing performance with RUM tools is not yet practical.

We can, however, do two things. First, compare raw byte sizes, and second, use the Image.decode() promise to measure the device impact of different resources.

Byte Savings

First, the byte size savings. To compare this I transcoded the trending top 100 animated Gifs from giphy.com and then converted into vp8/vp9/webp/h264/h265.

NB: These results should be taken as directional only! Each codec could be tuned much more as you can see the vp9 fairs worse than the default vp8 outputs. A more comprehensive study should be done that considers SSIM.

Below are the median (p50) results of the conversion:

Format	Bytes p50	% change p50
GIF	1,713KB
WebP	310KB	-81%
WebM/VP8	57KB	-97%
WebM/VP9	66KB	-96%
WebM/AV1	TBD
MP4/H.264	102KB	-93%
MP4/H.265	43KB	-97%

Yes animated WebP is smaller but any video format is much smaller. This shouldn’t surprise anyone since these modern video codecs are highly optimized for online video streaming. H.265 fairs very well as I expect AV1 will too.

The benefits here will not only be faster transit but also substantial $$ savings for end users.

Net-Net, using video in <img> tags is going to be much faster on a cellular connection.

Decode and Visual Performance Improvements

Next, let’s consider the impact of the decode and display effects on the browsing experience. H.264 (and H.265) has the notable advantage of being hardware decoded instead of using the primare core for decode.

How can we measure this? Since browsers haven’t yet implemented the proposed hero image API, we can use Steve Souder’s User Timing and Custom Metric strategy as a good aproximation of when the image starts to display to the user. It doesn’t measure frame rate, but it tells us roughly when the first frame is displayed. Better yet, we can also use the newly adopted Image.decode() event promise to measure decode performance. In the test page below, I inject a unique GIF and MP4 in an <img> tag 100 times and compare the decode and paint performance.

let image = new Image;
t_startReq = new Date().getTime();
document.getElementById("testimg").appendChild(image);
image.onload = timeOnLoad;
image.src = src;
return image.decode().then(() => { resolve(image); });

The results are quite impressive! Even on my powerful 2017 MacBook Pro, running the test locally, with no network throttling, we can see GIFs taking 20x longer than MP4s to draw the first frame (signaled by the onload event), and 7x longer to decode!

Curious? Clone the repo and test for yourself. I will note that adding network conditions on the transit of the GIF v. MP4 will disproportionately skew the test results. Specifically since decode can start happening before the last byte finishes, the delta between transfer, display and decode becomes much smaller. What this really tells us is that just the byte savings alone will improve substantially the user experience. However, factoring out the network as I’ve done on a localhost run, you can see that using video has substantial performance benefits for the energy consumption as well.

So now that Safari Technology Preview supports this design pattern, how can you actually take advantage of it, without serving broken images to non-supporting browsers? Good news! It’s relatively easy.

Option 1: Use Responsive Images

Ideally the simplest way is to use the <source type> attribute of the HTML5 <picture> tag.

<picture>
<source type="video/mp4" srcset="cats.mp4">
<source type="image/webp" srcset="cats.webp">
<img src="cats.gif">
</picture>

I’d like to say we can stop there. However, there is this nasty WebKit bug in Safari that causes the preloader to download the first<source> regardless of the mimetype declaration. The main DOM loader realizes the error and selects the correct one. However, the damage will be done. The preloader squanders its opportunity to download the image eary and on top of that, downloads the wrong version wasting bytes. The good news is that I’ve patched this bug and it should land in Safari TP 45.

In short, using the <picture> and <source type> for mime-type selection is not advisable until the next version of Safari reaches the 90%+ of the user base.

Option 2: Use MP4, animated WebP and Fallback to GIF

If you don’t want to change your HTML markup, you can use HTTP to send MP4s to Safari with content negotiation. In order to do so, you must generate multiple copies of your cinemagraphs (just like before) and Varyresponses based on both the Accept and User-Agent headers.

This will get a bit cleaner once WebKit BUG 179178 is resolved and you can add a test for the Accept: video/* header, (like the way you can test for Accept: image/webp). But the end result is that each browser gets the best format for <img>-based micro-form videos that it supports:

Browser	Accept Header	Response
Safari TP41+		H.264 MP4
	Accept: video/mp4	H.264 MP4
Chrome 42+	Accept: image/webp	aWebP
Opera 15	Accept: image/webp	aWebP
	Accept: image/apng	aPNG
Default		aGif

In nginx this would look something like:


map $http_user_agent $mp4_suffix {
    default   "";
    "~*Safari/605"  ".mp4";
}

location ~* .(gif)$ {
      add_header Vary Accept;
      try_files $uri$mp4_suffix $uri =404;
}

Of course, don’t forget the Vary: Accept, User-Agent to tell coffee-shop proxies and your CDN to cache each response differently. In fact, you should probably mark the Cache-Control as private and use TLS to ensure that the less sophisticated ISP Performance-Enhancing-Proxies don’t cache the content.

GET /example.gif HTTP/1.1
Accept: image/png; video/*; */*
User-Agent: User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/605.1.13 (KHTML, like Gecko) Version/11.1 Safari/605.1.13

…

HTTP/1.1 200 OK
Content-Type: video/mp4
Content-Length: 22378567
Vary: Accept, User-Agent

Option 3: Use RESS and fall Back to <video> tag

If you can manipulate your HTML, you can adopt the Responsive-Server-Side (RESS) technique. This option moves the browser detection logic into your HTML output.

For example, you could do it like this with PHP:

<?php if(strlen(strstr($_SERVER['HTTP_USER_AGENT'],"Safari/605")) <= 0 ){ // if not firefox ?>
<img src="example.mp4">
<?php } else {?>
<img src="example.gif">
<?php }?>

As above, be sure to emit a Vary: User-Agent response to inform your CDN that there are different versions of your HTML to cache. Some CDNs automatically honour the Vary headers while others can support this with a simple update to the CDN configuration.

Bonus: Don’t forget to remove the audio track

Now, since you aren’t converting GIF to MP4s but rather you are converting MP4s to GIFs, we should also remember to strip the audio track for extra byte savings. (Please tell me you aren’t using GIFs as your original. Right?!) Audio tracks add extra bytes to the file size that we can quickly strip off since we know that it will be played on mute anyway. The simplest way with ffmpeg is:

ffmpeg -i cats.mp4 -vcodec copy -an cats.mp4

As I’m writing this, Safari will blindly download whatever video you specify in the <img> tag, no matter how long it is. On the one hand, this is expected because it helps improve the performance of the browser. Yet, this can be deadly if you push down a 120-minute video to the user. I’ve tested multiple sizes and all were downloaded as long as the user hung around. So, be courteous to your users. If you want to push longer form video content, use the <video> tag for better performance.

Now that we can deliver MP4s via <img> tags, doors are opening to many new use cases. Two that come to mind: responsive video, and background videos. Now that we can put MP4s in srcsets, vary our responses for them using Client Hints and Content-DPR, art direct them with <picture media>, well – think of the possibilities!

<img src="cat.mp4" alt="cat"
  srcset="cat-160.mp4 160w, cat-320.mp4 320w, cat-640.mp4 640w, cat-1280.mp4 1280w"
  sizes="(max-width: 480px) 100vw, (max-width: 900px) 33vw, 254px">

Video in CSS background-image: url(.mp4) works, too!

<div style="width:800px, height: 200px, background-image:url(colin.mp4)"/>

By enabling video content in <img> tags, Safari Technology Preview is paving the way for awesome Gif-like experiences, without the terrible performance and quality costs associated with GIF files. This functionality will be fantastic for users, developers, designers, and the web. Besides the enormous performance wins that this change enables, it opens up many new use cases that media and ecommerce businesses have been yearning to implement for years. Here’s hoping the other browsers will soon follow. Google? Microsoft? Mozilla? Samsung? Your move!

↧

Books I read this year

December 4, 2017, 8:12 am

≫ Next: Apple is sharing your facial wireframe with apps

≪ Previous: Evolution of : Gif without the GIF

Turtles and jazz chickens

| December 4, 2017

Reading is my favorite way to indulge my curiosity. Although I’m lucky that I get to meet with a lot of interesting people and visit fascinating places through my work, I still think books are the best way to explore new topics that interest you.

This year I picked up books on a bunch of diverse subjects. I really enjoyed Black Flags: The Rise of ISIS by Joby Warrick. I recommend it to anyone who wants a compelling history lesson on how ISIS managed to seize power in Iraq.

On the other end of the spectrum, I loved John Green’s new novel, Turtles All the Way Down, which tells the story of a young woman who tracks down a missing billionaire. It deals with serious themes like mental illness, but John’s stories are always entertaining and full of great literary references.

Another good book I read recently is The Color of Law by Richard Rothstein. I’ve been trying to learn more about the forces preventing economic mobility in the U.S., and it helped me understand the role federal policies have played in creating racial segregation in American cities.

I’ve written longer reviews about some of the best books I read this year. They include a memoir by one of my favorite comedians, a heartbreaking tale of poverty in America, a deep dive into the history of energy, and not one but two stories about the Vietnam War. If you’re looking to curl up by the fireplace with a great read this holiday season, you can’t go wrong with one of these.

The Best We Could Do, by Thi Bui. This gorgeous graphic novel is a deeply personal memoir that explores what it means to be a parent and a refugee. The author’s family fled Vietnam in 1978. After giving birth to her own child, she decides to learn more about her parents’ experiences growing up in a country torn apart by foreign occupiers.

Evicted: Poverty and Profit in the American City, by Matthew Desmond. If you want a good understanding of how the issues that cause poverty are intertwined, you should read this book about the eviction crisis in Milwaukee. Desmond has written a brilliant portrait of Americans living in poverty. He gave me a better sense of what it is like to be poor in this country than anything else I have read.

Believe Me: A Memoir of Love, Death, and Jazz Chickens, by Eddie Izzard. Izzard’s personal story is fascinating: he survived a difficult childhood and worked relentlessly to overcome his lack of natural talent and become an international star. If you’re a huge fan of him like I am, you’ll love this book. His written voice is very similar to his stage voice, and I found myself laughing out loud several times while reading it.

The Sympathizer, by Viet Thanh Nguyen. Most of the books I’ve read and movies I’ve seen about the Vietnam War focused on the American perspective. Nguyen’s award-winning novel offers much-needed insight into what it was like to be Vietnamese and caught between both sides. Despite how dark it is, The Sympathizer is a gripping story about a double agent and the trouble he gets himself into.

Energy and Civilization: A History, by Vaclav Smil. Smil is one of my favorite authors, and this is his masterpiece. He lays out how our need for energy has shaped human history—from the era of donkey-powered mills to today’s quest for renewable energy. It’s not the easiest book to read, but at the end you’ll feel smarter and better informed about how energy innovation alters the course of civilizations.

Become a Gates Notes Insider for access to exclusive content and personalized reading suggestions

Read previous versions of the Annual Letter